Michał Szczepaniak

Self-healing for cloud-native high-performance query processing Pt. 3: System model July 16, 2024

Introduction

As we analyzed in the previous parts (part 1 and part 2), checkpointing can be promising from the perspective of a single query, especially for queries with a low output/input factor. However, the number of parameters makes detailed analysis more difficult. Additionally, we concentrated on cases where a failure occurred. In the actual system, the failures are rare events. For this reason, in this part we will evaluate how both methods would work in systems with different query types, sizes, and failure probabilities.

Self-healing for cloud-native high-performance query processing Pt. 2: Cost model July 2, 2024

Introduction

In the previous part, we focused on creating time models for two self-healing methods - recomputation and checkpointing. We have stated that the recomputation model can be described using the following parameters:

Input size
Processing speed
Failure point

The checkpointing model required the following additional parameters:

Output/input factor
Network connection
Checkpoint file size
CPU overhead for data transfer

However, we can often achieve close to 0 processing times in the cloud using scaling. Nevertheless, it can generate a high cost and be unreasonable from the maintenance perspective. For this reason, we should always consider the performance in the context of the price and find a reasonable trade-off.

Self-healing for cloud-native high-performance query processing Pt. 1: Time model June 18, 2024

Introduction

The cloud is one of the most impactful innovations in recent years. More and more companies are moving to the cloud instead of using on-premise servers. As a consequence, the amount of data processed in the cloud increases. For these reasons, new services for query execution are developed, e.g., Snowflake, Amazon Redshift, and Google BigQuery. However, the cloud is built on unreliable hardware and failures are common. Consequently, high-performance query processing has to be able to detect such accidents and execute self-healing operations to hide them from customers in order to provide high-quality service. Additionally, hardware failures are not the only reason for stopping and resuming processing on another machine. The other examples are spot-instance interruptions and scaling up.