The importance of avoiding unnecessary work
- tags
- #Coding-Challenge #Cloud #Data-Structures #Data-Skipping
- level
- Beginner
- author
- Pascal Ginter
- published
- reading time
- 5
Reading the Right Data Dominates Query Runtime
Analytical query processing powers modern data-driven decision-making, turning massive datasets into timely, trustworthy insights. To keep up with ever-growing data volumes, many systems turn to cloud object storage, which allows decoupling compute and storage and promises infinite scalability. Large data sets can be subdivided into smaller “blocks” which contain a subset of tuples, each stored as a separate object.
When analyzing query performance, most would expect joins or aggregations to dominate runtime, especially since analytical queries are often very complex. Surprisingly, that is not the case, instead the most time-consuming operator is seemingly simple, scanning and filtering data, which accounts for roughly 50% of the total runtime.
Why does scanning dominate runtime? First, transferring large amounts of data over the HTTP/TCP interface of object stores consumes substantial CPU resources. Second, the data is typically stored in compressed formats such as Apache Parquet, which must be decompressed before processing, a step which has become the bottleneck in modern cloud analytics.
To reduce execution time, it is thus paramount to reduce the amount of data that needs to be accessed by a particular query.
Data Skipping Index Structures
Unlike transactional database systems, cloud data warehouses lack indexes that can point directly to the relevant tuples. The reason for this lies in the characteristics of cloud object storage which exhibits high access latency and only allow storing immutable objects. Indexes require additional storage costs and every update requires rewriting a significant part.
Instead, cloud data warehouses still rely on full table scans, though these are accelerated through data skipping techniques. Before reading a block of data, the system checks lightweight metadata to determine whether the block could possibly contain tuples that satisfy the filter-predicate. If it is not the satisfied, there is no need to access the block, and it is therefore skipped. Using these techniques, production grade systems can avoid accessing over 75% of data blocks for the median query.
Most systems implement data skipping through min/max metadata. For each block and each column, they store just two values: the minimum and maximum values of that column within the block. At query time, a block is only read if the predicate falls between its minimum and maximum. Background services cluster similar data together to ensure that this check is selective.
Consider the illustrating example depicted in the figure below: We scan over a table consisting of a single column x for the value 42. The table consists of 3 blocks of data A, B and C. The query engine first iterates over the three blocks and accesses the min/max metadata for each block. For blocks A and C, the predicate value 42 is in between the blocks minimum and maximum value. These blocks could thus contain the value we are looking for and need to be retrieved. On the other hand, the minimum value of block B is smaller than the predicate value, and thus the block can not possibly contain the search value and can safely be skipped.
After fetching, decompressing and scanning blocks A and C, we only find our predicate value in block C, meaning that it was not necessary to access block A. We call block A a false positive.
- False positives lead to accessing more data blocks than necessary. While they increase query latency, the result remains correct as all rows of the block will be filtered out individually by the query engine after decompression.
- False negatives, however, mean that a block containing the predicate value will not be accessed during query execution, leading to incorrect results.
While false positives can be tolerated in the context of skipping data structures, false negatives can not.
The Trade-Off Between Accuracy and Storage Size
In the cloud, analytical data typically resides in object stores such as AWS S3 or Azure Blob Storage. These managed services automatically handle replication and scaling, but users are billed for storing the data (around 23$/TB-month) as well as for each access to the data (0.4$ per million requests). Each true negative provided by the skipping data structure avoids a request to object storage as well as the cost associated with the CPU runtime for actually reading and decompressing the data. Yet, the skipping metadata also needs to be stored and thus incurs storage cost.
The trade-off between access and storage cost is illustrated in the figure above. While minimizing the storage footprint reduces storage costs, unavoidable false positives of such a compact and thus approximate data structure drive the actual scan cost, which dominate the total workload cost. On the other extreme, a very detailed data structure avoids false positives but storage costs begin to dominate. The challenge is to find the right balance between storage size and accuracy of the data structure to minimize the total workload cost. The trade-off depends on how often the data is scanned. If queries arrive multiple times a second, storage costs can be an afterthought. However, if there are months between queries, every byte begins to count.
Currently, most systems simply use min/max statistics for all types of workloads and rely on background reorganization of the data to maximize the effectiveness of data skipping.
The TUMuchData 2025 Coding Challenge
If you’ve read this far and think you can design something smarter than simple min/max statistics, now is your chance. TUMuchData is hosting a coding challenge in which you can design and implement your own data skipping index structure. The goal is to optimize the total workload cost through the right choice of skipping index. On three representative workloads, you can compete for the top spot on the leaderboard with participants from all over the world.
The challenge runs from Friday, October 17th 2025 to Sunday,November 9th. Participation is free and open to everyone. We can’t wait to see what you come up with.
For more information visit our contest website.