Pascal Ginter
Reading the Right Data Dominates Query Runtime
Analytical query processing powers modern data-driven decision-making, turning massive datasets into timely, trustworthy insights. To keep up with ever-growing data volumes, many systems turn to cloud object storage, which allows decoupling compute and storage and promises infinite scalability. Large data sets can be subdivided into smaller “blocks” which contain a subset of tuples, each stored as a separate object.
When analyzing query performance, most would expect joins or aggregations to dominate runtime, especially since analytical queries are often very complex. Surprisingly, that is not the case, instead the most time-consuming operator is seemingly simple, scanning and filtering data, which accounts for roughly 50% of the total runtime.
Over the past 12 years, Just In Time Compilation for SQL query plans (pioneered by Thomas Neumann at TUM) gained popularity when it comes to developing high-performance analytical database management systems. The main idea sounds simple: the system generates specialized code for an individual query and avoids the overhead of interpretation of traditional query engines.
LingoDB, a new research project from Michael Jungmair at TUM, aims to enhance the flexibility and extensibility of this approach drastically. It pursues this goal in two ways: