LLMs Shouldn't Replace the Query Optimizer — They Should Sit Behind It
Putting the LLM after the optimizer, emitting JSON patches for local plan tuning, is easier to reason about as engineering than asking it to replace the cost-based optimizer.
Putting the LLM after the optimizer, emitting JSON patches for local plan tuning, is easier to reason about as engineering than asking it to replace the cost-based optimizer.
An ETH paper finds 21 previously unknown performance bugs in PostgreSQL, MySQL, CockroachDB and MariaDB by flipping optimization branches on and off. The technique is conceptually simple, the surface in Spark is unusually inviting, and the open-source engine community already ships one of the building blocks.
Databricks and UPenn put an LLM agent to work as an offline join-order tuner and got P90 latency down 41% / geomean 1.288× speedup on JOB’s 113 queries — beating even perfect cardinality estimates. From the trenches of an open-source query engine, here is what that result does and does not prove.
Part 5 of the SQL Metrics deep dive. How Gluten maps Substrait plan nodes to Velox operators, aggregates metrics across pipelines, walks the MetricsUpdaterTree, and handles aggregation sub-phases and shuffle metrics.
Part 6 of the SQL Metrics series. A real-world walkthrough of TPC-DS q99 at SF10000 with Gluten/Velox, reading every metric to understand what happened during execution.
Part 1 of a 3-part deep dive into Apache Spark’s SQL metrics system. Covers the 5 metric types, a complete reference of 100+ metrics across all operators, and how to read the numbers in the Spark UI.
Part 2 of the SQL Metrics deep dive. How metrics flow from tasks to driver, and how Adaptive Query Execution uses shuffle statistics to rewrite plans at runtime.
Part 3 of the SQL Metrics deep dive. How to extend Spark with custom metrics via the DataSource V2 API, how the UI renders them, and how to query metrics programmatically.
Part 4 of the SQL Metrics deep dive. How Apache Gluten bridges native Velox/ClickHouse metrics back to Spark’s SQL Metrics framework, adding 60+ metrics that vanilla Spark doesn’t have.
Apache Spark 4.1 introduces Spark Declarative Pipelines (SDP) — a declarative framework that lets you define what your data should look like, not how to compute it. As a Spark PMC Member, here’s my take on what this means for data engineering.