Performance

LLMs as Cardinality Estimators: Accurate, But Only If You Don't Call Them Every Time

Cardinality estimation is the heart of the optimizer. A team from Peking University and ByteDance fine-tunes Llama-3 8B to do CardEst, and on workloads like IMDB and STATS the 99th-percentile Q-error drops by up to 74.1% versus the strongest baseline (PRICE) — the accuracy win is real. But end-to-end, it backfires: on JOB-light and ErgastF1 the LLM’s more accurate plans are dragged down by its own inference latency, with total time exceeding even the strongest baseline PRICE. The real engineering contribution isn’t the model — it’s the gate that uses the optimizer’s own cost model as a bouncer: call the LLM only for high-cost sub-queries, leave the rest to the old methods.

When the Index Tuner's Cost Model Lies: Where LLMs See What DTA Can't

A Microsoft team evaluates LLM-driven index tuning on real enterprise customer workloads. On query 22 of Real-R, the SOTA commercial tuner DTA recommends indexes that cause a near-10x regression; on the same query, GPT-5 cuts execution time from 10 seconds to 4. The LLM wins precisely where the what-if cost model is wrong. But that intuition is high-variance, can’t be bolted into the existing architecture, and can’t be validated cheaply — it’s not a replacement for DTA today, it’s a source of the candidate indexes DTA can’t see.

−46% or −2%? Rule-Based Rewriters Only Work at Home

On TPC-H 10GB, a state-of-the-art learned rewriter cuts mean execution time from 69.84s to 37.57s — a 46% win. On DSB 10GB, the same rewriter takes 32.62s to 31.93s — a 2.1% non-event. The gap isn’t query difficulty; it’s whether the benchmark is in the rewriter’s training distribution. “Rule-based systems are stable and reliable” is often a benchmark artifact, not an engineering fact.

Branch Flip Analysis: A White-Box Way to Find Performance Bugs, and What It Means for Spark

An ETH paper finds 21 previously unknown performance bugs in PostgreSQL, MySQL, CockroachDB and MariaDB by flipping optimization branches on and off. The technique is conceptually simple, the surface in Spark is unusually inviting, and the open-source engine community already ships one of the building blocks.

Just Asking an LLM to Rewrite SQL Does Almost Nothing

On TPC-H 10GB, asking GPT-4o to rewrite SQL takes mean execution time from 78.81s down to 74.92s — almost nothing. Swap in an open 14B model, feed it plans, add a reward, fine-tune once, and the same workload drops to 29.67s. Whether LLMs can help SQL rewriting is not a question about model strength; it’s a question about whether you’re willing to give the model the signals it actually needs.

LLMs for Join Order: An Apache Spark Perspective on the Three-Tier Ladder

Databricks and UPenn put an LLM agent to work as an offline join-order tuner and got P90 latency down 41% / geomean 1.288× speedup on JOB’s 113 queries — beating even perfect cardinality estimates. From the trenches of an open-source query engine, here is what that result does and does not prove.

Deep Dive into Spark SQL Metrics (Part 6): Metrics In Action — TPC-DS q99 with Gluten

Part 6 of the SQL Metrics series. A real-world walkthrough of TPC-DS q99 at SF10000 with Gluten/Velox, reading every metric to understand what happened during execution.

Deep Dive into Spark SQL Metrics (Part 1): Types, Full Reference, and What They Mean

Part 1 of a 3-part deep dive into Apache Spark’s SQL metrics system. Covers the 5 metric types, a complete reference of 100+ metrics across all operators, and how to read the numbers in the Spark UI.

Introducing spark-advisor: An AI-Powered Spark Performance Engineer

spark-advisor is an agent skill that turns your AI coding assistant into a Spark performance engineer — diagnosing slow jobs, detecting skew, comparing benchmark runs, and producing actionable tuning recommendations.