LLMs as Cardinality Estimators: Accurate, But Only If You Don't Call Them Every Time

The earlier posts in this series circled the optimizer’s execution axis: query rewriting (just asking an LLM to rewrite SQL does almost nothing, the blind spot of rule-based rewriting on DSB), plan tuning (treating the LLM as a plan-tuner rather than an optimizer), and physical design (when the index tuner’s cost model lies). This one drills down to the lowest input the optimizer has — cardinality estimation (CardEst).

The source is an empirical study from Peking University and ByteDance, released in late March 2026 (arXiv:2603.28080), with a title posed as an honest question: Can Large Language Models be a Cardinality Estimator? An Empirical Study. It doesn’t presume the answer; instead it drags Llama-3 8B into PostgreSQL’s optimizer and carefully measures both the accuracy and the end-to-end bill.

The conclusion comes in two halves, and it’s the two halves together that make this paper worth reading: on accuracy the LLM wins cleanly, end-to-end it loses badly — at first.

Why cardinality estimation is the optimizer’s vital organ

The optimizer picks an execution plan via the cost model; whether the cost model is right rests almost entirely on cardinality estimation — how many rows each sub-query (especially after multi-table joins) will actually produce. Over-estimate and the optimizer thinks it should build a hash join on a huge table; under-estimate and it thinks a few rows can be nested-loop’d freely. A cardinality off by an order of magnitude can knock the downstream join order, join algorithm, and parallelism all askew, and the plan degrades outright.

Traditional estimators rely either on histograms plus independence assumptions (PostgreSQL’s default) or on learned models (MSCN, DeepDB, NeuroCard, and the newer PRICE). Their shared headache is correlation under multi-table joins: the more joins, the more distorted the distribution, the wilder the estimate. That is exactly the terrain this paper wants to see whether an LLM can handle.

Evidence one: on accuracy, the LLM genuinely wins

On method, the team did not cram the schema into some MLP encoder — they tried, and it overfit badly. The final recipe is a triple: a carefully designed prompt (fed coarse-grained statistics, and even the outputs of other estimators like PRICE stuffed into the context), LoRA fine-tuning (the key being to generate the number digit by digit as tokens rather than regress a scalar), and a self-correction loop at inference time. The default model is Llama-3 8B, the underlying engine PostgreSQL, with PilotScope injecting the estimated cardinalities back into the optimizer.

Accuracy is measured by Q-error (the ratio of estimate to truth, taken in the larger direction; 1 is perfect). One setting captures the point best: train only on queries with fewer than 3 joins, test entirely on queries with more than 3 joins — forcing the model to extrapolate to harder correlations. The 99th-percentile Q-error column:

Estimator	99th-pct Q-error
PostgreSQL	52594
MSCN	17239
DeepDB	20236
PRICE (best baseline)	10672
Llama-3 + fine-tuning	2581.73

The gap isn’t a few percent, it’s an order-of-magnitude tightening of the tail. On this IMDB table, the 99th percentile tightens from PRICE’s 10672.53 to 2581.73 — a 75.81% reduction. On the standard STATS setting, the 99th-pct Q-error drops by up to 74.1% relative to the best baseline. Across IMDB, STATS, ErgastF1, and Genome, the LLM beats the strongest baseline PRICE on nearly every setting. A tight tail means the very queries most likely to drive the optimizer off a cliff get rescued.

If the paper stopped here, it’d be another “the LLM won again” piece. But the authors went on to measure end-to-end.

Evidence two: end-to-end, it backfires — at first

Feed more accurate cardinalities back into the optimizer, the plan should be better, the query faster — that’s the intuition. The measurements flipped on several workloads.

On JOB-light and ErgastF1, plain Llama+PFT’s end-to-end total time actually exceeds the strongest baseline PRICE: it estimates more accurately, and on query-execution time alone it beats PG, PRICE, and ASM by a wide margin — but every estimate requires one inference pass of an 8B model, and that latency, accumulated into the total, eats up all the plan gain the accuracy bought on these two workloads. “More accurate” has, for the first time here, decoupled from “faster.”

This is the crux of the whole paper, and the one sentence the perf lens should keep: accuracy is the benefit, the cost of one estimate is the price, and end-to-end is the two subtracted. On the accuracy board alone, the LLM sweeps; count the wall-clock of every inference, and the bill changes instantly.

It doesn’t lose everywhere, of course. On more complex, larger-plan-space workloads like STATS and Genome, the plan improvement from the accuracy advantage far outweighs that inference overhead, and end-to-end still nets out ahead. The question shifts from “can the LLM do it” to “when is calling the LLM worth it.”

Evidence three: cost-aware gating, a very old wisdom

The authors’ fix isn’t in the model, it’s in the scheduling — and it’s an old trick anyone from query engines recognizes on sight: use the cost model as a bouncer.

Concretely (the paper calls it LLMs+PFT+cost): for each sub-query, first use PostgreSQL’s own cost model to estimate its execution cost; only the high-cost sub-queries above a threshold get to spend on an LLM call to refine the cardinality; very low-cost sub-queries just use PostgreSQL’s native estimate. The threshold is calibrated from the empirical relationship between “estimated cost vs actual runtime.”

The logic is clean: high-cost sub-queries tend to have many joins and strong correlation — precisely where the baselines degrade worst and the LLM’s edge is largest, so spending one inference here to buy back a plan that doesn’t degrade several-fold is worth it; low-cost sub-queries are already estimated accurately enough by the baseline, and calling the LLM is pure wasted latency. The gate spends the expensive compute exactly on the sub-queries where it can change the outcome.

This is the same sentence this series keeps returning to: the LLM doesn’t replace the optimizer, it’s used under the optimizer’s discipline. Last post, the cost model was the thing the LLM patched the blind spots of; this post, it flips into the gate that constrains the LLM — the cost model is on both ends.

Evidence four: laying the cost bill out flat

Worth being clear about the cost side too, so accuracy isn’t all that’s remembered.

Pre-fine-tuning (PFT): one-time, offline, about 12 hours, but the product is reusable across any database instance, so amortized it’s cheap.
Target-DB fine-tuning: LoRA, about 50–60 minutes, on par with most baselines, and even shorter than DeepDB, FactorJoin, and ALECE.
Single-inference latency: the structural weak point, and the root of the end-to-end backfire in evidence two; the gate exists for it.
Self-correction: iterative correction reduces hallucination, but the authors measured that on IMDB 85.03% of queries are accurate enough with zero iterations, and only 9.24% need up to 5 rounds — the bulk doesn’t actually iterate, and the overhead is less dramatic than feared.

The hardware is 8× A-800 (80GB), 512GB RAM, trained with LLaMA-Factory. The code is open-source (PKU-SDS-lab/EA-LLM-IN-DB). The scale isn’t small, but the training-side bill amortizes; what truly bottlenecks is always that one inference pass.

Closing: the cost model as bouncer

String the four pieces of evidence together and the paper answers its title’s question, with a caveat: a large model can be a cardinality estimator, provided you don’t call it for every sub-query.

On accuracy it genuinely won, and won on the hardest multi-join tail; but amortize inference latency into end-to-end and calling it blindly backfires. The real engineering contribution isn’t the fine-tuned Llama, it’s that plain gate — letting the optimizer’s own cost model decide which sub-queries deserve an LLM inference. Spending expensive compute on the cutting edge isn’t unique to CardEst; it’s most likely the general pattern for LLMs entering every stage of the query engine: the model is responsible for being more accurate where it’s hard, the cost model for constraining it to be called where it pays.

The next thing that should be examined this way is which stage left in the optimizer can both withstand the LLM’s accuracy dividend and bear the latency bill of every call.

Sources

arXiv:2603.28080 — Can Large Language Models be a Cardinality Estimator? An Empirical Study, Peking University + ByteDance + UPenn, 2026-03-31.
Code: https://github.com/PKU-SDS-lab/EA-LLM-IN-DB
Accessed: 2026-05-31.

Why cardinality estimation is the optimizer’s vital organ#

Evidence one: on accuracy, the LLM genuinely wins#

Evidence two: end-to-end, it backfires — at first#

Evidence three: cost-aware gating, a very old wisdom#

Evidence four: laying the cost bill out flat#

Closing: the cost model as bouncer#

Sources#