Deep Dive into Spark SQL Metrics (Part 4): How Gluten Extends the Metrics System

This is a bonus Part 4 of the Spark SQL Metrics series:

Part 1: Metric types, complete reference, and what they mean
Part 2: How metrics work internally, and how AQE uses them for runtime decisions
Part 3: Extension APIs, UI rendering, and REST API
Part 4 (this post): How Gluten extends the metrics system

How Gluten’s Native Engine Produces Metrics

Apache Gluten replaces the JVM execution engine with a native C++ engine — either Velox or ClickHouse. Because native operators execute independently (not fused by JVM codegen), each C++ operator is a separate function call with its own timing infrastructure. As a natural consequence, Gluten surfaces 60+ metrics per operator, including wall clock time, per-phase join metrics, native spill tracking, dynamic filter statistics, and I/O breakdowns by storage tier.

The 3-Layer Architecture

Gluten’s metrics system bridges two worlds: Spark’s JVM-based SQLMetric framework and the native C++ execution engine. The architecture has three layers:

Spark SQLMetric (JVM)     ←── MetricsUpdater (bridge)  ←── Velox/CH (C++)
Map[String, SQLMetric]        updateNativeMetrics()         long[] arrays via JNI

Layer 1: Spark SQLMetric (unchanged)

Each *ExecTransformer — Gluten’s replacement for vanilla Spark’s *Exec operators — overrides lazy val metrics using the same pattern as vanilla Spark. But instead of hardcoding the metric set, it delegates to the backend:

BackendsApiManager.getMetricsApiInstance
  .genFilterTransformerMetrics(sparkContext)

This means the Velox backend and the ClickHouse backend can define completely different metrics for the same logical operator. A FilterExecTransformer running on Velox might expose wallNanos and peakMemoryBytes, while the same operator running on ClickHouse could expose different internal counters. The metric definitions are backend-specific, but they all end up as standard SQLMetric objects that Spark’s UI and REST API can display.

Layer 2: MetricsUpdater (Gluten’s bridge abstraction)

The MetricsUpdater trait is Gluten’s central bridging abstraction. It defines a single method:

trait MetricsUpdater extends Serializable {
  def updateNativeMetrics(opMetrics: IOperatorMetrics): Unit
}

Each operator has a corresponding MetricsUpdater implementation. These updaters are organized into a MetricsUpdaterTree that mirrors the plan DAG — one updater per operator, connected in the same parent-child structure as the physical plan.

Why a separate tree? Because the MetricsUpdaterTree is Serializable — it can be sent to executors without serializing the full SparkPlan (which contains non-serializable objects like SparkContext). On the executor, after native execution completes, the tree walks the native metrics and updates the SQLMetric accumulators.

Three special sentinel instances handle edge cases:

MetricsUpdater.None — operator has no metrics to update
MetricsUpdater.Todo — metrics support not yet implemented for this operator
MetricsUpdater.Terminate — the branch ends here (no children to recurse into)

Here’s a concrete example — the FilterMetricsUpdater:

class FilterMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater {
  override def updateNativeMetrics(opMetrics: IOperatorMetrics): Unit = {
    val m = opMetrics.asInstanceOf[OperatorMetrics]
    metrics("numOutputRows") += m.outputRows
    metrics("outputVectors") += m.outputVectors
    metrics("outputBytes") += m.outputBytes
    metrics("cpuCount") += m.cpuCount
    metrics("wallNanos") += m.wallNanos
    metrics("peakMemoryBytes") += m.peakMemoryBytes
    metrics("numMemoryAllocations") += m.numMemoryAllocations
  }
}

Notice how each native metric field (e.g., m.wallNanos) maps directly to a SQLMetric key. The updater is the translation layer between native C++ naming and Spark’s metric namespace.

Layer 3: Native metrics via JNI

On the C++ side, the Velox engine collects metrics in arrays during execution — one entry per operator index. When a task completes, Gluten transfers these metrics across the JNI boundary as a Metrics object containing long[] arrays:

inputRows[]       — rows consumed by each operator
outputRows[]      — rows produced by each operator
wallNanos[]       — wall clock nanoseconds per operator
cpuCount[]        — CPU time per operator
peakMemoryBytes[] — peak memory per operator
...               — 20+ more arrays

The MetricsUpdatingFunction walks the MetricsUpdaterTree, extracting per-operator values from the arrays by operator index. This is a bulk transfer — one JNI call per task, not per row — keeping overhead minimal.

What Gluten Adds — 60+ Metrics

Let’s look at the specific metrics Gluten introduces, organized by category.

Per-Operator Execution Metrics

In vanilla Spark, most operators report only numOutputRows. In Gluten, every operator gets these:

Metric	Display Name	Type	What It Measures
`wallNanos`	time of {operator}	nsTiming	Wall clock time per operator
`cpuCount`	cpu wall time count	sum	Number of `getOutput()` invocations (batch count)
`peakMemoryBytes`	peak memory bytes	size	Peak memory usage
`numMemoryAllocations`	number of memory allocations	sum	Memory allocation count
`outputRows`	number of output rows	sum	Output row count
`outputVectors`	number of output vectors	sum	Output vector (batch) count
`outputBytes`	number of output bytes	size	Output data volume in columnar format
`loadLazyVectorTime`	time to load lazy vectors	timing	Time loading lazy-evaluated vectors

Note: wallNanos uses an operator-specific display name — “time of filter”, “time of sort”, “time of scan and filter”, “time of project”, etc.

Having wallNanos on every operator makes it straightforward to identify bottleneck operators in native execution.

Understanding wallNanos and cpuCount

These two metrics deserve special attention because they are the most important for performance analysis.

Both originate from Velox’s CpuWallTiming structure, which is collected via RAII timers (DeltaCpuWallTimer) wrapping each operator’s getOutput() call:

struct CpuWallTiming {
  uint64_t count;      // Number of getOutput() invocations (batch count)
  uint64_t wallNanos;  // Total wall-clock time (steady_clock, nanoseconds)
  uint64_t cpuNanos;   // Total CPU time (CLOCK_THREAD_CPUTIME_ID, nanoseconds)
};

wallNanos — measured with std::chrono::steady_clock. Captures total real elapsed time, including any time the operator spends blocked waiting for its child to produce data, I/O waits, or thread scheduling delays.

cpuCount — despite the name, this is actually the invocation count (number of getOutput() calls = number of batches processed), not CPU time. The Gluten JNI bridge maps CpuWallTiming.count to the cpuCount metric.

How to interpret:

Scenario	wallNanos	cpuCount	What It Means
Large data, even work	High	High	Many batches processed, expected
Few batches, each slow	High	Low	Possible skew or complex per-batch work
Leaf operator (scan)	High	—	Mostly I/O time (check `ioWaitTime` separately)
Middle operator (filter)	High	—	Includes wait for child — compare with child’s wallNanos

Important caveat — wallNanos includes child waiting:

Because wallNanos wraps the entire getOutput() call, a parent operator’s wallNanos includes time spent blocked waiting for its child to produce data. This means:

For a leaf operator (scan): wallNanos ≈ I/O + compute time
For a middle operator (filter above a scan): wallNanos = own compute + child’s scan time
You cannot simply sum wallNanos across all operators — that would double-count

To isolate an operator’s own contribution, compare its wallNanos with its child’s wallNanos. The difference is the operator’s own processing time. Velox also tracks some I/O-specific metrics separately (ioWaitTime, dataSourceReadTime) to help separate pure I/O from compute.

Scan-Specific Metrics

Vanilla Spark’s scan operators have scanTime and numFiles. Gluten goes much deeper:

Metric	Display Name	What It Measures
`skippedSplits` / `processedSplits`	number of skipped/processed splits	File split pruning effectiveness
`skippedStrides` / `processedStrides`	number of skipped/processed row groups	Row group/stripe pruning within files
`ioWaitTime`	io wait time	Time waiting for I/O operations
`storageReadBytes`	storage read bytes	Bytes read from remote storage
`localReadBytes`	Bytes read from local SSD cache
`ramReadBytes`	Bytes read from in-memory cache
`preloadSplits`	Pre-loaded splits (prefetching)
`dataSourceAddSplitTime`	Time managing split assignments
`dataSourceReadTime`	Time reading data from the source

The storageReadBytes / localReadBytes / ramReadBytes breakdown is particularly valuable for cloud environments. If you see most reads coming from storageReadBytes, your cache isn’t warm. If ioWaitTime dominates wallNanos, the bottleneck is network I/O, not CPU.

Spill Metrics

Vanilla Spark tracks spill at the stage level. Gluten tracks it per operator, per phase:

Metric	Display Name	What It Measures
`spilledBytes`	bytes written for spilling	Volume of data spilled to disk
`spilledRows`	total rows written for spilling	Number of rows spilled
`spilledPartitions`	total spilled partitions	Number of partitions involved in spill
`spilledFiles`	total spilled files	Number of spill files created

For join operators, spill is tracked separately for the build and probe phases (see next section), so you can pinpoint exactly which phase is under memory pressure.

Dynamic Filter Metrics

Dynamic filters (also called runtime filters) are generated by join operators to prune scan results at runtime. Vanilla Spark has no metrics for this. Gluten tracks the full lifecycle:

Metric	Display Name	What It Measures
`numDynamicFiltersProduced`	number of dynamic filters produced	Runtime filters generated by join build sides
`numDynamicFiltersAccepted`	number of dynamic filters accepted	Runtime filters applied to scan operators
`numReplacedWithDynamicFilterRows`	number of replaced with dynamic filter rows	Rows eliminated before reaching the join

If numDynamicFiltersProduced > 0 but numDynamicFiltersAccepted = 0, the filters were generated but not applied — a sign that the scan and join aren’t connected in the way the optimizer expected. If numReplacedWithDynamicFilterRows is a large number, runtime filters are saving significant work.

Join Phase Separation — 20+ Metrics per Join

This is arguably Gluten’s most powerful metric enhancement. Vanilla Spark’s join operators report a single buildTime and numOutputRows. Gluten splits every join into its constituent phases with separate metrics for each:

Build phase:

Metric	Display Name	What It Measures
`hashBuildInputRows`	number of hash build input rows	Rows consumed by the build side
`hashBuildOutputRows`	number of hash build output rows	Rows in the hash table
`hashBuildWallNanos`	time of hash build	Wall clock time for building
`hashBuildPeakMemoryBytes`	hash build peak memory bytes	Peak memory during build
`hashBuildSpilledBytes`	hash build spilled bytes	Data spilled during build
`hashBuildSpilledRows`	hash build spilled rows	Rows spilled during build
`hashBuildSpilledPartitions`	hash build spilled partitions	Partitions spilled during build
`hashBuildSpilledFiles`	hash build spilled files	Spill files created during build

Probe phase:

| Metric | What It Measures |

Metric	Display Name	What It Measures
`hashProbeInputRows`	number of hash probe input rows	Rows consumed by the probe side
`hashProbeOutputRows`	number of hash probe output rows	Rows output after probing
`hashProbeWallNanos`	time of hash probe	Wall clock time for probing
`hashProbePeakMemoryBytes`	hash probe peak memory bytes	Peak memory during probe
`hashProbeSpilledBytes`	hash probe spilled bytes	Data spilled during probe
`hashProbeSpilledRows`	hash probe spilled rows	Rows spilled during probe
`hashProbeSpilledPartitions`	hash probe spilled partitions	Partitions spilled during probe
`hashProbeSpilledFiles`	hash probe spilled files	Spill files created during probe

Pre/post projection:

Metric	Display Name	What It Measures
`streamPreProjectionWallNanos`	time of stream preProjection	Expression evaluation time on the stream (probe) side before join
`streamPreProjectionCpuCount`	stream preProject cpu wall time count	Batch count for stream pre-projection
`buildPreProjectionWallNanos`	time to build preProjection	Expression evaluation time on the build side before join
`buildPreProjectionCpuCount`	preProject cpu wall time count	Batch count for build pre-projection
`postProjectionWallNanos`	time of postProjection	Expression evaluation time after join
`postProjectionCpuCount`	postProject cpu wall time count	Batch count for post-projection

In vanilla Spark, a slow join gives you almost nothing to work with — you know it’s slow, but not why. With Gluten, you can immediately see: is the build phase slow (maybe the build side is too large)? Is the probe phase slow (maybe hash collisions are causing excessive probing)? Is the build phase spilling (memory pressure)? This level of detail changes how you diagnose join performance.

Write Metrics

Metric	Display Name	What It Measures
`physicalWrittenBytes`	number of written bytes	Actual bytes written to storage
`writeIOTime` / `writeIONanos`	time of write IO	I/O time during writes
`numWrittenFiles`	number of written files	Number of files produced

Reading Gluten Metrics in the Spark UI

Gluten metrics appear in the same Spark SQL tab because they use the same SQLMetric framework. The operator names change (e.g., HashAggregateExecTransformer instead of HashAggregateExec) but metrics appear in the same side panel when you click on an operator node.

What to Look For

Here are the key patterns to watch for when reading Gluten metrics:

Identify the bottleneck operator:

Look at wallNanos on each operator. In a healthy query, scan and join operators dominate. If a FilterExecTransformer or ProjectExecTransformer has high wallNanos, the filter or projection expression itself is expensive — consider simplifying it.

Diagnose slow joins:

Compare hashBuildWallNanos vs hashProbeWallNanos. If the build side dominates, the build input is too large — consider changing the join order or adding a filter to reduce the build side. If the probe side dominates, look at hashProbeInputRows — too many probe rows or hash collisions could be the cause.

Check native predicate pushdown:

If skippedSplits > 0, native file-level pruning is working. If skippedStrides > 0, row group or stripe-level pruning within files is working. If both are zero, your predicate isn’t being pushed down into the native scan — check if the column type supports pushdown.

Verify runtime filter effectiveness:

If numDynamicFiltersAccepted > 0, runtime filters from join build sides are being applied to scans. Check numReplacedWithDynamicFilterRows to see how many rows were eliminated — a large number means significant I/O savings.

Detect memory pressure in native engine:

If spilledBytes > 0 on any operator, the native engine is spilling to disk. For joins, check whether the build phase or probe phase is spilling. For aggregations, spill means the grouping cardinality is high. Consider increasing native memory allocation or reducing data volume.

I/O tier analysis:

Compare storageReadBytes, localReadBytes, and ramReadBytes on scan operators. In a well-cached environment, you want most reads from ramReadBytes or localReadBytes. High storageReadBytes means you’re reading from remote storage (S3, HDFS) — check if your caching layer is configured correctly.

Accessing via spark-history-cli

Gluten metrics are also available through the REST API and spark-history-cli, since they’re stored as standard SQLMetric values:

spark-history-cli --json -a <app> sql <id>  # includes Gluten metrics

The JSON output will contain all the Gluten-specific metrics alongside vanilla Spark metrics, using the same {name, value} format described in Part 3.

Architectural Implications

Gluten’s metrics system offers several insights about extending Spark’s observability:

Engine replacement provides comprehensive metrics naturally. When the engine controls every operator’s execution, it can measure every boundary. Each C++ operator is a separate function call with its own start and end timestamps — per-operator timing on every operator is achievable without any workarounds.

The MetricsUpdater pattern is reusable. Any native backend can adopt this pattern: define a tree of lightweight, serializable updater objects that mirror the plan, transfer bulk metric arrays via JNI, and walk the tree to update SQLMetric accumulators.

JNI array-based transfer minimizes overhead. Instead of calling back into the JVM for every metric update, Gluten batches all metrics into long[] arrays — one bulk JNI transfer per task. This keeps the metrics overhead negligible even with 60+ metrics per operator.

Backend-agnostic design through MetricsApi. The MetricsApi abstraction means the Velox backend and ClickHouse backend can define completely different metrics for the same operator type. Adding a new backend (say, DataFusion) would only require implementing the MetricsApi interface — no changes to the core bridging code.

In Part 1, we covered the five metric types and the complete reference. In Part 2, we traced the internal lifecycle and AQE’s use of shuffle statistics. In Part 3, we explored extension APIs, UI rendering, and the REST API. This bonus Part 4 examined how Apache Gluten extends the metrics system by bridging native engine metrics back to Spark’s framework.

How Gluten’s Native Engine Produces Metrics#

The 3-Layer Architecture#

Layer 1: Spark SQLMetric (unchanged)#

Layer 2: MetricsUpdater (Gluten’s bridge abstraction)#

Layer 3: Native metrics via JNI#

What Gluten Adds — 60+ Metrics#

Per-Operator Execution Metrics#

Understanding wallNanos and cpuCount#

Scan-Specific Metrics#

Spill Metrics#

Dynamic Filter Metrics#

Join Phase Separation — 20+ Metrics per Join#

Write Metrics#

Reading Gluten Metrics in the Spark UI#

What to Look For#

Accessing via spark-history-cli#

Architectural Implications#