This is a bonus Part 4 of the Spark SQL Metrics series:

How Gluten’s Native Engine Produces Metrics

Apache Gluten replaces the JVM execution engine with a native C++ engine — either Velox or ClickHouse. Because native operators execute independently (not fused by JVM codegen), each C++ operator is a separate function call with its own timing infrastructure. As a natural consequence, Gluten surfaces 60+ metrics per operator, including wall clock time, per-phase join metrics, native spill tracking, dynamic filter statistics, and I/O breakdowns by storage tier.

The 3-Layer Architecture

Gluten’s metrics system bridges two worlds: Spark’s JVM-based SQLMetric framework and the native C++ execution engine. The architecture has three layers:

Spark SQLMetric (JVM)     ←── MetricsUpdater (bridge)  ←── Velox/CH (C++)
Map[String, SQLMetric]        updateNativeMetrics()         long[] arrays via JNI

Layer 1: Spark SQLMetric (unchanged)

Each *ExecTransformer — Gluten’s replacement for vanilla Spark’s *Exec operators — overrides lazy val metrics using the same pattern as vanilla Spark. But instead of hardcoding the metric set, it delegates to the backend:

BackendsApiManager.getMetricsApiInstance
  .genFilterTransformerMetrics(sparkContext)

This means the Velox backend and the ClickHouse backend can define completely different metrics for the same logical operator. A FilterExecTransformer running on Velox might expose wallNanos and peakMemoryBytes, while the same operator running on ClickHouse could expose different internal counters. The metric definitions are backend-specific, but they all end up as standard SQLMetric objects that Spark’s UI and REST API can display.

Layer 2: MetricsUpdater (Gluten’s bridge abstraction)

The MetricsUpdater trait is Gluten’s central bridging abstraction. It defines a single method:

trait MetricsUpdater extends Serializable {
  def updateNativeMetrics(opMetrics: IOperatorMetrics): Unit
}

Each operator has a corresponding MetricsUpdater implementation. These updaters are organized into a MetricsUpdaterTree that mirrors the plan DAG — one updater per operator, connected in the same parent-child structure as the physical plan.

Why a separate tree? Because the MetricsUpdaterTree is Serializable — it can be sent to executors without serializing the full SparkPlan (which contains non-serializable objects like SparkContext). On the executor, after native execution completes, the tree walks the native metrics and updates the SQLMetric accumulators.

Three special sentinel instances handle edge cases:

  • MetricsUpdater.None — operator has no metrics to update
  • MetricsUpdater.Todo — metrics support not yet implemented for this operator
  • MetricsUpdater.Terminate — the branch ends here (no children to recurse into)

Here’s a concrete example — the FilterMetricsUpdater:

class FilterMetricsUpdater(val metrics: Map[String, SQLMetric]) extends MetricsUpdater {
  override def updateNativeMetrics(opMetrics: IOperatorMetrics): Unit = {
    val m = opMetrics.asInstanceOf[OperatorMetrics]
    metrics("numOutputRows") += m.outputRows
    metrics("outputVectors") += m.outputVectors
    metrics("outputBytes") += m.outputBytes
    metrics("cpuCount") += m.cpuCount
    metrics("wallNanos") += m.wallNanos
    metrics("peakMemoryBytes") += m.peakMemoryBytes
    metrics("numMemoryAllocations") += m.numMemoryAllocations
  }
}

Notice how each native metric field (e.g., m.wallNanos) maps directly to a SQLMetric key. The updater is the translation layer between native C++ naming and Spark’s metric namespace.

Layer 3: Native metrics via JNI

On the C++ side, the Velox engine collects metrics in arrays during execution — one entry per operator index. When a task completes, Gluten transfers these metrics across the JNI boundary as a Metrics object containing long[] arrays:

inputRows[]       — rows consumed by each operator
outputRows[]      — rows produced by each operator
wallNanos[]       — wall clock nanoseconds per operator
cpuCount[]        — CPU time per operator
peakMemoryBytes[] — peak memory per operator
...               — 20+ more arrays

The MetricsUpdatingFunction walks the MetricsUpdaterTree, extracting per-operator values from the arrays by operator index. This is a bulk transfer — one JNI call per task, not per row — keeping overhead minimal.

What Gluten Adds — 60+ Metrics

Let’s look at the specific metrics Gluten introduces, organized by category.

Per-Operator Execution Metrics

In vanilla Spark, most operators report only numOutputRows. In Gluten, every operator gets these:

MetricDisplay NameTypeWhat It Measures
wallNanostime of {operator}nsTimingWall clock time per operator
cpuCountcpu wall time countsumNumber of getOutput() invocations (batch count)
peakMemoryBytespeak memory bytessizePeak memory usage
numMemoryAllocationsnumber of memory allocationssumMemory allocation count
outputRowsnumber of output rowssumOutput row count
outputVectorsnumber of output vectorssumOutput vector (batch) count
outputBytesnumber of output bytessizeOutput data volume in columnar format
loadLazyVectorTimetime to load lazy vectorstimingTime loading lazy-evaluated vectors

Note: wallNanos uses an operator-specific display name — “time of filter”, “time of sort”, “time of scan and filter”, “time of project”, etc.

Having wallNanos on every operator makes it straightforward to identify bottleneck operators in native execution.

Understanding wallNanos and cpuCount

These two metrics deserve special attention because they are the most important for performance analysis.

Both originate from Velox’s CpuWallTiming structure, which is collected via RAII timers (DeltaCpuWallTimer) wrapping each operator’s getOutput() call:

struct CpuWallTiming {
  uint64_t count;      // Number of getOutput() invocations (batch count)
  uint64_t wallNanos;  // Total wall-clock time (steady_clock, nanoseconds)
  uint64_t cpuNanos;   // Total CPU time (CLOCK_THREAD_CPUTIME_ID, nanoseconds)
};

wallNanos — measured with std::chrono::steady_clock. Captures total real elapsed time, including any time the operator spends blocked waiting for its child to produce data, I/O waits, or thread scheduling delays.

cpuCount — despite the name, this is actually the invocation count (number of getOutput() calls = number of batches processed), not CPU time. The Gluten JNI bridge maps CpuWallTiming.count to the cpuCount metric.

How to interpret:

ScenariowallNanoscpuCountWhat It Means
Large data, even workHighHighMany batches processed, expected
Few batches, each slowHighLowPossible skew or complex per-batch work
Leaf operator (scan)HighMostly I/O time (check ioWaitTime separately)
Middle operator (filter)HighIncludes wait for child — compare with child’s wallNanos

Important caveat — wallNanos includes child waiting:

Because wallNanos wraps the entire getOutput() call, a parent operator’s wallNanos includes time spent blocked waiting for its child to produce data. This means:

  • For a leaf operator (scan): wallNanos ≈ I/O + compute time
  • For a middle operator (filter above a scan): wallNanos = own compute + child’s scan time
  • You cannot simply sum wallNanos across all operators — that would double-count

To isolate an operator’s own contribution, compare its wallNanos with its child’s wallNanos. The difference is the operator’s own processing time. Velox also tracks some I/O-specific metrics separately (ioWaitTime, dataSourceReadTime) to help separate pure I/O from compute.

Scan-Specific Metrics

Vanilla Spark’s scan operators have scanTime and numFiles. Gluten goes much deeper:

MetricDisplay NameWhat It Measures
skippedSplits / processedSplitsnumber of skipped/processed splitsFile split pruning effectiveness
skippedStrides / processedStridesnumber of skipped/processed row groupsRow group/stripe pruning within files
ioWaitTimeio wait timeTime waiting for I/O operations
storageReadBytesstorage read bytesBytes read from remote storage
localReadBytesBytes read from local SSD cache
ramReadBytesBytes read from in-memory cache
preloadSplitsPre-loaded splits (prefetching)
dataSourceAddSplitTimeTime managing split assignments
dataSourceReadTimeTime reading data from the source

The storageReadBytes / localReadBytes / ramReadBytes breakdown is particularly valuable for cloud environments. If you see most reads coming from storageReadBytes, your cache isn’t warm. If ioWaitTime dominates wallNanos, the bottleneck is network I/O, not CPU.

Spill Metrics

Vanilla Spark tracks spill at the stage level. Gluten tracks it per operator, per phase:

MetricDisplay NameWhat It Measures
spilledBytesbytes written for spillingVolume of data spilled to disk
spilledRowstotal rows written for spillingNumber of rows spilled
spilledPartitionstotal spilled partitionsNumber of partitions involved in spill
spilledFilestotal spilled filesNumber of spill files created

For join operators, spill is tracked separately for the build and probe phases (see next section), so you can pinpoint exactly which phase is under memory pressure.

Dynamic Filter Metrics

Dynamic filters (also called runtime filters) are generated by join operators to prune scan results at runtime. Vanilla Spark has no metrics for this. Gluten tracks the full lifecycle:

MetricDisplay NameWhat It Measures
numDynamicFiltersProducednumber of dynamic filters producedRuntime filters generated by join build sides
numDynamicFiltersAcceptednumber of dynamic filters acceptedRuntime filters applied to scan operators
numReplacedWithDynamicFilterRowsnumber of replaced with dynamic filter rowsRows eliminated before reaching the join

If numDynamicFiltersProduced > 0 but numDynamicFiltersAccepted = 0, the filters were generated but not applied — a sign that the scan and join aren’t connected in the way the optimizer expected. If numReplacedWithDynamicFilterRows is a large number, runtime filters are saving significant work.

Join Phase Separation — 20+ Metrics per Join

This is arguably Gluten’s most powerful metric enhancement. Vanilla Spark’s join operators report a single buildTime and numOutputRows. Gluten splits every join into its constituent phases with separate metrics for each:

Build phase:

MetricDisplay NameWhat It Measures
hashBuildInputRowsnumber of hash build input rowsRows consumed by the build side
hashBuildOutputRowsnumber of hash build output rowsRows in the hash table
hashBuildWallNanostime of hash buildWall clock time for building
hashBuildPeakMemoryByteshash build peak memory bytesPeak memory during build
hashBuildSpilledByteshash build spilled bytesData spilled during build
hashBuildSpilledRowshash build spilled rowsRows spilled during build
hashBuildSpilledPartitionshash build spilled partitionsPartitions spilled during build
hashBuildSpilledFileshash build spilled filesSpill files created during build

Probe phase:

| Metric | What It Measures |

MetricDisplay NameWhat It Measures
hashProbeInputRowsnumber of hash probe input rowsRows consumed by the probe side
hashProbeOutputRowsnumber of hash probe output rowsRows output after probing
hashProbeWallNanostime of hash probeWall clock time for probing
hashProbePeakMemoryByteshash probe peak memory bytesPeak memory during probe
hashProbeSpilledByteshash probe spilled bytesData spilled during probe
hashProbeSpilledRowshash probe spilled rowsRows spilled during probe
hashProbeSpilledPartitionshash probe spilled partitionsPartitions spilled during probe
hashProbeSpilledFileshash probe spilled filesSpill files created during probe

Pre/post projection:

MetricDisplay NameWhat It Measures
streamPreProjectionWallNanostime of stream preProjectionExpression evaluation time on the stream (probe) side before join
streamPreProjectionCpuCountstream preProject cpu wall time countBatch count for stream pre-projection
buildPreProjectionWallNanostime to build preProjectionExpression evaluation time on the build side before join
buildPreProjectionCpuCountpreProject cpu wall time countBatch count for build pre-projection
postProjectionWallNanostime of postProjectionExpression evaluation time after join
postProjectionCpuCountpostProject cpu wall time countBatch count for post-projection

In vanilla Spark, a slow join gives you almost nothing to work with — you know it’s slow, but not why. With Gluten, you can immediately see: is the build phase slow (maybe the build side is too large)? Is the probe phase slow (maybe hash collisions are causing excessive probing)? Is the build phase spilling (memory pressure)? This level of detail changes how you diagnose join performance.

Write Metrics

MetricDisplay NameWhat It Measures
physicalWrittenBytesnumber of written bytesActual bytes written to storage
writeIOTime / writeIONanostime of write IOI/O time during writes
numWrittenFilesnumber of written filesNumber of files produced

Reading Gluten Metrics in the Spark UI

Gluten metrics appear in the same Spark SQL tab because they use the same SQLMetric framework. The operator names change (e.g., HashAggregateExecTransformer instead of HashAggregateExec) but metrics appear in the same side panel when you click on an operator node.

What to Look For

Here are the key patterns to watch for when reading Gluten metrics:

Identify the bottleneck operator:

Look at wallNanos on each operator. In a healthy query, scan and join operators dominate. If a FilterExecTransformer or ProjectExecTransformer has high wallNanos, the filter or projection expression itself is expensive — consider simplifying it.

Diagnose slow joins:

Compare hashBuildWallNanos vs hashProbeWallNanos. If the build side dominates, the build input is too large — consider changing the join order or adding a filter to reduce the build side. If the probe side dominates, look at hashProbeInputRows — too many probe rows or hash collisions could be the cause.

Check native predicate pushdown:

If skippedSplits > 0, native file-level pruning is working. If skippedStrides > 0, row group or stripe-level pruning within files is working. If both are zero, your predicate isn’t being pushed down into the native scan — check if the column type supports pushdown.

Verify runtime filter effectiveness:

If numDynamicFiltersAccepted > 0, runtime filters from join build sides are being applied to scans. Check numReplacedWithDynamicFilterRows to see how many rows were eliminated — a large number means significant I/O savings.

Detect memory pressure in native engine:

If spilledBytes > 0 on any operator, the native engine is spilling to disk. For joins, check whether the build phase or probe phase is spilling. For aggregations, spill means the grouping cardinality is high. Consider increasing native memory allocation or reducing data volume.

I/O tier analysis:

Compare storageReadBytes, localReadBytes, and ramReadBytes on scan operators. In a well-cached environment, you want most reads from ramReadBytes or localReadBytes. High storageReadBytes means you’re reading from remote storage (S3, HDFS) — check if your caching layer is configured correctly.

Accessing via spark-history-cli

Gluten metrics are also available through the REST API and spark-history-cli, since they’re stored as standard SQLMetric values:

spark-history-cli --json -a <app> sql <id>  # includes Gluten metrics

The JSON output will contain all the Gluten-specific metrics alongside vanilla Spark metrics, using the same {name, value} format described in Part 3.

Architectural Implications

Gluten’s metrics system offers several insights about extending Spark’s observability:

Engine replacement provides comprehensive metrics naturally. When the engine controls every operator’s execution, it can measure every boundary. Each C++ operator is a separate function call with its own start and end timestamps — per-operator timing on every operator is achievable without any workarounds.

The MetricsUpdater pattern is reusable. Any native backend can adopt this pattern: define a tree of lightweight, serializable updater objects that mirror the plan, transfer bulk metric arrays via JNI, and walk the tree to update SQLMetric accumulators.

JNI array-based transfer minimizes overhead. Instead of calling back into the JVM for every metric update, Gluten batches all metrics into long[] arrays — one bulk JNI transfer per task. This keeps the metrics overhead negligible even with 60+ metrics per operator.

Backend-agnostic design through MetricsApi. The MetricsApi abstraction means the Velox backend and ClickHouse backend can define completely different metrics for the same operator type. Adding a new backend (say, DataFusion) would only require implementing the MetricsApi interface — no changes to the core bridging code.


In Part 1, we covered the five metric types and the complete reference. In Part 2, we traced the internal lifecycle and AQE’s use of shuffle statistics. In Part 3, we explored extension APIs, UI rendering, and the REST API. This bonus Part 4 examined how Apache Gluten extends the metrics system by bridging native engine metrics back to Spark’s framework.