[VL] Add RDDScanExec support to Velox backend by minni31 · Pull Request #12077 · apache/gluten

minni31 · 2026-05-12T10:38:29Z

CONTEXT

RDDScanExec is Spark's physical plan node used when creating DataFrames from in-memory RDDs
via sparkSession.createDataset(rdd) or createDataFrame(rdd, schema). Currently, when Gluten
encounters this node with the Velox backend, it falls back to vanilla Spark's row-based execution.
The ClickHouse backend already supports this via CHRDDScanTransformer.

WHAT

This PR adds native Velox execution support for RDDScanExec by implementing a
VeloxRDDScanTransformer. The transformer converts RDD[InternalRow] into Velox columnar
batches using the existing RowToVeloxColumnarExec JNI infrastructure, so no new native code
is needed.

Key design decisions:

Reuses existing infrastructure: The transformer delegates to RowToVeloxColumnarExec
for the actual row-to-columnar conversion, keeping the implementation lean and consistent
with how Velox already handles row-based input.
Schema validation: Delegates to VeloxValidatorApi.validateSchema for recursive type
validation. Supports all Velox-compatible types including complex types (Array, Map, Struct)
which are handled via the UnsafeRowFast::deserialize path in the native converter. Rejects
only truly unsupported types with clean fallback to vanilla Spark.
Leaf node correctness: withNewChildrenInternal returns copy(...) consistent with
CHRDDScanTransformer.
Follows existing patterns: Mirrors the structure of CHRDDScanTransformer in the
ClickHouse backend.

Changes

VeloxRDDScanTransformer.scala (new) — Columnar execution node wrapping
RowToVeloxColumnarExec for native row-to-columnar conversion.
VeloxSparkPlanExecApi.scala (modified) — Overrides isSupportRDDScanExec and
getRDDScanTransform to wire up the new transformer.
VeloxRDDScanSuite.scala (new) — 7 unit tests covering plan replacement, type coverage,
aggregation, empty RDD, null values, idempotent reads, and all primitive types.

Test Results

All 7 unit tests passed on the internal CI pipeline (build 218528457):

Test Name	Status
basic RDDScanExec is replaced by VeloxRDDScanTransformer	✅
RDDScan with string and numeric types	✅
RDDScan with aggregation downstream	✅
RDDScan with empty RDD	✅
RDDScan preserves data correctness with multiple re-reads	✅
RDDScan with null values	✅
RDDScan with all supported primitive types	✅

weiting-chen · 2026-05-12T12:36:47Z

+    override val outputPartitioning: Partitioning,
+    override val outputOrdering: Seq[SortOrder]
+) extends RDDScanTransformer(outputAttributes, outputPartitioning, outputOrdering) {
+


PR description contradicts validation logic for complex types

Problem: The PR description states "rejects complex types (ARRAY, MAP, STRUCT)" but doValidateInternal() explicitly accepts these types. The code is correct — Velox does support complex types via UnsafeRowFast::deserialize. The PR description should be updated to avoid misleading reviewers.

Evidence:

case _: org.apache.spark.sql.types.ArrayType => case _: org.apache.spark.sql.types.MapType => case _: org.apache.spark.sql.types.StructType =>

These cases fall through to ValidationResult.succeeded, meaning complex types are accepted.

Suggested Fix: Update the PR description to remove the claim that complex types are rejected, e.g.:

Supports all Velox-compatible types including complex types (Array, Map, Struct). Rejects only truly unsupported types (e.g., CalendarIntervalType) with clean fallback to vanilla Spark.

Good catch — updated the PR description. It now correctly states that complex types (Array, Map, Struct) are supported via the UnsafeRowFast::deserialize path, and only truly unsupported types trigger fallback

weiting-chen · 2026-05-12T12:36:47Z

+    rdd: RDD[InternalRow],
+    name: String,
+    override val outputPartitioning: Partitioning,
+    override val outputOrdering: Seq[SortOrder]


Validation does not recurse into complex type element types

Problem: The type allowlist checks top-level types only. An ArrayType(UnsupportedType) or MapType(StringType, UnsupportedType) would pass validation but could fail at native execution time. The CH backend avoids this by delegating to ConverterUtils.getTypeNode() which recursively validates.

Evidence:

case _: org.apache.spark.sql.types.ArrayType => // passes any ArrayType, no element check case _: org.apache.spark.sql.types.MapType => // passes any MapType, no key/value check case _: org.apache.spark.sql.types.StructType => // passes any StructType, no field check

Suggested Fix:

case a: org.apache.spark.sql.types.ArrayType => validateType(a.elementType) case m: org.apache.spark.sql.types.MapType => validateType(m.keyType) validateType(m.valueType) case s: org.apache.spark.sql.types.StructType => s.fields.foreach(f => validateType(f.dataType))

Alternatively, delegate to VeloxValidatorApi for centralized type validation.

Thanks, this is a great point. Replaced the manual allowlist with VeloxValidatorApi.validateSchema which handles recursive validation for complex type elements and also catches variant shredded structs. This keeps validation logic centralized

weiting-chen · 2026-05-12T12:36:47Z

+        case org.apache.spark.sql.types.YearMonthIntervalType.DEFAULT =>
+        case _: org.apache.spark.sql.types.NullType =>
+        case dt
+            if !VeloxConfig.get.enableTimestampNtzValidation &&


withNewChildrenInternal returns this instead of copy()

Problem: Returning this from a case class's withNewChildrenInternal breaks Spark's convention that tree transformations produce structurally new nodes. The CH backend returns copy(...) for the equivalent transformer. While this is functionally safe for a leaf node today, it's inconsistent with the project pattern.

Evidence:

// Velox (this PR): override protected def withNewChildrenInternal(newChildren: IndexedSeq[SparkPlan]): SparkPlan = this // CH backend (CHRDDScanTransformer): override protected def withNewChildrenInternal(newChildren: IndexedSeq[SparkPlan]): SparkPlan = copy(outputAttributes, rdd, name, outputPartitioning, outputOrdering)

Suggested Fix:

override protected def withNewChildrenInternal(newChildren: IndexedSeq[SparkPlan]): SparkPlan = copy()

Fixed — now uses copy(outputAttributes, rdd, name, outputPartitioning, outputOrdering) consistent with CHRDDScanTransformer

weiting-chen · 2026-05-12T12:36:47Z

+        case _: org.apache.spark.sql.types.StringType =>
+        case _: org.apache.spark.sql.types.TimestampType =>
+        case _: org.apache.spark.sql.types.DateType =>
+        case _: org.apache.spark.sql.types.BinaryType =>


No SQLMetrics propagation — Spark UI won't show conversion metrics

Problem: The 4-param overload of toColumnarBatchIterator creates throwaway SQLMetric instances not attached to this plan's metrics map. As a result, numInputRows, numOutputBatches, and convertTime won't appear in the Spark UI for this operator, making production debugging harder.

Evidence:

// Current (4-param overload creates throwaway metrics): RowToVeloxColumnarExec.toColumnarBatchIterator(iter, localSchema, batchSize, batchBytes)

Suggested Fix: Define plan-level metrics and use the 7-param overload:

override lazy val metrics = Map( "numInputRows" -> SQLMetrics.createMetric(sparkContext, "number of input rows"), "numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"), "convertTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to convert")) override def doExecuteColumnar(): RDD[ColumnarBatch] = { val numInputRows = longMetric("numInputRows") val numOutputBatches = longMetric("numOutputBatches") val convertTime = longMetric("convertTime") val localSchema = this.schema val batchSize = GlutenConfig.get.maxBatchSize val batchBytes = VeloxConfig.get.veloxPreferredBatchBytes rdd.mapPartitions { iter => RowToVeloxColumnarExec.toColumnarBatchIterator( iter, localSchema, numInputRows, numOutputBatches, convertTime, batchSize, batchBytes) } }

Added numInputRows, numOutputBatches, and convertTime as plan-level metrics and switched to the 7-param toColumnarBatchIterator overload. These will now show up in the Spark UI

weiting-chen · 2026-05-12T12:36:47Z

+
+    checkAnswer(df, expectedAnswer)
+  }
+}


Missing test coverage for complex types and unsupported-type fallback

Problem: The 7 tests cover primitives, nulls, empty RDD, and aggregation — but two important scenarios are untested:

Complex types (ArrayType, MapType, StructType) — validation explicitly accepts them, but no test exercises the full row-to-columnar JNI path with nested data.

Unsupported type fallback — no test verifies that a truly unsupported type (e.g., CalendarIntervalType) triggers graceful fallback to vanilla Spark instead of a runtime crash.

Suggested Fix: Add at least these two tests:

test("RDDScan with array type") { val rdd = spark.sparkContext.parallelize(Seq(Row(Seq(1, 2, 3)), Row(Seq(4, 5)))) val schema = StructType(Seq(StructField("arr", ArrayType(IntegerType)))) val data = spark.createDataFrame(rdd, schema) val expectedAnswer = data.collect() val node = LogicalRDD.fromDataset( rdd = data.queryExecution.toRdd, originDataset = data, isStreaming = false) val df = ClassicDataset.ofRows(spark, node).toDF() checkAnswer(df, expectedAnswer) } test("RDDScan falls back for unsupported types") { // Create RDD with CalendarIntervalType or another unsupported type // Verify plan does NOT contain VeloxRDDScanTransformer (i.e., fallback occurred) }

Added 4 new tests: array type, map type, struct type, and unsupported-type fallback (DayTimeIntervalType → verifies VeloxRDDScanTransformer is absent from plan). Total coverage is now 11 tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the VELOX label May 12, 2026

minni31 force-pushed the oss/velox-rdd-scan-support branch 2 times, most recently from 0546b9d to b70bf18 Compare May 12, 2026 10:49

minni31 changed the title ~~[GLUTEN-8629] Add RDDScanExec support to Velox backend~~ [VL] Add RDDScanExec support to Velox backend May 12, 2026

weiting-chen reviewed May 12, 2026

View reviewed changes

minni31 force-pushed the oss/velox-rdd-scan-support branch from 3c6f648 to d816859 Compare May 12, 2026 14:48

Add RDDScanExec support to Velox backend

89494df

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

minni31 force-pushed the oss/velox-rdd-scan-support branch from d816859 to 89494df Compare May 12, 2026 16:49

minni31 mentioned this pull request May 12, 2026

[VL] Add LocalTableScanExec support to Velox backend #12080

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Add RDDScanExec support to Velox backend#12077

[VL] Add RDDScanExec support to Velox backend#12077
minni31 wants to merge 1 commit into
apache:mainfrom
minni31:oss/velox-rdd-scan-support

minni31 commented May 12, 2026 •

edited

Loading

Uh oh!

weiting-chen May 12, 2026

Uh oh!

minni31 May 12, 2026

Uh oh!

weiting-chen May 12, 2026

Uh oh!

minni31 May 12, 2026

Uh oh!

weiting-chen May 12, 2026

Uh oh!

minni31 May 12, 2026

Uh oh!

weiting-chen May 12, 2026

Uh oh!

minni31 May 12, 2026 •

edited

Loading

Uh oh!

weiting-chen May 12, 2026

Uh oh!

minni31 May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

minni31 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CONTEXT

WHAT

Changes

Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minni31 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

minni31 commented May 12, 2026 •

edited

Loading

minni31 May 12, 2026 •

edited

Loading