-
Notifications
You must be signed in to change notification settings - Fork 606
[GLUTEN-8629][VL] Add RDDScanExec support to Velox backend #12077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d0a2d1d
d97e670
34deb4d
72945f8
877e9be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.gluten.execution | ||
|
|
||
| import org.apache.gluten.backendsapi.velox.VeloxValidatorApi | ||
| import org.apache.gluten.config.{GlutenConfig, VeloxConfig} | ||
|
|
||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.{Attribute, SortOrder} | ||
| import org.apache.spark.sql.catalyst.plans.physical.Partitioning | ||
| import org.apache.spark.sql.execution.{RDDScanTransformer, SparkPlan} | ||
| import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} | ||
| import org.apache.spark.sql.vectorized.ColumnarBatch | ||
|
|
||
| /** | ||
| * Velox-backend implementation of RDDScanTransformer. | ||
| * | ||
| * Converts an RDD[InternalRow] into columnar batches using Velox's native row-to-columnar | ||
| * conversion (same JNI path as RowToVeloxColumnarExec). | ||
| */ | ||
| case class VeloxRDDScanTransformer( | ||
| outputAttributes: Seq[Attribute], | ||
| rdd: RDD[InternalRow], | ||
| name: String, | ||
| // Row-to-columnar conversion preserves data distribution, so we carry through | ||
| // the original partitioning. This differs from CH which uses UnknownPartitioning(0) | ||
| // but is consistent with RowToVeloxColumnarExec's behavior. | ||
| override val outputPartitioning: Partitioning, | ||
| override val outputOrdering: Seq[SortOrder] | ||
| ) extends RDDScanTransformer(outputAttributes, outputPartitioning, outputOrdering) { | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PR description contradicts validation logic for complex types Problem: The PR description states "rejects complex types (ARRAY, MAP, STRUCT)" but Evidence: case _: org.apache.spark.sql.types.ArrayType =>
case _: org.apache.spark.sql.types.MapType =>
case _: org.apache.spark.sql.types.StructType =>These cases fall through to Suggested Fix: Update the PR description to remove the claim that complex types are rejected, e.g.:
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch — updated the PR description. It now correctly states that complex types (Array, Map, Struct) are supported via the UnsafeRowFast::deserialize path, and only truly unsupported types trigger fallback |
||
| @transient override lazy val metrics: Map[String, SQLMetric] = Map( | ||
| "numInputRows" -> SQLMetrics.createMetric(sparkContext, "number of input rows"), | ||
| "numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"), | ||
| "convertTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to convert") | ||
| ) | ||
|
|
||
| override protected def doValidateInternal(): ValidationResult = { | ||
| for (field <- schema.fields) { | ||
| val reason = VeloxValidatorApi.validateSchema(field.dataType) | ||
| if (reason.isDefined) { | ||
| return ValidationResult.failed(reason.get) | ||
| } | ||
| } | ||
| ValidationResult.succeeded | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Metrics gap in BatchCarrierRow unwrap path Problem: When the RDD contains Evidence: case _: BatchCarrierRow =>
// No metrics updated here
(Iterator.single(first) ++ iter).flatMap(row => BatchCarrierRow.unwrap(row))Suggested Fix: case _: BatchCarrierRow =>
(Iterator.single(first) ++ iter).flatMap { row =>
BatchCarrierRow.unwrap(row).map { batch =>
numOutputBatches += 1
numInputRows += batch.numRows()
batch
}
}
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated the BatchCarrierRow unwrap path to increment numOutputBatches and numInputRows per batch, so Spark UI now shows correct metrics for checkpointed data. convertTime is intentionally omitted since no row-to-columnar conversion happens in this path. |
||
| } | ||
|
|
||
| override def doExecuteColumnar(): RDD[ColumnarBatch] = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great catch — this is a real bug. If the upstream RDD was produced by a Gluten plan ending in VeloxColumnarToCarrierRowExec (e.g., via df.checkpoint()), the rows would be BatchCarrierRow instances and UnsafeProjection.apply() would throw. Fixed by peeking at the first row and branching: carrier rows are unwrapped directly via BatchCarrierRow.unwrap(), skipping row-to-columnar conversion entirely. This mirrors the CH pattern. |
||
| val numInputRows = longMetric("numInputRows") | ||
| val numOutputBatches = longMetric("numOutputBatches") | ||
| val convertTime = longMetric("convertTime") | ||
| val localSchema = this.schema | ||
| val batchSize = GlutenConfig.get.maxBatchSize | ||
| val batchBytes = VeloxConfig.get.veloxPreferredBatchBytes | ||
| rdd.mapPartitions { | ||
| iter => | ||
| if (iter.hasNext) { | ||
| val first = iter.next() | ||
| first match { | ||
| case _: BatchCarrierRow => | ||
| // RDD already contains columnar batches wrapped as carrier rows | ||
| // (e.g., from df.checkpoint() on a Gluten plan). Unwrap directly. | ||
| (Iterator.single(first) ++ iter).flatMap { | ||
| row => | ||
| BatchCarrierRow.unwrap(row).map { | ||
| batch => | ||
| numOutputBatches += 1 | ||
| numInputRows += batch.numRows() | ||
| batch | ||
| } | ||
| } | ||
| case _ => | ||
| // Standard InternalRow path - convert via native row-to-columnar. | ||
| RowToVeloxColumnarExec.toColumnarBatchIterator( | ||
| Iterator.single(first) ++ iter, | ||
| localSchema, | ||
| numInputRows, | ||
| numOutputBatches, | ||
| convertTime, | ||
| batchSize, | ||
| batchBytes) | ||
| } | ||
| } else { | ||
| Iterator.empty | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override protected def withNewChildrenInternal( | ||
| newChildren: IndexedSeq[SparkPlan]): SparkPlan = { | ||
| assert(newChildren.isEmpty, "VeloxRDDScanTransformer is a leaf node") | ||
| copy(outputAttributes, rdd, name, outputPartitioning, outputOrdering) | ||
| } | ||
| } | ||
|
|
||
| object VeloxRDDScanTransformer { | ||
| def replace(plan: org.apache.spark.sql.execution.RDDScanExec): RDDScanTransformer = | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CH uses
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid concern. Row-to-columnar conversion doesn't change data distribution — it converts row format within each partition, preserving the partition layout. This is consistent with RowToVeloxColumnarExec which also carries through the child's outputPartitioning. Added an inline comment explaining the rationale and the difference from CH's approach |
||
| VeloxRDDScanTransformer( | ||
| plan.output, | ||
| plan.inputRDD, | ||
| plan.nodeName, | ||
| plan.outputPartitioning, | ||
| plan.outputOrdering) | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,254 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.spark.sql.execution | ||
|
|
||
| import org.apache.gluten.execution._ | ||
|
|
||
| import org.apache.spark.SparkConf | ||
| import org.apache.spark.sql.{DataFrame, Row} | ||
| import org.apache.spark.sql.classic.ClassicDataset | ||
| import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.util.Utils | ||
|
|
||
| class VeloxRDDScanSuite extends VeloxWholeStageTransformerSuite with AdaptiveSparkPlanHelper { | ||
|
|
||
| override protected val resourcePath: String = "/tpch-data-parquet" | ||
| override protected val fileFormat: String = "parquet" | ||
|
|
||
| override protected def sparkConf: SparkConf = { | ||
| super.sparkConf | ||
| .set("spark.sql.ansi.enabled", "false") | ||
| } | ||
|
|
||
| override def beforeAll(): Unit = { | ||
| super.beforeAll() | ||
| createTPCHNotNullTables() | ||
| } | ||
|
|
||
| /** Creates a DataFrame backed by LogicalRDD/RDDScanExec from an existing DataFrame. */ | ||
| private def asRDDScanDF(data: DataFrame): DataFrame = { | ||
| val node = LogicalRDD( | ||
| data.queryExecution.logical.output, | ||
| data.queryExecution.toRdd)(data.sparkSession) | ||
| ClassicDataset.ofRows(spark, node).toDF() | ||
| } | ||
|
|
||
| test("basic RDDScanExec is replaced by VeloxRDDScanTransformer") { | ||
| val data = spark.sql("SELECT l_orderkey, l_partkey FROM lineitem LIMIT 10") | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with string and numeric types") { | ||
| val data = spark.sql("""SELECT l_returnflag, l_linestatus, l_quantity, l_extendedprice | ||
| |FROM lineitem LIMIT 20""".stripMargin) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with aggregation downstream") { | ||
| val query = | ||
| """SELECT l_returnflag, sum(l_quantity) AS sum_qty | ||
| |FROM lineitem | ||
| |WHERE l_shipdate <= date'1998-09-02' | ||
| |GROUP BY l_returnflag""".stripMargin | ||
| val data = spark.sql(query) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test — and the following
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right — tests 3–11 would silently pass even if offloading stopped working. Added collect { case _: VeloxRDDScanTransformer => true } assertions to all 8 tests that were missing them. The unsupported-type fallback test already asserts the absence of the transformer, so that one was fine as-is. |
||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with empty RDD") { | ||
| val data = spark.sql("SELECT l_orderkey FROM lineitem WHERE 1 = 0") | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| assert(df.count() == 0) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan preserves data correctness with multiple re-reads") { | ||
| val data = spark.sql("SELECT l_orderkey, l_partkey FROM lineitem LIMIT 50") | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| // Read twice to verify idempotency | ||
| checkAnswer(df, expectedAnswer) | ||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with null values") { | ||
| val rdd = spark.sparkContext.parallelize( | ||
| Seq( | ||
| Row(1, "a", null), | ||
| Row(null, "b", 2.0), | ||
| Row(3, null, 3.0) | ||
| )) | ||
| val schema = StructType( | ||
| Seq( | ||
| StructField("id", IntegerType, nullable = true), | ||
| StructField("name", StringType, nullable = true), | ||
| StructField("value", DoubleType, nullable = true) | ||
| )) | ||
| val data = spark.createDataFrame(rdd, schema) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with all supported primitive types") { | ||
| val rdd = spark.sparkContext.parallelize( | ||
| Seq( | ||
| Row( | ||
| true, | ||
| 1.toByte, | ||
| 2.toShort, | ||
| 3, | ||
| 4L, | ||
| 5.0f, | ||
| 6.0, | ||
| "hello", | ||
| java.sql.Date.valueOf("2024-01-01"), | ||
| java.sql.Timestamp.valueOf("2024-01-01 12:00:00"), | ||
| Array[Byte](1, 2, 3), | ||
| BigDecimal("123.45").underlying() | ||
| ) | ||
| )) | ||
| val schema = StructType( | ||
| Seq( | ||
| StructField("bool", BooleanType), | ||
| StructField("byte", ByteType), | ||
| StructField("short", ShortType), | ||
| StructField("int", IntegerType), | ||
| StructField("long", LongType), | ||
| StructField("float", FloatType), | ||
| StructField("double", DoubleType), | ||
| StructField("string", StringType), | ||
| StructField("date", DateType), | ||
| StructField("timestamp", TimestampType), | ||
| StructField("binary", BinaryType), | ||
| StructField("decimal", DecimalType(10, 2)) | ||
| )) | ||
| val data = spark.createDataFrame(rdd, schema) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with array type") { | ||
| val rdd = spark.sparkContext.parallelize( | ||
| Seq( | ||
| Row(Seq(1, 2, 3)), | ||
| Row(Seq(4, 5)) | ||
| )) | ||
| val schema = StructType(Seq(StructField("arr", ArrayType(IntegerType)))) | ||
| val data = spark.createDataFrame(rdd, schema) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with map type") { | ||
| val rdd = spark.sparkContext.parallelize( | ||
| Seq( | ||
| Row(Map("a" -> 1, "b" -> 2)), | ||
| Row(Map("c" -> 3)) | ||
| )) | ||
| val schema = StructType(Seq(StructField("m", MapType(StringType, IntegerType)))) | ||
| val data = spark.createDataFrame(rdd, schema) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan with struct type") { | ||
| val rdd = spark.sparkContext.parallelize( | ||
| Seq( | ||
| Row(Row("hello", 1)), | ||
| Row(Row("world", 2)) | ||
| )) | ||
| val innerSchema = StructType( | ||
| Seq(StructField("name", StringType), StructField("value", IntegerType))) | ||
| val schema = StructType(Seq(StructField("s", innerSchema))) | ||
| val data = spark.createDataFrame(rdd, schema) | ||
| val expectedAnswer = data.collect() | ||
| val df = asRDDScanDF(data) | ||
|
|
||
| checkAnswer(df, expectedAnswer) | ||
| val cnt = collect(df.queryExecution.executedPlan) { case _: VeloxRDDScanTransformer => true } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } | ||
|
|
||
| test("RDDScan falls back for unsupported types") { | ||
| val data = spark.sql("SELECT INTERVAL '1' DAY AS di") | ||
| val expectedAnswer = data.collect() | ||
| val result = asRDDScanDF(data) | ||
|
|
||
| // Should still produce correct results via fallback to vanilla Spark | ||
| checkAnswer(result, expectedAnswer) | ||
| val cnt = collect(result.queryExecution.executedPlan) { | ||
| case _: VeloxRDDScanTransformer => true | ||
| } | ||
| assert(cnt.isEmpty, "Expected fallback - VeloxRDDScanTransformer should NOT be in plan") | ||
| } | ||
|
|
||
| test("RDDScan handles BatchCarrierRow from checkpoint") { | ||
| val tempDir = Utils.createTempDir() | ||
| try { | ||
| spark.sparkContext.setCheckpointDir(tempDir.getAbsolutePath) | ||
| val df = spark.range(100).selectExpr("id", "id * 2 as value") | ||
| val checkpointed = df.localCheckpoint() | ||
| val result = asRDDScanDF(checkpointed) | ||
|
|
||
| checkAnswer(result, df.collect()) | ||
| val cnt = collect(result.queryExecution.executedPlan) { | ||
| case _: VeloxRDDScanTransformer => true | ||
| } | ||
| assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan") | ||
| } finally { | ||
| Utils.deleteRecursively(tempDir) | ||
| } | ||
| } | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing test coverage for complex types and unsupported-type fallback Problem: The 7 tests cover primitives, nulls, empty RDD, and aggregation — but two important scenarios are untested:
Suggested Fix: Add at least these two tests: test("RDDScan with array type") {
val rdd = spark.sparkContext.parallelize(Seq(Row(Seq(1, 2, 3)), Row(Seq(4, 5))))
val schema = StructType(Seq(StructField("arr", ArrayType(IntegerType))))
val data = spark.createDataFrame(rdd, schema)
val expectedAnswer = data.collect()
val node = LogicalRDD.fromDataset(
rdd = data.queryExecution.toRdd, originDataset = data, isStreaming = false)
val df = ClassicDataset.ofRows(spark, node).toDF()
checkAnswer(df, expectedAnswer)
}
test("RDDScan falls back for unsupported types") {
// Create RDD with CalendarIntervalType or another unsupported type
// Verify plan does NOT contain VeloxRDDScanTransformer (i.e., fallback occurred)
}
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added 4 new tests: array type, map type, struct type, and unsupported-type fallback (DayTimeIntervalType → verifies VeloxRDDScanTransformer is absent from plan). Total coverage is now 11 tests.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing test for BatchCarrierRow unwrap path Problem: The new Suggested Fix: Add a test that forces the BatchCarrierRow path: test("RDDScan handles BatchCarrierRow from checkpoint") {
spark.sparkContext.setCheckpointDir(tempPath)
val df = spark.range(100).selectExpr("id", "id * 2 as value")
val checkpointed = df.localCheckpoint()
val result = asRDDScanDF(checkpointed)
checkAnswer(result, df.collect())
val cnt = collect(result.queryExecution.executedPlan) {
case _: VeloxRDDScanTransformer => true
}
assert(cnt.nonEmpty, "Expected VeloxRDDScanTransformer in plan")
}
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a localCheckpoint() round-trip test that exercises the BatchCarrierRow detection and unwrap logic. It verifies both result correctness and that VeloxRDDScanTransformer is present in the plan. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation does not recurse into complex type element types
Problem: The type allowlist checks top-level types only. An
ArrayType(UnsupportedType)orMapType(StringType, UnsupportedType)would pass validation but could fail at native execution time. The CH backend avoids this by delegating toConverterUtils.getTypeNode()which recursively validates.Evidence:
Suggested Fix:
Alternatively, delegate to
VeloxValidatorApifor centralized type validation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a great point. Replaced the manual allowlist with VeloxValidatorApi.validateSchema which handles recursive validation for complex type elements and also catches variant shredded structs. This keeps validation logic centralized