Skip to content

[GLUTEN-11915][VL] Support RowBasedChecksum in ColumnarShuffleWriter (SPARK-51756)#12067

Open
jaylisde wants to merge 1 commit into
apache:mainfrom
jaylisde:fix/issue-11915-shuffle-checksum-v2
Open

[GLUTEN-11915][VL] Support RowBasedChecksum in ColumnarShuffleWriter (SPARK-51756)#12067
jaylisde wants to merge 1 commit into
apache:mainfrom
jaylisde:fix/issue-11915-shuffle-checksum-v2

Conversation

@jaylisde
Copy link
Copy Markdown

@jaylisde jaylisde commented May 11, 2026

Summary

Spark 4.1 introduced RowBasedChecksum (SPARK-51756) for detecting non-deterministic stage retries. When spark.sql.shuffle.orderIndependentChecksum.enabled or spark.sql.shuffle.checksum.mismatchFullRetry.enabled is true, the shuffle writer must compute an order-independent per-row checksum and pass it via MapStatus.checksumValue to the driver for comparing across task attempts.

Problem: Gluten's ColumnarShuffleWriter always returns checksumValue = 0, causing the driver to skip non-deterministic retry detection. If a task retry produces different output (e.g., due to round-robin partitioning), downstream consumers may silently read inconsistent data without triggering a full stage retry.

Fix: Implement native C++ row-based checksum computation in VeloxHashShuffleWriter. For each row in doSplit(), serialize via UnsafeRowFast and compute XXH64 hash. Aggregate per-partition using XOR+SUM (order-independent). Return checksum array via JNI to Scala layer, which passes the aggregated value to MapStatus.checksumValue.

Changes

  • VeloxHashShuffleWriter.cc: Added computeRowBasedChecksums() using UnsafeRowFast + XXH64 with per-partition XOR+SUM aggregation.
  • Options.h, ShuffleWriter.h/cc: Added rowBasedChecksumEnabled option and rowBasedChecksums() accessor.
  • JniWrapper.cc: Accept boolean config param, return checksum array.
  • GlutenSplitResult.java, ShuffleWriterJniWrapper.java: Added rowBasedChecksums field and param.
  • ColumnarShuffleWriter.scala: Read SQLConf (OR logic), pass to native, use for MapStatus.
  • GlutenMapStatusUtil.scala (shims/spark33-41): Cross-version MapStatus compatibility.
  • RowBasedChecksumTest.cc: C++ unit test for order-independence, null handling, determinism.
  • GlutenMapStatusEndToEndSuite.scala: Integration test with ansiFallback=false.

Test

  • C++ unit test: 4/4 pass (order-independence, data-change detection, null handling, deterministic)
  • GlutenMapStatusEndToEndSuite: 3/3 pass (propagation, deterministic, data-change detection)

Partially addresses #11915.

Note

File-based shuffle checksum (.checksum file with ADLER32 for corruption diagnosis) will be addressed in a follow-up PR.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 11, 2026
@wForget
Copy link
Copy Markdown
Member

wForget commented May 11, 2026

Spark 4.1 introduced shuffle checksum end-to-end verification (SPARK-53322), requiring MapStatus.checksumValue to be non-zero and .checksum files to contain valid per-partition checksums.

Do you mean https://issues.apache.org/jira/browse/SPARK-51756?

@jaylisde
Copy link
Copy Markdown
Author

jaylisde commented May 11, 2026

Thanks @wForget for catching that. The correct reference should be SPARK-54663. I'll update the PR title and description.

@jaylisde jaylisde changed the title [GLUTEN-11915][VL] Support checksum-based shuffle writers for Spark 4.1 (SPARK-53322) [GLUTEN-11915][VL] Support checksum-based shuffle writers for Spark 4.1 (SPARK-54663) May 11, 2026
@wForget
Copy link
Copy Markdown
Member

wForget commented May 11, 2026

Thanks @wForget for catching that. The correct reference should be SPARK-54663. I'll update the PR title and description.

SPARK-54663 proposes row-based checksum, but the current implementation is based on data file.

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch 2 times, most recently from fa1cfba to 0bbebc7 Compare May 11, 2026 09:21
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 0bbebc7 to 05d464d Compare May 11, 2026 22:30
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 05d464d to 0f7239d Compare May 11, 2026 22:33
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde changed the title [GLUTEN-11915][VL] Support checksum-based shuffle writers for Spark 4.1 (SPARK-54663) [GLUTEN-11915][VL] Support RowBasedChecksum in ColumnarShuffleWriter (SPARK-51756) May 11, 2026
@jaylisde
Copy link
Copy Markdown
Author

Thanks @wForget. Updated to proper RowBasedChecksum (SPARK-51756) with native per-row XXH64 + order-independent aggregation. File-based checksum will be a follow-up PR.

@jaylisde jaylisde marked this pull request as draft May 11, 2026 22:54
@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 0f7239d to f120e77 Compare May 12, 2026 00:22
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from f120e77 to 74427a9 Compare May 12, 2026 04:12
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 74427a9 to 44e0ac4 Compare May 12, 2026 04:29
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 44e0ac4 to 7601312 Compare May 12, 2026 04:51
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 7601312 to ba4f04c Compare May 12, 2026 05:08
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jaylisde jaylisde marked this pull request as ready for review May 12, 2026 08:09
@philo-he
Copy link
Copy Markdown
Member

@marin-ma, could you take a look when you get a chance?

@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 24d256c to 71ef704 Compare May 13, 2026 02:29
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

…(SPARK-51756)

Implement order-independent row-based checksum for non-deterministic stage retry detection.

- C++ computeRowBasedChecksums(): UnsafeRowFast + XXH64, per-partition XOR+SUM
- JNI: pass config, return checksum array
- Scala: read SQLConf (OR logic), pass to native, use for MapStatus
- Shim: GlutenMapStatusUtil for Spark 3.3-4.1 compatibility
- Tests: C++ unit (4/4) + Scala integration (3/3)
@jaylisde jaylisde force-pushed the fix/issue-11915-shuffle-checksum-v2 branch from 71ef704 to 04208a3 Compare May 13, 2026 02:44
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants