Skip to content

Conversation

zuston
Copy link
Member

@zuston zuston commented Sep 4, 2025

What changes were proposed in this pull request?

This PR is to introduce the overlapping decompression for the shuffle reading for the better shuffle speed.

Why are the changes needed?

for #2601

When applying the #2598 into the benchmark of terasort 100g, I found some bottleneck for the decompression time in the read phase.

Based on a 100 GB Terasort benchmark, the results are impressive, reducing shuffle read time by 50% after applying this PR.

image

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

  1. Unit tests.

@zuston zuston linked an issue Sep 4, 2025 that may be closed by this pull request
3 tasks
@zuston zuston requested review from xianjingfeng and jerqi September 4, 2025 10:07
Copy link

github-actions bot commented Sep 4, 2025

Test Results

 3 105 files  +15   3 105 suites  +15   6h 49m 38s ⏱️ +9s
 1 200 tests + 2   1 199 ✅ + 3   1 💤 ±0  0 ❌ ±0 
15 196 runs  +30  15 181 ✅ +32  15 💤 ±0  0 ❌ ±0 

Results for commit deb3a6c. ± Comparison against base commit 2a32171.

♻️ This comment has been updated with latest results.

@jerqi
Copy link
Contributor

jerqi commented Sep 4, 2025

Spark is ok. But mr may need order.

@zuston
Copy link
Member Author

zuston commented Sep 4, 2025

Spark is ok. But mr may need order.

Thanks for sharing this, but the impl has ensured the order. BTW now this PR is only valid in spark

@zuston zuston merged commit 1e48bc6 into apache:master Sep 5, 2025
80 of 81 checks passed
@zuston zuston deleted the readoverlapping branch September 5, 2025 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Overlapping decompression for shuffle read
2 participants