Skip to content

Optimize ParquetRowReader::Impl::skip to avoid io for skipped row groups#531

Open
Weixin-Xu wants to merge 1 commit into
bytedance:mainfrom
Weixin-Xu:parquet_reader_skip
Open

Optimize ParquetRowReader::Impl::skip to avoid io for skipped row groups#531
Weixin-Xu wants to merge 1 commit into
bytedance:mainfrom
Weixin-Xu:parquet_reader_skip

Conversation

@Weixin-Xu
Copy link
Copy Markdown
Collaborator

@Weixin-Xu Weixin-Xu commented Apr 28, 2026

What problem does this PR solve?

Issue Number: close #530

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Describe your changes in detail.
For complex logic, explain the "Why" and "How".

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Benchmark: ParquetReader::skip
    
    === Skip Across Row Groups ===
    Case            Old Time    New Time    Speedup    Old IO    New IO
    skip1RG         4.64 ms     4.14 ms     1.1x       18.5 MB   18.5 MB
    skip2RG         7.60 ms     4.76 ms     1.6x       27.8 MB   18.5 MB
    skip3RG         9.94 ms     4.83 ms     2.0x       37.0 MB   18.5 MB
    skip4RG        12.92 ms     4.89 ms     2.6x       46.3 MB   18.5 MB
    
    === Skip Mid Row Groups ===
    (similar pattern: time and IO scale linearly in old, remain constant in new)
    
    === Skip Past EOF ===
    Old: 10.74 ms, 46 MB
    New: 44.68 us, 9 MB
    (~240x faster, IO nearly eliminated)
    
    === Skip Within Row Group ===
    Old: 2.18 ms
    New: 2.11 ms
    (no regression)
    
    === Alternating Next/Skip ===
    Old: 18.46 ms
    New: 18.71 ms
    (no meaningful regression)
    
    Summary:
    - Eliminates redundant IO when skipping across row groups
    - IO no longer scales with skip distance
    - Up to 2.6x speedup for multi-row-group skip
    - Significant improvement for skipPastEof (~200x+)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@Weixin-Xu Weixin-Xu force-pushed the parquet_reader_skip branch 2 times, most recently from c34e254 to d98b401 Compare May 6, 2026 03:15
@Weixin-Xu Weixin-Xu force-pushed the parquet_reader_skip branch from d98b401 to 48bf0a2 Compare May 6, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Optimize ParquetRowReader::Impl::skip to avoid io for skipped row groups

1 participant