-
Notifications
You must be signed in to change notification settings - Fork 709
Implement bound truncation in parquet writer #29164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
With truncation on statistical bounds enabled these bounds won't always be equal to the actual min/max in the page(s). Hence for types that can be truncated the equality check is relaxed to a ordering check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements bound truncation in the parquet writer to prevent excessive file size growth when field sizes exceed target page sizes. The implementation truncates min/max statistical bounds to 64 bytes by default, which addresses an issue where large fields caused files to become ~3x larger than expected due to full bounds being encoded per page.
Key Changes
- Added UTF-8-aware bound truncation logic for binary types (strings, byte arrays)
- Implemented reverse byte iteration support in
iobufto enable efficient bound truncation - Added comprehensive UTF-8 utilities for code point manipulation and validation
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/test_utils/go/parquet_verifier/main.go | Updated verifier to validate truncated bounds and handle bound comparisons for byte arrays |
| src/v/strings/utf8.h | Added UTF-8 validation, code point manipulation, and reverse iteration utilities |
| src/v/strings/tests/utf8_test.cc | Added comprehensive tests for new UTF-8 utilities |
| src/v/strings/tests/BUILD | Added new test file and bytes dependency |
| src/v/strings/BUILD | Added boost::container dependency |
| src/v/serde/parquet/writer.h | Added max_bound_size configuration option with 64-byte default |
| src/v/serde/parquet/writer.cc | Passed max_bound_size to column writer |
| src/v/serde/parquet/tests/generate_file.cc | Updated test data generation to include larger byte arrays |
| src/v/serde/parquet/tests/column_stats_collector_test.cc | Added tests for bound truncation logic |
| src/v/serde/parquet/tests/BUILD | Added strings:utf8 dependency |
| src/v/serde/parquet/encoding.h | Changed encode_for_stats parameters to non-const references |
| src/v/serde/parquet/encoding.cc | Changed to use share() instead of copy() for efficiency |
| src/v/serde/parquet/column_writer.h | Added max_bound_size_bytes option |
| src/v/serde/parquet/column_writer.cc | Implemented bound truncation with UTF-8 awareness for string/binary types |
| src/v/serde/parquet/column_stats_collector.h | Added truncator template parameter and bound truncation logic |
| src/v/serde/parquet/column_stats_collector.cc | Implemented binary_bound_truncator with UTF-8 support |
| src/v/serde/parquet/BUILD | Added strings:utf8 dependency |
| src/v/bytes/tests/iobuf_tests.cc | Added tests for reverse byte iteration |
| src/v/bytes/iobuf.h | Added reverse iterator support |
| src/v/bytes/iobuf.cc | Fixed comparison bug when right-hand side becomes empty |
| src/v/bytes/details/io_placeholder.h | Made write() method generic for uint8_t and char |
| src/v/bytes/details/io_fragment.h | Added const_reverse_iterator support |
| src/v/bytes/details/io_byte_iterator.h | Implemented reverse byte iteration as io_byte_iterator_base |
b38d59c to
fa914a6
Compare
Larger fields are needed to properly test bound truncation
fa914a6 to
c3837e1
Compare
| if (rhs.empty()) { | ||
| rhs = other_next_view(); | ||
| if (o_it == o.cend()) { | ||
| if (rhs.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open up a separate PR to backport this fix to older RP versions.
Retry command for Build#78626please wait until all jobs are finished before running the slash command |
| [[gnu::always_inline]] void write(const char* src, size_t len) { | ||
| template<typename T> | ||
| [[gnu::always_inline]] void write(const T* src, size_t len) | ||
| requires(sizeof(T) == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function needs a doc now that it's not entirely obvious what it is doing.
It will do an element-wise copy from the src array to this placeholder, if T can be assigned to a char, right?
|
Except for really trivial commits, commit messages should have a body as well, e.g., why this is being added, etc. |
| // handle an empty fragment | ||
| if (_frag_index == _frag_index_end) { | ||
| continue; | ||
| if constexpr (Forward) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of logic should be DRY with the identical block of logic in the ctor, probably called "maybe_next_fragment", called in the ctor and after every increment/decrement type operation.
| } | ||
| } else { | ||
| // NOLINTNEXTLINE(cppcoreguidelines-pro-bounds-pointer-arithmetic) | ||
| if (_frag_index-- == _frag_index_end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will form a "one before the start" pointer (when the condition is true), which is not allowed (yes, it's quite annoying).
|
|
||
| #pragma once | ||
|
|
||
| #include "bytes/details/io_fragment.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iobuf_fuzz should probably be augemnted with reverse iterator cases
| if ok != !(colMin.IsNull() && colMax.IsNull()) { | ||
| log.Fatalf( | ||
| "❌ : missing column bounds for row group %d column %d (%d, %d)\n", | ||
| i, j, col.NumValues(), col.NullCount(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is allowed in the parquet format? I.e., it is not required that the min/max actually exist in the page, just that they are bounds?
|
After discussing this PR with @travisdowns I've decided to just copy the truncated bounds to a These changes should be up tomorrow. Switching the PR back to draft until they are up. |
This PR fixes an issue I ran into while testing the Iceberg implementation against large messages/fields. During the test the field size exceeded the targeted page size. This resulted in most pages only having a single field. And since the full min/max bound is also encoded per page the same field was encoded 3 times per parquet file. This resulted in the file being ~3x larger than expected. The fix for this is to just truncate the bounds as allowed by parquet.
Backports Required
Release Notes
Improvements