-
Notifications
You must be signed in to change notification settings - Fork 1k
parquet-rewrite: add write_batch_size and compression_level config #8642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
parquet/src/bin/parquet-rewrite.rs
Outdated
#[clap(long, value_enum)] | ||
compression: Option<CompressionArgs>, | ||
#[clap(long)] | ||
compression: Option<Compression>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, default level would be 1. But now, zstd would be 3 by default. I can also revert this and add a "compression-level" integer config. Both is ok for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow, won't the default level still be 1
? Or is it that now one must provide the level on the command line as --compression zstd(3)
?
|
||
/// Sets write batch size. | ||
#[clap(long)] | ||
write_batch_size: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without batch-size, a small data_page_row_count_limit
might not works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mapleFU, I think this is a nice improvement. But I think it might be better to keep the current CompressionArgs
and add a compression level command line option as well. Then I don't think this would be a breaking change.
Either that or add to the compression documentation so it's obvious what is now required.
}, | ||
}; | ||
|
||
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, ValueEnum, Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One point of this structure is to provide documentation. With your change help now looks like:
% ./target/debug/parquet-rewrite --help
Read and write parquet file with potentially different settings
Usage: parquet-rewrite [OPTIONS] --input <INPUT> --output <OUTPUT>
Options:
-i, --input <INPUT>
Path to input parquet file
-o, --output <OUTPUT>
Path to output parquet file
--compression <COMPRESSION>
Compression used for all columns
where before the compression help was:
--compression <COMPRESSION>
Compression used
Possible values:
- none: No compression
- snappy: Snappy
- gzip: GZip
- lzo: LZO
- brotli: Brotli
- lz4: LZ4
- zstd: Zstd
- lz4-raw: LZ4 Raw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, clap might already handle this well?
parquet/src/bin/parquet-rewrite.rs
Outdated
#[clap(long, value_enum)] | ||
compression: Option<CompressionArgs>, | ||
#[clap(long)] | ||
compression: Option<Compression>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow, won't the default level still be 1
? Or is it that now one must provide the level on the command line as --compression zstd(3)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mapleFU, looks great! And no longer breaking IMO.
Which issue does this PR close?
Rationale for this change
add write_batch_size config and change compression to use parquet Compression
What changes are included in this PR?
add write_batch_size config and change compression to use parquet Compression
Are these changes tested?
I've try these command by myself.
Are there any user-facing changes?
Yeah