-
Notifications
You must be signed in to change notification settings - Fork 538
feat: add framework for File Format Options #3794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Corwin Joy <[email protected]>
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
|
Note that fully supporting Parquet encryption requires being able to get write and read properties per-file, which is why the existing ability to set |
|
I have marked this pull request as draft. This does not compile as is, I can come back to it once it is able to compile and pass unit tests |
Signed-off-by: Corwin Joy <[email protected]>
@rtyler OK. It seems that when I auto-merged the main branch it introduced a build error. I have resolved this and the code is once again building and passing unit tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the benefit but we really need to reduce the surface of change that are being introduced
|
@corwinjoy - awesome to see this come to fruition! Will find some time to give this a review hopefully tomorrow. At first glance one quick question. Do we see a way to "bundle" the datafusion specific stuff a bit more? It's a bit hard to keep track of all the individual flags while reviewing :) |
What we did to minimize this dependency is define an abstract There might be some ways to refine this further, but in general we've tried to isolate and abstract these file properties where possible and not require datafusion. |
|
@roeap From a user point of view, we've tried hard to make the settings as easy as possible. This can be seen in Calling |
# Conflicts: # crates/core/src/delta_datafusion/table_provider.rs # crates/core/src/operations/delete.rs # crates/core/src/operations/drop_constraints.rs # crates/core/src/operations/filesystem_check.rs # crates/core/src/operations/load.rs # crates/core/src/operations/merge/mod.rs # crates/core/src/operations/mod.rs # crates/core/src/operations/optimize.rs # crates/core/src/operations/restore.rs # crates/core/src/operations/update.rs # crates/core/src/operations/write/mod.rs # crates/core/tests/command_optimize.rs # crates/core/tests/integration_datafusion.rs
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts: # crates/core/src/operations/optimize.rs
Signed-off-by: Corwin Joy <[email protected]>
I'll do another review over the weekend |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3794 +/- ##
===========================================
- Coverage 73.76% 25.44% -48.33%
===========================================
Files 152 126 -26
Lines 39524 20084 -19440
Branches 39524 20084 -19440
===========================================
- Hits 29156 5110 -24046
- Misses 9044 14605 +5561
+ Partials 1324 369 -955 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| Ok((operation, metrics)) | ||
| } | ||
|
|
||
| async fn get_file_decryption_properties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can this return None as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can return None because FileFormatOptions may be set in order to specify a particular parquet formatting. But, this may not include any encryption. FileFormatOptions is for more than just encryption.
| file_format_options: Option<FileFormatRef>, | ||
| ) -> WriterPropertiesFactoryRef { | ||
| build_writer_properties_factory_ffo(file_format_options) | ||
| .unwrap_or_else(|| build_writer_properties_factory_default()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap_or_default is more idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure how to do that since WriterPropertiesFactoryRef is just a typedef and not a true type.
|
@corwinjoy looks already better but I think we can reduce the amount of line changes even more! I still have to take a better look at why we need a WriterPropertiesFactory :s but maybe you can explain it shortly for me? |
Corwin is busy travelling this week so might be slow to reply, but I can help answer this part. For some use cases it might be fine to have a single
This also aligns with the DataFusion |
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
…s_scan functions. Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
…Builder Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts: # Cargo.toml # crates/core/src/operations/delete.rs # crates/core/src/operations/optimize.rs # crates/core/src/operations/update.rs # crates/core/src/operations/write/execution.rs # crates/core/src/operations/write/writer.rs # crates/core/src/table/builder.rs
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
…ies. Signed-off-by: Corwin Joy <[email protected]>
|
@ion-elgreco OK. I have applied the changes you suggested from two weeks ago and re-merged the latest from main. So, I am ready for another review when you get the chance. Thanks again for the suggestions! |
| let writer_properties_factory = build_writer_properties_factory_wp(writer_properties); | ||
| self.writer_properties_factory = writer_properties_factory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the use of extension traits here, example:
pub trait IntoWriterPropertiesFactoryRef {
fn into_factory_ref(self) -> WriterPropertiesFactoryRef;
}
impl IntoWriterPropertiesFactoryRef for WriterProperties {
fn into_factory_ref(self) -> WriterPropertiesFactoryRef {
Arc::new(SimpleWriterPropertiesFactory::new(self))
}
}And this can be applied across the board where we do func(a) -> b
| impl PartitionWriter { | ||
| /// Create a new instance of [`PartitionWriter`] from [`PartitionWriterConfig`] | ||
| pub fn try_with_config( | ||
| pub async fn try_with_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this been made async?
| // More advanced factory with KMS support | ||
| #[cfg(feature = "datafusion")] | ||
| #[derive(Clone, Debug)] | ||
| pub struct KMSWriterPropertiesFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be moved (all KMS stuff) to your own private crate Since we have the WriterPropetiesFactory trait
|
|
||
| #[cfg(feature = "datafusion")] | ||
| #[derive(Clone, Debug, Default)] | ||
| pub struct SimpleFileFormatOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose for this for other delta-rs users?
Description
This PR adds encryption support and other advanced file options to
delta-rsby implementing a comprehensive framework for file format settings. The changes enable users to configure encryption settings, customize writer properties, and apply file-level formatting options when reading and writing Delta tables.FileFormatOptionstrait and related infrastructure to handle file-specific configurationsIn general, we have added a new trait called
FileFormatOptionsat the rootDeltaTablelevel to unify how files within a delta table are read and written with specific formatting. The idea is that you can apply these settings once, at the top level, and then seamlessly perform any operations with the necessary settings.This PR leverages the DataFusion
TableOptionsstructure to support format options for multiple underlying file formats. (The idea being thatdelta-rsmay eventually want to support storage formats beyond Parquet, such as Vortex or Lance.) Additionally, it centralizes file format options in a single, consistent location. This avoids the current difficulties where one has to separately setWriterProperties; then reader properties as part of theSessionState. (This is in line with comments from @roeap about how file configuration might be improved: #3300 (comment)). We would also like to eventually extend this upgrade to add notations about these file configurations to the delta table properties. For example, if the files are encrypted, one could add a KMS configuration for where to retrieve encryption keys.Review Suggestion
This PR turned out to be larger than we hoped, so apologies for that, but I don't know how to split it into smaller pieces.
When reviewing, we suggest starting with the file
crates/core/src/table/file_format_options.rsto get an overview of the new file format trait that can be applied to delta tables.Related Issue(s)
Support Parquet Modular Encryption:
#3300
Documentation
Parquet Modular Encryption: https://docs.google.com/document/d/1MUg1J7u5VdLkgejJ4ybzfZt1OmwhQkq2iGPxsn4gqLI/edit?tab=t.0#heading=h.34wvmhc1zdch
Attribution
This PR was created in collaboration with @adamreeve