Skip to content

Conversation

@corwinjoy
Copy link
Contributor

Description

This PR adds encryption support and other advanced file options to delta-rs by implementing a comprehensive framework for file format settings. The changes enable users to configure encryption settings, customize writer properties, and apply file-level formatting options when reading and writing Delta tables.

  • Introduces a FileFormatOptions trait and related infrastructure to handle file-specific configurations
  • Adds support for both simple property-based encryption and KMS-based encryption through new factory patterns
  • Updates all operation builders to accept and propagate file format options throughout the write/read pipeline

In general, we have added a new trait called FileFormatOptions at the root DeltaTable level to unify how files within a delta table are read and written with specific formatting. The idea is that you can apply these settings once, at the top level, and then seamlessly perform any operations with the necessary settings.

This PR leverages the DataFusion TableOptions structure to support format options for multiple underlying file formats. (The idea being that delta-rs may eventually want to support storage formats beyond Parquet, such as Vortex or Lance.) Additionally, it centralizes file format options in a single, consistent location. This avoids the current difficulties where one has to separately set WriterProperties; then reader properties as part of the SessionState. (This is in line with comments from @roeap about how file configuration might be improved: #3300 (comment)). We would also like to eventually extend this upgrade to add notations about these file configurations to the delta table properties. For example, if the files are encrypted, one could add a KMS configuration for where to retrieve encryption keys.

Review Suggestion

This PR turned out to be larger than we hoped, so apologies for that, but I don't know how to split it into smaller pieces.
When reviewing, we suggest starting with the file crates/core/src/table/file_format_options.rs to get an overview of the new file format trait that can be applied to delta tables.

Related Issue(s)

Support Parquet Modular Encryption:
#3300

Documentation

Parquet Modular Encryption: https://docs.google.com/document/d/1MUg1J7u5VdLkgejJ4ybzfZt1OmwhQkq2iGPxsn4gqLI/edit?tab=t.0#heading=h.34wvmhc1zdch

Attribution

This PR was created in collaboration with @adamreeve

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Sep 29, 2025
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@corwinjoy
Copy link
Contributor Author

Note that fully supporting Parquet encryption requires being able to get write and read properties per-file, which is why the existing ability to set WriterProperties isn't sufficient, and why WriterPropertiesFactory::create_writer_properties is called per file and requires a file path. This allows generating new random data encryption keys per file and performing tasks such as specifying a per-file AAD prefix or supporting the external storage of encryption keys that can be looked up using the file path.

@corwinjoy
Copy link
Contributor Author

@rtyler @roeap @alamb Tagging you here per our previous discussion on adding encryption support to delta-rs.

@corwinjoy corwinjoy changed the title feat: Add framework for File Format Options feat: add framework for File Format Options Sep 29, 2025
@rtyler rtyler self-assigned this Sep 30, 2025
@rtyler rtyler marked this pull request as draft September 30, 2025 13:18
@rtyler
Copy link
Member

rtyler commented Sep 30, 2025

I have marked this pull request as draft. This does not compile as is, I can come back to it once it is able to compile and pass unit tests

@corwinjoy
Copy link
Contributor Author

I have marked this pull request as draft. This does not compile as is, I can come back to it once it is able to compile and pass unit tests

@rtyler OK. It seems that when I auto-merged the main branch it introduced a build error. I have resolved this and the code is once again building and passing unit tests.

@corwinjoy corwinjoy marked this pull request as ready for review October 1, 2025 01:31
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the benefit but we really need to reduce the surface of change that are being introduced

@roeap
Copy link
Collaborator

roeap commented Oct 1, 2025

@corwinjoy - awesome to see this come to fruition! Will find some time to give this a review hopefully tomorrow.

At first glance one quick question. Do we see a way to "bundle" the datafusion specific stuff a bit more? It's a bit hard to keep track of all the individual flags while reviewing :)

@corwinjoy
Copy link
Contributor Author

@roeap

At first glance one quick question. Do we see a way to "bundle" the datafusion specific stuff a bit more? It's a bit hard to keep track of all the individual flags while reviewing :)

What we did to minimize this dependency is define an abstract FileFormatOptions trait. Everything just passes around a FileFormatRef defined as Arc<dyn FileFormatOptions>. Then, only when needed, do we grab final table options or writer properties. Furthermore, we've gated these instances of getting final details behind three function calls in file_format_options.rs:

pub fn build_writer_properties_factory_ffo(
    file_format_options: Option<FileFormatRef>,
) -> Option<Arc<dyn WriterPropertiesFactory>> {...}

pub fn to_table_parquet_options_from_ffo(
    file_format_options: Option<&FileFormatRef>,
) -> Option<TableParquetOptions> {...}

pub fn state_with_file_format_options(
    state: SessionState,
    file_format_options: Option<&FileFormatRef>,
) -> DeltaResult<SessionState> {...}

There might be some ways to refine this further, but in general we've tried to isolate and abstract these file properties where possible and not require datafusion.

@corwinjoy
Copy link
Contributor Author

@roeap From a user point of view, we've tried hard to make the settings as easy as possible. This can be seen in crates/deltalake/examples/basic_operations_encryption.rs. Here, we demonstrate different kinds of operations on tables. (We have a more formal unit test at crates/core/tests/commands_with_encryption.rs). Thes code examples all look like ordinary operations; all we needed was a common function call when creating DeltaOps:

async fn ops_with_crypto(
    uri: &str,
    file_format_options: &FileFormatRef,
) -> Result<DeltaOps, DeltaTableError> {
    let prefix_uri = format!("file://{}", uri);
    let url = Url::parse(&*prefix_uri).unwrap();
    let ops = DeltaOps::try_from_uri(url).await?;
    Ok(ops.with_file_format_options(file_format_options.clone()))
}

Calling with_file_format_options is sufficient to apply the needed encryption settings for all operations.

# Conflicts:
#	crates/core/src/delta_datafusion/table_provider.rs
#	crates/core/src/operations/delete.rs
#	crates/core/src/operations/drop_constraints.rs
#	crates/core/src/operations/filesystem_check.rs
#	crates/core/src/operations/load.rs
#	crates/core/src/operations/merge/mod.rs
#	crates/core/src/operations/mod.rs
#	crates/core/src/operations/optimize.rs
#	crates/core/src/operations/restore.rs
#	crates/core/src/operations/update.rs
#	crates/core/src/operations/write/mod.rs
#	crates/core/tests/command_optimize.rs
#	crates/core/tests/integration_datafusion.rs
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts:
#	crates/core/src/operations/optimize.rs
Signed-off-by: Corwin Joy <[email protected]>
@ion-elgreco
Copy link
Collaborator

@ion-elgreco OK. I have migrated these file options to the config property in DeltaTable. This definitely reduced the changes quite a bit so thanks for the suggestion! I think it looks pretty solid and look forward to your feedback when you are ready. @adamreeve

I'll do another review over the weekend

@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 2.29358% with 426 lines in your changes missing coverage. Please review.
✅ Project coverage is 25.44%. Comparing base (5ac0629) to head (020d409).

Files with missing lines Patch % Lines
crates/core/src/table/file_format_options.rs 0.00% 118 Missing ⚠️
crates/core/src/operations/optimize.rs 0.00% 70 Missing ⚠️
crates/core/src/operations/encryption.rs 0.00% 44 Missing ⚠️
crates/core/src/operations/delete.rs 0.00% 30 Missing ⚠️
crates/core/src/operations/write/writer.rs 0.00% 30 Missing ⚠️
crates/core/src/writer/record_batch.rs 0.00% 26 Missing ⚠️
crates/core/src/delta_datafusion/table_provider.rs 0.00% 23 Missing ⚠️
crates/core/src/operations/mod.rs 15.78% 16 Missing ⚠️
crates/core/src/operations/update.rs 0.00% 14 Missing ⚠️
crates/core/src/operations/merge/mod.rs 0.00% 13 Missing ⚠️
... and 6 more

❗ There is a different number of reports uploaded between BASE (5ac0629) and HEAD (020d409). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (5ac0629) HEAD (020d409)
8 5
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3794       +/-   ##
===========================================
- Coverage   73.76%   25.44%   -48.33%     
===========================================
  Files         152      126       -26     
  Lines       39524    20084    -19440     
  Branches    39524    20084    -19440     
===========================================
- Hits        29156     5110    -24046     
- Misses       9044    14605     +5561     
+ Partials     1324      369      -955     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ion-elgreco ion-elgreco self-assigned this Oct 20, 2025
Ok((operation, metrics))
}

async fn get_file_decryption_properties(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this return None as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can return None because FileFormatOptions may be set in order to specify a particular parquet formatting. But, this may not include any encryption. FileFormatOptions is for more than just encryption.

file_format_options: Option<FileFormatRef>,
) -> WriterPropertiesFactoryRef {
build_writer_properties_factory_ffo(file_format_options)
.unwrap_or_else(|| build_writer_properties_factory_default())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_or_default is more idiomatic

Copy link
Contributor Author

@corwinjoy corwinjoy Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure how to do that since WriterPropertiesFactoryRef is just a typedef and not a true type.

@ion-elgreco
Copy link
Collaborator

@corwinjoy looks already better but I think we can reduce the amount of line changes even more!

I still have to take a better look at why we need a WriterPropertiesFactory :s but maybe you can explain it shortly for me?

@adamreeve
Copy link
Contributor

I still have to take a better look at why we need a WriterPropertiesFactory :s but maybe you can explain it shortly for me?

Corwin is busy travelling this week so might be slow to reply, but I can help answer this part. For some use cases it might be fine to have a single WriterProperties instance for all files, but there are a few reasons why you could want to generate new encryption properties per-file so need a factory:

  • To generate new random data encryption keys per file, as it's good security practice to limit how widely one key is used
  • To set a different AAD prefix per-file, which prevents attackers from being able to swap out encrypted modules between files and tamper with data
  • Be able to handle schema changes like adding columns while specifying per-column encryption keys
  • Enables use of external key material, which is when you write a JSON file alongside each Parquet file containing the key metadata. This allows rotation of master keys without having to re-write Parquet files, only the JSON files need to be rewritten. We don't implement this in the Rust parquet-key-management crate but it's supported by the Java and C++/Python Parquet implementations.

This also aligns with the DataFusion EncryptionFactory trait that takes the file path as a parameter when creating encryption and decryption properties.

Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts:
#	Cargo.toml
#	crates/core/src/operations/delete.rs
#	crates/core/src/operations/optimize.rs
#	crates/core/src/operations/update.rs
#	crates/core/src/operations/write/execution.rs
#	crates/core/src/operations/write/writer.rs
#	crates/core/src/table/builder.rs
Signed-off-by: Corwin Joy <[email protected]>
@corwinjoy
Copy link
Contributor Author

@ion-elgreco OK. I have applied the changes you suggested from two weeks ago and re-merged the latest from main. So, I am ready for another review when you get the chance. Thanks again for the suggestions!

Comment on lines +239 to +240
let writer_properties_factory = build_writer_properties_factory_wp(writer_properties);
self.writer_properties_factory = writer_properties_factory;
Copy link
Collaborator

@ion-elgreco ion-elgreco Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the use of extension traits here, example:

pub trait IntoWriterPropertiesFactoryRef {
    fn into_factory_ref(self) -> WriterPropertiesFactoryRef;
}

impl IntoWriterPropertiesFactoryRef for WriterProperties {
    fn into_factory_ref(self) -> WriterPropertiesFactoryRef {
        Arc::new(SimpleWriterPropertiesFactory::new(self))
    }
}

And this can be applied across the board where we do func(a) -> b

impl PartitionWriter {
/// Create a new instance of [`PartitionWriter`] from [`PartitionWriterConfig`]
pub fn try_with_config(
pub async fn try_with_config(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been made async?

// More advanced factory with KMS support
#[cfg(feature = "datafusion")]
#[derive(Clone, Debug)]
pub struct KMSWriterPropertiesFactory {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved (all KMS stuff) to your own private crate Since we have the WriterPropetiesFactory trait


#[cfg(feature = "datafusion")]
#[derive(Clone, Debug, Default)]
pub struct SimpleFileFormatOptions {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose for this for other delta-rs users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants