Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41246: [Docs][C++][Python] Improve docs on column encryption for nested fields #45411

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Feb 1, 2025

Rationale for this change

Encrypting columns with nested fields with a column key is not trivial since only leaf fields are allowed in the column key map. Documentation emphasizes this fact and provides examples.

What changes are included in this PR?

This amends the documentation on encryption for C++ and Python.

Are these changes tested?

Only documentation.

Are there any user-facing changes?

Only documentation.

Copy link

github-actions bot commented Feb 1, 2025

⚠️ GitHub issue #41246 has been automatically assigned in GitHub to PR creator.

@EnricoMi EnricoMi force-pushed the docs-column-encryption-nested-fields branch from 543bd7b to 56e803d Compare February 1, 2025 16:58
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing and document this @EnricoMi . Here are assorted comments.

Comment on lines 658 to 661
encryption_config->column_keys = "column_key_name: "
"ListColumn.list.element, "
"MapColumn.key_value.key, MapColumn.key_value.value, "
"StructColumn.f1, StructColumn.f2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the spaces embedded in the string actually supported? Also, it seems to lack a semicolon at the end of the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it supports extra whitespaces, it trims the strings. The final semicolon is not needed, it would introduce a next empty section:

ColumnPathToEncryptionPropertiesMap CryptoFactory::GetColumnEncryptionProperties(
int dek_length, const std::string& column_keys, FileKeyWrapper* key_wrapper) {
ColumnPathToEncryptionPropertiesMap encrypted_columns;
std::vector<::std::string_view> key_to_columns =
::arrow::internal::SplitString(column_keys, ';');
for (size_t i = 0; i < key_to_columns.size(); ++i) {
std::string cur_key_to_columns =
::arrow::internal::TrimString(std::string(key_to_columns[i]));
if (cur_key_to_columns.empty()) {
continue;
}
std::vector<::std::string_view> parts =
::arrow::internal::SplitString(cur_key_to_columns, ':');
if (parts.size() != 2) {
std::ostringstream message;
message << "Incorrect key to columns mapping in column keys property"
<< ": [" << cur_key_to_columns << "]";
throw ParquetException(message.str());
}
std::string column_key_id = ::arrow::internal::TrimString(std::string(parts[0]));
if (column_key_id.empty()) {
throw ParquetException("Empty key name in column keys property.");
}
std::string column_names_str = ::arrow::internal::TrimString(std::string(parts[1]));
std::vector<::std::string_view> column_names =
::arrow::internal::SplitString(column_names_str, ',');
if (0 == column_names.size()) {
throw ParquetException("No columns to encrypt defined for key: " + column_key_id);
}
for (size_t j = 0; j < column_names.size(); ++j) {
std::string column_name =
::arrow::internal::TrimString(std::string(column_names[j]));
if (column_name.empty()) {
std::ostringstream message;
message << "Empty column name in column keys property for key: " << column_key_id;
throw ParquetException(message.str());
}
if (encrypted_columns.find(column_name) != encrypted_columns.end()) {
throw ParquetException("Multiple keys defined for the same column: " +
column_name);
}
std::string column_key(dek_length, '\0');
RandBytes(reinterpret_cast<uint8_t*>(column_key.data()), column_key.size());
std::string column_key_key_metadata =
key_wrapper->GetEncryptionKeyMetadata(column_key, column_key_id, false);
std::shared_ptr<ColumnEncryptionProperties> cmd =
ColumnEncryptionProperties::Builder(column_name)
.key(column_key)
->key_metadata(column_key_key_metadata)
->build();
encrypted_columns.insert({column_name, cmd});
}
}
if (encrypted_columns.empty()) {
throw ParquetException("No column keys configured in column keys property.");
}
return encrypted_columns;
}


An example for writing a dataset using encrypted Parquet file format:

.. code-block:: cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @AlenkaF @raulcd What is our preferred policy for code examples? Do we put them inline in the docs? Do we use separate files?

docs/source/cpp/parquet.rst Outdated Show resolved Hide resolved
docs/source/cpp/parquet.rst Outdated Show resolved Hide resolved
docs/source/python/parquet.rst Outdated Show resolved Hide resolved
docs/source/python/parquet.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 6, 2025
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Feb 7, 2025

There is some improvement for this non-intuitive naming scheme: #45462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants