-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41246: [Docs][C++][Python] Improve docs on column encryption for nested fields #45411
base: main
Are you sure you want to change the base?
GH-41246: [Docs][C++][Python] Improve docs on column encryption for nested fields #45411
Conversation
|
543bd7b
to
56e803d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing and document this @EnricoMi . Here are assorted comments.
docs/source/cpp/parquet.rst
Outdated
encryption_config->column_keys = "column_key_name: " | ||
"ListColumn.list.element, " | ||
"MapColumn.key_value.key, MapColumn.key_value.value, " | ||
"StructColumn.f1, StructColumn.f2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the spaces embedded in the string actually supported? Also, it seems to lack a semicolon at the end of the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it supports extra whitespaces, it trims the strings. The final semicolon is not needed, it would introduce a next empty section:
arrow/cpp/src/parquet/encryption/crypto_factory.cc
Lines 101 to 167 in 240ebc9
ColumnPathToEncryptionPropertiesMap CryptoFactory::GetColumnEncryptionProperties( | |
int dek_length, const std::string& column_keys, FileKeyWrapper* key_wrapper) { | |
ColumnPathToEncryptionPropertiesMap encrypted_columns; | |
std::vector<::std::string_view> key_to_columns = | |
::arrow::internal::SplitString(column_keys, ';'); | |
for (size_t i = 0; i < key_to_columns.size(); ++i) { | |
std::string cur_key_to_columns = | |
::arrow::internal::TrimString(std::string(key_to_columns[i])); | |
if (cur_key_to_columns.empty()) { | |
continue; | |
} | |
std::vector<::std::string_view> parts = | |
::arrow::internal::SplitString(cur_key_to_columns, ':'); | |
if (parts.size() != 2) { | |
std::ostringstream message; | |
message << "Incorrect key to columns mapping in column keys property" | |
<< ": [" << cur_key_to_columns << "]"; | |
throw ParquetException(message.str()); | |
} | |
std::string column_key_id = ::arrow::internal::TrimString(std::string(parts[0])); | |
if (column_key_id.empty()) { | |
throw ParquetException("Empty key name in column keys property."); | |
} | |
std::string column_names_str = ::arrow::internal::TrimString(std::string(parts[1])); | |
std::vector<::std::string_view> column_names = | |
::arrow::internal::SplitString(column_names_str, ','); | |
if (0 == column_names.size()) { | |
throw ParquetException("No columns to encrypt defined for key: " + column_key_id); | |
} | |
for (size_t j = 0; j < column_names.size(); ++j) { | |
std::string column_name = | |
::arrow::internal::TrimString(std::string(column_names[j])); | |
if (column_name.empty()) { | |
std::ostringstream message; | |
message << "Empty column name in column keys property for key: " << column_key_id; | |
throw ParquetException(message.str()); | |
} | |
if (encrypted_columns.find(column_name) != encrypted_columns.end()) { | |
throw ParquetException("Multiple keys defined for the same column: " + | |
column_name); | |
} | |
std::string column_key(dek_length, '\0'); | |
RandBytes(reinterpret_cast<uint8_t*>(column_key.data()), column_key.size()); | |
std::string column_key_key_metadata = | |
key_wrapper->GetEncryptionKeyMetadata(column_key, column_key_id, false); | |
std::shared_ptr<ColumnEncryptionProperties> cmd = | |
ColumnEncryptionProperties::Builder(column_name) | |
.key(column_key) | |
->key_metadata(column_key_key_metadata) | |
->build(); | |
encrypted_columns.insert({column_name, cmd}); | |
} | |
} | |
if (encrypted_columns.empty()) { | |
throw ParquetException("No column keys configured in column keys property."); | |
} | |
return encrypted_columns; | |
} |
|
||
An example for writing a dataset using encrypted Parquet file format: | ||
|
||
.. code-block:: cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @AlenkaF @raulcd What is our preferred policy for code examples? Do we put them inline in the docs? Do we use separate files?
Co-authored-by: Antoine Pitrou <[email protected]>
There is some improvement for this non-intuitive naming scheme: #45462 |
Rationale for this change
Encrypting columns with nested fields with a column key is not trivial since only leaf fields are allowed in the column key map. Documentation emphasizes this fact and provides examples.
What changes are included in this PR?
This amends the documentation on encryption for C++ and Python.
Are these changes tested?
Only documentation.
Are there any user-facing changes?
Only documentation.