From 8a9e2f3fdb6e90871d1a05e2a64dec1ee9e84efd Mon Sep 17 00:00:00 2001 From: emkornfield Date: Sat, 25 May 2024 10:33:18 -0700 Subject: [PATCH 1/6] DRAFT: Alternative V3 metadata proposal. Salient points: 1. Introduce a new encoding that allows random access for byte arrays 2. Use page info as a structure for storing lists. 3. Storage pages out of line of thrift. --- README.md | 59 +++++++++++++++ src/main/thrift/parquet.thrift | 134 ++++++++++++++++++++++++++++++--- 2 files changed, 183 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 42578c7b..5ebafc8c 100644 --- a/README.md +++ b/README.md @@ -118,6 +118,65 @@ chunks they are interested in. The columns chunks should then be read sequentia ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif) + ### PAR3 File Footers + + PAR3 file footer footer format designed to better support wider-schemas and more control + over the various footer size vs compute trade-offs. Its format is as follows: + - Data pages containing serialized Thrift metadata objects that were modeled as lists + in PAR1.These are stored contiguously with offsets stored in the FileMetadata. See + parquet.thrift for more details on each. + - Serialized Thrift FileMetadata Structure + - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata. + - 4-byte length in bytes (little endian) of the serialized FileMetadata structure. + - 4-byte length in bytes (little endian) of all preceding elements in the footer. + - 1 byte flag field to indicate features that require special parsing of the footer. + Readers MUST raise an error if there is an unrecognized flag. Current flags: + + * 0x01 - Footer encryption enabled (when set the encryption information is written before + FileMeta structure as in the PAR1 footer). + * 0x02 - CRC32 of FileMetadata Footer. + + - 4-byte magic number "PAR3" + + When parsing the footer implementations SHOULD read at least the last 10 bytes of the footer. Then + read in the entirety of the footer based on the length of all preceding elements. This prevents further + I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers. + If a file is written with only PAR3 footer, implementation MUT write PAR3 as the first four bytes in + they file. PAR3 footers can also be written in a backwards compatible way after PAR1 Metadata + (see next section for details). + + #### Dual Mode PAR1 and PAR3 footers + + There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while + older readers can still properly parse the file. This section outlines a strategy to do this. + + As backgroud, Thrift structs are always serialized with a 0 trailing byte do delimit there ending. + Therefore for PAR1 written before PAR3 was introduced are always expect the files to have the following + trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value). We also expect all compliant + Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read. + Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata" + is consumed. Therefore, to allow both footers to exist simultaneously in the file the following algorithm is used: + + 1. Serialize and write the original (PAR1) FileMetadata thrift structure + 2. Transform the original FileMetadata structure to conform to PAR3 + * Move data elements if necessary + * Generate data pages for elements stored in metadata pages + * Clear the lists that were transferred to metadata pages + 3. Write out metadata pages + 4. Serialize and write the updated Thrift FileMetadata structure. + 5. Write out remainder of PAR3 header (last bytes written are "PAR3"). + 6. Write out the total size in bytes of both the serialized (PAR1) data structure plus the + size of the PAR3 footer as the final 4-byte byte length. + 7. Write PAR1 + + When these steps are followed readers wishing to use PAR3 footers SHOULD read the last 12 bytes of the file + and look for "PAR3" written out in step five at the beginning of the 12 bytes. As noted above, there should be + no ambiguity with files generated by Parquet reference implementations, as without PAR3 we expected [x, x, x, 0x00] + for PAR1 files. Any ambiguity can be completely eliminated if the CRC32 is written in PAR3 mode and verified by + readers. + + When embedded into a PAR1 file no modification to the magic number at the beginning of the file is mandated. + ## Metadata There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol. diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index c928ad66..3df09bbf 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -537,6 +537,39 @@ enum Encoding { Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. */ BYTE_STREAM_SPLIT = 9; + + /** Encoding for variable length binary data that allows random access of values. + * + * This encoding designed for random access of BYTE_ARRAY values. It is mostly useful in cases + * for non-nullable BYTE_ARRAY columns where determining the exact offset of the value does not require + * parsing definition levels. + * + * The layout consists of the following elements elements: + * 1. byte_arrays - Byte Array values layed out contiguously. The BYTE_ARRAYs are immediately contiguous the cumulative + * offsets. + * 2. offsets: A contiguous set of signed N-byte little-endian unsigned integers + * representing the end byte offset (exclusive) of a BYTE_ARRAY value from + * the the beginning of the page. For simplicity of implementation the 0 index is + * always as zero. + * 3. The last byte indicates the number of bytes used for offsets (valid values are 1,2,3 and 4). + * Implementations SHOULD try to use the smallest byte value that meets the length requirements. + * + * Note the order of lengths is reversed from DELTA_BINARY_PACKED to allow for byte array values to + * potentially allow for incremental compression in the case of Data Page V2 or other future data pages + * where values are compressed separately from nesting information. + * + * The beginning offset of the offsets can be determined using the final offset element. + * + * An individual byte array element can be found at an index using the following pseudo-code + * (real implementations SHOULD do bounds checking): + * + * return byte_arrays[offsets[index] : offsets[index+1]] + * + * + * Example encoding of "f", "oo", "bar1" (square brackets delimit the components listed): + * [foobar1][0,1,3,7][1] + */ + RANDOM_ACCESS_BYTE_ARRAY = 10; } /** @@ -779,8 +812,12 @@ struct ColumnMetaData { * whether we can decode those pages. **/ 2: required list encodings - /** Path in schema **/ - 3: required list path_in_schema + /** Path in schema + * Example of deprecated a field for PAR3 + * PAR1 Footer: Required + * PAR3 Footer: Deprecated (don't populate) + */ + 3: optional list path_in_schema /** Compression codec **/ 4: required CompressionCodec codec @@ -792,7 +829,11 @@ struct ColumnMetaData { 6: required i64 total_uncompressed_size /** total byte size of all compressed, and potentially encrypted, pages - * in this column chunk (including the headers) **/ + * in this column chunk (including the headers) + * + * Fetching the range of min(dictionary_page_offset, data_page_offset) + total_compressed_size + * should fetch all data in the the given column chunk + */ 7: required i64 total_compressed_size /** Optional key/value metadata **/ @@ -812,7 +853,7 @@ struct ColumnMetaData { /** Set of all encodings used for pages in this column chunk. * This information can be used to determine if all data pages are - * dictionary encoded for example **/ + * dictionary encoded for example **/ 13: optional list encoding_stats; /** Byte offset from beginning of file to Bloom filter data. **/ @@ -881,15 +922,21 @@ struct ColumnChunk { /** Crypto metadata of encrypted columns **/ 8: optional ColumnCryptoMetaData crypto_metadata - /** Encrypted column metadata for this chunk **/ + /** Encrypted column metadata for this chunk + * + * PAR3: Not set see column_metadata_page on FileMetadata struct + **/ 9: optional binary encrypted_column_metadata } struct RowGroup { /** Metadata for each column chunk in this row group. * This list must have the same order as the SchemaElement list in FileMetaData. + * + * PAR1: Required + * PAR3: Not populated. Use columns_page on FileMetadata. **/ - 1: required list columns + 1: optional list columns /** Total byte size of all the uncompressed column data in this row group **/ 2: required i64 total_byte_size @@ -1115,6 +1162,33 @@ union EncryptionAlgorithm { 2: AesGcmCtrV1 AES_GCM_CTR_V1 } +/** + * Description of location of a metadata page. + * + * A metadata page is a data page used to store metadata about + * the data stored in the file. This is a key feature of PAR3 + * footers which allow for deferred decoding of metadata. + * + * For common use cases the current recommendation is to use a + * an encoding that supported random access (e.g. PLAIN for fixed types + * and RANDOM_ACCESS_BYTE_ARRAY for variable sized types). implementations + * SHOULD consider allowing configurability per page to allow for end-users + * to optimize size vs compute trade-offs that make sense for their use-case. + */ +struct MetadataPageLocation { + // Offset from the beginning of the PAR3 footer to the header + // of the data page. + 1: optional i32 footer_offset + + // The length of the serialized page (header + data) in bytes. This + // is redundant with information in the header but allow + // for more robust checks before doing any Thrift parsing. + 2: optional i32 full_page_size + + // Optional compression applied to the page. + 3: optional CompressionCodec compression +} + /** * Description for file metadata */ @@ -1127,16 +1201,52 @@ struct FileMetaData { * are flattened to a list by doing a depth-first traversal. * The column metadata contains the path in the schema for that column which can be * used to map columns to nodes in the schema. - * The first element is the root **/ - 2: required list schema; + * The first element is the root + * + * PAR1: Required + * PAR3: Use schema_metadata_page + * + * TODO: This might be too much (i.e. leave as a list for PAR3), but potentially useful for + * wide Schemas if a "schema index" is every added. + **/ + 2: optional list schema; + + /** Required BYTE_ARRAY data where each element is REQUIRED. + * + * Each element is a serialized SchemaElement. The order and content should + * have a one to one correspondence with schema. + * + * If encryption is applied to the footer each element is encrypted individually. + */ + 10: optional MetadataPageLocation schema_page; /** Number of rows in this file **/ 3: required i64 num_rows - /** Row groups in this file **/ + /** Row groups in this file + * + * TODO: Decide if this should be moved to a metadata page. + **/ 4: required list row_groups - /** Optional key/value metadata **/ + /** Required BYTE_ARRAY data where each element is REQUIRED. + * + * Each element is a serialized ColumnChunk. The number of + * elements is M * N, where M is the number row groups in the file + * and N is the number of columns storing data. An columns metadata + * object is stored at `m*N + column index` where m is the row-group + * index. + * + * If encryption applies to the footer each element in page is encrypted + * individually. + * + * PAR1: Don't include + * PAR3: Required **/ + 11: optional MetadataPageLocation columns_page + + /** Optional key/value metadata + * TODO: Consider if this should be moved to use a data page as well + **/ 5: optional list key_value_metadata /** String for application that wrote this file. This should be in the format @@ -1160,6 +1270,10 @@ struct FileMetaData { * * The obsolete min and max fields in the Statistics object are always sorted * by signed comparison regardless of column_orders. + * + * TODO: consider moving to a data page. While fast to decode, this potentially + * compresses/encodes extremely well since it is only a single value at the + * moment. */ 7: optional list column_orders; From 9340c4017cd1243c1c6bda6a505c2a5f553c72f4 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Mon, 27 May 2024 13:26:05 -0700 Subject: [PATCH 2/6] d --- src/main/thrift/parquet.thrift | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 3df09bbf..85f01e6d 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -1171,9 +1171,11 @@ union EncryptionAlgorithm { * * For common use cases the current recommendation is to use a * an encoding that supported random access (e.g. PLAIN for fixed types - * and RANDOM_ACCESS_BYTE_ARRAY for variable sized types). implementations + * and RANDOM_ACCESS_BYTE_ARRAY for variable sized types). Implementations * SHOULD consider allowing configurability per page to allow for end-users * to optimize size vs compute trade-offs that make sense for their use-case. + * + * Statistics for Metadata pages SHOULD NOT be written. */ struct MetadataPageLocation { // Offset from the beginning of the PAR3 footer to the header From 68048188717cf43b327c36db1b4026ef92a4d0ae Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 28 May 2024 13:03:51 -0700 Subject: [PATCH 3/6] make bit mask 8 bytes --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 5ebafc8c..1eaa2435 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ chunks they are interested in. The columns chunks should then be read sequentia - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata. - 4-byte length in bytes (little endian) of the serialized FileMetadata structure. - 4-byte length in bytes (little endian) of all preceding elements in the footer. - - 1 byte flag field to indicate features that require special parsing of the footer. + - 8-byte little-endian flag field to indicate features that require special parsing of the footer. Readers MUST raise an error if there is an unrecognized flag. Current flags: * 0x01 - Footer encryption enabled (when set the encryption information is written before @@ -138,10 +138,10 @@ chunks they are interested in. The columns chunks should then be read sequentia - 4-byte magic number "PAR3" - When parsing the footer implementations SHOULD read at least the last 10 bytes of the footer. Then + When parsing the footer implementations SHOULD read at least the last 16 bytes of the footer. Then read in the entirety of the footer based on the length of all preceding elements. This prevents further I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers. - If a file is written with only PAR3 footer, implementation MUT write PAR3 as the first four bytes in + If a file is written with only PAR3 footer, implementation MUST write PAR3 as the first four bytes in they file. PAR3 footers can also be written in a backwards compatible way after PAR1 Metadata (see next section for details). @@ -172,8 +172,7 @@ chunks they are interested in. The columns chunks should then be read sequentia When these steps are followed readers wishing to use PAR3 footers SHOULD read the last 12 bytes of the file and look for "PAR3" written out in step five at the beginning of the 12 bytes. As noted above, there should be no ambiguity with files generated by Parquet reference implementations, as without PAR3 we expected [x, x, x, 0x00] - for PAR1 files. Any ambiguity can be completely eliminated if the CRC32 is written in PAR3 mode and verified by - readers. + for PAR1 files. Any ambiguity can be completely eliminated if the CRC32 is written in PAR3 mode and verified by readers. When embedded into a PAR1 file no modification to the magic number at the beginning of the file is mandated. From c6f18b22406ef7ae74cc77d827b3c10542f501d9 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 28 May 2024 13:07:21 -0700 Subject: [PATCH 4/6] fix some grammar --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1eaa2435..97f994e4 100644 --- a/README.md +++ b/README.md @@ -150,8 +150,8 @@ chunks they are interested in. The columns chunks should then be read sequentia There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while older readers can still properly parse the file. This section outlines a strategy to do this. - As backgroud, Thrift structs are always serialized with a 0 trailing byte do delimit there ending. - Therefore for PAR1 written before PAR3 was introduced are always expect the files to have the following + As backgroud, Thrift structs are always serialized with a 0x00 trailing byte to delimit their ending. + Therefore PAR1 files written before PAR3 was introduced should always have a trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value). We also expect all compliant Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read. Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata" From 40769c2e237a0da958249b34a7c8ed99b8c204de Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 29 May 2024 23:55:04 -0700 Subject: [PATCH 5/6] Address comments and make proposal more complete. --- README.md | 51 ++++----- src/main/thrift/parquet.thrift | 198 ++++++++++++++++++--------------- 2 files changed, 130 insertions(+), 119 deletions(-) diff --git a/README.md b/README.md index 97f994e4..2fb3a046 100644 --- a/README.md +++ b/README.md @@ -122,14 +122,10 @@ chunks they are interested in. The columns chunks should then be read sequentia PAR3 file footer footer format designed to better support wider-schemas and more control over the various footer size vs compute trade-offs. Its format is as follows: - - Data pages containing serialized Thrift metadata objects that were modeled as lists - in PAR1.These are stored contiguously with offsets stored in the FileMetadata. See - parquet.thrift for more details on each. - Serialized Thrift FileMetadata Structure - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata. - - 4-byte length in bytes (little endian) of the serialized FileMetadata structure. - 4-byte length in bytes (little endian) of all preceding elements in the footer. - - 8-byte little-endian flag field to indicate features that require special parsing of the footer. + - 4-byte little-endian flag field to indicate features that require special parsing of the footer. Readers MUST raise an error if there is an unrecognized flag. Current flags: * 0x01 - Footer encryption enabled (when set the encryption information is written before @@ -138,7 +134,7 @@ chunks they are interested in. The columns chunks should then be read sequentia - 4-byte magic number "PAR3" - When parsing the footer implementations SHOULD read at least the last 16 bytes of the footer. Then + When parsing the footer implementations SHOULD read at least the last 12 bytes of the footer. Then read in the entirety of the footer based on the length of all preceding elements. This prevents further I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers. If a file is written with only PAR3 footer, implementation MUST write PAR3 as the first four bytes in @@ -147,32 +143,23 @@ chunks they are interested in. The columns chunks should then be read sequentia #### Dual Mode PAR1 and PAR3 footers - There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while - older readers can still properly parse the file. This section outlines a strategy to do this. - - As backgroud, Thrift structs are always serialized with a 0x00 trailing byte to delimit their ending. - Therefore PAR1 files written before PAR3 was introduced should always have a - trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value). We also expect all compliant - Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read. - Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata" - is consumed. Therefore, to allow both footers to exist simultaneously in the file the following algorithm is used: - - 1. Serialize and write the original (PAR1) FileMetadata thrift structure - 2. Transform the original FileMetadata structure to conform to PAR3 - * Move data elements if necessary - * Generate data pages for elements stored in metadata pages - * Clear the lists that were transferred to metadata pages - 3. Write out metadata pages - 4. Serialize and write the updated Thrift FileMetadata structure. - 5. Write out remainder of PAR3 header (last bytes written are "PAR3"). - 6. Write out the total size in bytes of both the serialized (PAR1) data structure plus the - size of the PAR3 footer as the final 4-byte byte length. - 7. Write PAR1 - - When these steps are followed readers wishing to use PAR3 footers SHOULD read the last 12 bytes of the file - and look for "PAR3" written out in step five at the beginning of the 12 bytes. As noted above, there should be - no ambiguity with files generated by Parquet reference implementations, as without PAR3 we expected [x, x, x, 0x00] - for PAR1 files. Any ambiguity can be completely eliminated if the CRC32 is written in PAR3 mode and verified by readers. + The following section defines a layout that allows PAR1 + and PAR3 headers to co-exist in a single logical footer + but allow legacy readers to still read files. + + The laout consists of the following: + - Serialized PAR1 FileMetadata Thrift object + - PAR3 footer as described above + - 4 byte little-endian length in bytes of all + preceding elements. + - 4-byte magic number "PAR1" + + Readers aware of PAR3 can check for the "PAR3" magic number + beginning 12 bytes from the end of the file (This should + be unambiguous because thrift serialization of structs + use 0x00 as a field end delimiter). + (TODO: decide if one of the alternatives of embedding + the footer as a unknown field FileMetadata desirable as discussed in [Alkis's doc](https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit)) When embedded into a PAR1 file no modification to the magic number at the beginning of the file is mandated. diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 85f01e6d..32f362c5 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -537,39 +537,6 @@ enum Encoding { Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. */ BYTE_STREAM_SPLIT = 9; - - /** Encoding for variable length binary data that allows random access of values. - * - * This encoding designed for random access of BYTE_ARRAY values. It is mostly useful in cases - * for non-nullable BYTE_ARRAY columns where determining the exact offset of the value does not require - * parsing definition levels. - * - * The layout consists of the following elements elements: - * 1. byte_arrays - Byte Array values layed out contiguously. The BYTE_ARRAYs are immediately contiguous the cumulative - * offsets. - * 2. offsets: A contiguous set of signed N-byte little-endian unsigned integers - * representing the end byte offset (exclusive) of a BYTE_ARRAY value from - * the the beginning of the page. For simplicity of implementation the 0 index is - * always as zero. - * 3. The last byte indicates the number of bytes used for offsets (valid values are 1,2,3 and 4). - * Implementations SHOULD try to use the smallest byte value that meets the length requirements. - * - * Note the order of lengths is reversed from DELTA_BINARY_PACKED to allow for byte array values to - * potentially allow for incremental compression in the case of Data Page V2 or other future data pages - * where values are compressed separately from nesting information. - * - * The beginning offset of the offsets can be determined using the final offset element. - * - * An individual byte array element can be found at an index using the following pseudo-code - * (real implementations SHOULD do bounds checking): - * - * return byte_arrays[offsets[index] : offsets[index+1]] - * - * - * Example encoding of "f", "oo", "bar1" (square brackets delimit the components listed): - * [foobar1][0,1,3,7][1] - */ - RANDOM_ACCESS_BYTE_ARRAY = 10; } /** @@ -803,19 +770,29 @@ struct PageEncodingStats { /** * Description for column metadata + * Next-Id: 20 */ struct ColumnMetaData { - /** Type of this column **/ - 1: required Type type + /** Type of this column + * + * Available from schema via efficient lookup with schema_index. + * + * PAR1: Required. + * PAR3: Don't populate. + **/ + 1: optional Type type /** Set of all encodings used for this column. The purpose is to validate - * whether we can decode those pages. **/ - 2: required list encodings + * whether we can decode those pages. + * + * PAR1: Required. + * PAR3: don't populate redundant with column page stats. + **/ + 2: optional list encodings /** Path in schema - * Example of deprecated a field for PAR3 - * PAR1 Footer: Required - * PAR3 Footer: Deprecated (don't populate) + * PAR1 Footer: Required. + * PAR3 Footer: Deprecated (don't populate). Can be inferred from schema element. */ 3: optional list path_in_schema @@ -831,14 +808,21 @@ struct ColumnMetaData { /** total byte size of all compressed, and potentially encrypted, pages * in this column chunk (including the headers) * - * Fetching the range of min(dictionary_page_offset, data_page_offset) + total_compressed_size - * should fetch all data in the the given column chunk + * Fetching the range of min(dictionary_page_offset, data_page_offset) + * + total_compressed_size should fetch all data in the the given column + * chunk. */ 7: required i64 total_compressed_size - /** Optional key/value metadata **/ + /** Optional key/value metadata + * PAR1: Optional. + * PAR3: Don't write use key_value_metadata instead. + **/ 8: optional list key_value_metadata + /** See description on FileMetata.key_value_metadata **/ + 19: optional MetadataPage key_value_metadata_page + /** Byte offset from beginning of file to first data page **/ 9: required i64 data_page_offset @@ -853,8 +837,20 @@ struct ColumnMetaData { /** Set of all encodings used for pages in this column chunk. * This information can be used to determine if all data pages are - * dictionary encoded for example **/ + * dictionary encoded for example + * + * PAR1: Optional. May be deprecated in a future release in favor + * serialized_encoding_stats. + * PAR3: Don't populate. Write serialized_page_encoding_stats. + **/ 13: optional list encoding_stats; + /** + * Serialized page encoding stats. + * + * PAR1: Start populating after encoding_stats is deprecated. + * PAR3: Populate instead of encoding_stats. + */ + 17: optional binary serialized_encoding_stats /** Byte offset from beginning of file to Bloom filter data. **/ 14: optional i64 bloom_filter_offset; @@ -872,8 +868,13 @@ struct ColumnMetaData { * representations. The histograms contained in these statistics can * also be useful in some cases for more fine-grained nullability/list length * filter pushdown. + * + * PAR1: Optional. + * PAR3: Populate serialized_size_statistics. */ 16: optional SizeStatistics size_statistics; + /** Thrift serialized SizeStatistics **/ + 18: optional binary serialized_size_statistics; } struct EncryptionWithFooterKey { @@ -895,6 +896,9 @@ union ColumnCryptoMetaData { struct ColumnChunk { /** File where column data is stored. If not set, assumed to be same file as * metadata. This path is relative to the current file. + * + * DEPRECATED. The one know use-case for this is metadata cache files. + * These have been superceded by open source table formats, prefer those. **/ 1: optional string file_path @@ -927,6 +931,24 @@ struct ColumnChunk { * PAR3: Not set see column_metadata_page on FileMetadata struct **/ 9: optional binary encrypted_column_metadata + /** + * The column order for this chunk. + * + * If not set readers should check FileMetadata.column_orders + * instead. + * + * Populated in both PAR1 and PAR3 + */ + 10: optional ColumnOrder column_order + /** Set to true if all pages in the column chunk are dictionary + * encoded + */ + 11: optional bool all_pages_dictionary_encoded + /** + * The index to the SchemaElement in FileMetadata for this + * column. + */ + 12: optional i32 schema_index } struct RowGroup { @@ -934,9 +956,17 @@ struct RowGroup { * This list must have the same order as the SchemaElement list in FileMetaData. * * PAR1: Required - * PAR3: Not populated. Use columns_page on FileMetadata. + * PAR3: Not populated. Use columns_page. **/ 1: optional list columns + + /** Page has BYTE_ARRAY data where each element is REQUIRED. + * + * Each element is a Thrift Serialized ColumnChunk + * + * PAR1: Don't include + * PAR3: Required **/ + 8: optional MetadataPage columns_page /** Total byte size of all the uncompressed column data in this row group **/ 2: required i64 total_byte_size @@ -1163,32 +1193,31 @@ union EncryptionAlgorithm { } /** - * Description of location of a metadata page. + * Embedded metadata page. * * A metadata page is a data page used to store metadata about * the data stored in the file. This is a key feature of PAR3 * footers which allow for deferred decoding of metadata. * * For common use cases the current recommendation is to use a - * an encoding that supported random access (e.g. PLAIN for fixed types - * and RANDOM_ACCESS_BYTE_ARRAY for variable sized types). Implementations + * an encoding that supported random access but implementations may choose + * other configuration parameters if necessary. Implementations * SHOULD consider allowing configurability per page to allow for end-users * to optimize size vs compute trade-offs that make sense for their use-case. * * Statistics for Metadata pages SHOULD NOT be written. + * + * Structs of this type should never be written in PAR1. */ -struct MetadataPageLocation { - // Offset from the beginning of the PAR3 footer to the header - // of the data page. - 1: optional i32 footer_offset - - // The length of the serialized page (header + data) in bytes. This - // is redundant with information in the header but allow - // for more robust checks before doing any Thrift parsing. - 2: optional i32 full_page_size - +struct MetadataPage { + // A serialized page including metadata thrift header and data. + 1: required binary page // Optional compression applied to the page. - 3: optional CompressionCodec compression + 2: optional CompressionCodec compression + // Number of elements stored. This is duplicated here to help in + // use-cases where knowing the total number of elements up front for + // computation would be useful. + 3: num_values } /** @@ -1206,51 +1235,46 @@ struct FileMetaData { * The first element is the root * * PAR1: Required - * PAR3: Use schema_metadata_page - * - * TODO: This might be too much (i.e. leave as a list for PAR3), but potentially useful for - * wide Schemas if a "schema index" is every added. + * PAR3: Use schema_page **/ 2: optional list schema; - /** Required BYTE_ARRAY data where each element is REQUIRED. + /** Page has BYTE_ARRAY data where each element is REQUIRED. * * Each element is a serialized SchemaElement. The order and content should * have a one to one correspondence with schema. * * If encryption is applied to the footer each element is encrypted individually. */ - 10: optional MetadataPageLocation schema_page; + 10: optional binary schema_page; /** Number of rows in this file **/ 3: required i64 num_rows /** Row groups in this file - * - * TODO: Decide if this should be moved to a metadata page. - **/ - 4: required list row_groups - - /** Required BYTE_ARRAY data where each element is REQUIRED. * - * Each element is a serialized ColumnChunk. The number of - * elements is M * N, where M is the number row groups in the file - * and N is the number of columns storing data. An columns metadata - * object is stored at `m*N + column index` where m is the row-group - * index. - * - * If encryption applies to the footer each element in page is encrypted - * individually. - * - * PAR1: Don't include - * PAR3: Required **/ - 11: optional MetadataPageLocation columns_page + * PAR1: Required + * PAR3: Use row_groups_page + **/ + 4: optional list row_groups + /** Page has BYTE_ARRAY data where each element is REQUIRED. + * Each element is a thrift serialized RowGroup. + */ + 10: optional MetadataPage row_groups_page /** Optional key/value metadata - * TODO: Consider if this should be moved to use a data page as well + * + * PAR1: optional + * PAR3: Use key_value_metadata_page **/ 5: optional list key_value_metadata + /** Page has BYTE_ARRAY data where each element is REQUIRED. + * + * Each element in the page is a serialized KeyValue struct. + */ + 13: optional MetadataPage key_value_metadata_page + /** String for application that wrote this file. This should be in the format * version (build ). * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) @@ -1273,9 +1297,9 @@ struct FileMetaData { * The obsolete min and max fields in the Statistics object are always sorted * by signed comparison regardless of column_orders. * - * TODO: consider moving to a data page. While fast to decode, this potentially - * compresses/encodes extremely well since it is only a single value at the - * moment. + * PAR1: Optional, may be deprecated in the future in favor of + * ColumnChunk.column_order + * PAR3: Not written use ColumnChunk.column_order. */ 7: optional list column_orders; From 150cf3c59353bd6df418bbddcba6a9dd801698c8 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Thu, 30 May 2024 00:54:24 -0700 Subject: [PATCH 6/6] Remove some out of date comments and fix types on SchemaElement --- src/main/thrift/parquet.thrift | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 32f362c5..415b966b 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -926,10 +926,7 @@ struct ColumnChunk { /** Crypto metadata of encrypted columns **/ 8: optional ColumnCryptoMetaData crypto_metadata - /** Encrypted column metadata for this chunk - * - * PAR3: Not set see column_metadata_page on FileMetadata struct - **/ + /** Encrypted column metadata for this chunk **/ 9: optional binary encrypted_column_metadata /** * The column order for this chunk. @@ -1243,10 +1240,8 @@ struct FileMetaData { * * Each element is a serialized SchemaElement. The order and content should * have a one to one correspondence with schema. - * - * If encryption is applied to the footer each element is encrypted individually. */ - 10: optional binary schema_page; + 10: optional MetadataPage schema_page; /** Number of rows in this file **/ 3: required i64 num_rows @@ -1258,6 +1253,7 @@ struct FileMetaData { **/ 4: optional list row_groups /** Page has BYTE_ARRAY data where each element is REQUIRED. + * * Each element is a thrift serialized RowGroup. */ 10: optional MetadataPage row_groups_page