-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250
base: master
Are you sure you want to change the base?
Changes from all commits
8a9e2f3
9340c40
6804818
c6f18b2
40769c2
150cf3c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -118,6 +118,51 @@ chunks they are interested in. The columns chunks should then be read sequentia | |
|
||
![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif) | ||
|
||
### PAR3 File Footers | ||
|
||
PAR3 file footer footer format designed to better support wider-schemas and more control | ||
over the various footer size vs compute trade-offs. Its format is as follows: | ||
- Serialized Thrift FileMetadata Structure | ||
- (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should not be optional. All bytes are not equal in a file. In particular footer bytes are very important because if those are corrupt - we can't read any bytes of the file. If anything Footers not having a required checksum for their content is a design flaw of the original parquet specification. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tend to agree, other had concerns that it was gratuitous. I can update to make it required. |
||
- 4-byte length in bytes (little endian) of all preceding elements in the footer. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend a crc32 of the length itself. This is to detect early that a footer is corrupt and avoid reading some epic amount of garbage from the end of the file. For example think of a bit flip in one of the top bits of the length, it will cause a reader to read 100s of MBs of the end of the file only to check that the crc doesn't match. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm OK adding this. As a counter pointer 100s of MB would ideally be rejected by reasonable memory limitations on the footer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there's generally an expectation that filesystems do validation, but CRCs are relatively low cost, and help find problems in networking and NIC |
||
- 4-byte little-endian flag field to indicate features that require special parsing of the footer. | ||
Readers MUST raise an error if there is an unrecognized flag. Current flags: | ||
|
||
* 0x01 - Footer encryption enabled (when set the encryption information is written before | ||
FileMeta structure as in the PAR1 footer). | ||
* 0x02 - CRC32 of FileMetadata Footer. | ||
Comment on lines
+128
to
+133
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't find this flag particularly elegant. There is already a magic number. We can employ the same mechanism as PAR1/PARE had to distinguish an encrypted footer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm in favor of the feature flags and it in turn provides unified There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So for context I want from 8 bits to 64 bits and I think 32 is a reasonable default. The intent of this particular bitmap is to indicate to consumers that there is backwards incompatible change. I think if we had this originally we wouldn't of used a different magic number. Forward looking I imagine things that might go here if we chose to pursue them:
I would hope these would be relatively rare, and the last flag of this bitmap can always be reserved to indicate yet another bitmap. @wgtmac what use-cases where you thinking of? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking to leverage the features flag to continue the effort in #164 which defines core features that the parquet file implements. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the benefits of feature flags in this place in the file? We have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wgtmac yeah I was imagining this for much fewer use-cases and I think for features that readers can detect as they read that they don't understand I think it is fine for it to happen lazily.
@alkis at the very least compression. If we switch to flatbuffers I believe they compress quite well (a lot of extra padding in integers)? Would we then have a few more magic footers for the cross product of compression and and encryption? Again, I think there are only even a handful of imagined use-cases that this can be used which is originally why I had it as a single byte originally, IMO it is a small cost to pay for some potential flexibility. and is useful at least for encryption. |
||
|
||
- 4-byte magic number "PAR3" | ||
|
||
When parsing the footer implementations SHOULD read at least the last 12 bytes of the footer. Then | ||
read in the entirety of the footer based on the length of all preceding elements. This prevents further | ||
I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers. | ||
If a file is written with only PAR3 footer, implementation MUST write PAR3 as the first four bytes in | ||
they file. PAR3 footers can also be written in a backwards compatible way after PAR1 Metadata | ||
(see next section for details). | ||
|
||
#### Dual Mode PAR1 and PAR3 footers | ||
|
||
The following section defines a layout that allows PAR1 | ||
and PAR3 headers to co-exist in a single logical footer | ||
but allow legacy readers to still read files. | ||
|
||
The laout consists of the following: | ||
- Serialized PAR1 FileMetadata Thrift object | ||
- PAR3 footer as described above | ||
- 4 byte little-endian length in bytes of all | ||
preceding elements. | ||
- 4-byte magic number "PAR1" | ||
|
||
Readers aware of PAR3 can check for the "PAR3" magic number | ||
beginning 12 bytes from the end of the file (This should | ||
be unambiguous because thrift serialization of structs | ||
use 0x00 as a field end delimiter). | ||
(TODO: decide if one of the alternatives of embedding | ||
the footer as a unknown field FileMetadata desirable as discussed in [Alkis's doc](https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit)) | ||
|
||
When embedded into a PAR1 file no modification to the magic number at the beginning of the file is mandated. | ||
|
||
## Metadata | ||
There are three types of metadata: file metadata, column (chunk) metadata and page | ||
header metadata. All thrift structures are serialized using the TCompactProtocol. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -770,17 +770,31 @@ struct PageEncodingStats { | |
|
||
/** | ||
* Description for column metadata | ||
* Next-Id: 20 | ||
*/ | ||
struct ColumnMetaData { | ||
/** Type of this column **/ | ||
1: required Type type | ||
/** Type of this column | ||
* | ||
* Available from schema via efficient lookup with schema_index. | ||
* | ||
* PAR1: Required. | ||
* PAR3: Don't populate. | ||
**/ | ||
1: optional Type type | ||
|
||
/** Set of all encodings used for this column. The purpose is to validate | ||
* whether we can decode those pages. **/ | ||
2: required list<Encoding> encodings | ||
* whether we can decode those pages. | ||
* | ||
* PAR1: Required. | ||
* PAR3: don't populate redundant with column page stats. | ||
**/ | ||
2: optional list<Encoding> encodings | ||
|
||
/** Path in schema **/ | ||
3: required list<string> path_in_schema | ||
/** Path in schema | ||
* PAR1 Footer: Required. | ||
* PAR3 Footer: Deprecated (don't populate). Can be inferred from schema element. | ||
*/ | ||
3: optional list<string> path_in_schema | ||
|
||
/** Compression codec **/ | ||
4: required CompressionCodec codec | ||
|
@@ -792,12 +806,23 @@ struct ColumnMetaData { | |
6: required i64 total_uncompressed_size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we make This doesn't matter much for Thrift, but if we are happy with such a change, it makes a difference for other encodings like flatbuffers. In addition num_values can be optional and if left unset it can inherit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, we've had bugs in the past due to i32 overflow in various implementations and incompatibilities with Arrow's i32 offsets because the data stored is larger. I don't recall which of these fields had issues exactly but based on that it would indicate that there are in fact some users that overflow at least signed representations, so even unsigned int32 seems like a potential risk.
I agree this is a reasonable optimizations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would think |
||
|
||
/** total byte size of all compressed, and potentially encrypted, pages | ||
* in this column chunk (including the headers) **/ | ||
* in this column chunk (including the headers) | ||
* | ||
* Fetching the range of min(dictionary_page_offset, data_page_offset) | ||
* + total_compressed_size should fetch all data in the the given column | ||
* chunk. | ||
*/ | ||
7: required i64 total_compressed_size | ||
|
||
/** Optional key/value metadata **/ | ||
/** Optional key/value metadata | ||
* PAR1: Optional. | ||
* PAR3: Don't write use key_value_metadata instead. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, will fix. |
||
**/ | ||
8: optional list<KeyValue> key_value_metadata | ||
|
||
/** See description on FileMetata.key_value_metadata **/ | ||
19: optional MetadataPage key_value_metadata_page | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is unlikely to make a huge difference. In case it does, should we consider this also: struct KeyVals {
1: optional list<string> keys;
2: optional list<binary> vals;
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe in the weeds, but how would one indicate a null value for a key and distinguish it from an empty string? (value is optional in |
||
|
||
/** Byte offset from beginning of file to first data page **/ | ||
9: required i64 data_page_offset | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is almost always the same as the start of the columnchunk. We should make this optional and imply it is the same as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we indeed put the ColumnMetaData in front of the chunk, then this would no longer be true. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Makes sense |
||
|
||
|
@@ -812,8 +837,20 @@ struct ColumnMetaData { | |
|
||
/** Set of all encodings used for pages in this column chunk. | ||
* This information can be used to determine if all data pages are | ||
* dictionary encoded for example **/ | ||
* dictionary encoded for example | ||
* | ||
* PAR1: Optional. May be deprecated in a future release in favor | ||
* serialized_encoding_stats. | ||
* PAR3: Don't populate. Write serialized_page_encoding_stats. | ||
**/ | ||
13: optional list<PageEncodingStats> encoding_stats; | ||
/** | ||
* Serialized page encoding stats. | ||
* | ||
* PAR1: Start populating after encoding_stats is deprecated. | ||
* PAR3: Populate instead of encoding_stats. | ||
*/ | ||
17: optional binary serialized_encoding_stats | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to adopt @alkis's idea of using a boolean field here to indicate whether dictionary encoding is applied to all data pages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I just saw line 943. I meant field 17 is not required any more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But I just got this implemented 😭 😉. Joking aside, is there any other use for this field beyond optimizing dictionary based queries? Edit: looking above, encodings is now deprecated, so this field replaces that and serves the purpose of @alkis's boolean field. So if we adopt the latter, encodings will need to remain required I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I think it serves three purposes:
I'm OK removing completely but my concern around #2 is spec changes take a while to propagate, having the information for engines to determine what optimizations they want to do without requiring a new flag I think is useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we drop There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I thought I had marked encodings as dropped (as the information here duplicates it at the very least), if not I'll update it. I'm also OK with readers discovering unreadable pages lazily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jhorstmann this seems plausible however, I still think there is use in this data from a debuggability standpoint to elaborate on #3 I shouldn't have just limited it to engines. Being able to dump encodings and how many there are quickly allows users to understand a little bit better how there data is being compressed and let them potentially fine tune these aspects. It isn't a frequent occurrence but there is certainly cases when having a little bit more data helps with debugability. |
||
|
||
/** Byte offset from beginning of file to Bloom filter data. **/ | ||
14: optional i64 bloom_filter_offset; | ||
|
@@ -831,8 +868,13 @@ struct ColumnMetaData { | |
* representations. The histograms contained in these statistics can | ||
* also be useful in some cases for more fine-grained nullability/list length | ||
* filter pushdown. | ||
* | ||
* PAR1: Optional. | ||
* PAR3: Populate serialized_size_statistics. | ||
*/ | ||
16: optional SizeStatistics size_statistics; | ||
/** Thrift serialized SizeStatistics **/ | ||
18: optional binary serialized_size_statistics; | ||
} | ||
|
||
struct EncryptionWithFooterKey { | ||
|
@@ -854,6 +896,9 @@ union ColumnCryptoMetaData { | |
struct ColumnChunk { | ||
/** File where column data is stored. If not set, assumed to be same file as | ||
* metadata. This path is relative to the current file. | ||
* | ||
* DEPRECATED. The one know use-case for this is metadata cache files. | ||
* These have been superceded by open source table formats, prefer those. | ||
**/ | ||
1: optional string file_path | ||
|
||
|
@@ -883,13 +928,42 @@ struct ColumnChunk { | |
|
||
/** Encrypted column metadata for this chunk **/ | ||
9: optional binary encrypted_column_metadata | ||
/** | ||
* The column order for this chunk. | ||
* | ||
* If not set readers should check FileMetadata.column_orders | ||
* instead. | ||
* | ||
* Populated in both PAR1 and PAR3 | ||
*/ | ||
10: optional ColumnOrder column_order | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm afraid this might complicate page index because we have to check consistency of ColumnOrder across row groups. At the moment we have only one ColumnOrder, which is not a big issue. It may have problem if we introduce more orders in the future. This order is important to guide us on how to interpret serialized min_value/max_values in the statistics. Perhaps we can put this in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Schema element sounds OK to me we can also leave as is, I think I copied this from #242 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll move. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think maybe just keeping as is make sense actually, so I'll remove this field and we can keep the list or move that to a DATA_PAGE There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have examples of other orders that might be useful? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. different collations are a not-uncommon idea in databases (e.g. case insensitive comparison). One could argue on whether this should be modeled as a different logical type, a requirement on clients on how they store the data or using this concept but it is not unreasonable to model it in this way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
/** Set to true if all pages in the column chunk are dictionary | ||
* encoded | ||
*/ | ||
11: optional bool all_pages_dictionary_encoded | ||
/** | ||
* The index to the SchemaElement in FileMetadata for this | ||
* column. | ||
*/ | ||
12: optional i32 schema_index | ||
Comment on lines
+944
to
+948
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what use case is there for having column chunks be out of order? Wouldn't it be easier to just specify that the column chunks have to be in the same order as the schema? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also note that without the order being the same as the schema, the whole "random access" idea of this PR goes out of the window. It doesn't help me to have a MetadataPage with random access, if I don't know which column I need to access in the first place. Instead, what I want is that if I want to access the third column in the schema, then I need to access the third column chunk. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed on the mailing list, how about this alternative design: How about just turning things around: Instead of having a schema_index in the ColumnMetadata, we could have a column_metadata_index in the schema. If that index is missing/-1, then this signifies that the column is empty, so no metadata will be present for it. With this, we would get the best of both worlds: We would always have O(1) random I/O even in case of such empty columns (as we would use the column_metadata_index for the lookup) and we would not need to store any ColumnMetadata for empty columns. After given this a second thought, this also makes more sense in general. As the navigation direction is usually always from schema to metadata (not vice versa!), the schema should point us to the correct metadata instead of the metadata pointing us to the correct schema entry. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @JFinis this would only be useful for columns which are empty throughout all row groups, right? As opposed to skipping some column chunks but not all. |
||
} | ||
|
||
struct RowGroup { | ||
/** Metadata for each column chunk in this row group. | ||
* This list must have the same order as the SchemaElement list in FileMetaData. | ||
* | ||
* PAR1: Required | ||
* PAR3: Not populated. Use columns_page. | ||
**/ | ||
1: required list<ColumnChunk> columns | ||
1: optional list<ColumnChunk> columns | ||
|
||
/** Page has BYTE_ARRAY data where each element is REQUIRED. | ||
* | ||
* Each element is a Thrift Serialized ColumnChunk | ||
* | ||
* PAR1: Don't include | ||
* PAR3: Required **/ | ||
8: optional MetadataPage columns_page | ||
|
||
/** Total byte size of all the uncompressed column data in this row group **/ | ||
2: required i64 total_byte_size | ||
|
@@ -1115,6 +1189,34 @@ union EncryptionAlgorithm { | |
2: AesGcmCtrV1 AES_GCM_CTR_V1 | ||
} | ||
|
||
/** | ||
* Embedded metadata page. | ||
* | ||
* A metadata page is a data page used to store metadata about | ||
* the data stored in the file. This is a key feature of PAR3 | ||
* footers which allow for deferred decoding of metadata. | ||
Comment on lines
+1195
to
+1197
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the page will have a PageHeader/DataPageHeader at the top? Will nulls or repetition be allowed, i.e. do we need definition and repetition level data? If not, then should we define a new page type instead so we don't have to encode unused level encoding types? Then we could also drop the language below about not writing statistics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also a reasonable approach to define a new page type. I was leaving this open in case in the future we want nulls. Whether nulls are allowed and exact structure is dictated by its use on the field and also to minimize spec changes in this draft. The nice thing about this approach is it can work transparently if/when a new page type is added. |
||
* | ||
* For common use cases the current recommendation is to use a | ||
* an encoding that supported random access but implementations may choose | ||
* other configuration parameters if necessary. Implementations | ||
* SHOULD consider allowing configurability per page to allow for end-users | ||
* to optimize size vs compute trade-offs that make sense for their use-case. | ||
* | ||
* Statistics for Metadata pages SHOULD NOT be written. | ||
* | ||
* Structs of this type should never be written in PAR1. | ||
*/ | ||
struct MetadataPage { | ||
// A serialized page including metadata thrift header and data. | ||
1: required binary page | ||
// Optional compression applied to the page. | ||
2: optional CompressionCodec compression | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My feel is that most of the pages encoded in this form are going to be smallish, less than 1kb. For such small sizes, none of the general purpose compressors will do a good job at compressing. Are there any benchmarks where we can see the effectiveness compressing the above pages? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think column data is at least 20 bytes per column? this means we get to ~1kb at 50 column and 4kb at at ~200 which I don't think are unreasonable. What the relative compressibility is metadata is an open question. I will try to measure this for wider schemas. For smaller structures that use this I think the default would be not to compress. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest not doing this at all unless benchmarks show it is a large win. From experience with general purpose compressors (lempel ziv style) they fail to compress small entities like ints or ulebs. Since that's the majority of stuff we will put here, I do not expect compression to be profitable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alkis I'm OK either way since compression is already a supported concept I thought this was a low cost effort that might have value in some use cases. As a litmus test, I'm curious what happens if you run LZ4 on the footers you have been experimenting with? |
||
// Number of elements stored. This is duplicated here to help in | ||
// use-cases where knowing the total number of elements up front for | ||
// computation would be useful. | ||
3: num_values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing type? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For fixed width types PLAIN encoding gives us this data. For variable width types, I wanted to solve this in a more general way with the RANDOM_ACCESS_BYTE_ARRAY so that users could choose this to allow for random access. In future iterations when we tackle improving for "point lookup" use cases I imagine that would also help or be another option. |
||
} | ||
|
||
/** | ||
* Description for file metadata | ||
*/ | ||
|
@@ -1127,18 +1229,48 @@ struct FileMetaData { | |
* are flattened to a list by doing a depth-first traversal. | ||
* The column metadata contains the path in the schema for that column which can be | ||
* used to map columns to nodes in the schema. | ||
* The first element is the root **/ | ||
2: required list<SchemaElement> schema; | ||
* The first element is the root | ||
* | ||
* PAR1: Required | ||
* PAR3: Use schema_page | ||
**/ | ||
2: optional list<SchemaElement> schema; | ||
|
||
/** Page has BYTE_ARRAY data where each element is REQUIRED. | ||
* | ||
* Each element is a serialized SchemaElement. The order and content should | ||
* have a one to one correspondence with schema. | ||
*/ | ||
10: optional MetadataPage schema_page; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why isn't this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My intent was to introduce an encoding that allows zero-copy random access which I think would be better then list which I would guess might be slightly better. Plain encoding is effectively equivelant to list on the wire I believe, and this way we avoid the up front cost of decoding the list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @emkornfield have you given more thought to a schema index? One use for random access is column projection, but how does one figure out column indexes for the projection without fully decoding the schema? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A little bit I had a few ideas that I haven't had to write down. Starting from the easiest:
In both cases I think we need to potentially make a decision if this is 1 to 1 or 1 to many (e.g. I've seen parquet files with the duplicate column names that differ only by whether they are upper case or lower case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Benchmarks can guide this. From my experiments decoding |
||
|
||
/** Number of rows in this file **/ | ||
3: required i64 num_rows | ||
|
||
/** Row groups in this file **/ | ||
4: required list<RowGroup> row_groups | ||
/** Row groups in this file | ||
* | ||
* PAR1: Required | ||
* PAR3: Use row_groups_page | ||
**/ | ||
4: optional list<RowGroup> row_groups | ||
/** Page has BYTE_ARRAY data where each element is REQUIRED. | ||
* | ||
* Each element is a thrift serialized RowGroup. | ||
*/ | ||
10: optional MetadataPage row_groups_page | ||
|
||
/** Optional key/value metadata **/ | ||
/** Optional key/value metadata | ||
* | ||
* PAR1: optional | ||
* PAR3: Use key_value_metadata_page | ||
**/ | ||
5: optional list<KeyValue> key_value_metadata | ||
|
||
/** Page has BYTE_ARRAY data where each element is REQUIRED. | ||
* | ||
* Each element in the page is a serialized KeyValue struct. | ||
*/ | ||
13: optional MetadataPage key_value_metadata_page | ||
|
||
/** String for application that wrote this file. This should be in the format | ||
* <Application> version <App Version> (build <App Build Hash>). | ||
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) | ||
|
@@ -1160,6 +1292,10 @@ struct FileMetaData { | |
* | ||
* The obsolete min and max fields in the Statistics object are always sorted | ||
* by signed comparison regardless of column_orders. | ||
* | ||
* PAR1: Optional, may be deprecated in the future in favor of | ||
* ColumnChunk.column_order | ||
* PAR3: Not written use ColumnChunk.column_order. | ||
*/ | ||
7: optional list<ColumnOrder> column_orders; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: