-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[T1] Allow fixed length encoding for min/max and deprecate encoding_stats #252
base: master
Are you sure you want to change the base?
Conversation
src/main/thrift/parquet.thrift
Outdated
* Only one pair of max_value/min_value, max1/min1, max2/min2, max4/min4, | ||
* max8/min8 can be set. The pair is determined by the physical type of the | ||
* column. Floating point values are bitcasted to integers. Variable length | ||
* values are set in min_value/max_value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the docs for readers for backwards compatibility should check min_value/max_value if the non-variable width field is not not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten this to be clearer.
*/ | ||
5: optional binary max_value; | ||
6: optional binary min_value; | ||
/** If true, max_value is the actual maximum value for a column */ | ||
7: optional bool is_max_value_exact; | ||
/** If true, min_value is the actual minimum value for a column */ | ||
8: optional bool is_min_value_exact; | ||
9: optional i64 max8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you intentionally elide min1/max1? (they are still mentioned above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I removed them because they provide little benefit and do not justify the added complexity. This is because in thrift these are ulebs so it makes no difference in the wire. For flatbuffers this would make a difference though.
@@ -810,9 +803,13 @@ struct ColumnMetaData { | |||
/** optional statistics for this column chunk */ | |||
12: optional Statistics statistics; | |||
|
|||
/** Set of all encodings used for pages in this column chunk. | |||
/** | |||
* DEPRECATED: use is_fully_dict_encoded instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest making this a separate PR, I think we'd prefer to keep the changes as small and focused as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I will create another PR for the Statistics change alone if we are OK merging that now.
…tats 1. Add `min8`/`max8` fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes. 2. Deprecate `ColumnMetaData.encoding_stats` and replace with a bool `ColumnMetaData.is_fully_dict_encoded`
b2caa21
to
ec13f34
Compare
* the columns ColumnOrder | ||
* max_value/min_value: PLAIN encoded values, sans length prefix if varlen | ||
* max8/min8: up to 8-bytes: | ||
* FLOAT, DOUBLE: bitcasted to INT32 and INT64, respectively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to be more specific here about values less then 8 bytes are translated into 8 bytes. In practice it doesn't make a difference for readers but it would be good to limit ambiguity. I assume we do a normal cast from 1/4 integer byte values to 8 bytes values rather then just embedding them?
min8
/max8
fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes.ColumnMetaData.encoding_stats
and replace with a boolColumnMetaData.is_fully_dict_encoded
'ref Parquet Metadata evolution
Jira
Commits
Documentation