Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T1] Allow fixed length encoding for min/max and deprecate encoding_stats #252

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 26 additions & 22 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -243,42 +243,39 @@ struct SizeStatistics {
*/
struct Statistics {
/**
* DEPRECATED: min and max value of the column. Use min_value and max_value.
* max/min: deprecated, used to encoded signed orderable values, ignoring
* the columns ColumnOrder
* max_value/min_value: PLAIN encoded values, sans length prefix if varlen
* max8/min8: up to 8-bytes:
* FLOAT, DOUBLE: bitcasted to INT32 and INT64, respectively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to be more specific here about values less then 8 bytes are translated into 8 bytes. In practice it doesn't make a difference for readers but it would be good to limit ambiguity. I assume we do a normal cast from 1/4 integer byte values to 8 bytes values rather then just embedding them?

* FIXED_LEN_BYTE_ARRAY: up to 8 bytes, little endian
*
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
* Only one pair of max_value/min_value or max8/min8 should be set.
*
* These fields encode min and max values determined by signed comparison
* only. New files should use the correct order for a column's logical type
* and store the values in the min_value and max_value fields.
* Min and Max are the lower and upper bound values for the column,
* respectively, as determined by its ColumnOrder.
*
* To support older readers, these may be set when the column order is
* signed.
* These may be the actual minimum and maximum values found on a page or column
* chunk, but can also be (more compact) values that do not exist on a page or
* column chunk. For example, instead of storing "Blart Versenwald III", a writer
* may set min_value="B", max_value="C". Such more compact values must still be
* valid values within the column's logical type.
*/
/* DEPRECATED: do not use */
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
* Lower and upper bound values for the column, determined by its ColumnOrder.
*
* These may be the actual minimum and maximum values found on a page or column
* chunk, but can also be (more compact) values that do not exist on a page or
* column chunk. For example, instead of storing "Blart Versenwald III", a writer
* may set min_value="B", max_value="C". Such more compact values must still be
* valid values within the column's logical type.
*
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*/
5: optional binary max_value;
6: optional binary min_value;
/** If true, max_value is the actual maximum value for a column */
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;
9: optional i64 max8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intentionally elide min1/max1? (they are still mentioned above).

Copy link
Contributor Author

@alkis alkis May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I removed them because they provide little benefit and do not justify the added complexity. This is because in thrift these are ulebs so it makes no difference in the wire. For flatbuffers this would make a difference though.

10: optional i64 min8;
}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -810,9 +807,13 @@ struct ColumnMetaData {
/** optional statistics for this column chunk */
12: optional Statistics statistics;

/** Set of all encodings used for pages in this column chunk.
/**
* DEPRECATED: use is_fully_dict_encoded instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making this a separate PR, I think we'd prefer to keep the changes as small and focused as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will create another PR for the Statistics change alone if we are OK merging that now.

*
* Set of all encodings used for pages in this column chunk.
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
* dictionary encoded for example
*/
13: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to Bloom filter data. **/
Expand All @@ -833,6 +834,9 @@ struct ColumnMetaData {
* filter pushdown.
*/
16: optional SizeStatistics size_statistics;

/** If true, all data pages are dictionary encoded **/
17: optional bool is_fully_dict_encoded;
}

struct EncryptionWithFooterKey {
Expand Down