Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2249: Add nan_count to handle NaNs in statistics #196

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,18 +163,25 @@ following rules:
[Thrift definition](src/main/thrift/parquet.thrift) in the
`ColumnOrder` union. They are summarized here but the Thrift definition
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
is considered authoritative:
* NaNs should not be written to min or max statistics fields.
* If the computed max value is zero (whether negative or positive),
`+0.0` should be written into the max statistics field.
* If the computed min value is zero (whether negative or positive),
`-0.0` should be written into the min statistics field.

For backwards compatibility when reading files:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.
* The following compatibility rules should be applied when reading statistics:
JFinis marked this conversation as resolved.
Show resolved Hide resolved
* If the nan_count field is set to > 0 and both min and max are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's a little strict here? Just ingore min-max seems ok?

Copy link
Contributor Author

@JFinis JFinis Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU To your general comment (I can't answer there)

The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by null_count + nan_count == num_values?

The problem is that the ColumnIndex does not have the num_values field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a num_values list. Then this would indeed work everywhere but at the cost of an additional list.

So I see we have the following options:

  • Do what I did here, i.e., use min/max to determine whether there are only NaNs
  • Add a num_values list to the ColumnIndex
  • Accept the fact that the column index cannot detect only-NaN pages (might lead to fishy semantics)
  • Tell readers to use the min==max==NaN reasoning only in the column index, and use the null_count + nan_count == num_values for the statistics.

Which one would you suggest here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To this suggestion:

Seems it's a little strict here? Just ingore min-max seems ok?

Note that the line you mentioned here just tells a reader that they can rely on this information, and therfore could, e.g., skip this page if a predicate like x = 12.34 was used. They can of course also opt to ignore this information and not skip but rather scan the page. If we removed this, a reader couldn't do the skip here.

I guess this is related to your general suggestion: How do we detect only-NaN pages? Depending on what we do for that, this line will be adapted accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH: I would actually love to have a num_values list in the column index. We have the same in the statistics, Iceberg does the same, and not needing min=max=NaN for only-NaN checking would actually be much more elegant IMHO.

I just didn't want to suggest adding another list to each column index for the added space cost. However, given that these indexes are negligibly small in comparison to the data, I think actually no one would mind that extra space. If the consensus is that this is preferrable, I'm happy to adapt the commit to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, I think using both min-max is backward-capatible and can represent "all-data-is-nan". https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L944 can we import a status like that?

Copy link
Contributor Author

@JFinis JFinis Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU Yes, we could also add a nan_pages bool list in the column index. That would work as well.

My gut feeling is that one day having a value_counts count would be more useful than boolean lists. We already have null_pages and null_counts and we would then also have nan_pages and nan_counts, both null_pages and nan_pages would be obsolete if there were value_counts. Yes, storing one integer (value_counts) is likely more space than storing two booleans (null_pages & nan_pages), but knowing the number of values in a page could also be helpful for other pruposes.

But yes, we could drop the testing of min=max=NaN if we had a nan_pages list in the column index.

Note though that if we then also drop that we write NaN into the min/max here and rather write nothing, then we would be strictly speaking not backward compatible, as legacy readers might assume that any min/max for a page in the column index that has null_pages == false has a valid bound value, which would no longer be the case for an only-NaN page. I'm not sure whether any reader does this or whether that's just a theoretical problem though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe you are right. My point is that, if we write nan_count or even record count, the program would works well. However, non-float point page would have some size-overhead. Personally, I'd like to use list<bool>, because it's easy to implement, and also lightweight. And we can hear others idea.

NaN, a reader can rely on that all non-NULL values are NaN
* Otherwise, if the min or the max is a NaN, it should be ignored.
* When looking for NaN values, min and max should be ignored;
if the nan_count field is set, it should be used to check whether
NaNs are present.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When writing statistics the following rules should be followed:
* The nan_count fields should always be set for FLOAT and DOUBLE columns.
* NaNs should not be written to min or max statistics fields except
when all non-NULL values are NaN, in which case min and max should
both be written as NaN. If the nan_count field is set, this semantics
is mandated and readers may rely on it.
* If the computed max value is zero (whether negative or positive),
`+0.0` should be written into the max statistics field.
* If the computed min value is zero (whether negative or positive),
`-0.0` should be written into the min statistics field.

* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
comparison.
Expand Down
30 changes: 23 additions & 7 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ struct Statistics {
*/
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
/** count of null values in the column */
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
Expand All @@ -223,6 +223,8 @@ struct Statistics {
*/
5: optional binary max_value;
6: optional binary min_value;
/** count of NaN values in the column; only present if type is FLOAT or DOUBLE */
JFinis marked this conversation as resolved.
Show resolved Hide resolved
7: optional i64 nan_count;
}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -886,16 +888,23 @@ union ColumnOrder {
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
* point values (relations vs. total ordering) the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
* point values (relations vs. total ordering), the following compatibility
* rules should be applied when reading statistics:
* - If the nan_count field is set to > 0 and both min and max are
* NaN, a reader can rely on that all non-NULL values are NaN
* - Otherwise, if the min or the max is a NaN, it should be ignored.
* - When looking for NaN values, min and max should be ignored;
* if the nan_count field is set, it can be used to check whether
* NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
* - NaNs should not be written to min or max statistics fields.
* - The nan_count fields should always be set for FLOAT and DOUBLE columns.
JFinis marked this conversation as resolved.
Show resolved Hide resolved
* - NaNs should not be written to min or max statistics fields except
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to explicitly state that NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this with my next revision once we have decided on this issue.

* when all non-NULL values are NaN, in which case min and max should
* both be written as NaN. If the nan_count field is set, this semantics
* is mandated and readers may rely on it.
* - If the computed max value is zero (whether negative or positive),
* `+0.0` should be written into the max statistics field.
* - If the computed min value is zero (whether negative or positive),
Expand Down Expand Up @@ -952,6 +961,9 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated before
* using them by inspecting null_pages.
* For columns of type FLOAT and DOUBLE, NaN values are not to be included
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* in these bounds unless all non-null values in a page are NaN, in which
* case min and max are to be set to NaN.
*/
2: required list<binary> min_values
3: required list<binary> max_values
Expand All @@ -966,6 +978,10 @@ struct ColumnIndex {

/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts

/** A list containing the number of NaN values for each page. Only present
* for columns of type FLOAT and DOUBLE. **/
6: optional list<i64> nan_counts
}

struct AesGcmV1 {
Expand Down