Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
base: master
Are you sure you want to change the base?
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
Changes from 1 commit
2f3449e
aecffd7
8d31bff
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's a little strict here? Just ingore min-max seems ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU To your general comment (I can't answer there)
The problem is that the ColumnIndex does not have the
num_values
field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index anum_values
list. Then this would indeed work everywhere but at the cost of an additional list.So I see we have the following options:
num_values
list to the ColumnIndexmin==max==NaN
reasoning only in the column index, and use thenull_count + nan_count == num_values
for the statistics.Which one would you suggest here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To this suggestion:
Note that the line you mentioned here just tells a reader that they can rely on this information, and therfore could, e.g., skip this page if a predicate like
x = 12.34
was used. They can of course also opt to ignore this information and not skip but rather scan the page. If we removed this, a reader couldn't do the skip here.I guess this is related to your general suggestion: How do we detect only-NaN pages? Depending on what we do for that, this line will be adapted accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH: I would actually love to have a
num_values
list in the column index. We have the same in the statistics, Iceberg does the same, and not needing min=max=NaN for only-NaN checking would actually be much more elegant IMHO.I just didn't want to suggest adding another list to each column index for the added space cost. However, given that these indexes are negligibly small in comparison to the data, I think actually no one would mind that extra space. If the consensus is that this is preferrable, I'm happy to adapt the commit to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it, I think using both min-max is backward-capatible and can represent "all-data-is-nan". https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L944 can we import a status like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU Yes, we could also add a
nan_pages
bool list in the column index. That would work as well.My gut feeling is that one day having a
value_counts
count would be more useful than boolean lists. We already havenull_pages
andnull_counts
and we would then also havenan_pages
andnan_counts
, bothnull_pages
andnan_pages
would be obsolete if there werevalue_counts
. Yes, storing one integer (value_counts
) is likely more space than storing two booleans (null_pages
&nan_pages
), but knowing the number of values in a page could also be helpful for other pruposes.But yes, we could drop the testing of
min=max=NaN
if we had anan_pages
list in the column index.Note though that if we then also drop that we write
NaN
into the min/max here and rather write nothing, then we would be strictly speaking not backward compatible, as legacy readers might assume that any min/max for a page in the column index that hasnull_pages == false
has a valid bound value, which would no longer be the case for an only-NaN page. I'm not sure whether any reader does this or whether that's just a theoretical problem though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe you are right. My point is that, if we write nan_count or even record count, the program would works well. However, non-float point page would have some size-overhead. Personally, I'd like to use
list<bool>
, because it's easy to implement, and also lightweight. And we can hear others idea.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to explicitly state that
NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this with my next revision once we have decided on this issue.