You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Iceberg relies on statistics (called Metrics in Iceberg) to speed up the queries. Most of the metrics are available and can be easily extracted using the MetadataCollector, except for the NaN counts. If someone does an isNaN expression on a FLOAT/DOUBLE field, Iceberg tries to skip Parquet files by looking at the metrics that it has stored in the manifest files. It would be awesome if next to null_count also nan_count can be added:
@mapleFU That's right, that's being introduced in apache/parquet-format#196. In the Iceberg spec it is already there (and implemented many times), the field is called nan_value_counts.
@Fokko Did you have any comment or suggestion on the proposal you mentioned above? It would be great if it can align with the requirement from Apache Iceberg.
Describe the enhancement requested
Iceberg relies on statistics (called Metrics in Iceberg) to speed up the queries. Most of the metrics are available and can be easily extracted using the MetadataCollector, except for the NaN counts. If someone does an
isNaN
expression on a FLOAT/DOUBLE field, Iceberg tries to skip Parquet files by looking at the metrics that it has stored in the manifest files. It would be awesome if next tonull_count
alsonan_count
can be added:In addition to this, Parquet itself is also looking into this: apache/parquet-format#196
Component(s)
Python
The text was updated successfully, but these errors were encountered: