-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948
Comments
Statistics behavior in the presence of NaN values is currently in flux. See the discussion in apache/parquet-format#196 From the current parquet thrift specification:
|
@etseidl You are right, but it's a new version behavior. And there are also an extra field Actually, CPU Spark 330 returns wrong result when reading the GPU file with # enable agg push down
spark.conf.set("spark.sql.parquet.aggregatePushdown", "true")
# use V2 datasource
spark.conf.set("spark.sql.sources.useV1SourceList", "")
df = spark.read.parquet("/tmp/my-parquet")
df.createOrReplaceTempView("tab")
# agg push down, directly use the max stat, it's wrong.
spark.sql("select max(v) from tab").show()
+------+
|max(v)|
+------+
| 2.0|
+------+
spark.read.parquet("/tmp/my-parquet").show() # show all values
+---+
| v|
+---+
|1.0|
|2.0|
|NaN|
+---+
#add additinal filter will disable this agg push down, then gets correct result.
spark.sql("select max(v) from tab where v != 1.0").show()
+------+
|max(v)|
+------+
| NaN|
+------+ In scala, the min/max value are NaN for values (1.0, 2.0, NaN)
The GPU generated file: parquet-cli meta a.parquet
|
@res-life I mentioned the new discussion of NaN handling to say things might change at some point in the future. But given the current parquet specification, here's what it has to say about floating point ordering:
I'm no expert, but looking at issues such as https://issues.apache.org/jira/browse/PARQUET-1222 I'd contend that the current GPU behavior is not incorrect. The specification was ambiguous in the past, so different implementations had different (incompatible it seems) orderings and min/max behaviors. I believe the current GPU behavior of not writing NaN is consistent with the current spec. Being different from the parquet-mr implementation is not necessarily a bug. |
Thank you all for the discussion. I'll close this for now. |
Describe the bug
Parquet float/double statistic is wrong when float/double column contains NaN.
For example, a double column contains 2 values [NaN, 1.0d].
The CPU
org.apache.parquet.format.Statistics
: min_value 1.0, max_value: NaN.The GPU
org.apache.parquet.format.Statistics
: min_value 1.0, max_value: 1.0.Steps/Code to reproduce bug
CPU:
parquet-tools inspect --detail /tmp/test-001.parquet
GPU:
parquet-tools inspect --detail TestDoubleStatistic.parquet
Expected behavior
Make float/double stat consistent with CPU.
Environment overview (please complete the following information)
Environment details
cuDF: branch-23.10
parquet: apache-parquet-1.12.2
Additional context
parquet-tools link
Parquet convert from
org.apache.parquet.format.Statistics
toorg.apache.parquet.column.statistics
:https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L123-L126
If there are NaN, then
org.apache.parquet.column.statistics
min/max are converted to 0.0/0.0 and make they are invalid:So GPU should make sure min/max contains a NaN.
The text was updated successfully, but these errors were encountered: