Add test cases for Parquet statistics [databricks] #9090

res-life · 2023-08-22T13:22:03Z

Add test case for Parquet statistics

Signed-off-by: Chong Gao <[email protected]>

res-life · 2023-08-22T13:36:06Z

3 questions about this PR:

The row numbers in row groups are (1,000,000, 48576) in the cuDF Parquet file when write 1M rows, while it's (1048576) in the Spark generated Parquet file. Is this an issue? Even the parquet only contains one boolean column, cuDF always split the row group at 1,000,000.
cuDF writes timestamp as int96, while Spark write timestamp as int64:

optional int64 c10 (TIMESTAMP(MICROS,true));  // Spark
optional int96 c10; // cuDF

Seems it's not an issue, because Spark reads INT96 column as a timestamp column.

We can not compare the encodings for the ColumnChunk, e.g.:
The encodings are different for a int32 column:
encoding set (BIT_PACKED, PLAIN, RLE) on CPU
vs
encoding set Set(RLE, PLAIN) on GPU

revans2 · 2023-08-22T14:43:29Z

* The row numbers in row groups are (1,000,000,  48576)  in the cuDF Parquet file when write 1M rows, while it's (1048576) in the Spark generated Parquet file. Is this an issue? Even the parquet only contains one boolean column, cuDF always split the row group at 1,000,000.

The CPU splits up row groups based on the size of the data in the group after compression (128 MiB by default). The GPU splits things up by number of rows, or by the size of the batch passed in, whichever is smaller. The number of rows is configurable, but we are never going to match the CPU exactly.

* cuDF writes timestamp as int96, while Spark write timestamp as int64:
optional int64 c10 (TIMESTAMP(MICROS,true));  // Spark
optional int96 c10; // cuDF 
Seems it's not an issue, because Spark reads INT96 column as a timestamp column.

We should be matching Spark for this. Spark has a config that sets if it should write the data out as int96, for backwards compatibility, or if it should write them out as int64. I agree that it is not a huge problem, but we need to file an issue with the plugin and dig down to understand how and where we are messing this up. We should have tests around all of the various configs with int96 vs not.

* We can not compare the encodings for the `ColumnChunk`, e.g.:
  The encodings are different for a int32 column:
  encoding set (BIT_PACKED, PLAIN, RLE)  on CPU
  vs
  encoding set Set(RLE, PLAIN) on GPU

Yes that is expected. The CPU and the GPU will likely produce slightly different encodings. The things I think we care about are that the size of the data we encode is not too much larger than the size of the data that the CPU encodes (although this can be a follow on issue) and that we are not including encodings that are for the wrong version of parquet.

There are two versions of parquet (v1 and v2). V2 not only added some things to the footer, but it also enabled new encoding formats. In versions of Spark that support parquet V2 (spark 3.3.0 and above) The default is to output encodings that are version 1 compatible, but if you set "parquet.writer.version" to "v2" in the parquet configs for the writer it should enable the V2 encodings.

We really should be checking that we do not use the V2 encodings unless that config is set, and we are in Spark 3.3.0 or later.

res-life · 2023-08-26T01:36:18Z

cuDF writes timestamp as int96, while Spark write timestamp as int64:

It's because https://github.com/NVIDIA/spark-rapids/blob/branch-23.10/tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala#L201-L207

  def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = {
    val c = conf.clone()
      .set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure
      // temp work around to unsupported timestamp type
      .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
    withSparkSession(c, f)
  }

I removed the spark.sql.parquet.outputTimestampType config

res-life · 2023-08-26T01:36:26Z

build

Signed-off-by: Chong Gao <[email protected]>

res-life · 2023-08-30T02:01:57Z

build

res-life · 2023-08-30T05:12:29Z

build

res-life · 2023-08-30T09:06:56Z

build

res-life · 2023-08-30T13:14:18Z

build

github-actions · 2023-08-30T13:18:49Z

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component	Vulnerability	Description	Severity
Jython	CVE-2013-2027	Jython 2.2.1 uses the current umask to set the privileges of the class cache files, which allows local users to bypass intended access restrictions via unspecified vectors.	MEDIUM

res-life · 2023-08-31T01:36:15Z

build

res-life · 2023-08-31T02:11:41Z

tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala

@@ -201,8 +201,6 @@ trait SparkQueryCompareTestSuite extends AnyFunSuite with BeforeAndAfterAll {
  def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = {
    val c = conf.clone()
      .set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure
-      // temp work around to unsupported timestamp type
-      .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")


Fixes: cuDF writes timestamp as int96, while Spark write timestamp as int64:

Looks like I added that in 3 years ago and somehow the GPU version got cleaned up but the CPU version didn't. Thanks for fixing this.

res-life · 2023-08-31T06:24:23Z

Failed at:

[2023-08-31T02:28:23.226Z] FileCacheIntegrationSuite:
[2023-08-31T02:28:55.252Z] - filecache metrics v1 Parquet *** FAILED ***
[2023-08-31T02:28:55.253Z]   Expected 0, but got 170 (FileCacheIntegrationSuite.scala:189)

res-life · 2023-08-31T06:25:04Z

build

revans2 · 2023-08-31T15:18:43Z

tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala

@@ -201,8 +201,6 @@ trait SparkQueryCompareTestSuite extends AnyFunSuite with BeforeAndAfterAll {
  def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = {
    val c = conf.clone()
      .set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure
-      // temp work around to unsupported timestamp type
-      .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")


Looks like I added that in 3 years ago and somehow the GPU version got cleaned up but the CPU version didn't. Thanks for fixing this.

res-life · 2023-09-12T05:35:27Z

build

res-life · 2023-09-12T09:22:02Z

build

…le test project/repo

res-life · 2023-09-12T14:55:33Z

All the test cases passed in the premerge.
Already marked them as scale test and disabled them, and added to a follow-up issue: #8849

res-life · 2023-09-12T14:55:43Z

build

revans2

Generally it looks good. The only thing that I really want to see changed is us limiting the date range based on ORC limits for a parquet test. The rest are nits that can be done as a follow on, or possibly never.

revans2 · 2023-09-12T15:24:10Z

tests/src/test/scala/com/nvidia/spark/rapids/ParquetScaleTestSuite.scala

+     *
+     */
+    // skip check the schema
+    val (cpuStat, gpuStat) = checkStats(genDf(tab), skipCheckSchema = true)


Instead of skipping the schema entirely, can we still check, but know that the top level message name does not match, and names of messages under lists will not match? This can be a follow on issue if we want to.

revans2 · 2023-09-12T15:29:02Z

tests/src/test/scala/com/nvidia/spark/rapids/ParquetScaleTestSuite.scala

+    "timestamp")
+
+  test("Statistics tests for Parquet files written by GPU, float/double") {
+    assume(false, "Blocked by https://github.com/rapidsai/cudf/issues/13948")


I'm not sure that this is a bug. According to what is being discussed on the issue CUDF might be doing the right thing when Spark is not doing it. Could we try to update the tests to ignore NaNs? Or at least add in a test that does not have NaN in it?

I think it's a bug. Please refer to: rapidsai/cudf#13948 (comment)

The can we add a test that does not have nans in it for floating point. Just to verify that we are doing the right thing in those cases too?

Please review this follow-up PR:
#9256

revans2 · 2023-09-12T15:29:51Z

tests/src/test/scala/com/nvidia/spark/rapids/ParquetScaleTestSuite.scala

+      nullProbabilities.foreach { nullProbability =>
+        try {
+          val gen = DBGen()
+          gen.setDefaultValueRange(TimestampType, minTimestampForOrc, maxTimestampForOrc)


Why are we setting a timestamp range for ORC when we are in a parquet test?

res-life · 2023-09-13T03:12:54Z

build

res-life · 2023-09-13T09:29:00Z

Premerge is blocked by: #9233

jlowe · 2023-09-13T20:16:46Z

build

This reverts commit f50f42d.

res-life · 2023-09-18T01:13:59Z

build

firestarman · 2023-09-18T05:25:29Z

Approved but i am a little confused that we are going to add in some tests that will not run.

res-life · 2023-09-18T05:26:20Z

Tracked by follow-up issue: #8849

Add test case for Parquet statistics

23c8971

Signed-off-by: Chong Gao <[email protected]>

res-life requested a review from revans2 August 22, 2023 13:36

Add tests for nested types

2983526

Update test cases

fbd53e8

Signed-off-by: Chong Gao <[email protected]>

Fix compile error

0d09a89

res-life changed the title ~~[WIP] Add test case for Parquet statistics~~ [WIP] Add test case for Parquet statistics [databricks] Aug 30, 2023

Add config

7fd937e

res-life changed the title ~~[WIP] Add test case for Parquet statistics [databricks]~~ [WIP] Add test case for Parquet statistics Aug 30, 2023

Merge branch 'branch-23.10' into parquet-stat

d2cad5b

res-life commented Aug 31, 2023

View reviewed changes

res-life changed the title ~~[WIP] Add test case for Parquet statistics~~ [WIP] Add test cases for Parquet statistics Aug 31, 2023

revans2 previously approved these changes Aug 31, 2023

View reviewed changes

res-life mentioned this pull request Sep 5, 2023

Make map column non-nullable when it's a key in another map. #9147

Merged

Chong Gao added 3 commits September 8, 2023 11:12

Merge branch 'branch-23.10' into parquet-stat

7330466

Update

37a6d71

Update for timestamp type for Spark 31x; update for String type

e99514b

res-life dismissed revans2’s stale review via e99514b September 12, 2023 05:34

res-life changed the title ~~[WIP] Add test cases for Parquet statistics~~ [WIP] Add test cases for Parquet statistics [databricks] Sep 12, 2023

res-life mentioned this pull request Sep 12, 2023

[BUG] Move the Python tests with large_data_test mark in PR #8825 to scale test tool/repo #8849

Open

4 tasks

Disable tests and mark these case as scale test, will be moved to sca…

a9cb454

…le test project/repo

res-life marked this pull request as ready for review September 12, 2023 14:56

res-life changed the title ~~[WIP] Add test cases for Parquet statistics [databricks]~~ Add test cases for Parquet statistics [databricks] Sep 12, 2023

revans2 reviewed Sep 12, 2023

View reviewed changes

Chong Gao added 2 commits September 13, 2023 11:07

Remove the value range

5196616

Enable test cases to test premerge

f50f42d

revans2 previously approved these changes Sep 13, 2023

View reviewed changes

Revert "Enable test cases to test premerge"

894d7ad

This reverts commit f50f42d.

res-life dismissed revans2’s stale review via 894d7ad September 18, 2023 01:13

firestarman approved these changes Sep 18, 2023

View reviewed changes

res-life merged commit 92f308c into NVIDIA:branch-23.10 Sep 18, 2023
28 checks passed

res-life deleted the parquet-stat branch September 18, 2023 05:25

res-life mentioned this pull request Sep 19, 2023

Test Parquet double column stat without NaN [databricks] #9256

Merged

sameerz added the test Only impacts tests label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test cases for Parquet statistics [databricks] #9090

Add test cases for Parquet statistics [databricks] #9090

res-life commented Aug 22, 2023

res-life commented Aug 22, 2023

revans2 commented Aug 22, 2023 •

edited

Loading

res-life commented Aug 26, 2023

res-life commented Aug 26, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

github-actions bot commented Aug 30, 2023

res-life commented Aug 31, 2023

res-life Aug 31, 2023

revans2 Aug 31, 2023

res-life commented Aug 31, 2023

res-life commented Aug 31, 2023

revans2 Aug 31, 2023

res-life commented Sep 12, 2023

res-life commented Sep 12, 2023

res-life commented Sep 12, 2023

res-life commented Sep 12, 2023

revans2 left a comment

revans2 Sep 12, 2023

revans2 Sep 12, 2023

res-life Sep 13, 2023

revans2 Sep 13, 2023

res-life Sep 19, 2023

revans2 Sep 12, 2023

res-life commented Sep 13, 2023

res-life commented Sep 13, 2023 •

edited

Loading

jlowe commented Sep 13, 2023

res-life commented Sep 18, 2023

firestarman commented Sep 18, 2023 •

edited

Loading

res-life commented Sep 18, 2023

Add test cases for Parquet statistics [databricks] #9090

Add test cases for Parquet statistics [databricks] #9090

Conversation

res-life commented Aug 22, 2023

res-life commented Aug 22, 2023

revans2 commented Aug 22, 2023 • edited Loading

res-life commented Aug 26, 2023

res-life commented Aug 26, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

github-actions bot commented Aug 30, 2023

👎 Promotion blocked, new vulnerability found

Vulnerability report

res-life commented Aug 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Aug 31, 2023

res-life commented Aug 31, 2023

Choose a reason for hiding this comment

res-life commented Sep 12, 2023

res-life commented Sep 12, 2023

res-life commented Sep 12, 2023

res-life commented Sep 12, 2023

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Sep 13, 2023

res-life commented Sep 13, 2023 • edited Loading

jlowe commented Sep 13, 2023

res-life commented Sep 18, 2023

firestarman commented Sep 18, 2023 • edited Loading

res-life commented Sep 18, 2023

revans2 commented Aug 22, 2023 •

edited

Loading

res-life commented Sep 13, 2023 •

edited

Loading

firestarman commented Sep 18, 2023 •

edited

Loading