Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test cases for Parquet statistics [databricks] #9090

Merged
merged 13 commits into from
Sep 18, 2023

Conversation

res-life
Copy link
Collaborator

closes #8762

Add test case for Parquet statistics

@res-life
Copy link
Collaborator Author

@revans2

3 questions about this PR:

  • The row numbers in row groups are (1,000,000, 48576) in the cuDF Parquet file when write 1M rows, while it's (1048576) in the Spark generated Parquet file. Is this an issue? Even the parquet only contains one boolean column, cuDF always split the row group at 1,000,000.
  • cuDF writes timestamp as int96, while Spark write timestamp as int64:
optional int64 c10 (TIMESTAMP(MICROS,true));  // Spark
optional int96 c10; // cuDF 

Seems it's not an issue, because Spark reads INT96 column as a timestamp column.

  • We can not compare the encodings for the ColumnChunk, e.g.:
    The encodings are different for a int32 column:
    encoding set (BIT_PACKED, PLAIN, RLE) on CPU
    vs
    encoding set Set(RLE, PLAIN) on GPU

@res-life res-life requested a review from revans2 August 22, 2023 13:36
@revans2
Copy link
Collaborator

revans2 commented Aug 22, 2023

* The row numbers in row groups are (1,000,000,  48576)  in the cuDF Parquet file when write 1M rows, while it's (1048576) in the Spark generated Parquet file. Is this an issue? Even the parquet only contains one boolean column, cuDF always split the row group at 1,000,000.

The CPU splits up row groups based on the size of the data in the group after compression (128 MiB by default). The GPU splits things up by number of rows, or by the size of the batch passed in, whichever is smaller. The number of rows is configurable, but we are never going to match the CPU exactly.

* cuDF writes timestamp as int96, while Spark write timestamp as int64:
optional int64 c10 (TIMESTAMP(MICROS,true));  // Spark
optional int96 c10; // cuDF 

Seems it's not an issue, because Spark reads INT96 column as a timestamp column.

We should be matching Spark for this. Spark has a config that sets if it should write the data out as int96, for backwards compatibility, or if it should write them out as int64. I agree that it is not a huge problem, but we need to file an issue with the plugin and dig down to understand how and where we are messing this up. We should have tests around all of the various configs with int96 vs not.

* We can not compare the encodings for the `ColumnChunk`, e.g.:
  The encodings are different for a int32 column:
  encoding set (BIT_PACKED, PLAIN, RLE)  on CPU
  vs
  encoding set Set(RLE, PLAIN) on GPU

Yes that is expected. The CPU and the GPU will likely produce slightly different encodings. The things I think we care about are that the size of the data we encode is not too much larger than the size of the data that the CPU encodes (although this can be a follow on issue) and that we are not including encodings that are for the wrong version of parquet.

There are two versions of parquet (v1 and v2). V2 not only added some things to the footer, but it also enabled new encoding formats. In versions of Spark that support parquet V2 (spark 3.3.0 and above) The default is to output encodings that are version 1 compatible, but if you set "parquet.writer.version" to "v2" in the parquet configs for the writer it should enable the V2 encodings.

We really should be checking that we do not use the V2 encodings unless that config is set, and we are in Spark 3.3.0 or later.

@res-life
Copy link
Collaborator Author

cuDF writes timestamp as int96, while Spark write timestamp as int64:

It's because https://github.com/NVIDIA/spark-rapids/blob/branch-23.10/tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala#L201-L207

  def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = {
    val c = conf.clone()
      .set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure
      // temp work around to unsupported timestamp type
      .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
    withSparkSession(c, f)
  }

I removed the spark.sql.parquet.outputTimestampType config

@res-life
Copy link
Collaborator Author

build

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the title [WIP] Add test case for Parquet statistics [WIP] Add test case for Parquet statistics [databricks] Aug 30, 2023
@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the title [WIP] Add test case for Parquet statistics [databricks] [WIP] Add test case for Parquet statistics Aug 30, 2023
@res-life
Copy link
Collaborator Author

build

@github-actions
Copy link

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
Jython CVE-2013-2027 Jython 2.2.1 uses the current umask to set the privileges of the class cache files, which allows local users to bypass intended access restrictions via unspecified vectors. MEDIUM

@res-life
Copy link
Collaborator Author

build

@@ -201,8 +201,6 @@ trait SparkQueryCompareTestSuite extends AnyFunSuite with BeforeAndAfterAll {
def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = {
val c = conf.clone()
.set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure
// temp work around to unsupported timestamp type
.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes: cuDF writes timestamp as int96, while Spark write timestamp as int64:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I added that in 3 years ago and somehow the GPU version got cleaned up but the CPU version didn't. Thanks for fixing this.

@res-life res-life changed the title [WIP] Add test case for Parquet statistics [WIP] Add test cases for Parquet statistics Aug 31, 2023
@res-life
Copy link
Collaborator Author

Failed at:

[2023-08-31T02:28:23.226Z] FileCacheIntegrationSuite:
[2023-08-31T02:28:55.252Z] - filecache metrics v1 Parquet *** FAILED ***
[2023-08-31T02:28:55.253Z]   Expected 0, but got 170 (FileCacheIntegrationSuite.scala:189)

@res-life
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Aug 31, 2023
@@ -201,8 +201,6 @@ trait SparkQueryCompareTestSuite extends AnyFunSuite with BeforeAndAfterAll {
def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = {
val c = conf.clone()
.set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure
// temp work around to unsupported timestamp type
.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I added that in 3 years ago and somehow the GPU version got cleaned up but the CPU version didn't. Thanks for fixing this.

@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the title [WIP] Add test cases for Parquet statistics [WIP] Add test cases for Parquet statistics [databricks] Sep 12, 2023
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

All the test cases passed in the premerge.
Already marked them as scale test and disabled them, and added to a follow-up issue: #8849

@res-life
Copy link
Collaborator Author

build

@res-life res-life marked this pull request as ready for review September 12, 2023 14:56
@res-life res-life changed the title [WIP] Add test cases for Parquet statistics [databricks] Add test cases for Parquet statistics [databricks] Sep 12, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks good. The only thing that I really want to see changed is us limiting the date range based on ORC limits for a parquet test. The rest are nits that can be done as a follow on, or possibly never.

*
*/
// skip check the schema
val (cpuStat, gpuStat) = checkStats(genDf(tab), skipCheckSchema = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of skipping the schema entirely, can we still check, but know that the top level message name does not match, and names of messages under lists will not match? This can be a follow on issue if we want to.

"timestamp")

test("Statistics tests for Parquet files written by GPU, float/double") {
assume(false, "Blocked by https://github.com/rapidsai/cudf/issues/13948")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is a bug. According to what is being discussed on the issue CUDF might be doing the right thing when Spark is not doing it. Could we try to update the tests to ignore NaNs? Or at least add in a test that does not have NaN in it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bug. Please refer to: rapidsai/cudf#13948 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The can we add a test that does not have nans in it for floating point. Just to verify that we are doing the right thing in those cases too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this follow-up PR:
#9256

nullProbabilities.foreach { nullProbability =>
try {
val gen = DBGen()
gen.setDefaultValueRange(TimestampType, minTimestampForOrc, maxTimestampForOrc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting a timestamp range for ORC when we are in a parquet test?

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Sep 13, 2023

Premerge is blocked by: #9233

revans2
revans2 previously approved these changes Sep 13, 2023
@jlowe
Copy link
Member

jlowe commented Sep 13, 2023

build

@res-life
Copy link
Collaborator Author

build

@res-life res-life merged commit 92f308c into NVIDIA:branch-23.10 Sep 18, 2023
28 checks passed
@res-life res-life deleted the parquet-stat branch September 18, 2023 05:25
@firestarman
Copy link
Collaborator

firestarman commented Sep 18, 2023

Approved but i am a little confused that we are going to add in some tests that will not run.

@res-life
Copy link
Collaborator Author

Tracked by follow-up issue: #8849

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Statistics tests for Parquet files written by GPU
5 participants