-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test cases for Parquet statistics [databricks] #9090
Conversation
Signed-off-by: Chong Gao <[email protected]>
3 questions about this PR:
Seems it's not an issue, because Spark reads INT96 column as a timestamp column.
|
The CPU splits up row groups based on the size of the data in the group after compression (128 MiB by default). The GPU splits things up by number of rows, or by the size of the batch passed in, whichever is smaller. The number of rows is configurable, but we are never going to match the CPU exactly.
We should be matching Spark for this. Spark has a config that sets if it should write the data out as int96, for backwards compatibility, or if it should write them out as int64. I agree that it is not a huge problem, but we need to file an issue with the plugin and dig down to understand how and where we are messing this up. We should have tests around all of the various configs with int96 vs not.
Yes that is expected. The CPU and the GPU will likely produce slightly different encodings. The things I think we care about are that the size of the data we encode is not too much larger than the size of the data that the CPU encodes (although this can be a follow on issue) and that we are not including encodings that are for the wrong version of parquet. There are two versions of parquet (v1 and v2). V2 not only added some things to the footer, but it also enabled new encoding formats. In versions of Spark that support parquet V2 (spark 3.3.0 and above) The default is to output encodings that are version 1 compatible, but if you set "parquet.writer.version" to "v2" in the parquet configs for the writer it should enable the V2 encodings. We really should be checking that we do not use the V2 encodings unless that config is set, and we are in Spark 3.3.0 or later. |
I removed the |
build |
Signed-off-by: Chong Gao <[email protected]>
build |
build |
build |
build |
👎 Promotion blocked, new vulnerability foundVulnerability report
|
build |
@@ -201,8 +201,6 @@ trait SparkQueryCompareTestSuite extends AnyFunSuite with BeforeAndAfterAll { | |||
def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = { | |||
val c = conf.clone() | |||
.set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure | |||
// temp work around to unsupported timestamp type | |||
.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes: cuDF writes timestamp as int96, while Spark write timestamp as int64:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I added that in 3 years ago and somehow the GPU version got cleaned up but the CPU version didn't. Thanks for fixing this.
Failed at:
|
build |
@@ -201,8 +201,6 @@ trait SparkQueryCompareTestSuite extends AnyFunSuite with BeforeAndAfterAll { | |||
def withCpuSparkSession[U](f: SparkSession => U, conf: SparkConf = new SparkConf()): U = { | |||
val c = conf.clone() | |||
.set(RapidsConf.SQL_ENABLED.key, "false") // Just to be sure | |||
// temp work around to unsupported timestamp type | |||
.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I added that in 3 years ago and somehow the GPU version got cleaned up but the CPU version didn't. Thanks for fixing this.
build |
build |
…le test project/repo
All the test cases passed in the premerge. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it looks good. The only thing that I really want to see changed is us limiting the date range based on ORC limits for a parquet test. The rest are nits that can be done as a follow on, or possibly never.
* | ||
*/ | ||
// skip check the schema | ||
val (cpuStat, gpuStat) = checkStats(genDf(tab), skipCheckSchema = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of skipping the schema entirely, can we still check, but know that the top level message name does not match, and names of messages under lists will not match? This can be a follow on issue if we want to.
"timestamp") | ||
|
||
test("Statistics tests for Parquet files written by GPU, float/double") { | ||
assume(false, "Blocked by https://github.com/rapidsai/cudf/issues/13948") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is a bug. According to what is being discussed on the issue CUDF might be doing the right thing when Spark is not doing it. Could we try to update the tests to ignore NaNs? Or at least add in a test that does not have NaN in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bug. Please refer to: rapidsai/cudf#13948 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The can we add a test that does not have nans in it for floating point. Just to verify that we are doing the right thing in those cases too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review this follow-up PR:
#9256
nullProbabilities.foreach { nullProbability => | ||
try { | ||
val gen = DBGen() | ||
gen.setDefaultValueRange(TimestampType, minTimestampForOrc, maxTimestampForOrc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we setting a timestamp range for ORC when we are in a parquet test?
build |
Premerge is blocked by: #9233 |
build |
This reverts commit f50f42d.
build |
Approved but i am a little confused that we are going to add in some tests that will not run. |
Tracked by follow-up issue: #8849 |
closes #8762
Add test case for Parquet statistics