Rebase `Databricks 14.3` feature branch to 24.12 #5

mythrocks · 2024-10-16T17:12:46Z

I have rebased SP-10661-db-14.3 to branch-24.12, if only to make the recent RapidsErrorUtils refactor available in this branch.

…arily (NVIDIA#11469) Signed-off-by: Alessandro Bellina <[email protected]>

…abricks] (NVIDIA#11466) * Switch to a regular try Signed-off-by: Gera Shegalov <[email protected]> * drop Maven tarball Signed-off-by: Gera Shegalov <[email protected]> * unused import Signed-off-by: Gera Shegalov <[email protected]> * repro Signed-off-by: Gera Shegalov <[email protected]> --------- Signed-off-by: Gera Shegalov <[email protected]>

…s temporarily (NVIDIA#11469)" (NVIDIA#11473) This reverts commit 5beeba8. Signed-off-by: Alessandro Bellina <[email protected]>

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Robert (Bobby) Evans <[email protected]>

NVIDIA#11449) * Support yyyyMMdd in GetTimestamp operator for LEGACY mode Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

… [databricks] (NVIDIA#11462) * Support non-UTC timezone for casting from date type to timestamp type Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

* Install cuDF-py against python 3.10 on Databricks Fix on Databricks runtime for : NVIDIA#11394 Enable the udf_cudf_test test case for Databricks-13.3 Rapids 24.10+ drops python 3.9 or below conda packages. ref: https://docs.rapids.ai/notices/rsn0040/ Install cuDF-py packages against python 3.10 and above on Databricks runtime to run UDF cuDF tests, because on DB-13.3 Conda is not installed by default. Signed-off-by: timl <[email protected]> * Check if 'conda' exists to make the if/else expression more readable Signed-off-by: timl <[email protected]> --------- Signed-off-by: timl <[email protected]>

* add parquet column index ut test Signed-off-by: fejiang <[email protected]> * change Signed-off-by: fejiang <[email protected]> * added parquet suite Signed-off-by: fejiang <[email protected]> * pom changed Signed-off-by: fejiang <[email protected]> * DeltaEncoding Suite Signed-off-by: fejiang <[email protected]> * enable more suites Signed-off-by: fejiang <[email protected]> * remove ignored case Signed-off-by: fejiang <[email protected]> * format Signed-off-by: fejiang <[email protected]> * added ignored cases Signed-off-by: fejiang <[email protected]> * change to parquet hadoop version Signed-off-by: fejiang <[email protected]> * remove parquet.version Signed-off-by: fejiang <[email protected]> * adding scope and classifier Signed-off-by: fejiang <[email protected]> * pom remove unused Signed-off-by: fejiang <[email protected]> * pom chang3 2.13 Signed-off-by: fejiang <[email protected]> * add schema suite Signed-off-by: fejiang <[email protected]> * remove dataframe Signed-off-by: fejiang <[email protected]> * RapidsParquetThriftCompatibilitySuite Signed-off-by: fejiang <[email protected]> * ThriftCompaSuite added Signed-off-by: fejiang <[email protected]> * more suites but the RowIndexSuite one Signed-off-by: fejiang <[email protected]> * formatting issues Signed-off-by: fejiang <[email protected]> * exlude SPARK-36803: Signed-off-by: fejiang <[email protected]> * setting change Signed-off-by: fejiang <[email protected]> * setting change Signed-off-by: fejiang <[email protected]> * adjust order Signed-off-by: fejiang <[email protected]> * adjust settings Signed-off-by: fejiang <[email protected]> * adjust settings Signed-off-by: fejiang <[email protected]> * RapidsParquetThriftCompatibilitySuite settings * known issue added Signed-off-by: fejiang <[email protected]> * format new line Signed-off-by: fejiang <[email protected]> * known issue added Signed-off-by: fejiang <[email protected]> * RapidsParquetDeltaByteArrayEncodingSuite Signed-off-by: fejiang <[email protected]> * RapidsParquetAvroCompatibilitySuite Signed-off-by: fejiang <[email protected]> * ParquetFiledIdSchemaSuite and Avro suite added * pom Avro suite modified * ParquetFileFormatSuite added * RapidsParquetRebaseDatetimeSuite and QuerySuite added * RapidsParquetSchemaPruningSuite added * setting adjust Signed-off-by: fejiang <[email protected]> * setting adjust Signed-off-by: fejiang <[email protected]> * UT adjuct exclude added Signed-off-by: fejiang <[email protected]> * RapidsParquetThriftCompatibilitySuite adjust setting Signed-off-by: fejiang <[email protected]> * comment Create parquet table with compression Signed-off-by: fejiang <[email protected]> * SPARK_HOME NOT FOUND issue solved. Signed-off-by: fejiang <[email protected]> * enabling more suite Signed-off-by: fejiang <[email protected]> * remove exclude from RapidsParquetFieldIdIOSuite Signed-off-by: fejiang <[email protected]> * formate and remove parquet files Signed-off-by: fejiang <[email protected]> * comment setting Signed-off-by: fejiang <[email protected]> * pom modified and remove unnecess case Signed-off-by: fejiang <[email protected]> --------- Signed-off-by: fejiang <[email protected]> Signed-off-by: fejiang <[email protected]> Co-authored-by: fejiang <[email protected]>

Keep the rapids JNI and private dependency version at 24.10.0-SNAPSHOT until the nightly CI for the branch-24.12 branch is complete. Track the dependency update process at: NVIDIA#11492 Signed-off-by: nvauto <[email protected]>

…0798) * optimzing Expand+Aggregate in sqlw with many count distinct Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * simplify Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * add comment Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * address comments Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Signed-off-by: Alessandro Bellina <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

To fix: NVIDIA#11502 Download jars using wget instead of 'mvn dependency:get' to fix 'missing intermediate jars' failures, as we stopped deploying these intermediate jars since version 24.10 Signed-off-by: timl <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

* Support legacy mode for yyyymmdd format Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

* quick workaround to make image build work Signed-off-by: Peixin Li <[email protected]> * use mamba directly --------- Signed-off-by: Peixin Li <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

* add max memory watermark metric Signed-off-by: Zach Puller <[email protected]> --------- Signed-off-by: Zach Puller <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

* Updated parameters to enable file overwriting when dumping. Signed-off-by: ustcfy <[email protected]> * Validate LORE dump root path before execution Signed-off-by: ustcfy <[email protected]> * Add loreOutputRootPathChecked map for tracking lore output root path checks. Signed-off-by: ustcfy <[email protected]> * Delay path and filesystem initialization until actually needed. Signed-off-by: ustcfy <[email protected]> * Add test and update dev/lore.md doc. Signed-off-by: ustcfy <[email protected]> * Format code to ensure line length does not exceed 100 characters Signed-off-by: ustcfy <[email protected]> * Format code to ensure line length does not exceed 100 characters Signed-off-by: ustcfy <[email protected]> * Improved resource management by using withResource. Signed-off-by: ustcfy <[email protected]> * Update docs/dev/lore.md Co-authored-by: Renjie Liu <[email protected]> * Improved resource management by using withResource. Signed-off-by: ustcfy <[email protected]> * Removed for FileSystem instance. Signed-off-by: ustcfy <[email protected]> * Update docs/dev/lore.md Co-authored-by: Gera Shegalov <[email protected]> --------- Signed-off-by: ustcfy <[email protected]> Signed-off-by: ustcfy <[email protected]> Co-authored-by: Renjie Liu <[email protected]> Co-authored-by: Gera Shegalov <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Signed-off-by: Peixin Li <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Signed-off-by: Robert (Bobby) Evans <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Fix the latest merge conflict in integration tests

* Spark 4: Fix parquet_test.py. Fixes NVIDIA#11015. (Spark 4 failure.) Also fixes NVIDIA#11531. (Databricks 14.3 failure.) Contributes to NVIDIA#11004. This commit addresses the tests that fail in parquet_test.py, when run on Spark 4. 1. Some of the tests were failing as a result of NVIDIA#5114. Those tests have been disabled, at least until we get around to supporting aggregations with ANSI mode enabled. 2. `test_parquet_check_schema_compatibility` fails on Spark 4 regardless of ANSI mode, because it tests implicit type promotions where the read schema includes wider columns than the write schema. This will require new code. The test is disabled until NVIDIA#11512 is addressed. 3. `test_parquet_int32_downcast` had an erroneous setup phase that fails in ANSI mode. This has been corrected. The test was refactored to run in ANSI and non-ANSI mode. Signed-off-by: MithunR <[email protected]>

…lanca timezone and LEGACY mode (NVIDIA#11567) Signed-off-by: Chong Gao <[email protected]>

…CI (NVIDIA#11544) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

…IA#11561) Signed-off-by: Robert (Bobby) Evans <[email protected]>

* implement watermark Signed-off-by: Zach Puller <[email protected]> * consolidate/fix disk spill metric Signed-off-by: Zach Puller <[email protected]> --------- Signed-off-by: Zach Puller <[email protected]>

…DIA#11569) Signed-off-by: Robert (Bobby) Evans <[email protected]>

Signed-off-by: Jason Lowe <[email protected]>

Signed-off-by: Gera Shegalov <[email protected]>

Fix merge conflict with branch-24.10

…A#11559) * Spark 4: Addressed cast_test.py failures. Fixes NVIDIA#11009 and NVIDIA#11530. This commit addresses the test failures in cast_test.py, on Spark 4.0. These generally have to do with changes in behaviour of Spark when ANSI mode is enabled. In these cases, the tests have been split out into ANSI=on and ANSI=off. The bugs uncovered from the tests have been spun into their own issues; fixing all of them was beyond the scope of this change. Signed-off-by: MithunR <[email protected]>

* use task id as tie breaker Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * save threadlocal lookup Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* avoid long tail tasks due to PrioritySemaphore (NVIDIA#11574) * use task id as tie breaker Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * save threadlocal lookup Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * addressing jason's comment Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

* Fix collection_ops_tests for Spark 4.0. Fixes NVIDIA#11011. This commit fixes the failures in `collection_ops_tests` on Spark 4.0. On all versions of Spark, when a Sequence is collected with rows that exceed MAX_INT, an exception is thrown indicating that the collected Sequence/array is larger than permissible. The different versions of Spark vary in the contents of the exception message. On Spark 4, one sees that the error message now contains more information than all prior versions, including: 1. The name of the op causing the error 2. The errant sequence size This commit introduces a shim to make this new information available in the exception. Note that this shim does not fit cleanly in RapidsErrorUtils, because there are differences within major Spark versions. For instance, Spark 3.4.0-1 have a different message as compared to 3.4.2 and 3.4.3. Likewise, the differences in 3.5.0, 3.5.1, 3.5.2. Signed-off-by: MithunR <[email protected]> * Fixed formatting error. * Review comments. This moves the construction of the long-sequence error strings into RapidsErrorUtils. The process involved introducing many new RapidsErrorUtils classes, and using mix-ins of concrete implementations for the error-string construction. * Added missing shim tag for 3.5.2. * Review comments: Fixed code style. * Reformatting, per project guideline. * Fixed missed whitespace problem. --------- Signed-off-by: MithunR <[email protected]>

Signed-off-by: liyuan <[email protected]>

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

\nWait for the pre-merge CI job to SUCCEED Signed-off-by: nvauto <[email protected]>

* Update latest changelog [skip ci] Update change log with CLI: \n\n scripts/generate-changelog --token=<GIT_TOKEN> --releases=24.08,24.10 Signed-off-by: nvauto <[email protected]> * Update changelog Signed-off-by: timl <[email protected]> * Update changelog Signed-off-by: timl <[email protected]> --------- Signed-off-by: nvauto <[email protected]> Signed-off-by: timl <[email protected]> Co-authored-by: timl <[email protected]>

…-11604 Fix auto merge conflict 11604 [skip ci]

* xfail regexp tests to unblock CI Signed-off-by: Jason Lowe <[email protected]> * Disable failing regexp unit test to unblock CI --------- Signed-off-by: Jason Lowe <[email protected]>

* Remove an unused config shuffle.spillThreads Signed-off-by: Alessandro Bellina <[email protected]> * update configs.md --------- Signed-off-by: Alessandro Bellina <[email protected]>

Needed minor modifications. Signed-off-by: MithunR <[email protected]>

abellina and others added 30 commits September 11, 2024 20:44

Skip test_hash_groupby_approx_percentile byte and double tests tempor…

5beeba8

…arily (NVIDIA#11469) Signed-off-by: Alessandro Bellina <[email protected]>

Revert "Skip test_hash_groupby_approx_percentile byte and double test…

65f0095

…s temporarily (NVIDIA#11469)" (NVIDIA#11473) This reverts commit 5beeba8. Signed-off-by: Alessandro Bellina <[email protected]>

Enable tests after string_split was fixed (NVIDIA#11474)

00cd422

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Use improved CUDF JSON validation (NVIDIA#11464)

2589976

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Fix a json test for non utc time zone (NVIDIA#11482)

f4119c1

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Use reusable auto-merge workflow (NVIDIA#11483)

7c13383

Signed-off-by: Peixin Li <[email protected]>

Enable tests for all JSON white space normalization (NVIDIA#11456)

4b26190

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Support yyyyMMdd in GetTimestamp operator for LEGACY mode [databricks] (

0f5d510

NVIDIA#11449) * Support yyyyMMdd in GetTimestamp operator for LEGACY mode Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

Support non-UTC timezone for casting from date type to timestamp type…

ebcc146

… [databricks] (NVIDIA#11462) * Support non-UTC timezone for casting from date type to timestamp type Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

Init version 24.12.0-SNAPSHOT

614d8f5

Keep the rapids JNI and private dependency version at 24.10.0-SNAPSHOT until the nightly CI for the branch-24.12 branch is complete. Track the dependency update process at: NVIDIA#11492 Signed-off-by: nvauto <[email protected]>

Merge pull request NVIDIA#11494 from NVIDIA/branch-24.10

86d0f60

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Use UnaryLike instead of UnaryExpression (NVIDIA#11490)

6a9731f

Signed-off-by: Alessandro Bellina <[email protected]>

Merge pull request NVIDIA#11495 from NVIDIA/branch-24.10

9ed9b94

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Download artifacts via wget (NVIDIA#11503)

10eaa22

To fix: NVIDIA#11502 Download jars using wget instead of 'mvn dependency:get' to fix 'missing intermediate jars' failures, as we stopped deploying these intermediate jars since version 24.10 Signed-off-by: timl <[email protected]>

Merge pull request NVIDIA#11504 from NVIDIA/branch-24.10

bae19e3

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Support legacy mode for yyyymmdd format [databricks] (NVIDIA#11493)

01c3003

* Support legacy mode for yyyymmdd format Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

Replace libmamba-solver with mamba command [skip ci] (NVIDIA#11507)

b113c46

* quick workaround to make image build work Signed-off-by: Peixin Li <[email protected]> * use mamba directly --------- Signed-off-by: Peixin Li <[email protected]>

Merge pull request NVIDIA#11508 from NVIDIA/branch-24.10

7661abb

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

GPU device watermark metrics (NVIDIA#11457)

c047707

* add max memory watermark metric Signed-off-by: Zach Puller <[email protected]> --------- Signed-off-by: Zach Puller <[email protected]>

Merge pull request NVIDIA#11513 from NVIDIA/branch-24.10

4b61b45

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Merge pull request NVIDIA#11517 from NVIDIA/branch-24.10

0e446ad

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Deploy all submodules for default sparkver (NVIDIA#11516)

9625e58

Signed-off-by: Peixin Li <[email protected]>

Merge pull request NVIDIA#11518 from NVIDIA/branch-24.10

9089b7f

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Update from_json to use new cudf features (NVIDIA#11497)

fbd4db9

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Merge pull request NVIDIA#11523 from NVIDIA/branch-24.10

e9b89ff

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

revans2 and others added 26 commits October 7, 2024 12:19

Merge pull request NVIDIA#11563 from revans2/fix_merge_conflict

7a78951

Fix the latest merge conflict in integration tests

Fix test case unix_timestamp(col, 'yyyyMMdd') failed for Africa/Casab…

b9b7629

…lanca timezone and LEGACY mode (NVIDIA#11567) Signed-off-by: Chong Gao <[email protected]>

Update test case related to LEACY datetime format to unblock nightly …

cd46572

…CI (NVIDIA#11544) Signed-off-by: Chong Gao <[email protected]> Co-authored-by: Chong Gao <[email protected]>

Add in a basic plugin for dataframe UDF support in Apache Spark (NVID…

6897713

…IA#11561) Signed-off-by: Robert (Bobby) Evans <[email protected]>

Disk spill metric (NVIDIA#11564)

506d212

* implement watermark Signed-off-by: Zach Puller <[email protected]> * consolidate/fix disk spill metric Signed-off-by: Zach Puller <[email protected]> --------- Signed-off-by: Zach Puller <[email protected]>

Have "dump always" dump input files before trying to decode them (NVI…

b715ef2

…DIA#11569) Signed-off-by: Robert (Bobby) Evans <[email protected]>

Merge branch 'branch-24.10' into fixmerge

58eb33f

Signed-off-by: Jason Lowe <[email protected]>

Log reconfigure multi-file thread pool only once (NVIDIA#11571)

180da0f

Signed-off-by: Gera Shegalov <[email protected]>

Merge pull request NVIDIA#11579 from jlowe/fixmerge

025e62e

Fix merge conflict with branch-24.10

addressing jason's comment (NVIDIA#11587)

0ba4fd2

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge pull request NVIDIA#11594 from NVIDIA/branch-24.10

4866941

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

update doc for 2410 release (NVIDIA#11582)

8254f63

Signed-off-by: liyuan <[email protected]>

Merge pull request NVIDIA#11601 from NVIDIA/branch-24.10

3744ad2

[auto-merge] branch-24.10 to branch-24.12 [skip ci] [bot]

Update rapids JNI and private dependency to 24.10.0 (NVIDIA#11576)

3c9bb8b

\nWait for the pre-merge CI job to SUCCEED Signed-off-by: nvauto <[email protected]>

Merge branch-24.10 into branch-24.12

adc4e95

Merge pull request NVIDIA#11605 from NvTimLiu/fix-auto-merge-conflict…

8c55ef3

…-11604 Fix auto merge conflict 11604 [skip ci]

Disable regex tests to unblock CI (NVIDIA#11606)

2d3e0ec

* xfail regexp tests to unblock CI Signed-off-by: Jason Lowe <[email protected]> * Disable failing regexp unit test to unblock CI --------- Signed-off-by: Jason Lowe <[email protected]>

Remove an unused config shuffle.spillThreads (NVIDIA#11595)

11964ae

* Remove an unused config shuffle.spillThreads Signed-off-by: Alessandro Bellina <[email protected]> * update configs.md --------- Signed-off-by: Alessandro Bellina <[email protected]>

Merge remote-tracking branch 'master/branch-24.12' into collect-ops-test

441c052

Fixes for upmerge to branch-24.12.

8e14c65

Needed minor modifications. Signed-off-by: MithunR <[email protected]>

mythrocks requested a review from razajafri October 16, 2024 17:12

mythrocks self-assigned this Oct 16, 2024

mythrocks requested a review from jlowe as a code owner October 16, 2024 17:12

mythrocks removed the request for review from jlowe October 16, 2024 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase `Databricks 14.3` feature branch to 24.12 #5

Rebase `Databricks 14.3` feature branch to 24.12 #5

mythrocks commented Oct 16, 2024

Rebase Databricks 14.3 feature branch to 24.12 #5

Are you sure you want to change the base?

Rebase Databricks 14.3 feature branch to 24.12 #5

Conversation

mythrocks commented Oct 16, 2024

Rebase `Databricks 14.3` feature branch to 24.12 #5

Rebase `Databricks 14.3` feature branch to 24.12 #5