perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619

andygrove · 2025-04-07T16:31:02Z

Which issue does this PR close?

N/A

Rationale for this change

My primary motivation was to be able to run benchmarks with the new scans with and without Parquet pushdown enabled, since we know that there are performance issues.

What changes are included in this PR?

Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config.

How are these changes tested?

I ran TPC-H @ 100 GB locally with native_datafusion scan.

With pushdown enabled: 328 s
With pushdown disabled: 270 s

andygrove · 2025-04-07T16:31:50Z

@mbutrovich @parthchandra I am not very familiar with the code that I updated. Could you review?

mbutrovich · 2025-04-07T16:37:11Z

native/core/src/parquet/parquet_exec.rs

    let mut table_parquet_options = TableParquetOptions::new();
+    table_parquet_options.global.pushdown_filters = pushdown_filters;
    // TODO: Maybe these are configs?


I think we can remove the TODO, and maybe get rid of the reorder_filters = true on the line below. DF still defaults that to false as well so we might not understand the performance implications of it yet. We could add a config for that in a follow up PR and measure the difference.

Thanks. I have updated this and also added some notes to the tuning guide.

mbutrovich

Other than the one optional suggestion, this LGTM.

mbutrovich · 2025-04-07T16:52:06Z

Now that I think about it more: should this be a distinct config from Spark's (i.e., a Comet config like we might do with reorder_filters)? If we fallback to Spark scan are we hurting performance if this is set to false (as suggested by the tuning guide). Maybe we expect if someone is enabling the experimental reader, they should know that their workload is already capable of not falling back to Spark for scan. We can revisit these configs and defaults in the future, I suppose. For now I think this is reasonable for benchmarking the experimental readers.

andygrove · 2025-04-07T17:02:38Z

Now that I think about it more: should this be a distinct config from Spark's

I was also thinking about this. If we add a Comet configuration, we can set the default to disable the filter pushdown for now. I will make this change.

…wn-config

spark/src/main/scala/org/apache/comet/parquet/CometParquetFileFormat.scala

codecov-commenter · 2025-04-07T19:33:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.07%. Comparing base (f09f8af) to head (e528bb9).
Report is 139 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1619      +/-   ##
============================================
+ Coverage     56.12%   59.07%   +2.94%     
- Complexity      976     1072      +96     
============================================
  Files           119      125       +6     
  Lines         11743    12499     +756     
  Branches       2251     2340      +89     
============================================
+ Hits           6591     7384     +793     
+ Misses         4012     3950      -62     
- Partials       1140     1165      +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This reverts commit 084abc5.

This reverts commit 07ff3ba.

andygrove · 2025-04-07T19:58:26Z

I was also thinking about this. If we add a Comet configuration, we can set the default to disable the filter pushdown for now. I will make this change.

We do need to respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED otherwise we see regressions in the Spark SQL tests, so I reverted adding a Comet-specific config.

parthchandra · 2025-04-07T22:38:40Z

I don't see where PARQUET_FILTER_PUSHDOWN_ENABLED is used at all. Wondering how this affects performance.
Also, row index columns are tied to predicate pushdown. With row indexes, a file reader is allowed to filter row groups based on the predicate but cannot apply the predicate to filter out the values (because the row index generation cannot figure out which row numbers got filtered out).

parthchandra · 2025-04-07T23:50:55Z

I don't see where PARQUET_FILTER_PUSHDOWN_ENABLED is used at all

Nvm. It is used to decide whether to pass in the predicates to the batch reader which uses it to filter out row groups. Note that the behavior of the native code must match this exactly especially for row index generation. So if we want to have a Comet config then it must the same value as the Spark config which makes it redundant.

parthchandra · 2025-04-09T20:19:23Z

common/src/main/java/org/apache/comet/parquet/Native.java

@@ -257,7 +257,8 @@ public static native long initRecordBatchReader(
      byte[] requiredSchema,
      byte[] dataSchema,
      String sessionTimezone,
-      int batchSize);
+      int batchSize,
+      boolean pushdownFilters);


This is already implied if the filter parameter is null.

Thanks. I will update this ~~today~~ soon.

parthchandra · 2025-04-09T20:22:36Z

native/core/src/parquet/mod.rs

@@ -703,6 +704,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat
            file_groups,
            None,
            data_filters,
+            pushdown_filters != 0,


Same here. data_filters is an Option and should be None if filter pushdown is disabled.

andygrove · 2025-04-11T14:57:58Z

I addressed the feedback but I no longer see a performance improvement with native_datafusion when disabling PARQUET_FILTER_PUSHDOWN_ENABLED, so I have moved this to draft while I figure out why.

andygrove · 2025-04-11T15:27:13Z

I addressed the feedback but I no longer see a performance improvement with native_datafusion when disabling PARQUET_FILTER_PUSHDOWN_ENABLED, so I have moved this to draft while I figure out why.

I was testing with the wrong Comet JAR file 🤦‍♂️

This is ready for re-review @mbutrovich @parthchandra

native/core/src/parquet/parquet_exec.rs

Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config

de50cb0

mbutrovich reviewed Apr 7, 2025

View reviewed changes

mbutrovich approved these changes Apr 7, 2025

View reviewed changes

address feedback and update tuning guide

9d8b974

add Comet conf

fa4b985

andygrove changed the title ~~perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config~~ perf: Add new Comet PARQUET_FILTER_PUSHDOWN_ENABLED config Apr 7, 2025

andygrove added 2 commits April 7, 2025 11:09

add Comet-specfic config

07ff3ba

Merge remote-tracking branch 'apache/main' into parquet-filter-pushdo…

5e70aed

…wn-config

andygrove commented Apr 7, 2025

View reviewed changes

spark/src/main/scala/org/apache/comet/parquet/CometParquetFileFormat.scala Outdated Show resolved Hide resolved

update test

084abc5

andygrove added 3 commits April 7, 2025 13:49

Revert "update test"

3ba2dc7

This reverts commit 084abc5.

Revert "add Comet-specfic config"

54bdac4

This reverts commit 07ff3ba.

revert

585ff25

andygrove changed the title ~~perf: Add new Comet PARQUET_FILTER_PUSHDOWN_ENABLED config~~ perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config Apr 8, 2025

upmerge

2141d98

kazuyukitanimura approved these changes Apr 9, 2025

View reviewed changes

parthchandra reviewed Apr 9, 2025

View reviewed changes

andygrove mentioned this pull request Apr 10, 2025

Release Comet 0.8.0 #1635

Open

save

e8ae64d

andygrove marked this pull request as draft April 11, 2025 14:55

andygrove added 2 commits April 11, 2025 08:55

address feedback

4b8248d

format

60d4b16

andygrove marked this pull request as ready for review April 11, 2025 15:27

andygrove commented Apr 11, 2025

View reviewed changes

native/core/src/parquet/parquet_exec.rs Outdated Show resolved Hide resolved

andygrove added 2 commits April 11, 2025 09:37

remove redundant code

33dd35c

revert newline

e528bb9

parthchandra approved these changes Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619

perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619

andygrove commented Apr 7, 2025 •

edited

Loading

andygrove commented Apr 7, 2025

mbutrovich Apr 7, 2025

andygrove Apr 7, 2025

mbutrovich left a comment

mbutrovich commented Apr 7, 2025

andygrove commented Apr 7, 2025

codecov-commenter commented Apr 7, 2025 •

edited

Loading

andygrove commented Apr 7, 2025

parthchandra commented Apr 7, 2025

parthchandra commented Apr 7, 2025 •

edited

Loading

parthchandra Apr 9, 2025

andygrove Apr 10, 2025 •

edited

Loading

parthchandra Apr 9, 2025

andygrove commented Apr 11, 2025

andygrove commented Apr 11, 2025

perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619

Are you sure you want to change the base?

perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619

Conversation

andygrove commented Apr 7, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented Apr 7, 2025

mbutrovich Apr 7, 2025

Choose a reason for hiding this comment

andygrove Apr 7, 2025

Choose a reason for hiding this comment

mbutrovich left a comment

Choose a reason for hiding this comment

mbutrovich commented Apr 7, 2025

andygrove commented Apr 7, 2025

codecov-commenter commented Apr 7, 2025 • edited Loading

Codecov Report

andygrove commented Apr 7, 2025

parthchandra commented Apr 7, 2025

parthchandra commented Apr 7, 2025 • edited Loading

parthchandra Apr 9, 2025

Choose a reason for hiding this comment

andygrove Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

parthchandra Apr 9, 2025

Choose a reason for hiding this comment

andygrove commented Apr 11, 2025

andygrove commented Apr 11, 2025

andygrove commented Apr 7, 2025 •

edited

Loading

codecov-commenter commented Apr 7, 2025 •

edited

Loading

parthchandra commented Apr 7, 2025 •

edited

Loading

andygrove Apr 10, 2025 •

edited

Loading