Skip to content

perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Apr 7, 2025

Which issue does this PR close?

N/A

Rationale for this change

My primary motivation was to be able to run benchmarks with the new scans with and without Parquet pushdown enabled, since we know that there are performance issues.

What changes are included in this PR?

Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config.

How are these changes tested?

I ran TPC-H @ 100 GB locally with native_datafusion scan.

With pushdown enabled: 328 s
With pushdown disabled: 270 s

@andygrove
Copy link
Member Author

@mbutrovich @parthchandra I am not very familiar with the code that I updated. Could you review?

let mut table_parquet_options = TableParquetOptions::new();
table_parquet_options.global.pushdown_filters = pushdown_filters;
// TODO: Maybe these are configs?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the TODO, and maybe get rid of the reorder_filters = true on the line below. DF still defaults that to false as well so we might not understand the performance implications of it yet. We could add a config for that in a follow up PR and measure the difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated this and also added some notes to the tuning guide.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the one optional suggestion, this LGTM.

@mbutrovich
Copy link
Contributor

Now that I think about it more: should this be a distinct config from Spark's (i.e., a Comet config like we might do with reorder_filters)? If we fallback to Spark scan are we hurting performance if this is set to false (as suggested by the tuning guide). Maybe we expect if someone is enabling the experimental reader, they should know that their workload is already capable of not falling back to Spark for scan. We can revisit these configs and defaults in the future, I suppose. For now I think this is reasonable for benchmarking the experimental readers.

@andygrove
Copy link
Member Author

Now that I think about it more: should this be a distinct config from Spark's

I was also thinking about this. If we add a Comet configuration, we can set the default to disable the filter pushdown for now. I will make this change.

@andygrove andygrove changed the title perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config perf: Add new Comet PARQUET_FILTER_PUSHDOWN_ENABLED config Apr 7, 2025
@codecov-commenter
Copy link

codecov-commenter commented Apr 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.07%. Comparing base (f09f8af) to head (e528bb9).
Report is 139 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1619      +/-   ##
============================================
+ Coverage     56.12%   59.07%   +2.94%     
- Complexity      976     1072      +96     
============================================
  Files           119      125       +6     
  Lines         11743    12499     +756     
  Branches       2251     2340      +89     
============================================
+ Hits           6591     7384     +793     
+ Misses         4012     3950      -62     
- Partials       1140     1165      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member Author

I was also thinking about this. If we add a Comet configuration, we can set the default to disable the filter pushdown for now. I will make this change.

We do need to respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED otherwise we see regressions in the Spark SQL tests, so I reverted adding a Comet-specific config.

@parthchandra
Copy link
Contributor

I don't see where PARQUET_FILTER_PUSHDOWN_ENABLED is used at all. Wondering how this affects performance.
Also, row index columns are tied to predicate pushdown. With row indexes, a file reader is allowed to filter row groups based on the predicate but cannot apply the predicate to filter out the values (because the row index generation cannot figure out which row numbers got filtered out).

@parthchandra
Copy link
Contributor

parthchandra commented Apr 7, 2025

I don't see where PARQUET_FILTER_PUSHDOWN_ENABLED is used at all

Nvm. It is used to decide whether to pass in the predicates to the batch reader which uses it to filter out row groups. Note that the behavior of the native code must match this exactly especially for row index generation. So if we want to have a Comet config then it must the same value as the Spark config which makes it redundant.

@andygrove andygrove changed the title perf: Add new Comet PARQUET_FILTER_PUSHDOWN_ENABLED config perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config Apr 8, 2025
@@ -257,7 +257,8 @@ public static native long initRecordBatchReader(
byte[] requiredSchema,
byte[] dataSchema,
String sessionTimezone,
int batchSize);
int batchSize,
boolean pushdownFilters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already implied if the filter parameter is null.

Copy link
Member Author

@andygrove andygrove Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will update this today soon.

@@ -703,6 +704,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat
file_groups,
None,
data_filters,
pushdown_filters != 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. data_filters is an Option and should be None if filter pushdown is disabled.

@andygrove andygrove marked this pull request as draft April 11, 2025 14:55
@andygrove
Copy link
Member Author

I addressed the feedback but I no longer see a performance improvement with native_datafusion when disabling PARQUET_FILTER_PUSHDOWN_ENABLED, so I have moved this to draft while I figure out why.

@andygrove
Copy link
Member Author

I addressed the feedback but I no longer see a performance improvement with native_datafusion when disabling PARQUET_FILTER_PUSHDOWN_ENABLED, so I have moved this to draft while I figure out why.

I was testing with the wrong Comet JAR file 🤦‍♂️

This is ready for re-review @mbutrovich @parthchandra

@andygrove andygrove marked this pull request as ready for review April 11, 2025 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants