Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(tpcds): add queries 28-63 #9736

Merged
merged 18 commits into from
Aug 1, 2024
Merged

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 31, 2024

Adds TPC-DS queries 28-63. The commits adding the Ibis expressions are in batches of 10. It seemed no less annoying to do it that way than to create a separate PR for each batch of 10.

@cpcloud cpcloud added tests Issues or PRs related to tests clickhouse The ClickHouse backend datafusion The Apache DataFusion backend duckdb The DuckDB backend snowflake The Snowflake backend trino The Trino backend labels Jul 31, 2024
@cpcloud cpcloud requested review from jcrist and gforsyth July 31, 2024 11:42
@cpcloud cpcloud force-pushed the more-tpc-ds branch 2 times, most recently from efc26b1 to 42fe263 Compare July 31, 2024 13:51
@gforsyth
Copy link
Member

general question re: the baseline SQL queries: are there canonical (or non-canonical...?) sources that these are being drawn from? I imagine DuckDB has these somewhere in a benchmarking repo, but is that also the case with Clickhouse? Or did you perform the translation?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

Bleh, why the heck is druid trying to run tpc ds queries It's not.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

general question re: the baseline SQL queries: are there canonical (or non-canonical...?) sources that these are being drawn from? I imagine DuckDB has these somewhere in a benchmarking repo, but is that also the case with Clickhouse? Or did you perform the translation?

The DuckDB ones are copied from their repo and then modified in ways that don't change the results (adding column names and casting to dates), to accommodate the more pedantic engines.

Regarding ClickHouse, it looks like there's this https://github.com/Altinity/tpc-ds/tree/master/queries. It looks like they didn't rewrite anything heavily other than some truly atrocious formatting, and they are using some settings like

SET partial_merge_join = 1, partial_merge_join_optimizations = 1

that might allow our code to work as well without having to special case clickhouse.

I don't think there's a source of truth that works across all engines without alteration.

@gforsyth
Copy link
Member

gforsyth commented Jul 31, 2024

I don't think there's a source of truth that works across all engines without alteration.

oh, I'm sure that there isn't -- I was just wondering if we wanted to include links to upstream sources.

Maybe also we can put some notes in a README in the various sql file collections with a brief overview of changes (like the one-sentence description you just gave about the DuckDB queries)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

Added a readme about the folly of TPC :)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

Re clickhouse settings, partial_merge_join = 1 just ... doesn't work at all and complains that's either not a setting or it's a read only setting, so I'm abandoning that for now.

I should report the performance issue upstream, 200 MB of TPC-DS data shouldn't really be causing such a nutty performance problem.

@gforsyth
Copy link
Member

I should report the performance issue upstream, 200 MB of TPC-DS data shouldn't really be causing such a nutty performance problem.

agreed.

Also, on the subject of 200mb of data... if we bump up to sf=.45 we'll have non-empty results for queries 37, 41, 44, and 58

Only 51 requires a higher threshold (searching now)

@gforsyth
Copy link
Member

Only 51 requires a higher threshold (searching now)

hmm, I'm wrong, sf=.45 also works for query 51.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

Nice, I will update the testing-data repo then!

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

Do we want to merge this first and then I can handle the data updates in a follow-up?

@gforsyth
Copy link
Member

Sounds good -- I was waiting on the data updates, but those can definitely go in a follow-up

@cpcloud
Copy link
Member Author

cpcloud commented Jul 31, 2024

oh lemme run the clouds before merging

@cpcloud cpcloud added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Jul 31, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Jul 31, 2024
@cpcloud cpcloud merged commit 3939e61 into ibis-project:main Aug 1, 2024
101 checks passed
@cpcloud cpcloud deleted the more-tpc-ds branch August 1, 2024 09:42
@cpcloud cpcloud added this to the 9.3 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse The ClickHouse backend datafusion The Apache DataFusion backend duckdb The DuckDB backend snowflake The Snowflake backend tests Issues or PRs related to tests trino The Trino backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants