Skip to content

Conversation

wudidapaopao
Copy link

@wudidapaopao wudidapaopao commented Sep 10, 2025

This PR adds the chdb solution.

I've updated the test results for the 0.5G and 5G datasets, which were run on my local laptop, in time.csv and logs.csv. These results were not generated on a standard machine. They are only intended to demonstrate the final output of the test script.

Please feel free to contact me if there are any issues or missing information.

cyrusmsk and others added 14 commits July 27, 2025 18:05
The cbdb implementation is based on polars python code.
Current approach is using connection + cursor.
Draft for joining code is finished
Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3
Now groupings code is working
Initial working version of chdb for join
Proper logic for on_disk and na_flag identification was added.
fix for local run
Added settings and fix mergeTree for session branch
added latest query and max_threads
Added settings for threads
on M1 Pro chip - 8 threads showing better perf
@szarnyasg szarnyasg requested a review from Tmonster September 10, 2025 06:16
Copy link
Collaborator

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

conn.query("DROP TABLE IF EXISTS ans")
gc.collect()
if compress:
time.sleep(60)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we putting a sleep between the two queries?

Copy link
Author

@wudidapaopao wudidapaopao Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processing logic of chDB is similar to that of ClickHouse, so here, just like with ClickHouse, the same handling has been applied for the 50G large dataset to avoid OOM.
I added the following comments in the test script:

# It will take some time for memory freed by Memory engine to be returned back to the system.
# Without a sleep we might get a MEMORY_LIMIT exception during the second run of the query.
# It is done only when compress is true because this variable is set to true only for the largest dataset.

QUERY=f"CREATE TABLE ans ENGINE = {query_engine} AS SELECT id1, sum(v1) AS v1 FROM db_benchmark.x GROUP BY id1 {settings}"
conn.query(QUERY)
nr = int(str(conn.query("SELECT count(*) AS cnt FROM ans")).strip())
nc = len(str(conn.query("SELECT * FROM ans LIMIT 0", "CSVWITHNAMES")).split(','))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significant of "CSVWITHnAMES" here? Just curious

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is it just a way to get the number of columns?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's just to be able to get the number of columns.

import time
import sys

solution = str(sys.argv[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this file need to be included? Maybe it could be in utils with the idea all solutions can use it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be unused and identical to monitor_ram.py in the polars directory, so I have deleted it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I ported the solution from the polars folder - this file was taken from there.

@wudidapaopao
Copy link
Author

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

Thank you very much. I have reverted the changes made to time.csv and logs.csv.

@cyrusmsk
Copy link

Nice timing for new version of the benchmark to compare with newest DuckDB 1.4.0

@Tmonster
Copy link
Collaborator

@wudidapaopao seems like the regression test is failing for chdb? Can you take a look? The conflict is just that I removed pydatatable from the regression runner.

The failing Julia solutions is known. Eventually I will fix those as well

@wudidapaopao
Copy link
Author

wudidapaopao commented Oct 9, 2025

@Tmonster I've resolved the code conflicts and attempted to fix the failing regression tests, with chk output added. Since I couldn't reproduce the issue locally, would it be convenient to re-trigger the regression testing? I'm hoping this fix will work.

@Tmonster
Copy link
Collaborator

Awesome thanks!

@Tmonster Tmonster merged commit 3da7318 into duckdblabs:main Oct 12, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants