Add chdb solution #131

wudidapaopao · 2025-09-10T05:31:28Z

This PR adds the chdb solution.

I've updated the test results for the 0.5G and 5G datasets, which were run on my local laptop, in time.csv and logs.csv. These results were not generated on a standard machine. They are only intended to demonstrate the final output of the test script.

Please feel free to contact me if there are any issues or missing information.

The cbdb implementation is based on polars python code. Current approach is using connection + cursor.

Draft for joining code is finished

Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3

Now groupings code is working

fix nc field

Initial working version of chdb for join

Proper logic for on_disk and na_flag identification was added.

fix for local run

Added settings and fix mergeTree for session branch

added latest query and max_threads

Added settings for threads

on M1 Pro chip - 8 threads showing better perf

Tmonster

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

Tmonster · 2025-09-10T15:41:54Z

chdb/groupby-chdb.py

+conn.query("DROP TABLE IF EXISTS ans")
+gc.collect()
+if compress:
+    time.sleep(60)


why are we putting a sleep between the two queries?

The processing logic of chDB is similar to that of ClickHouse, so here, just like with ClickHouse, the same handling has been applied for the 50G large dataset to avoid OOM.
I added the following comments in the test script:

# It will take some time for memory freed by Memory engine to be returned back to the system. # Without a sleep we might get a MEMORY_LIMIT exception during the second run of the query. # It is done only when compress is true because this variable is set to true only for the largest dataset.

Tmonster · 2025-09-10T15:42:52Z

chdb/groupby-chdb.py

+QUERY=f"CREATE TABLE ans ENGINE = {query_engine} AS SELECT id1, sum(v1) AS v1 FROM db_benchmark.x GROUP BY id1 {settings}"
+conn.query(QUERY)
+nr = int(str(conn.query("SELECT count(*) AS cnt FROM ans")).strip())
+nc = len(str(conn.query("SELECT * FROM ans LIMIT 0", "CSVWITHNAMES")).split(','))


What is the significant of "CSVWITHnAMES" here? Just curious

Oh, is it just a way to get the number of columns?

Yes, it's just to be able to get the number of columns.

Tmonster · 2025-09-10T15:46:02Z

chdb/monitor_ram.py

+import time
+import sys
+
+solution = str(sys.argv[1])


does this file need to be included? Maybe it could be in utils with the idea all solutions can use it?

This file should be unused and identical to monitor_ram.py in the polars directory, so I have deleted it.

Yes, as I ported the solution from the polars folder - this file was taken from there.

wudidapaopao · 2025-09-10T17:27:36Z

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

Thank you very much. I have reverted the changes made to time.csv and logs.csv.

cyrusmsk · 2025-09-27T21:27:14Z

Nice timing for new version of the benchmark to compare with newest DuckDB 1.4.0

Tmonster · 2025-09-30T13:02:08Z

@wudidapaopao seems like the regression test is failing for chdb? Can you take a look? The conflict is just that I removed pydatatable from the regression runner.

The failing Julia solutions is known. Eventually I will fix those as well

wudidapaopao · 2025-10-09T06:44:58Z

@Tmonster I've resolved the code conflicts and attempted to fix the failing regression tests, with chk output added. Since I couldn't reproduce the issue locally, would it be convenient to re-trigger the regression testing? I'm hoping this fix will work.

Tmonster · 2025-10-12T09:05:14Z

Awesome thanks!

cyrusmsk and others added 14 commits July 27, 2025 18:05

Initial chdb implementation

c409f96

The cbdb implementation is based on polars python code. Current approach is using connection + cursor.

Update join-chdb.py

e3cd2a5

Draft for joining code is finished

Grouping queries

809401f

Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3

[chDB] Groupings code fixed

ae5c96c

Now groupings code is working

Update groupby-chdb.py

8ccf54d

fix nc field

Update join-chdb.py

34d2b83

Initial working version of chdb for join

On_disk and na_flag logic added

a5648c1

Proper logic for on_disk and na_flag identification was added.

Update groupby-chdb.py

03d8978

fix for local run

Update groupby-chdb.py

6b66e05

Added settings and fix mergeTree for session branch

Update groupby-chdb.py

6a5be4c

added latest query and max_threads

Fix settings for session branch

4f72c44

Added settings for threads

Optimal number of threads

90cff38

on M1 Pro chip - 8 threads showing better perf

Update chdb solution

a47ea72

Add local test results

9606449

szarnyasg requested a review from Tmonster September 10, 2025 06:16

Tmonster reviewed Sep 10, 2025

View reviewed changes

Remove chdb/monitor_ram.py and update the test script

76a834f

auxten mentioned this pull request Sep 18, 2025

Add chDB to the DuckDB benchmark(former h2o) chdb-io/chdb#356

Open

wudidapaopao added 2 commits October 9, 2025 11:29

Merge remote-tracking branch 'origin/main' into add-chdb-solution

5da8ec1

Add chk output

760d089

Tmonster merged commit 3da7318 into duckdblabs:main Oct 12, 2025
13 checks passed

Add chdb solution #131

Add chdb solution #131

Uh oh!

Conversation

wudidapaopao commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

cyrusmsk Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao commented Sep 10, 2025

Uh oh!

cyrusmsk commented Sep 27, 2025

Uh oh!

Tmonster commented Sep 30, 2025

Uh oh!

wudidapaopao commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tmonster commented Oct 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wudidapaopao commented Sep 10, 2025 •

edited

Loading

wudidapaopao Sep 10, 2025 •

edited

Loading

wudidapaopao commented Oct 9, 2025 •

edited

Loading