Add support for arrow stream #265

djouallah · 2024-09-08T08:43:21Z

first congratulation on the progress you made, chDB is substantially better than just 6 months ago, I am trying to read a folder of csv and export it to delta, current I am using df = sess.sql(sql,"ArrowTable") to transfer the data to deltalake Python, the problem is I am getting OOM errors, would be nice if you can add support for arrow recordbatch so the transfer is done in smaller batch

thanks

djouallah · 2024-09-08T14:17:38Z

@auxten how do you get a schema when using this

df = sess.sql(sql,"ArrowStream")
write_deltalake(f"/lakehouse/default/Tables/T{total_files}/chdb",df, mode="append", partition_by=['year'], storage_options= storage_options)

auxten · 2024-09-10T15:14:30Z

I understand that what you’re trying to do is retrieve the output schema and then stream the data into Delta Lake.

Regarding the issue of retrieving the schema, I believe it can be obtained by setting the output format to JSON, ArrowTable, DataFrame, etc. However, in cases of large data volumes, a LIMIT should be applied.
Currently, the implementation of chDB requires loading the entire dataset into memory before proceeding with further processing, which can lead to an OOM (out of memory) issue when dealing with large data volumes. This is a point that needs improvement, and I will schedule it for future development.

djouallah · 2024-09-10T15:44:39Z

I added chdb to my etl benchmarks, feel free to have a look, if i am doing something terribly wrong
https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/ETL/Light_ETL_Python_Notebook.ipynb

ViggoC · 2024-12-15T08:15:11Z

@auxten If I understand right, clickhouse-local can load and process the data in streaming style, but chdb collect data from the clickhouse-local in batch style?

auxten · 2024-12-15T12:12:35Z

@auxten If I understand right, clickhouse-local can load and process the data in streaming style, but chdb collect data from the clickhouse-local in batch style?

yes, you are partially right. For input side, chDB does exactly the same as clickhouse-local. Reading data from file and http or s3 in stream and also random access mode. But for output side, the data is written in batch style. This is what we need to improve.

auxten moved this to Done in chDB 2024 Q4 Sep 29, 2024

auxten added this to chDB 2024 Q4 Sep 29, 2024

auxten closed this as completed by moving to Done in chDB 2024 Q4 Sep 29, 2024

auxten reopened this Sep 29, 2024

auxten self-assigned this Sep 29, 2024

auxten changed the title ~~add support for arrow stream~~ Add support for arrow stream Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for arrow stream #265

Add support for arrow stream #265

djouallah commented Sep 8, 2024

djouallah commented Sep 8, 2024

auxten commented Sep 10, 2024 •

edited

Loading

djouallah commented Sep 10, 2024

ViggoC commented Dec 15, 2024

auxten commented Dec 15, 2024

Add support for arrow stream #265

Add support for arrow stream #265

Comments

djouallah commented Sep 8, 2024

djouallah commented Sep 8, 2024

auxten commented Sep 10, 2024 • edited Loading

djouallah commented Sep 10, 2024

ViggoC commented Dec 15, 2024

auxten commented Dec 15, 2024

auxten commented Sep 10, 2024 •

edited

Loading