Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arrow stream #265

Open
djouallah opened this issue Sep 8, 2024 · 5 comments
Open

Add support for arrow stream #265

djouallah opened this issue Sep 8, 2024 · 5 comments
Assignees

Comments

@djouallah
Copy link

first congratulation on the progress you made, chDB is substantially better than just 6 months ago, I am trying to read a folder of csv and export it to delta, current I am using df = sess.sql(sql,"ArrowTable") to transfer the data to deltalake Python, the problem is I am getting OOM errors, would be nice if you can add support for arrow recordbatch so the transfer is done in smaller batch

thanks

@djouallah
Copy link
Author

@auxten how do you get a schema when using this

df = sess.sql(sql,"ArrowStream")
write_deltalake(f"/lakehouse/default/Tables/T{total_files}/chdb",df, mode="append", partition_by=['year'], storage_options= storage_options)

@auxten
Copy link
Member

auxten commented Sep 10, 2024

I understand that what you’re trying to do is retrieve the output schema and then stream the data into Delta Lake.

  1. Regarding the issue of retrieving the schema, I believe it can be obtained by setting the output format to JSON, ArrowTable, DataFrame, etc. However, in cases of large data volumes, a LIMIT should be applied.
  2. Currently, the implementation of chDB requires loading the entire dataset into memory before proceeding with further processing, which can lead to an OOM (out of memory) issue when dealing with large data volumes. This is a point that needs improvement, and I will schedule it for future development.

@djouallah
Copy link
Author

I added chdb to my etl benchmarks, feel free to have a look, if i am doing something terribly wrong
https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/ETL/Light_ETL_Python_Notebook.ipynb

@auxten auxten moved this to Done in chDB 2024 Q4 Sep 29, 2024
@auxten auxten closed this as completed by moving to Done in chDB 2024 Q4 Sep 29, 2024
@auxten auxten reopened this Sep 29, 2024
@auxten auxten self-assigned this Sep 29, 2024
@auxten auxten changed the title add support for arrow stream Add support for arrow stream Sep 29, 2024
@ViggoC
Copy link

ViggoC commented Dec 15, 2024

@auxten If I understand right, clickhouse-local can load and process the data in streaming style, but chdb collect data from the clickhouse-local in batch style?

@auxten
Copy link
Member

auxten commented Dec 15, 2024

@auxten If I understand right, clickhouse-local can load and process the data in streaming style, but chdb collect data from the clickhouse-local in batch style?

yes, you are partially right. For input side, chDB does exactly the same as clickhouse-local. Reading data from file and http or s3 in stream and also random access mode. But for output side, the data is written in batch style. This is what we need to improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants