Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory and parallelism tuning #230

Open
jamessmith123456 opened this issue Apr 14, 2023 · 4 comments
Open

Memory and parallelism tuning #230

jamessmith123456 opened this issue Apr 14, 2023 · 4 comments

Comments

@jamessmith123456
Copy link

(1)It seems that memory issues cannot be solved when there is a large amount of data.
(2)If the parallelism is 20, the original data will be copied in 20 copies?
(3)How can I solve the coordination relationship between memory and CPU to set the optimal parameters,please?

@nalepae
Copy link
Owner

nalepae commented May 10, 2023

(1): Pandarallel basically doubles the amount of needed memory, as stated in the documentation:

pandarallel gets around this limitation by using all cores of your computer. But, in return, pandarallel need twice the memory that standard pandas operation would normally use.

(2): No, the original data will be copied only once, whatever the parallelism.

(3): There is no coordination relationship between CPU and memory (cf (2))

@SysuJayce
Copy link

(1): Pandarallel basically doubles the amount of needed memory, as stated in the documentation:

pandarallel gets around this limitation by using all cores of your computer. But, in return, pandarallel need twice the memory that standard pandas operation would normally use.

(2): No, the original data will be copied only once, whatever the parallelism.

(3): There is no coordination relationship between CPU and memory (cf (2))

hi @nalepae , if the amount of data is quite large, how can we boost the preparation before apply()?

If I have 100GB data read in memory, I have to wait a long time before the apply start

@nalepae
Copy link
Owner

nalepae commented Jan 23, 2024

Pandaral·lel is looking for a maintainer!
If you are interested, please open an GitHub issue.

@shermansiu
Copy link

@SysuJayce, what do you mean by "boosting the preparation"?

If you are memory-bound, I would suggest breaking up your dataframe into smaller shards and applying your function to each shard.

Do you have any other problems? If not, I would like to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants