Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parallelise efficiently? #3725

Open
fruce-ki opened this issue Feb 28, 2025 · 6 comments
Open

How to parallelise efficiently? #3725

fruce-ki opened this issue Feb 28, 2025 · 6 comments
Labels
question General question regarding SI

Comments

@fruce-ki
Copy link

Hello!

I have a large backlog of data (many thousands of Maxwell recordings), so I want to take maximum advantage of the computer's's capacity. I have a workstation with 20 Intel cores (ie. 40 "cores") and about 320GB of RAM.

Using the built-in parallelization option

global_job_kwargs = dict(n_jobs = 36, total_memory = '300G', mp_context = 'spawn', progress_bar = False)
si.set_global_job_kwargs(**global_job_kwargs)

does not make use of the provided capacity. Preprocessing for example uses only about max 5% of the CPU capacity and 20% of the RAM capacity. Clearly there are bottlenecks in parallelizing the processing of a single electrode array.

So instead of preprocessing one recording at a time with many cores, I tried processing many recordings at once with one core each, by setting n_jobs=1 and using Parallel() from the joblib package. This works very well in terms of making use of the CPU capacity, but inevitably causes a crash sooner or later.

For the actual sorting step I just use the provided run_sorter_jobs(engine_kwargs = {'n_jobs': 36}) with each joblist item having 'job_kwargs' : {'n_jobs': 1}`, and this at least does use the available capacity well.

I am not a python expert nor a computer scientist. I also don't know the internal designs and limitations of spikeinterface. So I am not getting anywhere trying to troubleshoot this by myself.

My hope is that you might have some insight on how I could get through preprocessing and postprocessing a very large number of recordings more quickly. Thank you in advance!

@b-grimaud
Copy link
Contributor

Is your data stored on a SSD or a HDD ? If the preprocessing isn't using as much resources as you allow it to, your bottleneck might be your disk's read speed.

Same goes for manually using joblib to iterate over multiple files, your process will try to access many files at once but the maximum access speed will remain the same.

@alejoe91 alejoe91 added the question General question regarding SI label Mar 3, 2025
@fruce-ki
Copy link
Author

fruce-ki commented Mar 3, 2025

The files are staged to an SSD.

The reason I am asking is that joblib does maximise CPU usage, so there is untapped potential there. My problem with using joblib onto spikeinterface functions is that the child processes die, due to some thread management issues and memory leaks/overflows.

@samuelgarcia
Copy link
Member

Hi,

run_sorter_jobs() using joblib backend is not a super great idea. It was implemented long time ago and should be removed in my point of view.
On spikeinterface side we cannot handle how the sorter is spawning and/or using thread and playing with GPU.
So using joblib has engine to has severa worker on the same machine will lead to nested parralelisation (and you want to avoid this)

run_sorter_jobs() is more usefull with slurm for instance.

For pre/post processing our internal machanism of parralelisation is quite flexible and should be efficient.
In short with high CPU count you should be able to staturate your read/write ability on your disk if it is a newtworks drive.

Be also aware that :

  • recently you also have pool_engine="process"/"thread" in job_kwargs
  • you can also play with max_threads_per_worker in case you are using processes
  • the process works better onlinux than windows, the spawn has a high cost at start.
global_job_kwargs = dict(pool_engine="process", n_jobs = 36, total_memory = '300G', mp_context = 'spawn', progress_bar = False)
si.set_global_job_kwargs(**global_job_kwargs)

@alejoe91
Copy link
Member

alejoe91 commented Mar 5, 2025

@samuelgarcia I agree with you on avoiding joblib. But in principle using total_memory="300G" should get the RAM usage close to the requested amount. Internally, this works by setting the chunk size so that chunk_size * n_jobs ~ total memory.

If that is not the case, we might have a bug.

Can you try to instead of using total memory, setting the n_jobs and chunk_duration?
For example:

global_job_kwargs = dict(pool_engine="process", n_jobs = 36, chunk_duration = '10s', mp_context = 'spawn', progress_bar = False)
si.set_global_job_kwargs(**global_job_kwargs)

Should use 10 times the default RAM, since default duration is 1s

@fruce-ki
Copy link
Author

fruce-ki commented Mar 5, 2025

At the moment I have decoupled preprocessing, sorting, and postprocessing from one another, to run them independently instead of all in one go. Let's focus on just preprocessing. run_sorter_jobs() works well in terms of CPU capacity on its own anyway, so I am not inclined to wrap it in joblib.

And just to be clear, this is a single workstation. All processing and storage are local to the machine, and it is a Linux machine.

pool_engine="process" is not a recognised kwarg in the version I have installed (0.101.2).

Filling memory capacity does not seem to be a problem. In fact the opposite is true, it exceeds the expected usage, by a lot.

I was monitoring a run with 20 wells preprocessing wrapped in joblib, with 1 core and 12GB each. So I expected 50% CPU capacity and 240G (20 * 12) RAM capacity. But the actual RAM usage was peaking repeatedly at 320GB (100%).

I tried a run without the joblib workaround: n_jobs=30, mp_context+'spawn', max_threads_per_process=1, chunk_memory='1G', total_memory=None for a single well.
The resource monitor clocked about 50% CPU capacity usage, but it did use the 30 cores (just not to 100%), after an initial single-core phase. So I think it works as you intended, minus some bottleneck steps.
By contrast I expected 30G RAM usage (a bit under 10%), but on the resource monitor it slowly climbed all the way to 30%.

Something there is not right with the memory usage. It might be the reason I experience so many crashes every time I try to scale up my processing, as it consumes a lot more memory than I allocate to it.

@fruce-ki
Copy link
Author

fruce-ki commented Mar 5, 2025

Oh, I just discovered that using less chunk_memory results in more cores. If I assign 10G per chunk, I get maybe 5 of the 30 assigned cores. If I assign 500M instead, all 30 cores come into play properly. Interesting. Maybe I really need to rethink how much memory is worth allocating. (the RAM footprint is still much larger than the parameters would indicate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question regarding SI
Projects
None yet
Development

No branches or pull requests

4 participants