-
Notifications
You must be signed in to change notification settings - Fork 415
Remove need to sync Gpu stream before deallocating memory #4432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Remove need to sync Gpu stream before deallocating memory #4432
Conversation
…tream_before_dealloc
…tream_before_dealloc
…tream_before_dealloc
I am worried about the complexity. It's always hard to reason about threading. So I might have missed something. Suppose there are two threads. Both call |
It is indeed super complicated when used with multiple threads. In that specific example, since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1, the single stream sync would sync both kernels. However, I now notice a flaw if the thread 0 kernel launch happens after thread 1 calls cudaStreamSynchronize. This would be strange since thread 0 updated m_stream_op_id before thread 1 called Gpu::streamSynchronize() and thread 0 should not really be doing anything that takes time between updating m_stream_op_id and launching the kernel, however it is technically possible and would result in the kernel from thread 0 to not be synced. |
I meant after. |
Hi @AlexanderSinn - do you want to experiment with a different approach, or can we close this? |
Yes I am still working on this. Next I will try to give each stream an array with one bool per omp thread to store if the stream is synced. |
Could you show some performance data comparing the development branch with this PR? |
Summary
Functionality is added to Gpu::Device and CArena to wait until the next stream sync before deallocating memory and to avoid double syncs.
Additional background
Currently, there is a lot of mixing/confusion of what CArena, Device and StreamManager are each meant to do.
If delay_memory_free_until_sync is true, sync_before_memory_free has no effect.
In the future there could be a single-stream no sync mode for (non host-accessible) device memory.
Checklist
The proposed changes: