Skip to content

Conversation

AlexanderSinn
Copy link
Member

@AlexanderSinn AlexanderSinn commented Apr 27, 2025

Summary

Functionality is added to Gpu::Device and CArena to wait until the next stream sync before deallocating memory and to avoid double syncs.

Additional background

Currently, there is a lot of mixing/confusion of what CArena, Device and StreamManager are each meant to do.
If delay_memory_free_until_sync is true, sync_before_memory_free has no effect.
In the future there could be a single-stream no sync mode for (non host-accessible) device memory.

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@WeiqunZhang
Copy link
Member

I am worried about the complexity. It's always hard to reason about threading. So I might have missed something. Suppose there are two threads. Both call gpuStream() and obtain the same stream. Then both launch a gpu kernel and call Gpu::streamSynchronize(). To thread 1, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 1 launches a gpu kernel; (3) thread 1 calls Gpu::streamSynchronize() that sees m_stream_op_id modified by thread 0 and eventually calls cudaStreamSynchronize and sets m_last_sync to the latest m_stream_op_id. To thread 0, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 0 modifies m_stream_op_id; (3) thread 0 launches a gpu kernel that happens before cudaStreamSynchronize from thread 1; (4) thread 0 calls Gpu::streamSynchronize that does nothing because it sees m_last_sync == m_stream_op_id. In the end thread 0's kernel is not synced.

@AlexanderSinn
Copy link
Member Author

It is indeed super complicated when used with multiple threads. In that specific example, since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1, the single stream sync would sync both kernels. However, I now notice a flaw if the thread 0 kernel launch happens after thread 1 calls cudaStreamSynchronize. This would be strange since thread 0 updated m_stream_op_id before thread 1 called Gpu::streamSynchronize() and thread 0 should not really be doing anything that takes time between updating m_stream_op_id and launching the kernel, however it is technically possible and would result in the kernel from thread 0 to not be synced.

@WeiqunZhang
Copy link
Member

since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1

I meant after.

@atmyers
Copy link
Member

atmyers commented Jun 11, 2025

Hi @AlexanderSinn - do you want to experiment with a different approach, or can we close this?

@AlexanderSinn
Copy link
Member Author

Yes I am still working on this. Next I will try to give each stream an array with one bool per omp thread to store if the stream is synced.

@WeiqunZhang
Copy link
Member

Could you show some performance data comparing the development branch with this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants