Remove need to sync Gpu stream before deallocating memory #4432

AlexanderSinn · 2025-04-27T17:23:43Z

Summary

Functionality is added to Gpu::Device and CArena to wait until the next stream sync before deallocating memory and to avoid double syncs.

Additional background

Currently, there is a lot of mixing/confusion of what CArena, Device and StreamManager are each meant to do.
If delay_memory_free_until_sync is true, sync_before_memory_free has no effect.
In the future there could be a single-stream no sync mode for (non host-accessible) device memory.

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

…_dealloc

…tream_before_dealloc

WeiqunZhang · 2025-06-04T16:55:52Z

I am worried about the complexity. It's always hard to reason about threading. So I might have missed something. Suppose there are two threads. Both call gpuStream() and obtain the same stream. Then both launch a gpu kernel and call Gpu::streamSynchronize(). To thread 1, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 1 launches a gpu kernel; (3) thread 1 calls Gpu::streamSynchronize() that sees m_stream_op_id modified by thread 0 and eventually calls cudaStreamSynchronize and sets m_last_sync to the latest m_stream_op_id. To thread 0, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 0 modifies m_stream_op_id; (3) thread 0 launches a gpu kernel that happens before cudaStreamSynchronize from thread 1; (4) thread 0 calls Gpu::streamSynchronize that does nothing because it sees m_last_sync == m_stream_op_id. In the end thread 0's kernel is not synced.

AlexanderSinn · 2025-06-04T17:37:25Z

It is indeed super complicated when used with multiple threads. In that specific example, since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1, the single stream sync would sync both kernels. However, I now notice a flaw if the thread 0 kernel launch happens after thread 1 calls cudaStreamSynchronize. This would be strange since thread 0 updated m_stream_op_id before thread 1 called Gpu::streamSynchronize() and thread 0 should not really be doing anything that takes time between updating m_stream_op_id and launching the kernel, however it is technically possible and would result in the kernel from thread 0 to not be synced.

WeiqunZhang · 2025-06-04T17:50:41Z

since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1

I meant after.

atmyers · 2025-06-11T17:26:48Z

Hi @AlexanderSinn - do you want to experiment with a different approach, or can we close this?

AlexanderSinn · 2025-06-11T20:22:23Z

Yes I am still working on this. Next I will try to give each stream an array with one bool per omp thread to store if the stream is synced.

WeiqunZhang · 2025-08-08T00:55:57Z

Could you show some performance data comparing the development branch with this PR?

AlexanderSinn and others added 12 commits April 27, 2025 19:22

Remove need to sync Gpu stream before dealloc

da47055

fix

4872a96

Merge branch 'development' into Remove_need_to_sync_Gpu_stream_before…

c46b805

…_dealloc

Merge branch 'development' into Remove_need_to_sync_Gpu_stream_before…

23717df

…_dealloc

more robust threading and better streamSync error

0b2c241

fix

b76a1a3

fix 2

705f6cc

fix 3

b3e3950

Merge branch 'AMReX-Codes:development' into Remove_need_to_sync_Gpu_s…

179403e

…tream_before_dealloc

Merge branch 'AMReX-Codes:development' into Remove_need_to_sync_Gpu_s…

37dbda6

…tream_before_dealloc

add runtime options

860219f

Merge branch 'AMReX-Codes:development' into Remove_need_to_sync_Gpu_s…

108c782

…tream_before_dealloc

AlexanderSinn added 2 commits June 16, 2025 19:05

use vector over omp threads for is_synced

a21f65b

use constructor

4174b08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove need to sync Gpu stream before deallocating memory #4432

Remove need to sync Gpu stream before deallocating memory #4432

Uh oh!

AlexanderSinn commented Apr 27, 2025 •

edited

Loading

Uh oh!

WeiqunZhang commented Jun 4, 2025

Uh oh!

AlexanderSinn commented Jun 4, 2025

Uh oh!

WeiqunZhang commented Jun 4, 2025

Uh oh!

atmyers commented Jun 11, 2025

Uh oh!

AlexanderSinn commented Jun 11, 2025

Uh oh!

WeiqunZhang commented Aug 8, 2025

Uh oh!

Uh oh!

Remove need to sync Gpu stream before deallocating memory #4432

Are you sure you want to change the base?

Remove need to sync Gpu stream before deallocating memory #4432

Uh oh!

Conversation

AlexanderSinn commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional background

Checklist

Uh oh!

WeiqunZhang commented Jun 4, 2025

Uh oh!

AlexanderSinn commented Jun 4, 2025

Uh oh!

WeiqunZhang commented Jun 4, 2025

Uh oh!

atmyers commented Jun 11, 2025

Uh oh!

AlexanderSinn commented Jun 11, 2025

Uh oh!

WeiqunZhang commented Aug 8, 2025

Uh oh!

Uh oh!

AlexanderSinn commented Apr 27, 2025 •

edited

Loading