Make htod_memcpy_async available on CPU #4640

AlexanderSinn · 2025-09-02T13:42:58Z

Summary

This PR makes the 6 GPU memory copy functions

htod_memcpy_async
dtoh_memcpy_async
dtod_memcpy_async
htod_memcpy
dtoh_memcpy
dtod_memcpy

available in CPU code, where they all call std::memcpy.

This helps to avoid the typical pattern of using ifdef in user code:

#ifdef AMREX_USE_GPU
    amrex::Gpu::dtoh_memcpy_async(
        m_datanodes[slice].m_buffer, m_trailing_gpu_buffer.dataPtr(), num_bytes);
#else
    std::memcpy(m_datanodes[slice].m_buffer, m_trailing_gpu_buffer.dataPtr(), num_bytes);
#endif

Additional background

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

WeiqunZhang · 2025-09-08T02:53:32Z

The reason we have never added cpu versions of these functions is that it might result in unnecessary copies. In many cases, one could simply copy the pointer for CPU cases.

WeiqunZhang · 2025-09-08T03:02:15Z

And amrex::Gpu::copy can be used to copy for both GPU and CPU cases. But this issue/complaint keeps coming up. My thought is we should make it hard for the users to write non-optimal code. To me, htod_memcpy etc. are GPU only concepts. But I can be convinced.

AlexanderSinn · 2025-09-09T13:27:58Z

The code I want to clean up in hipace packs and unpacks buffers for MPI. Even on CPU, a memory copy is necessary here. As it is dealing with raw memory, the iterator interface of copyAsync doesn't fit very well; additionally, I find that function more verbose to use due to the double amrex::Gpu:: and src.begin(), src.begin() + np.

For the cases where data is copied to be used in a GPU kernel, which seems to be very common, it is currently quite difficult to do this efficiently on both CPU and GPU. For this, I want to eventually add a container type that has both a CPU and GPU vector internally when a GPU is used but only one vector on unified memory systems or CPUs.

template<class T>
struct HostDeviceVector {

    amrex::Gpu::DeviceVector<T> m_d_data;
    amrex::Gpu::PinnedVector<T> m_h_data;
    bool m_unified_memory = true;

    T* dataPtr () {
        if (m_unified_memory) {
            return m_h_data.dataPtr();
        } else {
            return m_d_data.dataPtr();
        }
    }

    T& operator[] (std::size_t i) {
        return m_h_data[i];
    }

    void resize (std::size_t i) {
        m_h_data.resize(i);
        if (!m_unified_memory) {
            m_d_data.resize(i);
        }
    }

    void async_copy_d2h () {
        if (!m_unified_memory) {
            amrex::Gpu::dtoh_memcpy_async(m_h_data.dataPtr(), m_d_data.dataPtr(),
                m_d_data.size() * sizeof(T));
        }
    }

    void async_copy_h2d () {
        if (!m_unified_memory) {
            amrex::Gpu::htod_memcpy_async(m_d_data.dataPtr(), m_h_data.dataPtr(),
                m_h_data.size() * sizeof(T));
        }
    }

    void sync () {
        if (!m_unified_memory) {
            amrex::Gpu::streamSynchronize();
        }
    }
};

WeiqunZhang · 2025-09-09T15:54:53Z

We have something that is somewhat like that with one "vector" for CPU build and two for GPU builds. amrex::Gpu::Buffer

AlexanderSinn · 2025-09-09T16:37:22Z

Very interesting, I didn't know this existed. One flaw I see with the current version of it is that to use the Buffer (T const* h_p, const std::size_t n) constructor, the user needs to have another vector/array of which a copy is made. I think it would need a constructor that just takes a size and a way to write to the host data and initiate an htod transver.

WeiqunZhang · 2025-09-09T17:00:08Z

Agreed. We need to add more methods for initialization.

AlexanderSinn added 2 commits September 2, 2025 15:33

make htod_memcpy_async available on CPU

a102b03

cleaning

d8e3ecf

AlexanderSinn requested a review from WeiqunZhang September 4, 2025 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make htod_memcpy_async available on CPU #4640

Make htod_memcpy_async available on CPU #4640

Uh oh!

AlexanderSinn commented Sep 2, 2025

Uh oh!

WeiqunZhang commented Sep 8, 2025

Uh oh!

WeiqunZhang commented Sep 8, 2025

Uh oh!

AlexanderSinn commented Sep 9, 2025

Uh oh!

WeiqunZhang commented Sep 9, 2025

Uh oh!

AlexanderSinn commented Sep 9, 2025

Uh oh!

WeiqunZhang commented Sep 9, 2025

Uh oh!

Uh oh!

Make htod_memcpy_async available on CPU #4640

Are you sure you want to change the base?

Make htod_memcpy_async available on CPU #4640

Uh oh!

Conversation

AlexanderSinn commented Sep 2, 2025

Summary

Additional background

Checklist

Uh oh!

WeiqunZhang commented Sep 8, 2025

Uh oh!

WeiqunZhang commented Sep 8, 2025

Uh oh!

AlexanderSinn commented Sep 9, 2025

Uh oh!

WeiqunZhang commented Sep 9, 2025

Uh oh!

AlexanderSinn commented Sep 9, 2025

Uh oh!

WeiqunZhang commented Sep 9, 2025

Uh oh!

Uh oh!