Skip to content

Conversation

AlexanderSinn
Copy link
Member

Summary

This PR makes the 6 GPU memory copy functions

htod_memcpy_async
dtoh_memcpy_async
dtod_memcpy_async
htod_memcpy
dtoh_memcpy
dtod_memcpy

available in CPU code, where they all call std::memcpy.

This helps to avoid the typical pattern of using ifdef in user code:

#ifdef AMREX_USE_GPU
    amrex::Gpu::dtoh_memcpy_async(
        m_datanodes[slice].m_buffer, m_trailing_gpu_buffer.dataPtr(), num_bytes);
#else
    std::memcpy(m_datanodes[slice].m_buffer, m_trailing_gpu_buffer.dataPtr(), num_bytes);
#endif

Additional background

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@WeiqunZhang
Copy link
Member

The reason we have never added cpu versions of these functions is that it might result in unnecessary copies. In many cases, one could simply copy the pointer for CPU cases.

@WeiqunZhang
Copy link
Member

And amrex::Gpu::copy can be used to copy for both GPU and CPU cases. But this issue/complaint keeps coming up. My thought is we should make it hard for the users to write non-optimal code. To me, htod_memcpy etc. are GPU only concepts. But I can be convinced.

@AlexanderSinn
Copy link
Member Author

The code I want to clean up in hipace packs and unpacks buffers for MPI. Even on CPU, a memory copy is necessary here. As it is dealing with raw memory, the iterator interface of copyAsync doesn't fit very well; additionally, I find that function more verbose to use due to the double amrex::Gpu:: and src.begin(), src.begin() + np.

For the cases where data is copied to be used in a GPU kernel, which seems to be very common, it is currently quite difficult to do this efficiently on both CPU and GPU. For this, I want to eventually add a container type that has both a CPU and GPU vector internally when a GPU is used but only one vector on unified memory systems or CPUs.

template<class T>
struct HostDeviceVector {

    amrex::Gpu::DeviceVector<T> m_d_data;
    amrex::Gpu::PinnedVector<T> m_h_data;
    bool m_unified_memory = true;

    T* dataPtr () {
        if (m_unified_memory) {
            return m_h_data.dataPtr();
        } else {
            return m_d_data.dataPtr();
        }
    }

    T& operator[] (std::size_t i) {
        return m_h_data[i];
    }

    void resize (std::size_t i) {
        m_h_data.resize(i);
        if (!m_unified_memory) {
            m_d_data.resize(i);
        }
    }

    void async_copy_d2h () {
        if (!m_unified_memory) {
            amrex::Gpu::dtoh_memcpy_async(m_h_data.dataPtr(), m_d_data.dataPtr(),
                m_d_data.size() * sizeof(T));
        }
    }

    void async_copy_h2d () {
        if (!m_unified_memory) {
            amrex::Gpu::htod_memcpy_async(m_d_data.dataPtr(), m_h_data.dataPtr(),
                m_h_data.size() * sizeof(T));
        }
    }

    void sync () {
        if (!m_unified_memory) {
            amrex::Gpu::streamSynchronize();
        }
    }
};

@WeiqunZhang
Copy link
Member

We have something that is somewhat like that with one "vector" for CPU build and two for GPU builds. amrex::Gpu::Buffer

@AlexanderSinn
Copy link
Member Author

Very interesting, I didn't know this existed. One flaw I see with the current version of it is that to use the Buffer (T const* h_p, const std::size_t n) constructor, the user needs to have another vector/array of which a copy is made. I think it would need a constructor that just takes a size and a way to write to the host data and initiate an htod transver.

@WeiqunZhang
Copy link
Member

Agreed. We need to add more methods for initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants