Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] implement memory visibility #2180

Draft
wants to merge 19 commits into
base: develop
Choose a base branch
from

Conversation

SimeonEhrig
Copy link
Member

@SimeonEhrig SimeonEhrig commented Oct 11, 2023

Memory visibility types allows to implement an API, which allows to check if an acc can access the memory of a buffer.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2023

could you describe what the expected interface is going to look like ?
how does this interact with CUDA/HIP unified memory, SYCL shared memory, Linux heterogeneous memory, etc. ?

@SimeonEhrig
Copy link
Member Author

I'm not fully sure about how the interface should looks like yet. I opened the PR to run the CI (at the moment I'm a train and have a bad connection to a system with GPU).

But I already thought about to support unified memory and similar. That's the reason why the buffer has a std::tupel for the memory visibility types.

I thought about following idea, how it could looks like:

template<typename TAcc, typename TBuf>
void transform(TAcc const & acc, TBuf & buf, ... ){
   // verify that acc type can access memory type
   // later, maybe compare at runtime if the device can access the memory 
   // -> for example if GPU 0 tries to access memory allocated on GPU 0
   static_assert(alpaka::hasSameMemView<TAcc, TBuf>());
   //...
}

int main(){
   using Host = alpaka::AccCpuSerial<Dim, Idx>;
   using Dev = alpaka::AccGpuHipRt<Dim, Idx>;

   Host hostAcc = ...
   Dev devAcc = ...

   using BufShared = alpaka::Buf<std::tupel<Host,Dev>, Data, Dim, Idx>;
   BufShared buffer(alpaka::allocBuf<Data, Idx>({hostAcc, devAcc}, extents));

  transform(hostAcc, buffer, ...);
  transform(devAcc, buffer, ...);
}

@fwyzard
Copy link
Contributor

fwyzard commented Oct 12, 2023

Ah, I see.
So, this would be a way to declare a buffer that is "visible" from multiple devices.

When discussing this in CMS a few months ago, there were at least two issues that came to mind.

1. meaning and intent

"visible" could mean different things, even on the same platform.
For example:

  • memory allocated with cudaMallocHost resides on the host, and is visible by the host and the GPU;
  • unified memory allocated with cudaMallocManaged can reside either on the host or on the GPU, with CUDA migrating it as necessary;
  • on recent Linux kernel memory allocated with malloc could behave in the same way.

2. concurrent access

If a buffer is "visible" by both the host and the GPU(s), some extra care is needed to ensure that concurrent access is avoided (or properly synchronised).
With a pseudocode example, compare

auto hbuf = allocate_host_buffer(host);
fill(hbuf);
auto dbuf = allocate_dev_buffer(dev);
alpaka::memcpy(queue, dbuf, hbuf);
// the kernel is guaranteed to run after the copy is complete
alpaka::exec<Acc>(queue, div, kernel{}, dbuf.data(), size);
// here we can release or reuse the host buffer
fill_again(hbuf);

with

auto buf = allocate_shared_buffer(host, dev);
fill(buf);
// warning: the kernel could run while the buffer has not yet been migrated
alpaka::exec<Acc>(queue, div, kernel{}, dbuf.data(), size);
// warning: we cannot reuse the shared buffer here!
fill_again(buf);

To be clear, I'm not saying this PR goes in the right or wrong direction, I'm just bringing up the issues we thought of so far.

@SimeonEhrig
Copy link
Member Author

@fwyzard Thanks for the hint. I had already a discussion with @psychocoderHPC how we abstract when memory is available on a device. At the moment, I'm not sure, if we should encode it in the visibility or add second property for this. I will spent more time with this question, before I continue.

@SimeonEhrig
Copy link
Member Author

@fwyzard I had a discussion with @psychocoderHPC. We agreed, that I will not implement a check if a visible memory is in a valid state, because it is a complex problem.

So, the function of the visibility properties is to forbid memory access from devices to memory, which is never allowed. For example access memory located on a Nvidia GPU from a Intel GPU (current state - maybe it is possible with HMM in future). In the case of virtual shared memory, the user or driver is still responsible that the data is available on the correct device, when it is accessed. Maybe we can add a second property to categorized the type of memory, that the user needs to know, what needs to be done. Similar to fine and coarse grain memory in HIP and OpenCL.

By the way, I'm also not sure, if we integrate the usage of memory visibility in alpaka itself (maybe in alpaka::copy). At the moment, this function is planned to be used in user code and libraries built on top of alpaka.

{
using type = std::tuple<alpaka::MemVisibleCPU>;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the memory in a BufCpu returned by a call to allocMappedBuf is visible by the devices in the corresponding platform.

But the C++ type is the same, so I don't know how it could be distinguished at compile time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, what you mean.

Each time, if we use a device to allocate memory, the buffer will be accessible for the device. But afterwards you can try to access the memory with another device, e.g. a cuda device. If the memory was allocate with malloc and without HMM, the cuda device can never access the memory. If we would have HMM support, the type needs to be std::tuple<alpaka::MemVisibleCPU, MemVisibleGpuCudaRt>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without HMM support, if you allocate a host buffer with allocMappedBuf, it will be visible by the GPU.

However, the C++ type of the host buffer returned by allocBuf and allocMappedBuf is the same.

So, trait::MemVisibility<TBuf> cannot distinguish between the two cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I know what you mean. I missed allocMappedBuf two times 😬 First that it exist and second that you wrote allocMappedBuf and not allocBuf.

Yes your are right. Therefore I need to implement the visibility as template type of BufCpu, like I already did it for ViewPlainPtr:

//! The memory view to wrap plain pointers.
template<
typename TDev,
typename TElem,
typename TDim,
typename TIdx,
typename TMemVisibility =
typename alpaka::meta::toTuple<typename alpaka::trait::MemVisibility<alpaka::Platform<TDev>>::type>::type>
struct ViewPlainPtr final : internal::ViewAccessOps<ViewPlainPtr<TDev, TElem, TDim, TIdx, TMemVisibility>>

I think the buffer type cannot have a default visibility type.

@fwyzard
Copy link
Contributor

fwyzard commented May 15, 2024

Mhm, I see... so with these changes a default host memory buffer and a pinned host memory buffer will have different types ?

I'll have to figure out if this is fine for the CMS code.

@SimeonEhrig
Copy link
Member Author

SimeonEhrig commented May 16, 2024

Mhm, I see... so with these changes a default host memory buffer and a pinned host memory buffer will have different types ?

I'll have to figure out if this is fine for the CMS code.

At the moment not. The memory visibility only represent which buffer/view can be accessed by which accelerator. So host and pinnend memory has the same type because they are only available from the CPUPlatform. A the moment, I see no extra usage for a more fine granular compile time visibility.

The major ideas of visibility tags are:

  • Secure APIs, like vikunja::transform. Because without the memory visibility, it is not clear if vikunja will copy data to the device if a host buffer is used or not. In the worst case, the application will be crash with a segmentation fault and you don't know why.
  • Unify double memory buffer and managed memory. Means write your application like a you wold use separate host and device memory and do manually copies. If the managed memory is available, the second buffer is optimized away and memory copies are noops. I already discussed it already a little bit with @psychocoderHPC and the idea needs more tuning . For example it is important to distinguish between alpaka::copy (copies all the time) and alpaka::zeroCopy (copies only if required). @bernhardmgruber already tried to implement this: [abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) #1820

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

At the moment not.

Aren't they ?

I understood that the buffer type becomes

template<typename TElem, typename TDim, typename TIdx, typename TMemVisibility>
class BufCpu { ... };

and the visibility (described by TMemVisibility) becomes part of the type.

So a buffer that holds system memory would have a type like

BufCpu<float, Dim1D, size_t, MemVisibleCPU>

while a buffer that pinned memory visible to NVIDIA GPUs would have a type like

BufCpu<float, Dim1D, size_t, std::tuple<MemVisibleCPU,MemVisibleGpuCudaRt>>

Did I misunderstand the intent ?

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

By the way, I do like the idea, but IMHO "visibility" is not enough for this:

  • Unify double memory buffer and managed memory.

Pinned memory is visible from a GPU, but using it directly to run a kernel is likely to be inefficient.

Managed memory should be almost as efficient as global GPU memory (assuming the runtime is good enough with the migration).

Truly shared memory (e.g. MI300A) should of course be the most efficient.


I know that I have not though enough about this topic (it keeps coming up, but we have more urgent things to improve), but I think we need some kind of runtime (not compile-time) visibility information, and also some kind of metric for how efficient an access is expected to be.

@psychocoderHPC
Copy link
Member

By the way, I do like the idea, but IMHO "visibility" is not enough for this:

@fwyzard I agree with you that this is not enough for optimized code but is a step into the direction where you can write a buffer wrapper in user code where memcopies can be avoided. E.g. what @bernhardmgruber started in this PR #1820

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

I agree with you that this is not enough for optimized code but is a step into the direction where you can write a buffer wrapper in user code where memcopies can be avoided.

But how would you distinguish between pinned host memory and managed memory ?

@SimeonEhrig
Copy link
Member Author

SimeonEhrig commented May 16, 2024

@fwyzard Sorry, I was not aware, that Pinned memory is working without explicit mem copies. Therefore the visibility is BufCpu<float, Dim1D, size_t, std::tuple<MemVisibleCPU,MemVisibleGpuCudaRt>> like for managed memory.

@fwyzard
Copy link
Contributor

fwyzard commented May 27, 2024

static inline Error_t hostRegister(void *ptr, size_t size, Flag_t flags)

This function is a wrapper for CUDA's cudaHostRegister and HIP's hipHostRegister.
It takes a chunk of system memory, makes it non-pageable, and maps it into the memory space of the GPU driver.

The result is that this memory is now visible both from the host and from the gpu. To be portable, the access from the gpu should use the pointer returned by calling static inline Error_t hostGetDevicePointer(void** pDevice, void* pHost, Flag_t flags).

On some architectures and devices the host and gpu pointers will be the same, but not on all systems, and wrapping it in a View does make sense.

However, the original buffer and this View should have different visibility: the buffer is visible on the host, the View is visible on the gpu - even if they point to the same underlying memory...

@fwyzard
Copy link
Contributor

fwyzard commented May 27, 2024

By the way, as far as I know SYCL/oneAPI does not support an equivalent functionality:

From DPCT1026:

SYCL currently does not support registering of existing host memory for use by device. Use USM to allocate memory for use by host and device.

@SimeonEhrig
Copy link
Member Author

The result is that this memory is now visible both from the host and from the gpu. To be portable, the access from the gpu should use the pointer returned by calling static inline Error_t hostGetDevicePointer(void** pDevice, void* pHost, Flag_t flags).

I missed that the function takes two pointer. Therefore a second view/buf is useful and than we can add the new visibility.

@SimeonEhrig
Copy link
Member Author

By the way, as far as I know SYCL/oneAPI does not support an equivalent functionality:

From DPCT1026:

SYCL currently does not support registering of existing host memory for use by device. Use USM to allocate memory for use by host and device.

Okay, maybe this will be never a problem for the memory visibility tags if sycl will not support it. Nevertheless, the current design would allow it.

@@ -162,8 +162,7 @@ auto main() -> int
CounterBasedRngKernel::Key key = {rd(), rd()};

// Allocate buffer on the accelerator
using BufAcc = alpaka::Buf<Acc, Data, Dim, Idx>;
BufAcc bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));
auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer the syntax

Suggested change
auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));
auto bufAcc = alpaka::allocBuf<Data, Idx>(devAcc, extent);

It's just a preference, not a request to make any changes.

@@ -118,6 +118,11 @@ namespace alpaka::trait
{
};

struct MemVisibility<alpaka::AccCpuSerial<TDim, TIdx>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why AccCpuSerial ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a copy paste error.

@fwyzard
Copy link
Contributor

fwyzard commented May 29, 2024

After looking at the changes, I think it would be nicer to make the visibility of a buffer default to the device where the buffer resides.
For example, for a buffer on the DevCpu device, the visibility should be MemVisibilityTypeList<PlatformCpu> by default.

Special cases like pinned memory or unified memory could override the default.

@SimeonEhrig
Copy link
Member Author

After looking at the changes, I think it would be nicer to make the visibility of a buffer default to the device where the buffer resides. For example, for a buffer on the DevCpu device, the visibility should be MemVisibilityTypeList<PlatformCpu> by default.

Special cases like pinned memory or unified memory could override the default.

Only for consistency, I asked the question already here: #2180 (comment)
Thanks for your answer. @psychocoderHPC What do you think?

@SimeonEhrig SimeonEhrig modified the milestones: 1.2.0, 2.0.0 Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants