[WIP] implement memory visibility #2180

SimeonEhrig · 2023-10-11T14:32:49Z

Memory visibility types allows to implement an API, which allows to check if an acc can access the memory of a buffer.

fwyzard · 2023-10-11T14:50:45Z

could you describe what the expected interface is going to look like ?
how does this interact with CUDA/HIP unified memory, SYCL shared memory, Linux heterogeneous memory, etc. ?

SimeonEhrig · 2023-10-11T15:06:36Z

I'm not fully sure about how the interface should looks like yet. I opened the PR to run the CI (at the moment I'm a train and have a bad connection to a system with GPU).

But I already thought about to support unified memory and similar. That's the reason why the buffer has a std::tupel for the memory visibility types.

I thought about following idea, how it could looks like:

template<typename TAcc, typename TBuf>
void transform(TAcc const & acc, TBuf & buf, ... ){
   // verify that acc type can access memory type
   // later, maybe compare at runtime if the device can access the memory 
   // -> for example if GPU 0 tries to access memory allocated on GPU 0
   static_assert(alpaka::hasSameMemView<TAcc, TBuf>());
   //...
}

int main(){
   using Host = alpaka::AccCpuSerial<Dim, Idx>;
   using Dev = alpaka::AccGpuHipRt<Dim, Idx>;

   Host hostAcc = ...
   Dev devAcc = ...

   using BufShared = alpaka::Buf<std::tupel<Host,Dev>, Data, Dim, Idx>;
   BufShared buffer(alpaka::allocBuf<Data, Idx>({hostAcc, devAcc}, extents));

  transform(hostAcc, buffer, ...);
  transform(devAcc, buffer, ...);
}

fwyzard · 2023-10-12T07:31:29Z

Ah, I see.
So, this would be a way to declare a buffer that is "visible" from multiple devices.

When discussing this in CMS a few months ago, there were at least two issues that came to mind.

1. meaning and intent

"visible" could mean different things, even on the same platform.
For example:

memory allocated with cudaMallocHost resides on the host, and is visible by the host and the GPU;
unified memory allocated with cudaMallocManaged can reside either on the host or on the GPU, with CUDA migrating it as necessary;
on recent Linux kernel memory allocated with malloc could behave in the same way.

2. concurrent access

If a buffer is "visible" by both the host and the GPU(s), some extra care is needed to ensure that concurrent access is avoided (or properly synchronised).
With a pseudocode example, compare

auto hbuf = allocate_host_buffer(host);
fill(hbuf);
auto dbuf = allocate_dev_buffer(dev);
alpaka::memcpy(queue, dbuf, hbuf);
// the kernel is guaranteed to run after the copy is complete
alpaka::exec<Acc>(queue, div, kernel{}, dbuf.data(), size);
// here we can release or reuse the host buffer
fill_again(hbuf);

with

auto buf = allocate_shared_buffer(host, dev);
fill(buf);
// warning: the kernel could run while the buffer has not yet been migrated
alpaka::exec<Acc>(queue, div, kernel{}, dbuf.data(), size);
// warning: we cannot reuse the shared buffer here!
fill_again(buf);

To be clear, I'm not saying this PR goes in the right or wrong direction, I'm just bringing up the issues we thought of so far.

SimeonEhrig · 2023-10-12T11:37:55Z

@fwyzard Thanks for the hint. I had already a discussion with @psychocoderHPC how we abstract when memory is available on a device. At the moment, I'm not sure, if we should encode it in the visibility or add second property for this. I will spent more time with this question, before I continue.

SimeonEhrig · 2023-10-17T08:33:57Z

@fwyzard I had a discussion with @psychocoderHPC. We agreed, that I will not implement a check if a visible memory is in a valid state, because it is a complex problem.

So, the function of the visibility properties is to forbid memory access from devices to memory, which is never allowed. For example access memory located on a Nvidia GPU from a Intel GPU (current state - maybe it is possible with HMM in future). In the case of virtual shared memory, the user or driver is still responsible that the data is available on the correct device, when it is accessed. Maybe we can add a second property to categorized the type of memory, that the user needs to know, what needs to be done. Similar to fine and coarse grain memory in HIP and OpenCL.

By the way, I'm also not sure, if we integrate the usage of memory visibility in alpaka itself (maybe in alpaka::copy). At the moment, this function is planned to be used in user code and libraries built on top of alpaka.

include/alpaka/mem/Visibility.hpp

fwyzard · 2024-05-14T10:31:41Z

include/alpaka/mem/buf/BufCpu.hpp

+        {
+            using type = std::tuple<alpaka::MemVisibleCPU>;
+        };
+


Actually, the memory in a BufCpu returned by a call to allocMappedBuf is visible by the devices in the corresponding platform.

But the C++ type is the same, so I don't know how it could be distinguished at compile time.

I'm not sure, what you mean.

Each time, if we use a device to allocate memory, the buffer will be accessible for the device. But afterwards you can try to access the memory with another device, e.g. a cuda device. If the memory was allocate with malloc and without HMM, the cuda device can never access the memory. If we would have HMM support, the type needs to be std::tuple<alpaka::MemVisibleCPU, MemVisibleGpuCudaRt>.

Even without HMM support, if you allocate a host buffer with allocMappedBuf, it will be visible by the GPU.

However, the C++ type of the host buffer returned by allocBuf and allocMappedBuf is the same.

So, trait::MemVisibility<TBuf> cannot distinguish between the two cases.

Ah, now I know what you mean. I missed allocMappedBuf two times 😬 First that it exist and second that you wrote allocMappedBuf and not allocBuf.

Yes your are right. Therefore I need to implement the visibility as template type of BufCpu, like I already did it for ViewPlainPtr:

alpaka/include/alpaka/mem/view/ViewPlainPtr.hpp

Lines 30 to 38 in 4054e15

//! The memory view to wrap plain pointers.

template<

typename TDev,

typename TElem,

typename TDim,

typename TIdx,

typename TMemVisibility =

typename alpaka::meta::toTuple<typename alpaka::trait::MemVisibility<alpaka::Platform<TDev>>::type>::type>

struct ViewPlainPtr final : internal::ViewAccessOps<ViewPlainPtr<TDev, TElem, TDim, TIdx, TMemVisibility>>

I think the buffer type cannot have a default visibility type.

fwyzard · 2024-05-15T11:45:30Z

Mhm, I see... so with these changes a default host memory buffer and a pinned host memory buffer will have different types ?

I'll have to figure out if this is fine for the CMS code.

SimeonEhrig · 2024-05-16T14:32:16Z

Mhm, I see... so with these changes a default host memory buffer and a pinned host memory buffer will have different types ?

I'll have to figure out if this is fine for the CMS code.

At the moment not. The memory visibility only represent which buffer/view can be accessed by which accelerator. So host and pinnend memory has the same type because they are only available from the CPUPlatform. A the moment, I see no extra usage for a more fine granular compile time visibility.

The major ideas of visibility tags are:

Secure APIs, like vikunja::transform. Because without the memory visibility, it is not clear if vikunja will copy data to the device if a host buffer is used or not. In the worst case, the application will be crash with a segmentation fault and you don't know why.
Unify double memory buffer and managed memory. Means write your application like a you wold use separate host and device memory and do manually copies. If the managed memory is available, the second buffer is optimized away and memory copies are noops. I already discussed it already a little bit with @psychocoderHPC and the idea needs more tuning . For example it is important to distinguish between alpaka::copy (copies all the time) and alpaka::zeroCopy (copies only if required). @bernhardmgruber already tried to implement this: [abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) #1820

fwyzard · 2024-05-16T14:44:57Z

At the moment not.

Aren't they ?

I understood that the buffer type becomes

template<typename TElem, typename TDim, typename TIdx, typename TMemVisibility>
class BufCpu { ... };

and the visibility (described by TMemVisibility) becomes part of the type.

So a buffer that holds system memory would have a type like

BufCpu<float, Dim1D, size_t, MemVisibleCPU>

while a buffer that pinned memory visible to NVIDIA GPUs would have a type like

BufCpu<float, Dim1D, size_t, std::tuple<MemVisibleCPU,MemVisibleGpuCudaRt>>

Did I misunderstand the intent ?

fwyzard · 2024-05-16T14:49:37Z

By the way, I do like the idea, but IMHO "visibility" is not enough for this:

Unify double memory buffer and managed memory.

Pinned memory is visible from a GPU, but using it directly to run a kernel is likely to be inefficient.

Managed memory should be almost as efficient as global GPU memory (assuming the runtime is good enough with the migration).

Truly shared memory (e.g. MI300A) should of course be the most efficient.

I know that I have not though enough about this topic (it keeps coming up, but we have more urgent things to improve), but I think we need some kind of runtime (not compile-time) visibility information, and also some kind of metric for how efficient an access is expected to be.

psychocoderHPC · 2024-05-16T14:54:21Z

By the way, I do like the idea, but IMHO "visibility" is not enough for this:

@fwyzard I agree with you that this is not enough for optimized code but is a step into the direction where you can write a buffer wrapper in user code where memcopies can be avoided. E.g. what @bernhardmgruber started in this PR #1820

fwyzard · 2024-05-16T14:58:23Z

I agree with you that this is not enough for optimized code but is a step into the direction where you can write a buffer wrapper in user code where memcopies can be avoided.

But how would you distinguish between pinned host memory and managed memory ?

SimeonEhrig · 2024-05-16T15:11:19Z

@fwyzard Sorry, I was not aware, that Pinned memory is working without explicit mem copies. Therefore the visibility is BufCpu<float, Dim1D, size_t, std::tuple<MemVisibleCPU,MemVisibleGpuCudaRt>> like for managed memory.

fwyzard · 2024-05-27T07:51:30Z

static inline Error_t hostRegister(void *ptr, size_t size, Flag_t flags)

This function is a wrapper for CUDA's cudaHostRegister and HIP's hipHostRegister.
It takes a chunk of system memory, makes it non-pageable, and maps it into the memory space of the GPU driver.

The result is that this memory is now visible both from the host and from the gpu. To be portable, the access from the gpu should use the pointer returned by calling static inline Error_t hostGetDevicePointer(void** pDevice, void* pHost, Flag_t flags).

On some architectures and devices the host and gpu pointers will be the same, but not on all systems, and wrapping it in a View does make sense.

However, the original buffer and this View should have different visibility: the buffer is visible on the host, the View is visible on the gpu - even if they point to the same underlying memory...

fwyzard · 2024-05-27T07:52:22Z

By the way, as far as I know SYCL/oneAPI does not support an equivalent functionality:

From DPCT1026:

SYCL currently does not support registering of existing host memory for use by device. Use USM to allocate memory for use by host and device.

SimeonEhrig · 2024-05-27T07:55:52Z

The result is that this memory is now visible both from the host and from the gpu. To be portable, the access from the gpu should use the pointer returned by calling static inline Error_t hostGetDevicePointer(void** pDevice, void* pHost, Flag_t flags).

I missed that the function takes two pointer. Therefore a second view/buf is useful and than we can add the new visibility.

SimeonEhrig · 2024-05-27T07:57:58Z

By the way, as far as I know SYCL/oneAPI does not support an equivalent functionality:

From DPCT1026:

SYCL currently does not support registering of existing host memory for use by device. Use USM to allocate memory for use by host and device.

Okay, maybe this will be never a problem for the memory visibility tags if sycl will not support it. Nevertheless, the current design would allow it.

CI_FILTER: ^linux_icpx

CI_FILTER: ^nope

- tests if the correct visibility type is set for allocated memory is missing, except for `allocBuf` - the state of the commit is, it should compile with all backends and does not break existing tests

fwyzard · 2024-05-29T21:12:47Z

example/counterBasedRng/src/counterBasedRng.cpp

@@ -162,8 +162,7 @@ auto main() -> int
    CounterBasedRngKernel::Key key = {rd(), rd()};

    // Allocate buffer on the accelerator
-    using BufAcc = alpaka::Buf<Acc, Data, Dim, Idx>;
-    BufAcc bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));
+    auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));


Personally, I prefer the syntax

Suggested change

auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));

auto bufAcc = alpaka::allocBuf<Data, Idx>(devAcc, extent);

It's just a preference, not a request to make any changes.

fwyzard · 2024-05-29T21:15:41Z

include/alpaka/acc/AccGenericSycl.hpp

@@ -118,6 +118,11 @@ namespace alpaka::trait
    {
    };

+    struct MemVisibility<alpaka::AccCpuSerial<TDim, TIdx>>


Why AccCpuSerial ?

Looks like a copy paste error.

fwyzard · 2024-05-29T21:21:07Z

After looking at the changes, I think it would be nicer to make the visibility of a buffer default to the device where the buffer resides.
For example, for a buffer on the DevCpu device, the visibility should be MemVisibilityTypeList<PlatformCpu> by default.

Special cases like pinned memory or unified memory could override the default.

SimeonEhrig · 2024-06-04T08:53:01Z

After looking at the changes, I think it would be nicer to make the visibility of a buffer default to the device where the buffer resides. For example, for a buffer on the DevCpu device, the visibility should be MemVisibilityTypeList<PlatformCpu> by default.

Special cases like pinned memory or unified memory could override the default.

Only for consistency, I asked the question already here: #2180 (comment)
Thanks for your answer. @psychocoderHPC What do you think?

SimeonEhrig added Type:Enhancement State:Work In Progress Backend:CUDA Backend:OpenMP Backend:TBB Backend:std::thread Backend:SYCL Backend:HIP Backend:Serial labels Oct 11, 2023

psychocoderHPC added this to the 1.2.0 milestone Jan 29, 2024

SimeonEhrig force-pushed the memoryVisibility branch from 5fdcb22 to 13f9170 Compare April 22, 2024 15:38

SimeonEhrig mentioned this pull request May 7, 2024

implement AccIsEnabled #2267

Merged

SimeonEhrig force-pushed the memoryVisibility branch from bdbd4bc to 2d3eda4 Compare May 13, 2024 15:09

SimeonEhrig mentioned this pull request May 13, 2024

implement alpaka::meta::isList, alpaka::meta::ToList and alpaka::meta::toTuple #2269

Merged

1 task

SimeonEhrig force-pushed the memoryVisibility branch from 117f9f3 to 2db4b35 Compare May 14, 2024 09:21

fwyzard reviewed May 14, 2024

View reviewed changes

include/alpaka/mem/Visibility.hpp Show resolved Hide resolved

fwyzard reviewed May 14, 2024

View reviewed changes

include/alpaka/mem/Visibility.hpp Outdated Show resolved Hide resolved

fwyzard reviewed May 14, 2024

View reviewed changes

SimeonEhrig added 19 commits May 27, 2024 16:59

implement memory visibility

aa6cf2e

use device type instead acc type

57d2034

fixes warnings

4e0d945

add small fixes

d7e7b8e

implement first memory visibility test

d1dce1f

move memory visibility from device to platform

9bf634f

extend hasSameMemView to supported devices and accelerators

d5dbce5

improve test

8e9ce27

improve memoryVisibilityType again

e2b827d

add memory visibility to raw pointer view

f20c2a1

sycl fixes

fa496fd

CI_FILTER: ^linux_icpx

fix

341d765

CI_FILTER: ^linux_icpx

fix 2

3a3fbf5

CI_FILTER: ^linux_icpx

implement memory visibility as template type for bufCPU

db2b4c6

CI_FILTER: ^nope

finish implementing visibility as template type for buffer types

527f0d2

- tests if the correct visibility type is set for allocated memory is missing, except for `allocBuf` - the state of the commit is, it should compile with all backends and does not break existing tests

fix

4a09464

fix 2

6ae951d

fix 3

a5f7336

add buffer visibility tests

c3eebea

SimeonEhrig force-pushed the memoryVisibility branch from f5671ad to c3eebea Compare May 27, 2024 15:08

fwyzard reviewed May 29, 2024

View reviewed changes

SimeonEhrig modified the milestones: 1.2.0, 2.0.0 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] implement memory visibility #2180

[WIP] implement memory visibility #2180

SimeonEhrig commented Oct 11, 2023 •

edited

Loading

fwyzard commented Oct 11, 2023

SimeonEhrig commented Oct 11, 2023

fwyzard commented Oct 12, 2023

SimeonEhrig commented Oct 12, 2023

SimeonEhrig commented Oct 17, 2023

fwyzard May 14, 2024

SimeonEhrig May 14, 2024

fwyzard May 14, 2024

SimeonEhrig May 14, 2024

fwyzard commented May 15, 2024

SimeonEhrig commented May 16, 2024 •

edited

Loading

fwyzard commented May 16, 2024

fwyzard commented May 16, 2024

psychocoderHPC commented May 16, 2024

fwyzard commented May 16, 2024

SimeonEhrig commented May 16, 2024 •

edited

Loading

fwyzard commented May 27, 2024

fwyzard commented May 27, 2024 •

edited

Loading

SimeonEhrig commented May 27, 2024

SimeonEhrig commented May 27, 2024

fwyzard May 29, 2024

fwyzard May 29, 2024

SimeonEhrig Jun 4, 2024

fwyzard commented May 29, 2024

SimeonEhrig commented Jun 4, 2024

	//! The memory view to wrap plain pointers.
	template<
	typename TDev,
	typename TElem,
	typename TDim,
	typename TIdx,
	typename TMemVisibility =
	typename alpaka::meta::toTuple<typename alpaka::trait::MemVisibility<alpaka::Platform<TDev>>::type>::type>
	struct ViewPlainPtr final : internal::ViewAccessOps<ViewPlainPtr<TDev, TElem, TDim, TIdx, TMemVisibility>>

	auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent));
	auto bufAcc = alpaka::allocBuf<Data, Idx>(devAcc, extent);

[WIP] implement memory visibility #2180

Are you sure you want to change the base?

[WIP] implement memory visibility #2180

Conversation

SimeonEhrig commented Oct 11, 2023 • edited Loading

fwyzard commented Oct 11, 2023

SimeonEhrig commented Oct 11, 2023

fwyzard commented Oct 12, 2023

1. meaning and intent

2. concurrent access

SimeonEhrig commented Oct 12, 2023

SimeonEhrig commented Oct 17, 2023

fwyzard May 14, 2024

Choose a reason for hiding this comment

SimeonEhrig May 14, 2024

Choose a reason for hiding this comment

fwyzard May 14, 2024

Choose a reason for hiding this comment

SimeonEhrig May 14, 2024

Choose a reason for hiding this comment

fwyzard commented May 15, 2024

SimeonEhrig commented May 16, 2024 • edited Loading

fwyzard commented May 16, 2024

fwyzard commented May 16, 2024

psychocoderHPC commented May 16, 2024

fwyzard commented May 16, 2024

SimeonEhrig commented May 16, 2024 • edited Loading

fwyzard commented May 27, 2024

fwyzard commented May 27, 2024 • edited Loading

SimeonEhrig commented May 27, 2024

SimeonEhrig commented May 27, 2024

fwyzard May 29, 2024

Choose a reason for hiding this comment

fwyzard May 29, 2024

Choose a reason for hiding this comment

SimeonEhrig Jun 4, 2024

Choose a reason for hiding this comment

fwyzard commented May 29, 2024

SimeonEhrig commented Jun 4, 2024

SimeonEhrig commented Oct 11, 2023 •

edited

Loading

SimeonEhrig commented May 16, 2024 •

edited

Loading

SimeonEhrig commented May 16, 2024 •

edited

Loading

fwyzard commented May 27, 2024 •

edited

Loading