-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] implement memory visibility #2180
base: develop
Are you sure you want to change the base?
Conversation
could you describe what the expected interface is going to look like ? |
I'm not fully sure about how the interface should looks like yet. I opened the PR to run the CI (at the moment I'm a train and have a bad connection to a system with GPU). But I already thought about to support unified memory and similar. That's the reason why the buffer has a I thought about following idea, how it could looks like: template<typename TAcc, typename TBuf>
void transform(TAcc const & acc, TBuf & buf, ... ){
// verify that acc type can access memory type
// later, maybe compare at runtime if the device can access the memory
// -> for example if GPU 0 tries to access memory allocated on GPU 0
static_assert(alpaka::hasSameMemView<TAcc, TBuf>());
//...
}
int main(){
using Host = alpaka::AccCpuSerial<Dim, Idx>;
using Dev = alpaka::AccGpuHipRt<Dim, Idx>;
Host hostAcc = ...
Dev devAcc = ...
using BufShared = alpaka::Buf<std::tupel<Host,Dev>, Data, Dim, Idx>;
BufShared buffer(alpaka::allocBuf<Data, Idx>({hostAcc, devAcc}, extents));
transform(hostAcc, buffer, ...);
transform(devAcc, buffer, ...);
} |
Ah, I see. When discussing this in CMS a few months ago, there were at least two issues that came to mind. 1. meaning and intent"visible" could mean different things, even on the same platform.
2. concurrent accessIf a buffer is "visible" by both the host and the GPU(s), some extra care is needed to ensure that concurrent access is avoided (or properly synchronised). auto hbuf = allocate_host_buffer(host);
fill(hbuf);
auto dbuf = allocate_dev_buffer(dev);
alpaka::memcpy(queue, dbuf, hbuf);
// the kernel is guaranteed to run after the copy is complete
alpaka::exec<Acc>(queue, div, kernel{}, dbuf.data(), size);
// here we can release or reuse the host buffer
fill_again(hbuf); with auto buf = allocate_shared_buffer(host, dev);
fill(buf);
// warning: the kernel could run while the buffer has not yet been migrated
alpaka::exec<Acc>(queue, div, kernel{}, dbuf.data(), size);
// warning: we cannot reuse the shared buffer here!
fill_again(buf); To be clear, I'm not saying this PR goes in the right or wrong direction, I'm just bringing up the issues we thought of so far. |
@fwyzard Thanks for the hint. I had already a discussion with @psychocoderHPC how we abstract when memory is available on a device. At the moment, I'm not sure, if we should encode it in the visibility or add second property for this. I will spent more time with this question, before I continue. |
@fwyzard I had a discussion with @psychocoderHPC. We agreed, that I will not implement a check if a visible memory is in a valid state, because it is a complex problem. So, the function of the visibility properties is to forbid memory access from devices to memory, which is never allowed. For example access memory located on a Nvidia GPU from a Intel GPU (current state - maybe it is possible with HMM in future). In the case of virtual shared memory, the user or driver is still responsible that the data is available on the correct device, when it is accessed. Maybe we can add a second property to categorized the type of memory, that the user needs to know, what needs to be done. Similar to fine and coarse grain memory in HIP and OpenCL. By the way, I'm also not sure, if we integrate the usage of memory visibility in alpaka itself (maybe in |
5fdcb22
to
13f9170
Compare
bdbd4bc
to
2d3eda4
Compare
117f9f3
to
2db4b35
Compare
{ | ||
using type = std::tuple<alpaka::MemVisibleCPU>; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the memory in a BufCpu
returned by a call to allocMappedBuf
is visible by the devices in the corresponding platform.
But the C++ type is the same, so I don't know how it could be distinguished at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, what you mean.
Each time, if we use a device to allocate memory, the buffer will be accessible for the device. But afterwards you can try to access the memory with another device, e.g. a cuda device. If the memory was allocate with malloc and without HMM, the cuda device can never access the memory. If we would have HMM support, the type needs to be std::tuple<alpaka::MemVisibleCPU, MemVisibleGpuCudaRt>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without HMM support, if you allocate a host buffer with allocMappedBuf
, it will be visible by the GPU.
However, the C++ type of the host buffer returned by allocBuf
and allocMappedBuf
is the same.
So, trait::MemVisibility<TBuf>
cannot distinguish between the two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I know what you mean. I missed allocMappedBuf
two times 😬 First that it exist and second that you wrote allocMappedBuf
and not allocBuf
.
Yes your are right. Therefore I need to implement the visibility as template type of BufCpu
, like I already did it for ViewPlainPtr
:
alpaka/include/alpaka/mem/view/ViewPlainPtr.hpp
Lines 30 to 38 in 4054e15
//! The memory view to wrap plain pointers. | |
template< | |
typename TDev, | |
typename TElem, | |
typename TDim, | |
typename TIdx, | |
typename TMemVisibility = | |
typename alpaka::meta::toTuple<typename alpaka::trait::MemVisibility<alpaka::Platform<TDev>>::type>::type> | |
struct ViewPlainPtr final : internal::ViewAccessOps<ViewPlainPtr<TDev, TElem, TDim, TIdx, TMemVisibility>> |
I think the buffer type cannot have a default visibility type.
Mhm, I see... so with these changes a default host memory buffer and a pinned host memory buffer will have different types ? I'll have to figure out if this is fine for the CMS code. |
At the moment not. The memory visibility only represent which buffer/view can be accessed by which accelerator. So host and pinnend memory has the same type because they are only available from the The major ideas of visibility tags are:
|
Aren't they ? I understood that the buffer type becomes template<typename TElem, typename TDim, typename TIdx, typename TMemVisibility>
class BufCpu { ... }; and the visibility (described by So a buffer that holds system memory would have a type like BufCpu<float, Dim1D, size_t, MemVisibleCPU> while a buffer that pinned memory visible to NVIDIA GPUs would have a type like BufCpu<float, Dim1D, size_t, std::tuple<MemVisibleCPU,MemVisibleGpuCudaRt>> Did I misunderstand the intent ? |
By the way, I do like the idea, but IMHO "visibility" is not enough for this:
Pinned memory is visible from a GPU, but using it directly to run a kernel is likely to be inefficient. Managed memory should be almost as efficient as global GPU memory (assuming the runtime is good enough with the migration). Truly shared memory (e.g. MI300A) should of course be the most efficient. I know that I have not though enough about this topic (it keeps coming up, but we have more urgent things to improve), but I think we need some kind of runtime (not compile-time) visibility information, and also some kind of metric for how efficient an access is expected to be. |
@fwyzard I agree with you that this is not enough for optimized code but is a step into the direction where you can write a buffer wrapper in user code where memcopies can be avoided. E.g. what @bernhardmgruber started in this PR #1820 |
But how would you distinguish between pinned host memory and managed memory ? |
@fwyzard Sorry, I was not aware, that Pinned memory is working without explicit mem copies. Therefore the visibility is |
This function is a wrapper for CUDA's cudaHostRegister and HIP's hipHostRegister. The result is that this memory is now visible both from the host and from the gpu. To be portable, the access from the gpu should use the pointer returned by calling On some architectures and devices the host and gpu pointers will be the same, but not on all systems, and wrapping it in a However, the original buffer and this |
By the way, as far as I know SYCL/oneAPI does not support an equivalent functionality: From DPCT1026:
|
I missed that the function takes two pointer. Therefore a second view/buf is useful and than we can add the new visibility. |
Okay, maybe this will be never a problem for the memory visibility tags if sycl will not support it. Nevertheless, the current design would allow it. |
CI_FILTER: ^linux_icpx
CI_FILTER: ^nope
- tests if the correct visibility type is set for allocated memory is missing, except for `allocBuf` - the state of the commit is, it should compile with all backends and does not break existing tests
f5671ad
to
c3eebea
Compare
@@ -162,8 +162,7 @@ auto main() -> int | |||
CounterBasedRngKernel::Key key = {rd(), rd()}; | |||
|
|||
// Allocate buffer on the accelerator | |||
using BufAcc = alpaka::Buf<Acc, Data, Dim, Idx>; | |||
BufAcc bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent)); | |||
auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I prefer the syntax
auto bufAcc(alpaka::allocBuf<Data, Idx>(devAcc, extent)); | |
auto bufAcc = alpaka::allocBuf<Data, Idx>(devAcc, extent); |
It's just a preference, not a request to make any changes.
@@ -118,6 +118,11 @@ namespace alpaka::trait | |||
{ | |||
}; | |||
|
|||
struct MemVisibility<alpaka::AccCpuSerial<TDim, TIdx>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why AccCpuSerial
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a copy paste error.
After looking at the changes, I think it would be nicer to make the visibility of a buffer default to the device where the buffer resides. Special cases like pinned memory or unified memory could override the default. |
Only for consistency, I asked the question already here: #2180 (comment) |
Memory visibility types allows to implement an API, which allows to check if an acc can access the memory of a buffer.