-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Ensure cudaMemcpy is called by thrust::copy #210
Comments
Thanks a lot for raising this potential performance issue. We have moved to our new mono repo, I will duplicate the issue there. |
I'll just transfer it. |
So it seems that what we're missing here is that we should detect when the input/output iterator satisfy |
We are currently in the process of merging the first part of |
We may want to land this change sooner, and Thrust already has an |
@jrhemstad we already identify when it's safe to use memcpy: 1, 2. To my understanding, the issue is that we are using explicit |
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform. Fixes NVIDIA#210
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform. Fixes NVIDIA#210
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform. Fixes NVIDIA#210
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform. Fixes NVIDIA#210
The issue was accidentally close, reopening it. |
@gonzalobg could you provide a reproducer for this so we can be sure it is addressed correctly? |
The only way I could imagine testing this is with benchmarks: benchmarking memory BW and making sure it matches |
I believe we were talking about potential bugs, where managed memory is involved and we cal cudaMemcpyAsync with |
@gonzalobg I thought the issue is about using #include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
int main() {
constexpr int n = 10;
thrust::device_vector<int> src(n);
int dst[n];
int *src_ptr = thrust::raw_pointer_cast(src.data());
int *dst_ptr = dst;
thrust::copy_n(thrust::device, src_ptr, n, dst_ptr);
}
// terminate called after throwing an instance of 'thrust::system::system_error'
// what(): __copy:: D->D: failed: cudaErrorInvalidValue: invalid argument |
The issue I am running into is H2D copies being slow for iterators of raw pointer type ( These pointers are allocated with |
We should ensure that this:
calls
cudaMemcpy
orcudaMemcpyAsync
withcudaMemcpyDefault
.Right now it does not seem to be happening. @jrhemstad
The text was updated successfully, but these errors were encountered: