Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Ensure cudaMemcpy is called by thrust::copy #210

Open
gonzalobg opened this issue Jul 12, 2023 · 12 comments
Open

[BUG]: Ensure cudaMemcpy is called by thrust::copy #210

gonzalobg opened this issue Jul 12, 2023 · 12 comments
Labels
bug Something isn't working right. libcu++ For all items related to libcu++ nvbug Has an associated internal NVIDIA NVBug. thrust For all items related to Thrust.

Comments

@gonzalobg
Copy link
Collaborator

gonzalobg commented Jul 12, 2023

We should ensure that this:

void test(double* in, double* out, size_t n) {
    thrust::copy(thrust::device, in, in + n, out);
}

calls cudaMemcpy or cudaMemcpyAsync with cudaMemcpyDefault.

Right now it does not seem to be happening. @jrhemstad

@miscco
Copy link
Collaborator

miscco commented Jul 12, 2023

Thanks a lot for raising this potential performance issue. We have moved to our new mono repo, I will duplicate the issue there.

@jrhemstad
Copy link
Collaborator

Thanks a lot for raising this potential performance issue. We have moved to our new mono repo, I will duplicate the issue there.

I'll just transfer it.

@jrhemstad jrhemstad transferred this issue from NVIDIA/thrust Jul 12, 2023
@jrhemstad
Copy link
Collaborator

So it seems that what we're missing here is that we should detect when the input/output iterator satisfy is_contiguous_iterator we should just use a memcpy?

@miscco miscco changed the title Ensure cudaMemcpy is called by thrust::copy [FEA]: Ensure cudaMemcpy is called by thrust::copy Jul 12, 2023
@miscco miscco added nvbug Has an associated internal NVIDIA NVBug. thrust For all items related to Thrust. libcu++ For all items related to libcu++ bug Something isn't working right. labels Jul 12, 2023
@miscco miscco changed the title [FEA]: Ensure cudaMemcpy is called by thrust::copy [BUG]: Ensure cudaMemcpy is called by thrust::copy Jul 12, 2023
@miscco
Copy link
Collaborator

miscco commented Jul 12, 2023

We are currently in the process of merging the first part of <ranges> so detection of contiguous_iterator should be easy enough soon^TM

@jrhemstad
Copy link
Collaborator

We may want to land this change sooner, and Thrust already has an is_contiguous_iterator trait, so we can use that in the mean time.

@gevtushenko
Copy link
Collaborator

So it seems that what we're missing here is that we should detect when the input/output iterator satisfy is_contiguous_iterator we should just use a memcpy?

@jrhemstad we already identify when it's safe to use memcpy: 1, 2. To my understanding, the issue is that we are using explicit cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost instead of cudaMemcpyDefault that's requested in this issue. @gonzalobg to verify.

miscco added a commit to miscco/cccl that referenced this issue Jul 12, 2023
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform.

Fixes NVIDIA#210
miscco added a commit to miscco/cccl that referenced this issue Jul 12, 2023
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform.

Fixes NVIDIA#210
miscco added a commit to miscco/cccl that referenced this issue Jul 12, 2023
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform.

Fixes NVIDIA#210
miscco added a commit to miscco/cccl that referenced this issue Jul 14, 2023
In case of contigous ranges of trivially relocatable types we can directly utilize `cudaMemcpyAsync` instead of going through transform.

Fixes NVIDIA#210
@miscco miscco closed this as completed in ce6a462 Jul 18, 2023
@gevtushenko
Copy link
Collaborator

The issue was accidentally close, reopening it.

@gevtushenko gevtushenko reopened this Jul 18, 2023
@jrhemstad
Copy link
Collaborator

@gonzalobg could you provide a reproducer for this so we can be sure it is addressed correctly?

@gonzalobg
Copy link
Collaborator Author

The only way I could imagine testing this is with benchmarks: benchmarking memory BW and making sure it matches cudaMemcpy. Is that what you are looking for?

@miscco
Copy link
Collaborator

miscco commented Jul 19, 2023

I believe we were talking about potential bugs, where managed memory is involved and we cal cudaMemcpyAsync with cudaMemcpyDeviceToDevice and it crashes

@gevtushenko
Copy link
Collaborator

@jrhemstad we already identify when it's safe to use memcpy: 1, 2. To my understanding, the issue is that we are using explicit cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost instead of cudaMemcpyDefault that's requested in this issue. @gonzalobg to verify.

@gonzalobg I thought the issue is about using cudaMemcpyDefault. When you specify device execution policy and provide pointers, we'll currently segfault because cudaMemcpyAsync is used with explicit direction:

#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>

int main() {
  constexpr int n = 10;
  thrust::device_vector<int> src(n);
  int dst[n];

  int *src_ptr = thrust::raw_pointer_cast(src.data());
  int *dst_ptr = dst;

  thrust::copy_n(thrust::device, src_ptr, n, dst_ptr);
}

// terminate called after throwing an instance of 'thrust::system::system_error'
// what():  __copy:: D->D: failed: cudaErrorInvalidValue: invalid argument

@gonzalobg
Copy link
Collaborator Author

The issue I am running into is H2D copies being slow for iterators of raw pointer type (T*) because instead of using cudaMemcpy with cudaMemcpyDefault we are doing something else in those cases.

These pointers are allocated with cudaMallocManaged or malloc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right. libcu++ For all items related to libcu++ nvbug Has an associated internal NVIDIA NVBug. thrust For all items related to Thrust.
Projects
Status: Todo
Development

No branches or pull requests

4 participants