We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance
Thrust
thrust::transform_reduce(thrust::device, ... flow is the following:
thrust::transform_reduce(thrust::device, ...
get_value
The cudaStreamSynchronize in step 3 is unnecessary when get_value is going to perform a stream-ordered operation and synchronize.
This was discovered in TeaLeaf and impacts std::transform_reduce performance.
std::transform_reduce
Use the transform reduce algorithm on trivial types on device memory or unified memory.
No unnecessary cudaStreamSynchronize.
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Is this a duplicate?
Type of Bug
Performance
Component
Thrust
Describe the bug
thrust::transform_reduce(thrust::device, ...
flow is the following:get_value
: cudaMemcpyAsync + cudaStreamSynchronizeThe cudaStreamSynchronize in step 3 is unnecessary when
get_value
is going to perform a stream-ordered operation and synchronize.This was discovered in TeaLeaf and impacts
std::transform_reduce
performance.How to Reproduce
Use the transform reduce algorithm on trivial types on device memory or unified memory.
Expected behavior
No unnecessary cudaStreamSynchronize.
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
The text was updated successfully, but these errors were encountered: