This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
CUB 1.9.8 (CUDA 11.0 Early Access)
Summary
CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit.
When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.
Enhancements
- (C++11 or later) Cache calls to
cudaFuncGetAttributes
andcudaDeviceGetAttribute
withincub::PtxVersion
andcub::SmVersion
. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention. DispatchReduce
now takes anOutputT
template parameter so that users can specify the intermediate type explicitly.- Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.
Bug Fixes
- Change initialization style from copy initialization to direct initialization (which is more permissive) in
AgentReduce
to allow a wider range of types to be used with it. - Fix bad signed/unsigned comparisons in
WarpReduce
. - Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.