Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Releases: NVIDIA/cub

CUB 1.13.0 (NVIDIA HPC SDK 21.7)

15 Jun 16:43
ae1721b
Compare
Choose a tag to compare

CUB 1.13.0 is the major release accompanying the NVIDIA HPC SDK 21.7 release.

Notable new features include support for striped data arrangements in block load/store utilities, bfloat16 radix sort support, and fewer restrictions on offset iterators in segmented device algorithms. Several bugs in cub::BlockShuffle, cub::BlockDiscontinuity, and cub::DeviceHistogram have been addressed. The amount of code generated in cub::DeviceScan has been greatly reduced, leading to significant compile-time improvements when targeting multiple PTX architectures.

This release also includes several user-contributed documentation fixes that will be reflected in CUB's online documentation in the coming weeks.

Breaking Changes

  • #320: Deprecated cub::TexRefInputIterator<T, UNIQUE_ID>. Use cub::TexObjInputIterator<T> as a replacement.

New Features

  • #274: Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED functionality to cub::BlockLoadAlgorithm and cub::BlockStoreAlgorithm. Thanks to Matthew Nicely (@mnicely) for this contribution.
  • #291: cub::DeviceSegmentedRadixSort and cub::DeviceSegmentedReduce now support different types for begin/end offset iterators. Thanks to Sergey Pavlov (@psvvsp) for this contribution.
  • #306: Add bfloat16 support to cub::DeviceRadixSort. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
  • #320: Introduce a new CUB_IGNORE_DEPRECATED_API macro that disables deprecation warnings on Thrust and CUB APIs.

Bug Fixes

  • #277: Fixed sanitizer warnings in RadixSortScanBinsKernels. Thanks to Andy Adinets (@canonizer) for this contribution.
  • #287: cub::DeviceHistogram now correctly handles cases where OffsetT is not an int. Thanks to Dominique LaSalle (@nv-dlasalle) for this contribution.
  • #311: Fixed several bugs and added tests for the cub::BlockShuffle collective operations.
  • #312: Eliminate unnecessary kernel instantiations when compiling cub::DeviceScan. Thanks to Elias Stehle (@elstehle) for this contribution.
  • #319: Fixed out-of-bounds memory access on debugging builds of cub::BlockDiscontinuity::FlagHeadsAndTails.
  • #320: Fixed harmless missing return statement warning in unreachable cub::TexObjInputIterator code path.

Other Enhancements

  • Several documentation fixes are included in this release.
    • #275: Fixed comments describing the cub::If and cub::Equals utilities. Thanks to Rukshan Jayasekara (@rukshan99) for this contribution.
    • #290: Documented that cub::DeviceSegmentedReduce will produce consistent results run-to-run on the same device for pseudo-associated reduction operators. Thanks to Himanshu (@himanshu007-creator) for this contribution.
    • #298: CONTRIBUTING.md now refers to Thrust's build instructions for developer builds, which is the preferred way to build the CUB test harness. Thanks to Xiang Gao (@zasdfgbnm) for contributing.
    • #301: Expand cub::DeviceScan documentation to include in-place support and add tests. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
    • #307: Expand cub::DeviceRadixSort and cub::BlockRadixSort documentation to clarify stability, in-place support, and type-specific bitwise transformations. Thanks to Himanshu (@himanshu007-creator) for contributing.
    • #316: Move WARP_TIME_SLICING documentation to the correct location. Thanks to Peter Han (@Peter9606) for this contribution.
    • #321: Update URLs from deprecated github.com to preferred github.io. Thanks to Lilo Huang (@lilohuang) for this contribution.

CUB 1.12.1 (CUDA Toolkit 11.4)

15 Jun 16:39
Compare
Choose a tag to compare

CUB 1.12.1 is a trivial patch release that slightly changes the phrasing of a deprecation message.

CUB 1.12.0 (NVIDIA HPC SDK 21.3, CUDA Toolkit 11.4)

23 Feb 19:02
7cdf6df
Compare
Choose a tag to compare

Summary

CUB 1.12.0 is a bugfix release accompanying the NVIDIA HPC SDK 21.3 release and the CUDA Toolkit 11.4 release.

Radix sort is now stable when both +0.0 and -0.0 are present in the input (they are treated as equivalent).
Many compilation warnings and subtle overflow bugs were fixed in the device algorithms, including a long-standing bug that returned invalid temporary storage requirements when num_items was close to (but not exceeding) INT32_MAX.
Support for Clang < 7.0 and MSVC < 2019 (aka 19.20/16.0/14.20) is now deprecated.

Breaking Changes

  • #256: Deprecate Clang < 7 and MSVC < 2019.

New Features

  • #218: Radix sort now treats -0.0 and +0.0 as equivalent for floating
    point types, which is required for the sort to be stable. Thanks to Andy
    Adinets for this contribution.

Bug Fixes

  • #247: Suppress newly triggered warnings in Clang. Thanks to Andrew Corrigan for this contribution.
  • #249: Enable stricter warning flags. This fixes a number of outstanding issues:
    • #221: Overflow in temp_storage_bytes when num_items close to (but not over) INT32_MAX.
    • #228: CUB uses non-standard C++ extensions that break strict compilers.
    • #257: Warning when compiling GridEvenShare with unsigned offsets.
  • #258: Use correct OffsetT in DispatchRadixSort::InitPassConfig. Thanks to Felix Kallenborn for this contribution.
  • #259: Remove some problematic __forceinline__ annotations.

Other Enhancements

  • #123: Fix incorrect issue number in changelog. Thanks to Peet Whittaker for this contribution.

CUB 1.11.0 (CUDA Toolkit 11.3)

23 Nov 19:12
Compare
Choose a tag to compare

Summary

CUB 1.11.0 is a major release providing bugfixes and performance enhancements. It includes a new DeviceRadixSort backend that improves performance by up to 2x on supported keys and hardware. Our CMake package and build system continue to see improvements with add_subdirectory support, installation rules, status messages, and other features that make CUB easier to use from CMake projects. The release includes several other bugfixes and modernizations, and received updates from 11 contributors.

Breaking Changes

  • #201: The intermediate accumulator type used when DeviceScan is invoked with different input/output types is now consistent with P0571. This may produce different results for some edge cases when compared with earlier releases of CUB.

New Features

  • #204: Faster DeviceRadixSort, up to 2x performance increase for 32/64-bit keys on Pascal and up (SM60+). Thanks to Andy Adinets for this contribution.
  • Unroll loops in BlockRadixRank to improve performance for 32-bit keys by 1.5-2x on Clang CUDA. Thanks to Justin Lebar for this contribution.
  • #200: Allow CUB to be added to CMake projects via add_subdirectory.
  • #214: Optionally add install rules when included with CMake's add_subdirectory. Thanks to Kai Germaschewski for this contribution.

Bug Fixes

  • #215: Fix integer truncation in AgentReduceByKey, AgentScan, and AgentSegmentFixup. Thanks to Rory Mitchell for this contribution.
  • #225: Fix compile-time regression when defining CUB_NS_PREFIX/CUB_NS_POSTFIX macro. Thanks to Elias Stehle for this contribution.
  • #210: Fix some edge cases in DeviceScan:
    • Use values from the input when padding temporary buffers. This prevents custom functors from getting unexpected values.
    • Prevent integer truncation when using large indices via the DispatchScan layer.
    • Use timesliced reads/writes for types > 128 bytes.
  • #217: Fix and add test for cmake package install rules. Thanks to Keith Kraus and Kai Germaschewski for testing and discussion.
  • #170, #233: Update CUDA version checks to behave on Clang CUDA and nvc++. Thanks to Artem Belevich, Andrew Corrigan, and David Olsen for these contributions.
  • #220, #216: Various fixes for Clang CUDA. Thanks to Andrew Corrigan for these contributions.
  • #231: Fix signedness mismatch warnings in unit tests.
  • #231: Suppress GPU deprecation warnings.
  • #214: Use semantic versioning rules for our CMake package's compatibility checks. Thanks to Kai Germaschewski for this contribution.
  • #214: Use FindPackageHandleStandardArgs to print standard status messages when our CMake package is found. Thanks to Kai Germaschewski for this contribution.
  • #207: Fix CubDebug usage in CachingDeviceAllocator::DeviceAllocate. Thanks to Andreas Hehn for this contribution.
  • Fix documentation for DevicePartition. Thanks to ByteHamster for this contribution.
  • Clean up unused code in DispatchScan. Thanks to ByteHamster for this contribution.

Other Enhancements

  • #213: Remove tuning policies for unsupported hardware (<SM35).
  • References to the old Github repository and branch names were updated.
    • Github's thrust/cub repository is now NVIDIA/cub
    • Development has moved from the master branch to the main branch.

CUB 1.10.0 (NVIDIA HPC SDK 20.9, CUDA Toolkit 11.2)

23 Sep 02:20
Compare
Choose a tag to compare

Summary

CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release and the CUDA Toolkit 11.2 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md

Breaking Changes

  • C++03 is no longer supported.
  • GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
  • C++11 is deprecated. Using this dialect will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT or CUB_IGNORE_DEPRECATED_CPP_11. Suppression is only a short term solution. We will be dropping support for C++11 in the near future.
  • CMake < 3.15 is no longer supported.
  • The default branch on GitHub is now called main.

Other Enhancements

Bug Fixes

  • NVIDIA/thrust#1244: Check for macro collisions with system headers during header testing.
  • thrust/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
  • thrust/cub#38: Fix cub::DeviceHistogram for size_t OffsetTs. Thanks to Leo Fang for this contribution.
  • thrust/cub#35: Fix GCC-5 maybe-uninitialized warning. Thanks to Rong Ou for this contribution.
  • thrust/cub#36: Qualify namespace for va_printf in _CubLog. Thanks to Andrei Tchouprakov for this contribution.

CUB 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1)

23 Sep 02:19
Compare
Choose a tag to compare

Summary

CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.

Bug Fixes

  • #1217: Move static local in cub::DeviceCount to a separate host-only function because NVC++ doesn't support static locals in host-device functions.

CUB 1.9.10 (NVIDIA HPC SDK 20.5)

19 May 09:22
Compare
Choose a tag to compare

Summary

Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. It adds CMake find_package support. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.

Breaking Changes

  • Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
  • C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or CUB_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
  • GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

New Features

  • CMake find_package support. Just point CMake at the cmake folder in your CUB include directory (ex: cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .) and then you can add CUB to your CMake project with find_package(CUB REQUIRED CONFIG).

CUB 1.9.9 (CUDA 11.0)

19 May 09:13
Compare
Choose a tag to compare

CUB 1.9.9 (CUDA 11.0)

Summary

CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. It introduces CMake support, version macros, platform detection machinery, and support for NVC++, which uses Thrust (and thus CUB) to implement GPU-accelerated C++17 Parallel Algorithms. Additionally, the scan dispatch layer was refactored and modernized. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.

Breaking Changes

  • Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
  • C++03 and C++11 are deprecated. Using these dialects will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_CPP_DIALECT (to suppress C++03 and C++11 deprecation warnings) or CUB_IGNORE_DEPRECATED_CPP11 (to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future.
  • GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. Using these compilers will generate a compile-time warning. These warnings can be suppressed by defining CUB_IGNORE_DEPRECATED_COMPILER. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.

New Features

  • CMake support. Thanks to Francis Lemaire for this contribution.
  • Refactorized and modernized scan dispatch layer. Thanks to Francis Lemaire for this contribution.
  • Policy hooks for device-wide reduce, scan, and radix sort facilities to simplify tuning and allow users to provide custom policies. Thanks to Francis Lemaire for this contribution.
  • <cub/version.cuh>: CUB_VERSION, CUB_VERSION_MAJOR, CUB_VERISON_MINOR, CUB_VERSION_SUBMINOR, and CUB_PATCH_NUMBER.
  • Platform detection machinery:
    • <cub/util_cpp_dialect.cuh>: Detects the C++ standard dialect.
    • <cub/util_compiler.cuh>: host and device compiler detection.
    • <cub/util_deprecated.cuh>: CUB_DEPRECATED.
    • <cub/config.cuh>: Includes <cub/util_arch.cuh>, <cub/util_compiler.cuh>, <cub/util_cpp_dialect.cuh>, <cub/util_deprecated.cuh>, <cub/util_macro.cuh>, <cub/util_namespace.cuh>`
  • cub::DeviceCount and cub::DeviceCountUncached, caching abstractions for cudaGetDeviceCount.

Other Enhancements

  • Lazily initialize the per-device CUDAattribute caches, because CUDA context creation is expensive and adds up with large CUDA binaries on machines with many GPUs. Thanks to the NVIDIA PyTorch team for bringing this to our attention.
  • Make cub::SwitchDevice avoid setting/resetting the device if the current device is the same as the target device.

Bug Fixes

  • Add explicit failure parameter to CAS in the CUB attribute cache to workaround a GCC 4.8 bug.
  • Revert a change in reductions that changed the signedness of the lane_id variable to suppress a warning, as this introduces a bug in optimized device code.
  • Fix initialization in cub::ExclusiveSum. Thanks to Conor Hoekstra for this contribution.
  • Fix initialization of the std::array in the CUB attribute cache.
  • Fix -Wsign-compare warnings. Thanks to Elias Stehle for this contribution.
  • Fix test_block_reduce.cu to build without parameters. Thanks to Francis Lemaire for this contribution.
  • Add missing includes to grid_even_share.cuh. Thanks to Francis Lemaire for this contribution.
  • Add missing includes to thread_search.cuh. Thanks to Francis Lemaire for this contribution.
  • Add missing includes to cub.cuh. Thanks to Felix Kallenborn for this contribution.

CUB 1.9.8-1 (NVIDIA HPC SDK 20.3)

19 May 09:05
Compare
Choose a tag to compare

Summary

CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.

CUB 1.9.8 (CUDA 11.0 Early Access)

19 May 09:02
Compare
Choose a tag to compare

Summary

CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit.
When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.

Enhancements

  • (C++11 or later) Cache calls to cudaFuncGetAttributes and cudaDeviceGetAttribute within cub::PtxVersion and cub::SmVersion. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention.
  • DispatchReduce now takes an OutputT template parameter so that users can specify the intermediate type explicitly.
  • Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.

Bug Fixes

  • Change initialization style from copy initialization to direct initialization (which is more permissive) in AgentReduce to allow a wider range of types to be used with it.
  • Fix bad signed/unsigned comparisons in WarpReduce.
  • Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.