Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Releases: NVIDIA/cub

CUB 1.2.3

19 May 07:39
Compare
Choose a tag to compare

Summary

CUB 1.2.3 is a minor release.

Bug Fixes

  • Fixed access violation bug in cub::DeviceReduce::ReduceByKey for non-primitive value types.
  • Fixed code-snippet bug in ArgIndexInputIteratorT documentation.

CUB 1.2.2

19 May 07:33
Compare
Choose a tag to compare

Summary

CUB 1.2.2 adds a new variant of cub::BlockReduce and MSVC project solections for examples.

New Features

  • MSVC project solutions for device-wide and block-wide examples
  • New algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition).

Bug Fixes

  • Inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly.

CUB 1.2.0

19 May 07:32
Compare
Choose a tag to compare

Summary

CUB 1.2.0 adds cub::DeviceReduce::ReduceByKey and cub::DeviceReduce::RunLengthEncode and support for CUDA 6.0.

New Features

  • cub::DeviceReduce::ReduceByKey.
  • cub::DeviceReduce::RunLengthEncode.

Other Enhancements

  • Improved cub::DeviceScan, cub::DeviceSelect, cub::DevicePartition performance.
  • Documentation and testing:
    • Added performance-portability plots for many device-wide primitives.
    • Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
  • Revised the operation of temporary tile status bookkeeping for cub::DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences).

Bug Fixes

  • Fix cub::DeviceScan bug where Windows alignment disagreements between host and device regarding user-defined data types would corrupt tile status.
  • Fix cub::BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block.
  • Added workaround to make cub::TexRefInputIteratorT work with CUDA 6.0.

CUB 1.1.1

19 May 07:32
Compare
Choose a tag to compare

Summary

CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, cub::DeviceSelect, cub::DevicePartition, cub::Shuffle*, and cub::MaxSMOccupancy. Additionally, scan and sort performance for older GPUs has been improved and many bugs have been fixed.

Breaking Changes

  • Refactored block-wide I/O (cub::BlockLoad and cub::BlockStore), removing cache-modifiers from their interfaces. cub::CacheModifiedInputIterator and cub::CacheModifiedOutputIterator should now be used with cub::BlockLoad and cub::BlockStore to effect that behavior.

New Features

  • cub::TexObjInputIterator, cub::TexRefInputIterator, cub::CacheModifiedInputIterator, and cub::CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.
  • Descending sorting for cub::DeviceRadixSort and cub::BlockRadixSort.
  • Min, max, arg-min, and arg-max operators for cub::DeviceReduce.
  • cub::DeviceSelect (select-unique, select-if, and select-flagged).
  • cub::DevicePartition (partition-if, partition-flagged).
  • Generic cub::ShuffleUp, cub::ShuffleDown, and cub::ShuffleIndex for warp-wide communication of arbitrary data types (SM3x and up).
  • cub::MaxSmOccupancy for accurately determining SM occupancy for any given kernel function pointer.

Other Enhancements

  • Improved cub::DeviceScan and cub::DeviceRadixSort performance for older GPUs (SM1x to SM3x).
  • Renamed device-wide stream_synchronous param to debug_synchronous to avoid confusion about usage.
  • Documentation improvements:
    • Added simple examples of device-wide methods.
    • Improved doxygen documentation and example snippets.
  • Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).

Bug Fixes

  • Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
  • SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
  • For cub::WarpScan-based scans, not all threads in the first warp were entering the prefix callback functor.
  • cub::DeviceRadixSort had a race condition with key-value pairs for pre-SM35 architectures.
  • cub::DeviceRadixSor bitfield-extract behavior with long keys on 64-bit Linux was incorrect.
  • cub::BlockDiscontinuity failed to compile for types other than int32_t/uint32_t.
  • CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.

CUB 1.0.2

19 May 07:31
Compare
Choose a tag to compare

Summary

CUB 1.0.2 is a minor release.

Bug Fixes

  • Corrections to code snippet examples for cub::BlockLoad, cub::BlockStore, and cub::BlockDiscontinuity.
  • Cleaned up unnecessary/missing header includes. You can now safely include a specific .cuh (instead of cub.cuh).
  • Bug/compilation fixes for cub::BlockHistogram.

CUB 1.0.1

19 May 07:31
Compare
Choose a tag to compare

Summary

CUB 1.0.1 adds cub::DeviceRadixSort and cub::DeviceScan. Numerous other performance and correctness fixes and included.

Breaking Changes

  • New collective interface idiom (specialize/construct/invoke).

New Features

  • cub::DeviceRadixSort. Implements short-circuiting for homogenous digit passes.
  • cub::DeviceScan. Implements single-pass "adaptive-lookback" strategy.

Other Enhancements

  • Significantly improved documentation (with example code snippets).
  • More extensive regression test suit for aggressively testing collective variants.
  • Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types).
  • Improved support for SM3x SHFL (collective ops now use SHFL for types larger than 32 bits).
  • Better code generation for 64-bit addressing within cub::BlockLoad/cub::BlockStore.
  • cub::DeviceHistogram now supports histograms of arbitrary bins.
  • Updates to accommodate CUDA 5.5 dynamic parallelism.

Bug Fixes

  • Workarounds for SM10 codegen issues in uncommonly-used cub::WarpScan/cub::WarpReduce specializations.

CUB 0.9.4

19 May 07:30
Compare
Choose a tag to compare

Summary

CUB 0.9.3 is a minor release.

Enhancements

  • Various documentation updates and corrections.

Bug Fixes

  • Fixed compilation errors for SM1x.
  • Fixed compilation errors for some WarpScan entrypoints on SM3x and up.

CUB 0.9.3

19 May 08:23
Compare
Choose a tag to compare

Summary

CUB 0.9.3 adds histogram algorithms and work management utility descriptors.

New Features

  • cub::DevicHistogram256.
  • cub::BlockHistogram256.
  • cub::BlockScan algorithm variant BLOCK_SCAN_RAKING_MEMOIZE, which trades more register consumption for less shared memory I/O.
  • cub::GridQueue, cub::GridEvenShare, work management utility descriptors.

Other Enhancements

  • Updates to cub::BlockRadixRank to use cub::BlockScan, which improves performance on SM3x by using SHFL.
  • Allow types other than builtin types to be used in cub::WarpScan::*Sum methods if they only have operator+ overloaded. Previously they also required to support assignment from int(0).
  • Update cub::BlockReduce's BLOCK_REDUCE_WARP_REDUCTIONS algorithm to work even when block size is not an even multiple of warp size.
  • Refactoring of cub::DeviceAllocator interface and cub::CachingDeviceAllocator implementation.

CUB 0.9.2

19 May 08:22
Compare
Choose a tag to compare

Summary

CUB 0.9.2 adds cub::WarpReduce.

New Features

  • cub::WarpReduce, which uses the SHFL instruction when applicable. cub::BlockReduce now uses this cub::WarpReduce instead of implementing its own.

Enhancements

  • Documentation updates and corrections.

Bug Fixes

  • Fixes for 64-bit Linux compilation warnings and errors.

CUB 0.9.1

19 May 08:21
Compare
Choose a tag to compare

Summary

CUB 0.9.1 is a minor release.

Bug Fixes

  • Fix for ambiguity in cub::BlockScan::Reduce between generic reduction and summation. Summation entrypoints are now called ::Sum(), similar to the convention in cub::BlockScan.
  • Small edits to documentation and download tracking.