Skip to content

Releases: ROCm/rocPRIM

rocPRIM 3.2.1 for ROCm 6.2.2

27 Sep 16:01
93501cf
Compare
Choose a tag to compare

rocPRIM code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.

rocPRIM 3.2.1 for ROCm 6.2.1

20 Sep 19:58
93501cf
Compare
Choose a tag to compare

Optimizations

  • Improved performance of block_reduce_warp_reduce when warp size == block size.

rocPRIM 3.2.0 for ROCm 6.2.0

02 Aug 16:15
eab1eed
Compare
Choose a tag to compare

Additions

  • New overloads for warp_scan::exclusive_scan that take no initial value. These new overloads will write an unspecified result to the first value of each warp.
  • The internal accumulator type of inclusive_scan(_by_key) and exclusive_scan(_by_key) is now exposed as an optional type parameter.
    • The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
      This is the same behaviour as before this change.
  • New overload for device_adjacent_difference_inplace that allows separate input and output iterators, but allows them to point to the same element.
  • New public API for deriving resulting type on device-only functions:
    • rocprim::invoke_result
    • rocprim::invoke_result_t
    • rocprim::invoke_result_binary_op
    • rocprim::invoke_result_binary_op_t
  • New rocprim::batch_copy function added. Similar to rocprim::batch_memcpy, but copies by element, not with memcpy.
  • Added more test cases, to better cover supported data types.
  • Updated some tests to work with supported data types.
  • An optional decomposer argument for all member functions of rocprim::block_radix_sort and all functions of device_radix_sort.
    To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a rocprim::tuple
    of references to arithmetic types from the key.
  • New rocprim::predicate_iterator which acts as a proxy for an underlying iterator based on a predicate.
    It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is true.
    It can be instantiated with:
    • rocprim::make_predicate_iterator
    • rocprim::make_mask_iterator
  • Added custom radix sizes as the last parameter for block_radix_sort. The default value is 4, it can be a number between 0 and 32.
  • New rocprim::radix_key_codec, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.

Optimizations

  • Improved the performance of warp_sort_shuffle and block_sort_bitonic.
  • Created an optimized version of the warp_exchange functions blocked_to_striped_shuffle and striped_to_blocked_shuffle when the warpsize is equal to the items per thread.

Fixes

  • Fixed incorrect results of warp_exchange::blocked_to_striped_shuffle and warp_exchange::striped_to_blocked_shuffle when the block size is
    larger than the logical warp size. The test suite has been updated with such cases.
  • Fixed incorrect results returned when calling device unique_by_key with overlapping values_input and values_output.
  • Fixed incorrect output type used in device_adjacent_difference.
  • Hotfix for incorrect results on the GFX10 (Navi 10/RDNA1, Navi 20/RDNA2) ISA and GFX11 ISA (Navi 30 GPUs) on device scan algorithms rocprim::inclusive_scan(_by_key) and rocprim::exclusive_scan(_by_key) with large input types.
  • device_adjacent_difference now considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.
  • Fixed incorrect data being loaded with rocprim::thread_load when compiling with -O0.
  • Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.

Deprecations

  • The internal header detail/match_result_type.hpp has been deprecated.
  • TwiddleIn and TwiddleOut have been deprecated in favor of radix_key_codec.
  • The internal ::rocprim::detail::radix_key_codec has been deprecated in favor of the new public utility with the same name.

rocPRIM 3.1.0 for ROCm 6.1.2

04 Jun 16:53
85253f8
Compare
Choose a tag to compare

rocPRIM code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.

rocPRIM 3.1.0 for ROCm 6.1.1

08 May 18:00
85253f8
Compare
Choose a tag to compare

rocPRIM code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.

rocPRIM 3.1.0 for ROCm 6.1.0

16 Apr 19:10
435f7f4
Compare
Choose a tag to compare

Additions

  • New primitive: block_run_length_decode
  • New primitive: batch_memcpy

Changes

  • Renamed:
    • scan_config_v2 to scan_config
    • scan_by_key_config_v2 to scan_by_key_config
    • radix_sort_config_v2 to radix_sort_config
    • reduce_by_key_config_v2 to reduce_by_key_config
    • radix_sort_config_v2 to radix_sort_config
  • Removed support for custom config types for device algorithms
  • host_warp_size() was moved into rocprim/device/config_types.hpp; it now uses either device_id or
    a stream parameter to query the proper device and a device_id out parameter
    • The return type is hipError_t
  • Added support for __int128_t in device_radix_sort and block_radix_sort
  • Improved the performance of match_any, and block_histogram which uses it

Deprecations

  • Removed reduce_by_key_config, MatchAny, scan_config, scan_by_key_config, and
    radix_sort_config

Fixes

  • Build issues with rmake.py on Windows when using VS 2017 15.8 or later (due to a breaking fix with
    extended aligned storage)

rocPRIM 3.0.0 for ROCm 6.0.2

31 Jan 20:12
c8297d6
Compare
Choose a tag to compare

rocPRIM code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.

rocPRIM 3.0.0 for ROCm 6.0.0

15 Dec 18:31
c8297d6
Compare
Choose a tag to compare

Added

  • block_sort::sort() overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, all block_sort::sort() overloads with a dynamic size are now supported for block_sort_algorithm::merge_sort and block_sort_algorithm::bitonic_sort.
  • New two-way partition primitive partition_two_way which can write to two separate iterators.

Optimizations

  • Improved the performance of partition.

Fixed

  • Fixed rocprim::MatchAny for devices with 64-bit warp size. The function rocprim::MatchAny is deprecated and rocprim::match_any is preferred instead.

rocPRIM 2.13.1 for ROCm 5.7.1

13 Oct 18:57
b54aaa7
Compare
Choose a tag to compare

rocPRIM code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.

rocPRIM 2.13.1 for ROCm 5.7.0

15 Sep 17:29
b54aaa7
Compare
Choose a tag to compare

Changed

  • Deprecated configuration radix_sort_config for device-level radix sort as it no longer matches the algorithm's parameters. New configuration radix_sort_config_v2 is preferred instead.
  • Removed erroneous implementation of device-level inclusive_scan and exclusive_scan. The prior default implementation using lookback-scan now is the only available implementation.
  • The benchmark metric indicating the bytes processed for exclusive_scan_by_key and inclusive_scan_by_key has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as scan and scan_by_key instead of scan_exclusive and scan_inclusive.
  • Deprecated configurations scan_config and scan_by_key_config for device-level scans, as they no longer match the algorithm's parameters. New configurations scan_config_v2 and scan_by_key_config_v2 are preferred instead.

Fixed

  • Fixed build issue caused by missing header in thread/thread_search.hpp.