- * \par - * - Blocked arrangement. The aggregate tile of items is partitioned - * evenly across threads in "blocked" fashion with threadi - * owning the ith segment of consecutive elements. - * Blocked arrangements are often desirable for algorithmic benefits (where - * long sequences of items can be processed sequentially within each thread). - * | - *
- * \par
- * \image html blocked.png
- * Blocked arrangement across four threads
- * (emphasis on items owned by thread0) |
- *
- * \par - * - Striped arrangement. The aggregate tile of items is partitioned across - * threads in "striped" fashion, i.e., the \p ITEMS_PER_THREAD items owned by - * each thread have logical stride \p BLOCK_THREADS between them. Striped arrangements - * are often desirable for data movement through global memory (where - * [read/write coalescing](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-global-memory) - * is an important performance consideration). - * | - *
- * \par
- * \image html striped.png
- * Striped arrangement across four threads
- * (emphasis on items owned by thread0) |
- *
+ CUB
+
+ |
+
![]() ![]() | CUB namespace |
![]() ![]() ![]() | Array traits |
![]() ![]() ![]() | Basic type traits |
![]() ![]() ![]() | BlockDiscontinuity provides operations for flagging discontinuities within a list of data items partitioned across a CUDA threadblock.
+
+
+![]() |
![]() ![]() ![]() | BlockExchange provides operations for reorganizing the partitioning of ordered data across a CUDA threadblock.
+
+
+![]() |
![]() ![]() ![]() | BlockLoad provides data movement operations for reading block-arranged data from global memory.
+
+
+![]() |
![]() ![]() ![]() | BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.
+
+
+![]() |
![]() ![]() ![]() | BlockReduce provides variants of parallel reduction across a CUDA threadblock.
+
+
+![]() |
![]() ![]() ![]() | BlockScan provides variants of parallel prefix scan (and prefix sum) across a CUDA threadblock.
+
+
+![]() |
![]() ![]() ![]() | BlockStore provides data movement operations for writing blocked-arranged data to global memory.
+
+
+![]() |
![]() ![]() ![]() | Simple enable-if (similar to Boost) |
![]() ![]() ![]() | Default equality functor |
![]() ![]() ![]() | Type equality test |
![]() ![]() ![]() | Type selection (IF ? ThenType : ElseType ) |
![]() ![]() ![]() | Volatile modifier test |
![]() ![]() ![]() | Statically determine log2(N), rounded up |
![]() ![]() ![]() | Default max functor |
![]() ![]() ![]() | A simple "NULL" marker type |
![]() ![]() ![]() | Numeric type traits |
![]() ![]() ![]() | Removes const and volatile qualifiers from type Tp |
![]() ![]() ![]() | Default sum functor |
![]() ![]() ![]() | Type traits |
![]() ![]() ![]() | WarpScan provides variants of parallel prefix scan across a CUDA warp.
+
+
+![]() |
+ CUB
+
+ |
+
#include <cuda_runtime.h>
#include "../device_props.cuh"
#include "../type_utils.cuh"
#include "../operators.cuh"
#include "../ns_wrapper.cuh"
+Classes | |
class | cub::BlockDiscontinuity< T, BLOCK_THREADS > |
BlockDiscontinuity provides operations for flagging discontinuities within a list of data items partitioned across a CUDA threadblock.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
cub::BlockDiscontinuity provides operations for flagging discontinuities within a list of data items partitioned across a CUDA threadblock.
+
+ CUB
+
+ |
+
#include "../ns_wrapper.cuh"
#include "../device_props.cuh"
#include "../ptx_intrinsics.cuh"
#include "../type_utils.cuh"
+Classes | |
class | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > |
BlockExchange provides operations for reorganizing the partitioning of ordered data across a CUDA threadblock.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
cub::BlockExchange provides operations for reorganizing the partitioning of ordered data across a CUDA threadblock.
+
+ CUB
+
+ |
+
#include <iterator>
#include "../ns_wrapper.cuh"
#include "../macro_utils.cuh"
#include "../thread/thread_load.cuh"
#include "../type_utils.cuh"
#include "../vector_type.cuh"
#include "block_exchange.cuh"
+Classes | |
class | cub::BlockLoad< InputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > |
BlockLoad provides data movement operations for reading block-arranged data from global memory.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Enumerations | |
enum | cub::BlockLoadPolicy { cub::BLOCK_LOAD_DIRECT, +cub::BLOCK_LOAD_VECTORIZE, +cub::BLOCK_LOAD_TRANSPOSE + } |
Tuning policy for cub::BlockLoad. More... | |
+Functions | |
Direct threadblock loads (blocked arrangement) | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | cub::BlockLoadDirect (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | cub::BlockLoadDirect (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly, guarded by range. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier, guarded by range, with assignment for out-of-bound elements. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly, guarded by range, with assignment for out-of-bound elements. More... | |
Direct threadblock loads (striped arrangement) | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | cub::BlockLoadDirectStriped (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | cub::BlockLoadDirectStriped (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped directly tile using the specified cache modifier, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly, guarded by range. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped directly tile using the specified cache modifier, guarded by range, with assignment for out-of-bound elements. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly, guarded by range, with assignment for out-of-bound elements. More... | |
Threadblock vectorized loads (blocked arrangement) | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | cub::BlockLoadVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | cub::BlockLoadVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly. More... | |
Operations for reading global tiles of data into the threadblock (in blocked arrangement across threads).
+
+ CUB
+
+ |
+
#include "../ns_wrapper.cuh"
#include "../device_props.cuh"
#include "../type_utils.cuh"
#include "block_exchange.cuh"
#include "block_radix_rank.cuh"
+Classes | |
class | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > |
BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
cub::BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.
+
+ CUB
+
+ |
+
#include "../block/block_raking_grid.cuh"
#include "../device_props.cuh"
#include "../operators.cuh"
#include "../thread/thread_reduce.cuh"
#include "../thread/thread_load.cuh"
#include "../thread/thread_store.cuh"
#include "../ns_wrapper.cuh"
+Classes | |
class | cub::BlockReduce< T, BLOCK_THREADS > |
BlockReduce provides variants of parallel reduction across a CUDA threadblock.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
cub::BlockReduce provides variants of parallel reduction across a CUDA threadblock
+
+ CUB
+
+ |
+
#include "../device_props.cuh"
#include "../type_utils.cuh"
#include "../operators.cuh"
#include "../warp/warp_scan.cuh"
#include "../thread/thread_reduce.cuh"
#include "../thread/thread_scan.cuh"
#include "../ns_wrapper.cuh"
+Classes | |
class | cub::BlockScan< T, BLOCK_THREADS, POLICY > |
BlockScan provides variants of parallel prefix scan (and prefix sum) across a CUDA threadblock.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Enumerations | |
enum | cub::BlockScanPolicy { cub::BLOCK_SCAN_RAKING, +cub::BLOCK_SCAN_WARPSCANS + } |
Tuning policy for cub::BlockScan. More... | |
cub::BlockScan provides variants of parallel prefix scan across a CUDA threadblock.
+
+ CUB
+
+ |
+
#include <iterator>
#include "../ns_wrapper.cuh"
#include "../macro_utils.cuh"
#include "../thread/thread_store.cuh"
#include "../type_utils.cuh"
#include "../vector_type.cuh"
#include "block_exchange.cuh"
+Classes | |
class | cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > |
BlockStore provides data movement operations for writing blocked-arranged data to global memory.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Enumerations | |
enum | cub::BlockStorePolicy { cub::BLOCK_STORE_DIRECT, +cub::BLOCK_STORE_VECTORIZE, +cub::BLOCK_STORE_TRANSPOSE + } |
Tuning policy for cub::BlockStore. More... | |
+Functions | |
Direct threadblock stores (blocked arrangement) | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | cub::BlockStoreDirect (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | cub::BlockStoreDirect (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly. More... | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockStoreDirect (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly using the specified cache modifier, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockStoreDirect (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly, guarded by range. More... | |
Direct threadblock stores (striped arrangement) | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | cub::BlockStoreDirectStriped (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Store striped tile directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | cub::BlockStoreDirectStriped (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Store striped tile directly. More... | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockStoreDirectStriped (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | cub::BlockStoreDirectStriped (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Store striped tile directly, guarded by range. More... | |
Threadblock vectorized stores (blocked arrangement) | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | cub::BlockStoreVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | cub::BlockStoreVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly. More... | |
Operations for writing global tiles of data from the threadblock (in blocked arrangement across threads).
+
+ CUB
+
+ |
+
Duane Merrill and Andrew Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02):245–272, 2011.
+ +
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockDiscontinuity< T, BLOCK_THREADS >, including all inherited members.
+Flag(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD], T &last_tile_item) | cub::BlockDiscontinuity< T, BLOCK_THREADS > | inlinestatic |
Flag(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD]) | cub::BlockDiscontinuity< T, BLOCK_THREADS > | inlinestatic |
Flag(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T tile_predecessor, FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD], T &last_tile_item) | cub::BlockDiscontinuity< T, BLOCK_THREADS > | inlinestatic |
Flag(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T tile_predecessor, FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD]) | cub::BlockDiscontinuity< T, BLOCK_THREADS > | inlinestatic |
SmemStorage typedef | cub::BlockDiscontinuity< T, BLOCK_THREADS > |
+ CUB
+
+ |
+
BlockDiscontinuity provides operations for flagging discontinuities within a list of data items partitioned across a CUDA threadblock.
+T | The data type to be exchanged. |
BLOCK_THREADS | The threadblock size in threads. |
tile_predecessor
and last_tile_item
) are only considered valid in thread0__syncthreads()
barrier is required if the supplied BlockDiscontinuity::SmemStorage is to be reused or repurposed by the threadblock+Public Types | |
+typedef _SmemStorage | SmemStorage |
The operations exposed by BlockDiscontinuity require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > | |
static __device__ +__forceinline__ void | Flag (SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD], T &last_tile_item) |
Sets discontinuity flags for a tile of threadblock items, for which the first item has no reference (and is always flagged). The last tile item of the last thread is also returned to thread0. More... | |
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > | |
static __device__ +__forceinline__ void | Flag (SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD]) |
Sets discontinuity flags for a tile of threadblock items, for which the first item has no reference (and is always flagged). More... | |
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > | |
static __device__ +__forceinline__ void | Flag (SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T tile_predecessor, FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD], T &last_tile_item) |
Sets discontinuity flags for a tile of threadblock items. The last tile item of the last thread is also returned to thread0. More... | |
template<int ITEMS_PER_THREAD, typename FlagT , typename FlagOp > | |
static __device__ +__forceinline__ void | Flag (SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T tile_predecessor, FlagOp flag_op, FlagT(&flags)[ITEMS_PER_THREAD]) |
Sets discontinuity flags for a tile of threadblock items. More... | |
+
|
+ +inlinestatic | +
Sets discontinuity flags for a tile of threadblock items, for which the first item has no reference (and is always flagged). The last tile item of the last thread is also returned to thread0.
+Assuming a blocked arrangement of elements across threads, flags
i is set non-zero for item input
i when scan_op(
previous-item, inputi)
is true
(where previous-item is either inputi-1
, or inputITEMS_PER_THREAD-1
in the previous thread). Furthermore, flags
i is always non-zero for input0
in thread0.
The last_tile_item
is undefined in threads other than thread0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary boolean functor type, having input parameters (const T &a, const T &b) and returning true if a discontinuity exists between a and b , otherwise false . |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Input items |
[in] | flag_op | Binary boolean flag predicate |
[out] | flags | Discontinuity flags |
[out] | last_tile_item | [thread0 only] The last tile item (inputITEMS_PER_THREAD-1 from threadBLOCK_THREADS -1) |
+
|
+ +inlinestatic | +
Sets discontinuity flags for a tile of threadblock items, for which the first item has no reference (and is always flagged).
+Assuming a blocked arrangement of elements across threads, flags
i is set non-zero for item input
i when scan_op(
previous-item, inputi)
is true
(where previous-item is either inputi-1
, or inputITEMS_PER_THREAD-1
in the previous thread). Furthermore, flags
i is always non-zero for input0
in thread0.
The last_tile_item
is undefined in threads other than thread0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary boolean functor type, having input parameters (const T &a, const T &b) and returning true if a discontinuity exists between a and b , otherwise false . |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Input items |
[in] | flag_op | Binary boolean flag predicate |
[out] | flags | Discontinuity flags |
+
|
+ +inlinestatic | +
Sets discontinuity flags for a tile of threadblock items. The last tile item of the last thread is also returned to thread0.
+Assuming a blocked arrangement of elements across threads, flags
i is set non-zero for item input
i when scan_op(
previous-item, inputi)
is true
(where previous-item is either inputi-1
, or inputITEMS_PER_THREAD-1
in the previous thread). For thread0, item input0
is compared against /p tile_predecessor.
The tile_predecessor
and last_tile_item
are undefined in threads other than thread0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary boolean functor type, having input parameters (const T &a, const T &b) and returning true if a discontinuity exists between a and b , otherwise false . |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Input items |
[in] | tile_predecessor | [thread0 only] Item with which to compare the first tile item (input0 from thread0). |
[in] | flag_op | Binary boolean flag predicate |
[out] | flags | Discontinuity flags |
[out] | last_tile_item | [thread0 only] The last tile item (inputITEMS_PER_THREAD-1 from threadBLOCK_THREADS -1) |
+
|
+ +inlinestatic | +
Sets discontinuity flags for a tile of threadblock items.
+Assuming a blocked arrangement of elements across threads, flags
i is set non-zero for item input
i when scan_op(
previous-item, inputi)
is true
(where previous-item is either inputi-1
, or inputITEMS_PER_THREAD-1
in the previous thread). For thread0, item input0
is compared against /p tile_predecessor.
The tile_predecessor
and last_tile_item
are undefined in threads other than thread0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
FlagT | [inferred] The flag type (must be an integer type) |
FlagOp | [inferred] Binary boolean functor type, having input parameters (const T &a, const T &b) and returning true if a discontinuity exists between a and b , otherwise false . |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Input items |
[in] | tile_predecessor | [thread0 only] Item with which to compare the first tile item (input0 from thread0). |
[in] | flag_op | Binary boolean flag predicate |
[out] | flags | Discontinuity flags |
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD >, including all inherited members.
+BlockedToStriped(SmemStorage &smem_storage, T items[ITEMS_PER_THREAD]) | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > | inlinestatic |
ScatterToBlocked(SmemStorage &smem_storage, T items[ITEMS_PER_THREAD], unsigned int ranks[ITEMS_PER_THREAD]) | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > | inlinestatic |
ScatterToStriped(SmemStorage &smem_storage, T items[ITEMS_PER_THREAD], unsigned int ranks[ITEMS_PER_THREAD]) | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > | inlinestatic |
SmemStorage typedef | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > | |
StripedToBlocked(SmemStorage &smem_storage, T items[ITEMS_PER_THREAD]) | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > | inlinestatic |
+ CUB
+
+ |
+
BlockExchange provides operations for reorganizing the partitioning of ordered data across a CUDA threadblock.
+T | The data type to be exchanged. |
BLOCK_THREADS | The threadblock size in threads. |
ITEMS_PER_THREAD | The number of items partitioned onto each thread. |
__syncthreads()
barrier is required if the supplied BlockExchange::SmemStorage is to be reused or repurposed by the threadblock+Public Types | |
+typedef SmemStorage | SmemStorage |
The operations exposed by BlockExchange require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
Transpose exchanges | |
static __device__ +__forceinline__ void | BlockedToStriped (SmemStorage &smem_storage, T items[ITEMS_PER_THREAD]) |
Transposes data items from blocked arrangement to striped arrangement. More... | |
static __device__ +__forceinline__ void | StripedToBlocked (SmemStorage &smem_storage, T items[ITEMS_PER_THREAD]) |
Transposes data items from striped arrangement to blocked arrangement. More... | |
Scatter exchanges | |
static __device__ +__forceinline__ void | ScatterToBlocked (SmemStorage &smem_storage, T items[ITEMS_PER_THREAD], unsigned int ranks[ITEMS_PER_THREAD]) |
Exchanges data items annotated by rank into blocked arrangement. More... | |
static __device__ +__forceinline__ void | ScatterToStriped (SmemStorage &smem_storage, T items[ITEMS_PER_THREAD], unsigned int ranks[ITEMS_PER_THREAD]) |
Exchanges data items annotated by rank into striped arrangement. More... | |
+
|
+ +inlinestatic | +
Transposes data items from blocked arrangement to striped arrangement.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | items | Items to exchange, converting between blocked and striped arrangements. |
+
|
+ +inlinestatic | +
Transposes data items from striped arrangement to blocked arrangement.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | items | Items to exchange, converting between striped and blocked arrangements. |
+
|
+ +inlinestatic | +
Exchanges data items annotated by rank into blocked arrangement.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | items | Items to exchange |
[in] | ranks | Corresponding scatter ranks |
+
|
+ +inlinestatic | +
Exchanges data items annotated by rank into striped arrangement.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | items | Items to exchange |
[in] | ranks | Corresponding scatter ranks |
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockLoad< InputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER >, including all inherited members.
+Load(SmemStorage &smem_storage, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) | cub::BlockLoad< InputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > | inlinestatic |
Load(SmemStorage &smem_storage, InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) | cub::BlockLoad< InputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > | inlinestatic |
SmemStorage typedef | cub::BlockLoad< InputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > |
+ CUB
+
+ |
+
BlockLoad provides data movement operations for reading block-arranged data from global memory.
+BlockLoad provides a single tile-loading abstraction whose performance behavior can be statically tuned. In particular, BlockLoad implements alternative cub::BlockLoadPolicy strategies catering to different granularity sizes (i.e., number of items per thread).
+InputIterator | The input iterator type (may be a simple pointer type). |
BLOCK_THREADS | The threadblock size in threads. |
ITEMS_PER_THREAD | The number of consecutive items partitioned onto each thread. |
POLICY | [optional] cub::BlockLoadPolicy tuning policy. Default = cub::BLOCK_LOAD_DIRECT. |
MODIFIER | [optional] cub::PtxLoadModifier cache modifier. Default = cub::PTX_LOAD_NONE. |
__syncthreads()
barrier is required if the supplied BlockLoad::SmemStorage is to be reused or repurposed by the threadblockITEMS_PER_THREAD
consecutive integers per thread using vectorized loads and global-only caching: +Public Types | |
+typedef _SmemStorage | SmemStorage |
The operations exposed by BlockLoad require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
static __device__ +__forceinline__ void | Load (SmemStorage &smem_storage, InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock. More... | |
template<typename SizeT > | |
static __device__ +__forceinline__ void | Load (SmemStorage &smem_storage, InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock, guarded by range. More... | |
+
|
+ +inlinestatic | +
Load a tile of items across a threadblock.
+[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | block_itr | The threadblock's base input iterator for loading from |
[out] | items | Data to load |
+
|
+ +inlinestatic | +
Load a tile of items across a threadblock, guarded by range.
+SizeT | [inferred] Integer type for offsets |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | block_itr | The threadblock's base input iterator for loading from |
[in] | guarded_items | Number of valid items in the tile |
[out] | items | Data to load |
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >, including all inherited members.
+SmemStorage typedef | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | |
SortBlocked(SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | inlinestatic |
SortBlocked(SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | inlinestatic |
SortBlockedToStriped(SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | inlinestatic |
SortBlockedToStriped(SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | inlinestatic |
SortStriped(SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | inlinestatic |
SortStriped(SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > | inlinestatic |
+ CUB
+
+ |
+
BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.
+unsigned char
, int
, double
, etc. Within each key, the implementation treats fixed-length bit-sequences of RADIX_BITS
as radix digit places. Although the direct radix sorting method can only be applied to unsigned integral types, BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering.KeyType | Key type |
BLOCK_THREADS | The threadblock size in threads |
ITEMS_PER_THREAD | The number of items per thread |
ValueType | [optional] Value type (default: cub::NullType) |
RADIX_BITS | [optional] The number of radix bits per digit place (default: 5 bits) |
SMEM_CONFIG | [optional] Shared memory bank mode (default: cudaSharedMemBankSizeFourByte ) |
__syncthreads()
barrier is required if the supplied BlockRadixSort::SmemStorage is to be reused or repurposed by the threadblock.BLOCK_THREADS
is a multiple of the architecture's warp sizeKeyType
is an unsigned integral type+Public Types | |
+typedef _SmemStorage | SmemStorage |
The operations exposed by BlockRadixSort require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
Keys-only sorting | |
static __device__ +__forceinline__ void | SortBlocked (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) |
Performs a threadblock-wide radix sort over a blocked arrangement of keys. More... | |
static __device__ +__forceinline__ void | SortBlockedToStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) |
Performs a radix sort across a blocked arrangement of keys, leaving them in a striped arrangement. More... | |
static __device__ +__forceinline__ void | SortStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) |
Performs a radix sort across a striped arrangement of keys. More... | |
Key-value pair sorting | |
static __device__ +__forceinline__ void | SortBlocked (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) |
Performs a radix sort across a blocked arrangement of keys and values. More... | |
static __device__ +__forceinline__ void | SortBlockedToStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) |
Performs a radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement. More... | |
static __device__ +__forceinline__ void | SortStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8) |
Performs a radix sort across a striped arrangement of keys and values. More... | |
+
|
+ +inlinestatic | +
Performs a threadblock-wide radix sort over a blocked arrangement of keys.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
+
|
+ +inlinestatic | +
Performs a radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
+
|
+ +inlinestatic | +
Performs a radix sort across a striped arrangement of keys.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | keys | Keys to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
+
|
+ +inlinestatic | +
Performs a radix sort across a blocked arrangement of keys and values.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
+
|
+ +inlinestatic | +
Performs a radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
+
|
+ +inlinestatic | +
Performs a radix sort across a striped arrangement of keys and values.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in,out] | keys | Keys to sort |
[in,out] | values | Values to sort |
[in] | begin_bit | [optional] The beginning (least-significant) bit index needed for key comparison |
[in] | end_bit | [optional] The past-the-end (most-significant) bit index needed for key comparison |
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockReduce< T, BLOCK_THREADS >, including all inherited members.
+Reduce(SmemStorage &smem_storage, T input, ReductionOp reduction_op) | cub::BlockReduce< T, BLOCK_THREADS > | inlinestatic |
Reduce(SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op) | cub::BlockReduce< T, BLOCK_THREADS > | inlinestatic |
Reduce(SmemStorage &smem_storage, T input, ReductionOp reduction_op, const unsigned int &valid_threads) | cub::BlockReduce< T, BLOCK_THREADS > | inlinestatic |
SmemStorage typedef | cub::BlockReduce< T, BLOCK_THREADS > | |
Sum(SmemStorage &smem_storage, T input) | cub::BlockReduce< T, BLOCK_THREADS > | inlinestatic |
Sum(SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD]) | cub::BlockReduce< T, BLOCK_THREADS > | inlinestatic |
Sum(SmemStorage &smem_storage, T input, const unsigned int &valid_threads) | cub::BlockReduce< T, BLOCK_THREADS > | inlinestatic |
+ CUB
+
+ |
+
BlockReduce provides variants of parallel reduction across a CUDA threadblock.
+T | The reduction input/output element type |
BLOCK_THREADS | The threadblock size in threads |
__syncthreads()
barrier is required if the supplied BlockReduce::SmemStorage is to be reused or repurposed by the threadblockT
is a built-in C++ primitive or CUDA vector type (e.g., short
, int2
, double
, float2
, etc.)BLOCK_THREADS
is a multiple of the architecture's warp sizenum_elements
keys that are partitioned in a partially-full blocked arrangement across BLOCK_THREADS
threads. +Public Types | |
+typedef _SmemStorage | SmemStorage |
The operations exposed by BlockReduce require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
Generic reductions | |
template<typename ReductionOp > | |
static __device__ __forceinline__ T | Reduce (SmemStorage &smem_storage, T input, ReductionOp reduction_op) |
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. More... | |
template<int ITEMS_PER_THREAD, typename ReductionOp > | |
static __device__ __forceinline__ T | Reduce (SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op) |
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. More... | |
template<typename ReductionOp > | |
static __device__ __forceinline__ T | Reduce (SmemStorage &smem_storage, T input, ReductionOp reduction_op, const unsigned int &valid_threads) |
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. The first valid_threads threads each contribute one input element. More... | |
Summation reductions | |
static __device__ __forceinline__ T | Sum (SmemStorage &smem_storage, T input) |
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. More... | |
template<int ITEMS_PER_THREAD> | |
static __device__ __forceinline__ T | Sum (SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD]) |
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. More... | |
static __device__ __forceinline__ T | Sum (SmemStorage &smem_storage, T input, const unsigned int &valid_threads) |
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. The first valid_threads threads each contribute one input element. More... | |
+
|
+ +inlinestatic | +
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.
+The return value is undefined in threads other than thread0.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ReductionOp | [inferred] Binary reduction functor type (a model of Binary Function). |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input |
[in] | reduction_op | Binary associative reduction functor |
+
|
+ +inlinestatic | +
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.
+The return value is undefined in threads other than thread0.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
ReductionOp | [inferred] Binary reduction functor type (a model of Binary Function). |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | inputs | Calling thread's input segment |
[in] | reduction_op | Binary associative reduction functor |
+
|
+ +inlinestatic | +
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. The first valid_threads
threads each contribute one input element.
The return value is undefined in threads other than thread0.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ReductionOp | [inferred] Binary reduction functor type (a model of Binary Function). |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input |
[in] | reduction_op | Binary associative reduction functor |
[in] | valid_threads | Number of threads containing valid elements (may be less than BLOCK_THREADS) |
+
|
+ +inlinestatic | +
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.
+The return value is undefined in threads other than thread0.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input |
+
|
+ +inlinestatic | +
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.
+The return value is undefined in threads other than thread0.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | inputs | Calling thread's input segment |
+
|
+ +inlinestatic | +
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. The first valid_threads
threads each contribute one input element.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
The return value is undefined in threads other than thread0.
+[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input |
[in] | valid_threads | Number of threads containing valid elements (may be less than BLOCK_THREADS) |
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockScan< T, BLOCK_THREADS, POLICY >, including all inherited members.
+ExclusiveScan(SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], const T &identity, ScanOp scan_op, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, T identity, ScanOp scan_op, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T identity, ScanOp scan_op, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, T identity, ScanOp scan_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], const T &identity, ScanOp scan_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T input, T &output, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T input, T &output, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T input, T &output) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], ScanOp scan_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T input, T &output, T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T input, T &output, T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD], T &block_aggregate, BlockPrefixOp &block_prefix_op) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T input, T &output) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T(&input)[ITEMS_PER_THREAD], T(&output)[ITEMS_PER_THREAD]) | cub::BlockScan< T, BLOCK_THREADS, POLICY > | inlinestatic |
SmemStorage typedef | cub::BlockScan< T, BLOCK_THREADS, POLICY > |
+ CUB
+
+ |
+
This is the complete list of members for cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER >, including all inherited members.
+SmemStorage typedef | cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > | |
Store(SmemStorage &smem_storage, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) | cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > | inlinestatic |
Store(SmemStorage &smem_storage, OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) | cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > | inlinestatic |
+ CUB
+
+ |
+
BlockStore provides data movement operations for writing blocked-arranged data to global memory.
+BlockStore provides a single tile-storing abstraction whose performance behavior can be statically tuned. In particular, BlockStore implements several alternative cub::BlockStorePolicy strategies catering to different granularity sizes (i.e., number of items per thread).
+OutputIterator | The input iterator type (may be a simple pointer type). |
BLOCK_THREADS | The threadblock size in threads. |
ITEMS_PER_THREAD | The number of consecutive items partitioned onto each thread. |
POLICY | [optional] cub::BlockStorePolicy tuning policy enumeration. Default = cub::BLOCK_STORE_DIRECT. |
MODIFIER | [optional] cub::PtxStoreModifier cache modifier. Default = cub::PTX_STORE_NONE. |
__syncthreads()
barrier is required if the supplied BlockStore::SmemStorage is to be reused or repurposed by the threadblockExample 2. Have a threadblock store a blocked arrangement of ITEMS_PER_THREAD
consecutive integers per thread using vectorized stores and global-only caching:
+
+Public Types | |
+typedef _SmemStorage | SmemStorage |
The operations exposed by BlockStore require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
static __device__ +__forceinline__ void | Store (SmemStorage &smem_storage, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock. More... | |
template<typename SizeT > | |
static __device__ +__forceinline__ void | Store (SmemStorage &smem_storage, OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock, guarded by range. More... | |
+
|
+ +inlinestatic | +
Store a tile of items across a threadblock.
+[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | block_itr | The threadblock's base output iterator for storing to |
[in] | items | Data to store |
+
|
+ +inlinestatic | +
Store a tile of items across a threadblock, guarded by range.
+SizeT | [inferred] Integer type for offsets |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | block_itr | The threadblock's base output iterator for storing to |
[in] | guarded_items | Number of valid items in the tile |
[in] | items | Data to store |
+ CUB
+
+ |
+
This is the complete list of members for cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >, including all inherited members.
+ExclusiveScan(SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T input, T &output) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T input, T &output, T &warp_aggregate) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
ExclusiveSum(SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
InclusiveScan(SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T input, T &output) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T input, T &output, T &warp_aggregate) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
InclusiveSum(SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > | inlinestatic |
SmemStorage typedef | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > |
+ CUB
+
+ |
+
WarpScan provides variants of parallel prefix scan across a CUDA warp.
+T | The scan input/output element type |
WARPS | The number of "logical" warps performing concurrent warp scans |
LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 warps for SM20). |
warp_prefix_op
and warp_aggregate
) are only considered valid in lane0__syncthreads()
barrier is required if the supplied WarpScan::SmemStorage is to be reused or repurposed by the threadblockSHFL
)warp_prefix_op
and warp_aggregate
)T
is a built-in C++ primitive or CUDA vector type (e.g., short
, int2
, double
, float2
, etc.)LOGICAL_WARP_THREADS
is a multiple of the architecture's warp sizeLOGICAL_WARP_THREADS
(which defaults to the warp size associated with the CUDA Compute Capability targeted by the compiler). +Public Types | |
+typedef _SmemStorage | SmemStorage |
The operations exposed by WarpScan require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union 'd with other types to facilitate shared memory reuse. | |
+Static Public Methods | |
Inclusive prefix sums | |
static __device__ +__forceinline__ void | InclusiveSum (SmemStorage &smem_storage, T input, T &output) |
Computes an inclusive prefix sum in each logical warp. More... | |
static __device__ +__forceinline__ void | InclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate) |
Computes an inclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
template<typename WarpPrefixOp > | |
static __device__ +__forceinline__ void | InclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate . More... | |
Exclusive prefix sums | |
static __device__ +__forceinline__ void | ExclusiveSum (SmemStorage &smem_storage, T input, T &output) |
Computes an exclusive prefix sum in each logical warp. More... | |
static __device__ +__forceinline__ void | ExclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate) |
Computes an exclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
template<typename WarpPrefixOp > | |
static __device__ +__forceinline__ void | ExclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate . More... | |
Inclusive prefix scans | |
template<typename ScanOp > | |
static __device__ +__forceinline__ void | InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. More... | |
template<typename ScanOp > | |
static __device__ +__forceinline__ void | InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate) |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
template<typename ScanOp , typename WarpPrefixOp > | |
static __device__ +__forceinline__ void | InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate . More... | |
Exclusive prefix scans | |
template<typename ScanOp > | |
static __device__ +__forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. More... | |
template<typename ScanOp > | |
static __device__ +__forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
template<typename ScanOp , typename WarpPrefixOp > | |
static __device__ +__forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate . More... | |
Exclusive prefix scans (without supplied identity) | |
template<typename ScanOp > | |
static __device__ +__forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. More... | |
template<typename ScanOp > | |
static __device__ +__forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
template<typename ScanOp , typename WarpPrefixOp > | |
static __device__ +__forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate of all inputs for thread-thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate . More... | |
+
|
+ +inlinestatic | +
Computes an inclusive prefix sum in each logical warp.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
+
|
+ +inlinestatic | +
Computes an inclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0.
The warp_aggregate
is undefined in threads other than thread-lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
+
|
+ +inlinestatic | +
Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0. The warp_prefix_op
is further updated by the value of warp_aggregate
.
The warp_aggregate
is undefined in threads other than thread-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items, exclusive of the warp_prefix_op value |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
+
|
+ +inlinestatic | +
Computes an exclusive prefix sum in each logical warp.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
+
|
+ +inlinestatic | +
Computes an exclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0.
The warp_aggregate
is undefined in threads other than thread-lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
+
|
+ +inlinestatic | +
Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0. The warp_prefix_op
is further updated by the value of warp_aggregate
.
The warp_aggregate
is undefined in threads other than thread-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
+
|
+ +inlinestatic | +
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
+
|
+ +inlinestatic | +
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0.
The warp_aggregate
is undefined in threads other than thread-lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
+
|
+ +inlinestatic | +
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0. The warp_prefix_op
is further updated by the value of warp_aggregate
.
The warp_aggregate
is undefined in threads other than thread-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
+
|
+ +inlinestatic | +
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp.
+A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
+
|
+ +inlinestatic | +
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0.
The warp_aggregate
is undefined in threads other than thread-lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
+
|
+ +inlinestatic | +
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op
is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0. The warp_prefix_op
is further updated by the value of warp_aggregate
.
The warp_aggregate
is undefined in threads other than thread-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | identity | Identity value |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
+
|
+ +inlinestatic | +
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output
computed for thread-lane0 is invalid.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
+
|
+ +inlinestatic | +
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output
computed for thread-lane0 is invalid. Also computes the warp-wide warp_aggregate
of all inputs for thread-lane0.
The warp_aggregate
is undefined in threads other than thread-lane0.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
+
|
+ +inlinestatic | +
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The warp_prefix_op
value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate
of all inputs for thread-thread-lane0. The warp_prefix_op
is further updated by the value of warp_aggregate
.
The warp_aggregate
is undefined in threads other than thread-lane0.
The warp_prefix_op
functor must implement a member function T operator()(T warp_aggregate)
. The functor's input parameter warp_aggregate
is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads()
threadblock barrier should be invoked after calling this method if the supplied smem_storage
is to be reused or repurposed by the threadblock.
ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
[in] | smem_storage | Shared reference to opaque SmemStorage layout |
[in] | input | Calling thread's input item. |
[out] | output | Calling thread's output item. May be aliased with input . |
[in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
[out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
[in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
+ CUB
+
+ |
+
| BlockRadixSort (cub) |
|
|
| ||||
BlockReduce (cub) | ||||||||
ArrayTraits (cub) | BlockScan (cub) | If (cub) | NullType (cub) | Traits (cub) | ||||
| BlockStore (cub) | IsVolatile (cub) | NumericTraits (cub) |
| ||||
|
|
| ||||||
BaseTraits (cub) | WarpScan (cub) | |||||||
BlockDiscontinuity (cub) | EnableIf (cub) | Log2 (cub) | RemoveQualifiers (cub) | |||||
BlockExchange (cub) | Equality (cub) |
|
| |||||
BlockLoad (cub) | Equals (cub) | |||||||
Max (cub) | Sum (cub) | |||||||
+ CUB
+
+ |
+
#include <stdio.h>
#include "ns_wrapper.cuh"
#include "device_props.cuh"
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Macros | |
#define | CubDebug(f) cub::Debug(f, __FILE__, __LINE__) |
#define | CubDebugExit(f) if (cub::Debug(f, __FILE__, __LINE__)) exit(1) |
+Functions | |
__host__ __device__ +__forceinline__ cudaError_t | cub::Debug (cudaError_t error, const char *message, const char *filename, int line) |
If CUB_STDERR is defined and error is not cudaSuccess , message is printed to stderr along with the supplied source context. More... | |
__host__ __device__ +__forceinline__ cudaError_t | cub::Debug (cudaError_t error, const char *filename, int line) |
If CUB_STDERR is defined and error is not cudaSuccess , the corresponding error message is printed to stderr along with the supplied source context. More... | |
Debug error display routines
+#define CubDebug | +( | ++ | f | ) | +cub::Debug(f, __FILE__, __LINE__) | +
Debug macro
+ +#define CubDebugExit | +( | ++ | f | ) | +if (cub::Debug(f, __FILE__, __LINE__)) exit(1) | +
Debug macro with exit
+ +
+ CUB
+
+ |
+
+Files | |
file | block_discontinuity.cuh |
file | block_exchange.cuh |
file | block_load.cuh |
file | block_radix_sort.cuh |
file | block_reduce.cuh |
file | block_scan.cuh |
file | block_store.cuh |
+ CUB
+
+ |
+
+Files | |
file | thread_load.cuh |
file | thread_store.cuh |
+ CUB
+
+ |
+
+Files | |
file | warp_scan.cuh |
+ CUB
+
+ |
+
+Directories | |
directory | block |
directory | thread |
directory | warp |
+Files | |
file | debug.cuh |
file | operators.cuh |
file | type_utils.cuh |
+ CUB
+
+ |
+
+ CUB
+
+ |
+
+ CUB
+
+ |
+
+ CUB
+
+ |
+
+ CUB
+
+ |
+
+Functions | |
__host__ __device__ +__forceinline__ cudaError_t | cub::Debug (cudaError_t error, const char *message, const char *filename, int line) |
If CUB_STDERR is defined and error is not cudaSuccess , message is printed to stderr along with the supplied source context. More... | |
__host__ __device__ +__forceinline__ cudaError_t | cub::Debug (cudaError_t error, const char *filename, int line) |
If CUB_STDERR is defined and error is not cudaSuccess , the corresponding error message is printed to stderr along with the supplied source context. More... | |
__host__ __device__ __forceinline__ cudaError_t cub::Debug | +( | +cudaError_t | +error, | +
+ | + | const char * | +message, | +
+ | + | const char * | +filename, | +
+ | + | int | +line | +
+ | ) | ++ |
If CUB_STDERR
is defined and error
is not cudaSuccess
, message
is printed to stderr
along with the supplied source context.
__host__ __device__ __forceinline__ cudaError_t cub::Debug | +( | +cudaError_t | +error, | +
+ | + | const char * | +filename, | +
+ | + | int | +line | +
+ | ) | ++ |
If CUB_STDERR
is defined and error
is not cudaSuccess
, the corresponding error message is printed to stderr
along with the supplied source context.
+ CUB
+
+ |
+
+Modules | |
Cooperative SIMT Operations | |
SIMT Utilities | |
+ CUB
+
+ |
+
+Classes | |
class | cub::BlockDiscontinuity< T, BLOCK_THREADS > |
BlockDiscontinuity provides operations for flagging discontinuities within a list of data items partitioned across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | cub::BlockExchange< T, BLOCK_THREADS, ITEMS_PER_THREAD > |
BlockExchange provides operations for reorganizing the partitioning of ordered data across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | cub::BlockLoad< InputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > |
BlockLoad provides data movement operations for reading block-arranged data from global memory.
+
+
+ More...![]()
+. | |
class | cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > |
BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | cub::BlockReduce< T, BLOCK_THREADS > |
BlockReduce provides variants of parallel reduction across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | cub::BlockScan< T, BLOCK_THREADS, POLICY > |
BlockScan provides variants of parallel prefix scan (and prefix sum) across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > |
BlockStore provides data movement operations for writing blocked-arranged data to global memory.
+
+
+ More...![]()
+. | |
class | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > |
WarpScan provides variants of parallel prefix scan across a CUDA warp.
+
+
+ More...![]()
+. | |
+ CUB
+
+ |
+
+ CUB
+
+ |
+
T
and is specialized for the underlying architecture.union
'd with other types so that the shared memory can be reused for other purposes.ld.global.v4.s32
PTX instructions will be generated when T
= int
and ITEMS_PER_THREAD
is a multiple of 4.
|
+![]() |
|
+![]() |
|
+![]() |
#include
the cub.cuh
header file into your .cu
CUDA C++ sources and compile with NVIDIA's nvcc
compiler.
|
|
|
|
t |
+ CUB
+
+ |
+
![]() | |
![]() ![]() | |
![]() ![]() | |
![]() |
+ CUB
+
+ |
+
CUB namespace. +More...
++Classes | |
struct | ArrayTraits |
Array traits. More... | |
struct | BaseTraits |
Basic type traits. More... | |
class | BlockDiscontinuity |
BlockDiscontinuity provides operations for flagging discontinuities within a list of data items partitioned across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | BlockExchange |
BlockExchange provides operations for reorganizing the partitioning of ordered data across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | BlockLoad |
BlockLoad provides data movement operations for reading block-arranged data from global memory.
+
+
+ More...![]()
+. | |
class | BlockRadixSort |
BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | BlockReduce |
BlockReduce provides variants of parallel reduction across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | BlockScan |
BlockScan provides variants of parallel prefix scan (and prefix sum) across a CUDA threadblock.
+
+
+ More...![]()
+. | |
class | BlockStore |
BlockStore provides data movement operations for writing blocked-arranged data to global memory.
+
+
+ More...![]()
+. | |
struct | EnableIf |
Simple enable-if (similar to Boost) More... | |
struct | Equality |
Default equality functor. More... | |
struct | Equals |
Type equality test. More... | |
struct | If |
Type selection (IF ? ThenType : ElseType ) More... | |
struct | IsVolatile |
Volatile modifier test. More... | |
struct | Log2 |
Statically determine log2(N), rounded up. More... | |
struct | Max |
Default max functor. More... | |
struct | NullType |
A simple "NULL" marker type. More... | |
struct | NumericTraits |
Numeric type traits. More... | |
struct | RemoveQualifiers |
Removes const and volatile qualifiers from type Tp . More... | |
struct | Sum |
Default sum functor. More... | |
struct | Traits |
Type traits. More... | |
class | WarpScan |
WarpScan provides variants of parallel prefix scan across a CUDA warp.
+
+
+ More...![]()
+. | |
+Enumerations | |
enum | BlockLoadPolicy { BLOCK_LOAD_DIRECT, +BLOCK_LOAD_VECTORIZE, +BLOCK_LOAD_TRANSPOSE + } |
Tuning policy for cub::BlockLoad. More... | |
enum | BlockScanPolicy { BLOCK_SCAN_RAKING, +BLOCK_SCAN_WARPSCANS + } |
Tuning policy for cub::BlockScan. More... | |
enum | BlockStorePolicy { BLOCK_STORE_DIRECT, +BLOCK_STORE_VECTORIZE, +BLOCK_STORE_TRANSPOSE + } |
Tuning policy for cub::BlockStore. More... | |
enum | Category { NOT_A_NUMBER, +SIGNED_INTEGER, +UNSIGNED_INTEGER, +FLOATING_POINT + } |
Basic type traits categories. | |
enum | PtxLoadModifier { + PTX_LOAD_NONE, +PTX_LOAD_CA, +PTX_LOAD_CG, +PTX_LOAD_CS, + + PTX_LOAD_CV, +PTX_LOAD_LDG, +PTX_LOAD_VS + + } |
Enumeration of PTX cache-modifiers for memory load operations. More... | |
enum | PtxStoreModifier { + PTX_STORE_NONE, +PTX_STORE_WB, +PTX_STORE_CG, +PTX_STORE_CS, + + PTX_STORE_WT, +PTX_STORE_VS + + } |
Enumeration of PTX cache-modifiers for memory store operations. More... | |
+Functions | |
__host__ __device__ +__forceinline__ cudaError_t | Debug (cudaError_t error, const char *message, const char *filename, int line) |
If CUB_STDERR is defined and error is not cudaSuccess , message is printed to stderr along with the supplied source context. More... | |
__host__ __device__ +__forceinline__ cudaError_t | Debug (cudaError_t error, const char *filename, int line) |
If CUB_STDERR is defined and error is not cudaSuccess , the corresponding error message is printed to stderr along with the supplied source context. More... | |
Direct threadblock loads (blocked arrangement) | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | BlockLoadDirect (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | BlockLoadDirect (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly, guarded by range. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier, guarded by range, with assignment for out-of-bound elements. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirect (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly, guarded by range, with assignment for out-of-bound elements. More... | |
Direct threadblock loads (striped arrangement) | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | BlockLoadDirectStriped (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator > | |
__device__ __forceinline__ void | BlockLoadDirectStriped (InputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped directly tile using the specified cache modifier, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly, guarded by range. More... | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped directly tile using the specified cache modifier, guarded by range, with assignment for out-of-bound elements. More... | |
template<typename T , int ITEMS_PER_THREAD, typename InputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockLoadDirectStriped (InputIterator block_itr, const SizeT &guarded_items, T oob_default, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Load striped tile directly, guarded by range, with assignment for out-of-bound elements. More... | |
Threadblock vectorized loads (blocked arrangement) | |
template<PtxLoadModifier MODIFIER, typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | BlockLoadVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | BlockLoadVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Load a tile of items across a threadblock directly. More... | |
Direct threadblock stores (blocked arrangement) | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | BlockStoreDirect (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | BlockStoreDirect (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly. More... | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockStoreDirect (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly using the specified cache modifier, guarded by range. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockStoreDirect (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly, guarded by range. More... | |
Direct threadblock stores (striped arrangement) | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | BlockStoreDirectStriped (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Store striped tile directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator > | |
__device__ __forceinline__ void | BlockStoreDirectStriped (OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Store striped tile directly. More... | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockStoreDirectStriped (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
template<typename T , int ITEMS_PER_THREAD, typename OutputIterator , typename SizeT > | |
__device__ __forceinline__ void | BlockStoreDirectStriped (OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD], int stride=blockDim.x) |
Store striped tile directly, guarded by range. More... | |
Threadblock vectorized stores (blocked arrangement) | |
template<PtxStoreModifier MODIFIER, typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | BlockStoreVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly using the specified cache modifier. More... | |
template<typename T , int ITEMS_PER_THREAD> | |
__device__ __forceinline__ void | BlockStoreVectorized (T *block_ptr, T(&items)[ITEMS_PER_THREAD]) |
Store a tile of items across a threadblock directly. More... | |
Thread utilities for memory I/O using PTX cache modifiers | |
template<PtxLoadModifier MODIFIER, typename InputIterator > | |
__device__ __forceinline__ +std::iterator_traits +< InputIterator >::value_type | ThreadLoad (InputIterator itr) |
Thread utility for reading memory using cub::PtxLoadModifier cache modifiers. More... | |
template<PtxStoreModifier MODIFIER, typename OutputIterator , typename T > | |
__device__ __forceinline__ void | ThreadStore (OutputIterator itr, const T &val) |
Thread utility for writing memory using cub::PtxStoreModifier cache modifiers. More... | |
CUB namespace.
+enum cub::BlockScanPolicy | +
Tuning policy for cub::BlockScan.
+enum cub::BlockLoadPolicy | +
Tuning policy for cub::BlockLoad.
+Enumerator | |
---|---|
BLOCK_LOAD_DIRECT |
+
A blocked arrangement of data is read directly from memory. The threadblock reads items in a parallel "raking" fashion: threadi reads the ith segment of consecutive elements. +
|
BLOCK_LOAD_VECTORIZE |
+
A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization. The threadblock reads items in a parallel "raking" fashion: threadi uses vector loads to read the ith segment of consecutive elements. +For example,
|
BLOCK_LOAD_TRANSPOSE |
+
A striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. The threadblock reads items in a parallel "strip-mining" fashion: threadi reads items having stride
|
enum cub::BlockStorePolicy | +
Tuning policy for cub::BlockStore.
+Enumerator | |
---|---|
BLOCK_STORE_DIRECT |
+
A blocked arrangement of data is written directly to memory. The threadblock writes items in a parallel "raking" fashion: threadi writes the ith segment of consecutive elements. +
|
BLOCK_STORE_VECTORIZE |
+
A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. The threadblock writes items in a parallel "raking" fashion: threadi uses vector stores to write the ith segment of consecutive elements. +For example,
|
BLOCK_STORE_TRANSPOSE |
+
|
+ CUB
+
+ |
+
+Classes | |
struct | cub::Equality< T > |
Default equality functor. More... | |
struct | cub::Max< T > |
Default max functor. More... | |
struct | cub::Sum< T > |
Default sum functor. More... | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
Simple binary operator functor types
+
+ CUB
+
+ |
+
Array traits.
+
+ CUB
+
+ |
+
This is the complete list of members for cub::BaseTraits< _CATEGORY, _PRIMITIVE, _NULL_TYPE, _UnsignedBits >, including all inherited members.
+
+ CUB
+
+ |
+
Basic type traits.
++Public Types | |
enum | { PRIMITIVE = _PRIMITIVE, +NULL_TYPE = _NULL_TYPE + } |
+Static Public Members | |
+static const Category | CATEGORY = _CATEGORY |
Category. | |
+ CUB
+
+ |
+
This is the complete list of members for cub::EnableIf< Condition, T >, including all inherited members.
+Type typedef | cub::EnableIf< Condition, T > |
+ CUB
+
+ |
+
Simple enable-if (similar to Boost)
++Public Types | |
+typedef T | Type |
Enable-if type for SFINAE dummy variables. | |
+ CUB
+
+ |
+
This is the complete list of members for cub::Equality< T >, including all inherited members.
+operator()(const T &a, const T &b) | cub::Equality< T > | inline |
+ CUB
+
+ |
+
Default equality functor.
++Public Methods | |
+__host__ __device__ +__forceinline__ bool | operator() (const T &a, const T &b) |
Boolean equality operator, returns (a == b) | |
+ CUB
+
+ |
+
This is the complete list of members for cub::Equals< A, B >, including all inherited members.
+NEGATE enum value (defined in cub::Equals< A, B >) | cub::Equals< A, B > | |
VALUE enum value (defined in cub::Equals< A, B >) | cub::Equals< A, B > |
+ CUB
+
+ |
+
Type equality test.
++Public Types | |
enum | { VALUE = 0, +NEGATE = 1 + } |
+ CUB
+
+ |
+
This is the complete list of members for cub::If< IF, ThenType, ElseType >, including all inherited members.
+Type typedef | cub::If< IF, ThenType, ElseType > |
+ CUB
+
+ |
+
Type selection (IF ? ThenType : ElseType
)
+Public Types | |
+typedef ThenType | Type |
Conditional type result. | |
+ CUB
+
+ |
+
This is the complete list of members for cub::IsVolatile< Tp >, including all inherited members.
+VALUE enum value (defined in cub::IsVolatile< Tp >) | cub::IsVolatile< Tp > |
+ CUB
+
+ |
+
Volatile modifier test.
++Public Types | |
enum | { VALUE = 0 + } |
+ CUB
+
+ |
+
This is the complete list of members for cub::Log2< N, CURRENT_VAL, COUNT >, including all inherited members.
+VALUE | cub::Log2< N, CURRENT_VAL, COUNT > | static |
+ CUB
+
+ |
+
Statically determine log2(N), rounded up.
+For example: Log2<8>::VALUE // 3 Log2<3>::VALUE // 2
++Static Public Members | |
+static const int | VALUE = Log2<N, (CURRENT_VAL >> 1), COUNT + 1>::VALUE |
Static logarithm value. | |
+ CUB
+
+ |
+
This is the complete list of members for cub::Max< T >, including all inherited members.
+operator()(const T &a, const T &b) | cub::Max< T > | inline |
+ CUB
+
+ |
+
Default max functor.
++Public Methods | |
+__host__ __device__ +__forceinline__ T | operator() (const T &a, const T &b) |
Boolean max operator, returns (a > b) ? a : b | |
+ CUB
+
+ |
+
A simple "NULL" marker type.
+
+ CUB
+
+ |
+
This is the complete list of members for cub::NumericTraits< T >, including all inherited members.
+CATEGORY | cub::BaseTraits< NOT_A_NUMBER, false, false, T > | static |
NULL_TYPE enum value (defined in cub::BaseTraits< NOT_A_NUMBER, false, false, T >) | cub::BaseTraits< NOT_A_NUMBER, false, false, T > | |
PRIMITIVE enum value (defined in cub::BaseTraits< NOT_A_NUMBER, false, false, T >) | cub::BaseTraits< NOT_A_NUMBER, false, false, T > |
+ CUB
+
+ |
+
Numeric type traits.
++Additional Inherited Members | |
![]() | |
enum | |
![]() | |
+static const Category | CATEGORY |
Category. | |
+ CUB
+
+ |
+
This is the complete list of members for cub::RemoveQualifiers< Tp, Up >, including all inherited members.
+Type typedef | cub::RemoveQualifiers< Tp, Up > |
+ CUB
+
+ |
+
Removes const
and volatile
qualifiers from type Tp
.
For example: typename RemoveQualifiers<volatile int>::Type // int;
+Public Types | |
+typedef Up | Type |
Type without const and volatile qualifiers. | |
+ CUB
+
+ |
+
This is the complete list of members for cub::Sum< T >, including all inherited members.
+operator()(const T &a, const T &b) | cub::Sum< T > | inline |
+ CUB
+
+ |
+
Default sum functor.
++Public Methods | |
+__host__ __device__ +__forceinline__ T | operator() (const T &a, const T &b) |
Boolean sum operator, returns a + b | |
+ CUB
+
+ |
+
This is the complete list of members for cub::Traits< T >, including all inherited members.
+
+ CUB
+
+ |
+
Type traits.
++Additional Inherited Members | |
![]() | |
enum | |
![]() | |
+static const Category | CATEGORY |
Category. | |
+ CUB
+
+ |
+
#include <cuda.h>
#include <iterator>
#include "../ptx_intrinsics.cuh"
#include "../type_utils.cuh"
#include "../ns_wrapper.cuh"
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Enumerations | |
enum | cub::PtxLoadModifier { + cub::PTX_LOAD_NONE, +cub::PTX_LOAD_CA, +cub::PTX_LOAD_CG, +cub::PTX_LOAD_CS, + + cub::PTX_LOAD_CV, +cub::PTX_LOAD_LDG, +cub::PTX_LOAD_VS + + } |
Enumeration of PTX cache-modifiers for memory load operations. More... | |
+Functions | |
Thread utilities for memory I/O using PTX cache modifiers | |
template<PtxLoadModifier MODIFIER, typename InputIterator > | |
__device__ __forceinline__ +std::iterator_traits +< InputIterator >::value_type | cub::ThreadLoad (InputIterator itr) |
Thread utility for reading memory using cub::PtxLoadModifier cache modifiers. More... | |
Thread utilities for reading memory using PTX cache modifiers.
+
+ CUB
+
+ |
+
#include <cuda.h>
#include "../ptx_intrinsics.cuh"
#include "../type_utils.cuh"
#include "../ns_wrapper.cuh"
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Enumerations | |
enum | cub::PtxStoreModifier { + cub::PTX_STORE_NONE, +cub::PTX_STORE_WB, +cub::PTX_STORE_CG, +cub::PTX_STORE_CS, + + cub::PTX_STORE_WT, +cub::PTX_STORE_VS + + } |
Enumeration of PTX cache-modifiers for memory store operations. More... | |
+Functions | |
Thread utilities for memory I/O using PTX cache modifiers | |
template<PtxStoreModifier MODIFIER, typename OutputIterator , typename T > | |
__device__ __forceinline__ void | cub::ThreadStore (OutputIterator itr, const T &val) |
Thread utility for writing memory using cub::PtxStoreModifier cache modifiers. More... | |
Thread utilities for writing memory using PTX cache modifiers.
+
+ CUB
+
+ |
+
#include <iostream>
#include "ns_wrapper.cuh"
+Classes | |
struct | cub::ArrayTraits< ArrayType, LENGTH > |
Array traits. More... | |
struct | cub::BaseTraits< _CATEGORY, _PRIMITIVE, _NULL_TYPE, _UnsignedBits > |
Basic type traits. More... | |
struct | cub::EnableIf< Condition, T > |
Simple enable-if (similar to Boost) More... | |
struct | cub::Equals< A, B > |
Type equality test. More... | |
struct | cub::If< IF, ThenType, ElseType > |
Type selection (IF ? ThenType : ElseType ) More... | |
struct | cub::IsVolatile< Tp > |
Volatile modifier test. More... | |
struct | cub::Log2< N, CURRENT_VAL, COUNT > |
Statically determine log2(N), rounded up. More... | |
struct | cub::NullType |
A simple "NULL" marker type. More... | |
struct | cub::NumericTraits< T > |
Numeric type traits. More... | |
struct | cub::RemoveQualifiers< Tp, Up > |
Removes const and volatile qualifiers from type Tp . More... | |
struct | cub::Traits< T > |
Type traits. More... | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
+Macros | |
#define | CUB_HAS_NESTED_TYPE(detect_struct, nested_type_name) |
+Enumerations | |
enum | cub::Category { NOT_A_NUMBER, +SIGNED_INTEGER, +UNSIGNED_INTEGER, +FLOATING_POINT + } |
Basic type traits categories. | |
Common type manipulation (metaprogramming) utilities
+#define CUB_HAS_NESTED_TYPE | +( | ++ | detect_struct, | +
+ | + | + | nested_type_name | +
+ | ) | ++ |
Allows the definition of structures that will detect the presence of the specified type name within other classes
+ +
+ CUB
+
+ |
+
#include "../thread/thread_load.cuh"
#include "../thread/thread_store.cuh"
#include "../device_props.cuh"
#include "../type_utils.cuh"
#include "../operators.cuh"
#include "../ns_wrapper.cuh"
+Classes | |
class | cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > |
WarpScan provides variants of parallel prefix scan across a CUDA warp.
+
+
+ More...![]()
+. | |
+Namespaces | |
namespace | cub |
CUB namespace. | |
cub::WarpScan provides variants of parallel prefix scan across a CUDA warp.
++ * \par + * - Blocked arrangement. The aggregate tile of items is partitioned + * evenly across threads in "blocked" fashion with threadi + * owning the ith segment of consecutive elements. + * Blocked arrangements are often desirable for algorithmic benefits (where + * long sequences of items can be processed sequentially within each thread). + * | + *
+ * \par
+ * \image html blocked.png
+ * Blocked arrangement across four threads
+ * (emphasis on items owned by thread0) |
+ *
+ * \par + * - Striped arrangement. The aggregate tile of items is partitioned across + * threads in "striped" fashion, i.e., the \p ITEMS_PER_THREAD items owned by + * each thread have logical stride \p BLOCK_THREADS between them. Striped arrangements + * are often desirable for data movement through global memory (where + * [read/write coalescing](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-global-memory) + * is an important performance consideration). + * | + *
+ * \par
+ * \image html striped.png
+ * Striped arrangement across four threads
+ * (emphasis on items owned by thread0) |
+ *