diff --git a/CHANGE_LOG.TXT b/CHANGE_LOG.TXT index 6cbb0b044f..5713441b04 100644 --- a/CHANGE_LOG.TXT +++ b/CHANGE_LOG.TXT @@ -1,5 +1,10 @@ //----------------------------------------------------------------------------- +1.2.0 02/25/2014 + - New features: + +//----------------------------------------------------------------------------- + 1.1.1 12/11/2013 - New features: - Added TexObjInputIterator, TexRefInputIterator, CacheModifiedInputIterator, and CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. Compatible with Thrust API. diff --git a/LICENSE.TXT b/LICENSE.TXT index 99415665a9..ab0505796d 100644 --- a/LICENSE.TXT +++ b/LICENSE.TXT @@ -1,5 +1,5 @@ Copyright (c) 2010-2011, Duane Merrill. All rights reserved. -Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: diff --git a/README.md b/README.md index 7bb5ca7fd4..e89220a2ae 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@

About CUB

-Current release: v1.1.1 (December 11, 2013) +Current release: v1.2.0 (February 25, 2014) We recommend the [CUB Project Website](http://nvlabs.github.com/cub) and the [cub-users discussion forum](http://groups.google.com/group/cub-users) for further information and examples. @@ -84,6 +84,7 @@ See [CUB Project Website](http://nvlabs.github.com/cub) for more information. | Date | Version | | ---- | ------- | +| 02/25/2014 | [CUB v1.2.0 Primary Release](https://github.com/NVlabs/cub/archive/1.2.0.zip) | | 12/10/2013 | [CUB v1.1.1 Primary Release](https://github.com/NVlabs/cub/archive/1.1.1.zip) | | 08/08/2013 | [CUB v1.0.1 Primary Release](https://github.com/NVlabs/cub/archive/1.0.1.zip) | | 05/07/2013 | [CUB v0.9.4 Update Release](https://github.com/NVlabs/cub/archive/0.9.4.zip) | @@ -104,7 +105,7 @@ CUB is available under the "New BSD" open-source license: ``` Copyright (c) 2010-2011, Duane Merrill. All rights reserved. -Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: diff --git a/cub/block/block_discontinuity.cuh b/cub/block/block_discontinuity.cuh index 3cb48a375d..8348ee7f7e 100644 --- a/cub/block/block_discontinuity.cuh +++ b/cub/block/block_discontinuity.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -61,7 +61,7 @@ namespace cub { * \blockcollective{BlockDiscontinuity} * \par * The code snippet below illustrates the head flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -274,7 +274,7 @@ public: * * \par * The code snippet below illustrates the head-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -352,7 +352,7 @@ public: * * \par * The code snippet below illustrates the head-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -445,7 +445,7 @@ public: * * \par * The code snippet below illustrates the tail-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -524,7 +524,7 @@ public: * * \par * The code snippet below illustrates the tail-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code diff --git a/cub/block/block_exchange.cuh b/cub/block/block_exchange.cuh index 87c9e74b6f..f94947cb17 100644 --- a/cub/block/block_exchange.cuh +++ b/cub/block/block_exchange.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -60,10 +60,10 @@ namespace cub { * yet most block-wide operations prefer a "blocked" partitioning of items across threads * (where consecutive items belong to a single thread). * - BlockExchange supports the following types of data exchanges: - * - Transposing between [blocked](index.html#sec4sec3) and [striped](index.html#sec4sec3) arrangements - * - Transposing between [blocked](index.html#sec4sec3) and [warp-striped](index.html#sec4sec3) arrangements - * - Scattering ranked items to a [blocked arrangement](index.html#sec4sec3) - * - Scattering ranked items to a [striped arrangement](index.html#sec4sec3) + * - Transposing between [blocked](index.html#sec5sec3) and [striped](index.html#sec5sec3) arrangements + * - Transposing between [blocked](index.html#sec5sec3) and [warp-striped](index.html#sec5sec3) arrangements + * - Scattering ranked items to a [blocked arrangement](index.html#sec5sec3) + * - Scattering ranked items to a [striped arrangement](index.html#sec5sec3) * * \par A Simple Example * \blockcollective{BlockExchange} diff --git a/cub/block/block_histogram.cuh b/cub/block/block_histogram.cuh index f1b2f44ff7..ec2eede6fb 100644 --- a/cub/block/block_histogram.cuh +++ b/cub/block/block_histogram.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/block_load.cuh b/cub/block/block_load.cuh index 1a938015fd..739290a4ff 100644 --- a/cub/block/block_load.cuh +++ b/cub/block/block_load.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -441,7 +441,7 @@ enum BlockLoadAlgorithm /** * \par Overview * - * A [blocked arrangement](index.html#sec4sec3) of data is read + * A [blocked arrangement](index.html#sec5sec3) of data is read * directly from memory. The thread block reads items in a parallel "raking" fashion: threadi * reads the ith segment of consecutive elements. * @@ -454,7 +454,7 @@ enum BlockLoadAlgorithm /** * \par Overview * - * A [blocked arrangement](index.html#sec4sec3) of data is read directly + * A [blocked arrangement](index.html#sec5sec3) of data is read directly * from memory using CUDA's built-in vectorized loads as a coalescing optimization. * The thread block reads items in a parallel "raking" fashion: threadi uses vector loads to * read the ith segment of consecutive elements. @@ -476,13 +476,13 @@ enum BlockLoadAlgorithm /** * \par Overview * - * A [striped arrangement](index.html#sec4sec3) of data is read + * A [striped arrangement](index.html#sec5sec3) of data is read * directly from memory and then is locally transposed into a - * [blocked arrangement](index.html#sec4sec3). The thread block + * [blocked arrangement](index.html#sec5sec3). The thread block * reads items in a parallel "strip-mining" fashion: * threadi reads items having stride \p BLOCK_THREADS * between them. cub::BlockExchange is then used to locally reorder the items - * into a [blocked arrangement](index.html#sec4sec3). + * into a [blocked arrangement](index.html#sec5sec3). * * \par Performance Considerations * - The utilization of memory transactions (coalescing) remains high regardless @@ -496,13 +496,13 @@ enum BlockLoadAlgorithm /** * \par Overview * - * A [warp-striped arrangement](index.html#sec4sec3) of data is read + * A [warp-striped arrangement](index.html#sec5sec3) of data is read * directly from memory and then is locally transposed into a - * [blocked arrangement](index.html#sec4sec3). Each warp reads its own + * [blocked arrangement](index.html#sec5sec3). Each warp reads its own * contiguous segment in a parallel "strip-mining" fashion: lanei * reads items having stride \p WARP_THREADS between them. cub::BlockExchange * is then used to locally reorder the items into a - * [blocked arrangement](index.html#sec4sec3). + * [blocked arrangement](index.html#sec5sec3). * * \par Usage Considerations * - BLOCK_THREADS must be a multiple of WARP_THREADS @@ -518,7 +518,7 @@ enum BlockLoadAlgorithm /** - * \brief The BlockLoad class provides [collective](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [blocked arrangement](index.html#sec4sec3) across a CUDA thread block. ![](block_load_logo.png) + * \brief The BlockLoad class provides [collective](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [blocked arrangement](index.html#sec5sec3) across a CUDA thread block. ![](block_load_logo.png) * \ingroup BlockModule * \ingroup UtilIo * @@ -533,17 +533,17 @@ enum BlockLoadAlgorithm * to implement different cub::BlockLoadAlgorithm strategies. This facilitates different * performance policies for different architectures, data types, granularity sizes, etc. * - BlockLoad can be optionally specialized by different data movement strategies: - * -# cub::BLOCK_LOAD_DIRECT. A [blocked arrangement](index.html#sec4sec3) + * -# cub::BLOCK_LOAD_DIRECT. A [blocked arrangement](index.html#sec5sec3) * of data is read directly from memory. [More...](\ref cub::BlockLoadAlgorithm) - * -# cub::BLOCK_LOAD_VECTORIZE. A [blocked arrangement](index.html#sec4sec3) + * -# cub::BLOCK_LOAD_VECTORIZE. A [blocked arrangement](index.html#sec5sec3) * of data is read directly from memory using CUDA's built-in vectorized loads as a * coalescing optimization. [More...](\ref cub::BlockLoadAlgorithm) - * -# cub::BLOCK_LOAD_TRANSPOSE. A [striped arrangement](index.html#sec4sec3) + * -# cub::BLOCK_LOAD_TRANSPOSE. A [striped arrangement](index.html#sec5sec3) * of data is read directly from memory and is then locally transposed into a - * [blocked arrangement](index.html#sec4sec3). [More...](\ref cub::BlockLoadAlgorithm) - * -# cub::BLOCK_LOAD_WARP_TRANSPOSE. A [warp-striped arrangement](index.html#sec4sec3) + * [blocked arrangement](index.html#sec5sec3). [More...](\ref cub::BlockLoadAlgorithm) + * -# cub::BLOCK_LOAD_WARP_TRANSPOSE. A [warp-striped arrangement](index.html#sec5sec3) * of data is read directly from memory and is then locally transposed into a - * [blocked arrangement](index.html#sec4sec3). [More...](\ref cub::BlockLoadAlgorithm) + * [blocked arrangement](index.html#sec5sec3). [More...](\ref cub::BlockLoadAlgorithm) * * \par A Simple Example * \blockcollective{BlockLoad} diff --git a/cub/block/block_radix_rank.cuh b/cub/block/block_radix_rank.cuh index 479cf78a50..126c2d2739 100644 --- a/cub/block/block_radix_rank.cuh +++ b/cub/block/block_radix_rank.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -63,7 +63,7 @@ namespace cub { * * \par Usage Considerations * - Keys must be in a form suitable for radix ranking (i.e., unsigned bits). - * - Assumes a [blocked arrangement](index.html#sec4sec3) of elements across threads + * - Assumes a [blocked arrangement](index.html#sec5sec3) of elements across threads * - \smemreuse{BlockRadixRank::TempStorage} * * \par Performance Considerations diff --git a/cub/block/block_radix_sort.cuh b/cub/block/block_radix_sort.cuh index 9c71ea387e..9e43a7422a 100644 --- a/cub/block/block_radix_sort.cuh +++ b/cub/block/block_radix_sort.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -81,7 +81,7 @@ namespace cub { * \blockcollective{BlockRadixSort} * \par * The code snippet below illustrates a sort of 512 integer keys that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -433,7 +433,7 @@ public: //@{ /** - * \brief Performs an ascending block-wide radix sort over a [blocked arrangement](index.html#sec4sec3) of keys. + * \brief Performs an ascending block-wide radix sort over a [blocked arrangement](index.html#sec5sec3) of keys. * * \par * - \granularity @@ -441,7 +441,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive keys. * \par * \code @@ -481,7 +481,7 @@ public: /** - * \brief Performs an ascending block-wide radix sort across a [blocked arrangement](index.html#sec4sec3) of keys and values. + * \brief Performs an ascending block-wide radix sort across a [blocked arrangement](index.html#sec5sec3) of keys and values. * * \par * - BlockRadixSort can only accommodate one associated tile of values. To "truck along" @@ -494,7 +494,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys and values that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive pairs. * \par * \code @@ -534,7 +534,7 @@ public: } /** - * \brief Performs a descending block-wide radix sort over a [blocked arrangement](index.html#sec4sec3) of keys. + * \brief Performs a descending block-wide radix sort over a [blocked arrangement](index.html#sec5sec3) of keys. * * \par * - \granularity @@ -542,7 +542,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive keys. * \par * \code @@ -582,7 +582,7 @@ public: /** - * \brief Performs a descending block-wide radix sort across a [blocked arrangement](index.html#sec4sec3) of keys and values. + * \brief Performs a descending block-wide radix sort across a [blocked arrangement](index.html#sec5sec3) of keys and values. * * \par * - BlockRadixSort can only accommodate one associated tile of values. To "truck along" @@ -595,7 +595,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys and values that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive pairs. * \par * \code @@ -643,7 +643,7 @@ public: /** - * \brief Performs an ascending radix sort across a [blocked arrangement](index.html#sec4sec3) of keys, leaving them in a [striped arrangement](index.html#sec4sec3). + * \brief Performs an ascending radix sort across a [blocked arrangement](index.html#sec5sec3) of keys, leaving them in a [striped arrangement](index.html#sec5sec3). * * \par * - \granularity @@ -651,7 +651,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys that - * are initially partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are initially partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive keys. The final partitioning is striped. * \par * \code @@ -692,7 +692,7 @@ public: /** - * \brief Performs an ascending radix sort across a [blocked arrangement](index.html#sec4sec3) of keys and values, leaving them in a [striped arrangement](index.html#sec4sec3). + * \brief Performs an ascending radix sort across a [blocked arrangement](index.html#sec5sec3) of keys and values, leaving them in a [striped arrangement](index.html#sec5sec3). * * \par * - BlockRadixSort can only accommodate one associated tile of values. To "truck along" @@ -705,7 +705,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys and values that - * are initially partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are initially partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive pairs. The final partitioning is striped. * \par * \code @@ -746,7 +746,7 @@ public: /** - * \brief Performs a descending radix sort across a [blocked arrangement](index.html#sec4sec3) of keys, leaving them in a [striped arrangement](index.html#sec4sec3). + * \brief Performs a descending radix sort across a [blocked arrangement](index.html#sec5sec3) of keys, leaving them in a [striped arrangement](index.html#sec5sec3). * * \par * - \granularity @@ -754,7 +754,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys that - * are initially partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are initially partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive keys. The final partitioning is striped. * \par * \code @@ -795,7 +795,7 @@ public: /** - * \brief Performs a descending radix sort across a [blocked arrangement](index.html#sec4sec3) of keys and values, leaving them in a [striped arrangement](index.html#sec4sec3). + * \brief Performs a descending radix sort across a [blocked arrangement](index.html#sec5sec3) of keys and values, leaving them in a [striped arrangement](index.html#sec5sec3). * * \par * - BlockRadixSort can only accommodate one associated tile of values. To "truck along" @@ -808,7 +808,7 @@ public: * * \par * The code snippet below illustrates a sort of 512 integer keys and values that - * are initially partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are initially partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive pairs. The final partitioning is striped. * \par * \code diff --git a/cub/block/block_raking_layout.cuh b/cub/block/block_raking_layout.cuh index 364e28a46c..015355c05f 100644 --- a/cub/block/block_raking_layout.cuh +++ b/cub/block/block_raking_layout.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/block_reduce.cuh b/cub/block/block_reduce.cuh index bbba3122e6..2e42c8bcdb 100644 --- a/cub/block/block_reduce.cuh +++ b/cub/block/block_reduce.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -140,7 +140,7 @@ enum BlockReduceAlgorithm * \blockcollective{BlockReduce} * \par * The code snippet below illustrates a sum reduction of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -326,7 +326,7 @@ public: * * \par * The code snippet below illustrates a max reduction of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code @@ -473,7 +473,7 @@ public: * * \par * The code snippet below illustrates a sum reduction of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec4sec3) across 128 threads + * are partitioned in a [blocked arrangement](index.html#sec5sec3) across 128 threads * where each thread owns 4 consecutive items. * \par * \code diff --git a/cub/block/block_scan.cuh.REMOVED.git-id b/cub/block/block_scan.cuh.REMOVED.git-id index cf8c107f6c..57ae74226d 100644 --- a/cub/block/block_scan.cuh.REMOVED.git-id +++ b/cub/block/block_scan.cuh.REMOVED.git-id @@ -1 +1 @@ -5a5ba466472c50cedf0263532e0cd2ae58b4a550 \ No newline at end of file +d65d116bc27cdf4993f3af3d3ec2e89b94605a86 \ No newline at end of file diff --git a/cub/block/block_shift.cuh b/cub/block/block_shift.cuh index 981c6ed6f8..69089e98bd 100644 --- a/cub/block/block_shift.cuh +++ b/cub/block/block_shift.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/block_store.cuh b/cub/block/block_store.cuh index 70d6a4a437..7c3f9736b8 100644 --- a/cub/block/block_store.cuh +++ b/cub/block/block_store.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -343,7 +343,7 @@ enum BlockStoreAlgorithm /** * \par Overview * - * A [blocked arrangement](index.html#sec4sec3) of data is written + * A [blocked arrangement](index.html#sec5sec3) of data is written * directly to memory. The thread block writes items in a parallel "raking" fashion: * threadi writes the ith segment of consecutive elements. * @@ -356,7 +356,7 @@ enum BlockStoreAlgorithm /** * \par Overview * - * A [blocked arrangement](index.html#sec4sec3) of data is written directly + * A [blocked arrangement](index.html#sec5sec3) of data is written directly * to memory using CUDA's built-in vectorized stores as a coalescing optimization. * The thread block writes items in a parallel "raking" fashion: threadi uses vector stores to * write the ith segment of consecutive elements. @@ -377,11 +377,11 @@ enum BlockStoreAlgorithm /** * \par Overview - * A [blocked arrangement](index.html#sec4sec3) is locally - * transposed into a [striped arrangement](index.html#sec4sec3) + * A [blocked arrangement](index.html#sec5sec3) is locally + * transposed into a [striped arrangement](index.html#sec5sec3) * which is then written to memory. More specifically, cub::BlockExchange * used to locally reorder the items into a - * [striped arrangement](index.html#sec4sec3), after which the + * [striped arrangement](index.html#sec5sec3), after which the * thread block writes items in a parallel "strip-mining" fashion: consecutive * items owned by threadi are written to memory with * stride \p BLOCK_THREADS between them. @@ -396,11 +396,11 @@ enum BlockStoreAlgorithm /** * \par Overview - * A [blocked arrangement](index.html#sec4sec3) is locally - * transposed into a [warp-striped arrangement](index.html#sec4sec3) + * A [blocked arrangement](index.html#sec5sec3) is locally + * transposed into a [warp-striped arrangement](index.html#sec5sec3) * which is then written to memory. More specifically, cub::BlockExchange used * to locally reorder the items into a - * [warp-striped arrangement](index.html#sec4sec3), after which + * [warp-striped arrangement](index.html#sec5sec3), after which * each warp writes its own contiguous segment in a parallel "strip-mining" fashion: * consecutive items owned by lanei are written to memory * with stride \p WARP_THREADS between them. @@ -416,7 +416,7 @@ enum BlockStoreAlgorithm /** - * \brief The BlockStore class provides [collective](index.html#sec0) data movement methods for writing a [blocked arrangement](index.html#sec4sec3) of items partitioned across a CUDA thread block to a linear segment of memory. ![](block_store_logo.png) + * \brief The BlockStore class provides [collective](index.html#sec0) data movement methods for writing a [blocked arrangement](index.html#sec5sec3) of items partitioned across a CUDA thread block to a linear segment of memory. ![](block_store_logo.png) * \ingroup BlockModule * \ingroup UtilIo * @@ -431,16 +431,16 @@ enum BlockStoreAlgorithm * to implement different cub::BlockStoreAlgorithm strategies. This facilitates different * performance policies for different architectures, data types, granularity sizes, etc. * - BlockStore can be optionally specialized by different data movement strategies: - * -# cub::BLOCK_STORE_DIRECT. A [blocked arrangement](index.html#sec4sec3) of data is written + * -# cub::BLOCK_STORE_DIRECT. A [blocked arrangement](index.html#sec5sec3) of data is written * directly to memory. [More...](\ref cub::BlockStoreAlgorithm) - * -# cub::BLOCK_STORE_VECTORIZE. A [blocked arrangement](index.html#sec4sec3) + * -# cub::BLOCK_STORE_VECTORIZE. A [blocked arrangement](index.html#sec5sec3) * of data is written directly to memory using CUDA's built-in vectorized stores as a * coalescing optimization. [More...](\ref cub::BlockStoreAlgorithm) - * -# cub::BLOCK_STORE_TRANSPOSE. A [blocked arrangement](index.html#sec4sec3) - * is locally transposed into a [striped arrangement](index.html#sec4sec3) which is + * -# cub::BLOCK_STORE_TRANSPOSE. A [blocked arrangement](index.html#sec5sec3) + * is locally transposed into a [striped arrangement](index.html#sec5sec3) which is * then written to memory. [More...](\ref cub::BlockStoreAlgorithm) - * -# cub::BLOCK_STORE_WARP_TRANSPOSE. A [blocked arrangement](index.html#sec4sec3) - * is locally transposed into a [warp-striped arrangement](index.html#sec4sec3) which is + * -# cub::BLOCK_STORE_WARP_TRANSPOSE. A [blocked arrangement](index.html#sec5sec3) + * is locally transposed into a [warp-striped arrangement](index.html#sec5sec3) which is * then written to memory. [More...](\ref cub::BlockStoreAlgorithm) * * \par A Simple Example diff --git a/cub/block/specializations/block_histogram_atomic.cuh b/cub/block/specializations/block_histogram_atomic.cuh index ecc980098c..3e18ddd6de 100644 --- a/cub/block/specializations/block_histogram_atomic.cuh +++ b/cub/block/specializations/block_histogram_atomic.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/specializations/block_histogram_sort.cuh b/cub/block/specializations/block_histogram_sort.cuh index e81edec6c3..d380cdb9ba 100644 --- a/cub/block/specializations/block_histogram_sort.cuh +++ b/cub/block/specializations/block_histogram_sort.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/specializations/block_reduce_raking.cuh b/cub/block/specializations/block_reduce_raking.cuh index d70736964c..50cca72d9e 100644 --- a/cub/block/specializations/block_reduce_raking.cuh +++ b/cub/block/specializations/block_reduce_raking.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/specializations/block_reduce_warp_reductions.cuh b/cub/block/specializations/block_reduce_warp_reductions.cuh index 68180307d0..876e33b447 100644 --- a/cub/block/specializations/block_reduce_warp_reductions.cuh +++ b/cub/block/specializations/block_reduce_warp_reductions.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/specializations/block_scan_raking.cuh b/cub/block/specializations/block_scan_raking.cuh index 5a46fbce5a..8870015a9d 100644 --- a/cub/block/specializations/block_scan_raking.cuh +++ b/cub/block/specializations/block_scan_raking.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/block/specializations/block_scan_warp_scans.cuh b/cub/block/specializations/block_scan_warp_scans.cuh index b671ec3243..aed114c690 100644 --- a/cub/block/specializations/block_scan_warp_scans.cuh +++ b/cub/block/specializations/block_scan_warp_scans.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/cub.cuh b/cub/cub.cuh index c92b80cbed..a0902ba854 100644 --- a/cub/cub.cuh +++ b/cub/cub.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/device_histogram.cuh b/cub/device/device_histogram.cuh index 49b4b6c234..0044a958c1 100644 --- a/cub/device/device_histogram.cuh +++ b/cub/device/device_histogram.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -585,7 +585,7 @@ struct DeviceHistogram * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the computation of a 8-bin histogram of * single-channel unsigned char samples. * \par @@ -673,7 +673,7 @@ struct DeviceHistogram * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the computation of a 8-bin histogram of * single-channel unsigned char samples. * \par @@ -761,7 +761,7 @@ struct DeviceHistogram * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the computation of a 8-bin histogram of * single-channel unsigned char samples. * \par @@ -857,7 +857,7 @@ struct DeviceHistogram * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the computation of three 256-bin histograms from * an input sequence of quad-channel (interleaved) unsigned char samples. * (E.g., RGB histograms from RGBA pixel samples.) @@ -956,7 +956,7 @@ struct DeviceHistogram * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the computation of three 256-bin histograms from * an input sequence of quad-channel (interleaved) unsigned char samples. * (E.g., RGB histograms from RGBA pixel samples.) @@ -1054,7 +1054,7 @@ struct DeviceHistogram * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the computation of three 256-bin histograms from * an input sequence of quad-channel (interleaved) unsigned char samples. * (E.g., RGB histograms from RGBA pixel samples.) diff --git a/cub/device/device_partition.cuh b/cub/device/device_partition.cuh index e43b570199..bc043acf2d 100644 --- a/cub/device/device_partition.cuh +++ b/cub/device/device_partition.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -63,6 +63,15 @@ namespace cub { * \cdp_class{DevicePartition} * * \par Performance + * \linear_performance{partition} + * + * \par + * The following chart illustrates DevicePartition::If + * performance across different CUDA architectures for \p int32 items, + * where 50% of the items are randomly selected for the first partition. + * \plots_below + * + * \image html partition_if_int32_50_percent.png * */ struct DevicePartition @@ -78,7 +87,7 @@ struct DevicePartition * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the compaction of items selected from an \p int device vector. * \par * \code @@ -159,19 +168,37 @@ struct DevicePartition * - \devicestorage * - \cdp * + * \par Performance + * The following charts illustrate saturated partition-if performance across different + * CUDA architectures for \p int32 and \p int64 items, respectively. Items are + * selected for the first partition with 50% probability. + * + * \image html partition_if_int32_50_percent.png + * \image html partition_if_int64_50_percent.png + * * \par + * The following charts are similar, but 5% selection probability for the first partition: + * + * \image html partition_if_int32_5_percent.png + * \image html partition_if_int64_5_percent.png + * + * \par Snippet * The code snippet below illustrates the compaction of items selected from an \p int device vector. * \par * \code * #include // or equivalently * - * // Functor for selecting values that are multiples of three - * struct IsTriple + * // Functor type for selecting values less than some criteria + * struct LessThan * { - * template + * int compare; + * * __host__ __device__ __forceinline__ - * bool operator()(const T &a) const { - * return (a % 3 == 0); + * LessThan(int compare) : compare(compare) {} + * + * __host__ __device__ __forceinline__ + * bool operator()(const int &a) const { + * return (a < compare); * } * }; * @@ -180,7 +207,7 @@ struct DevicePartition * int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8] * int *d_out; // e.g., [ , , , , , , , ] * int *d_num_selected; // e.g., [ ] - * IsTriple select_op; + * LessThan select_op(7); * ... * * // Determine temporary device storage requirements @@ -194,8 +221,8 @@ struct DevicePartition * // Run selection * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected, num_items, select_op); * - * // d_out <-- [0, 3, 9, 81, 8, 2, 5, 2] - * // d_num_selected <-- [4] + * // d_out <-- [0, 2, 3, 5, 2, 8, 81, 9] + * // d_num_selected <-- [5] * * \endcode * diff --git a/cub/device/device_radix_sort.cuh b/cub/device/device_radix_sort.cuh index 56ba1df6db..bdfcab91c0 100644 --- a/cub/device/device_radix_sort.cuh +++ b/cub/device/device_radix_sort.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -956,8 +956,11 @@ struct DeviceRadixSortDispatch * \cdp_class{DeviceRadixSort} * * \par Performance + * \linear_performance{radix sort} The following chart illustrates DeviceRadixSort::SortKeys + * performance across different CUDA architectures for uniform-random \p uint32 keys. + * \plots_below * - * \image html lsd_sort_perf.png + * \image html lsb_radix_sort_int32_keys.png * */ struct DeviceRadixSort @@ -973,7 +976,15 @@ struct DeviceRadixSort * - \devicestorage * - \cdp * - * \par + * \par Performance + * The following charts illustrate saturated sorting performance across different + * CUDA architectures for uniform-random uint32,uint32 and + * uint64,uint64 pairs, respectively. + * + * \image html lsb_radix_sort_int32_pairs.png + * \image html lsb_radix_sort_int64_pairs.png + * + * \par Snippet * The code snippet below illustrates the sorting of a device vector of \p int keys * with associated vector of \p int values. * \par @@ -1053,7 +1064,10 @@ struct DeviceRadixSort * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is similar to DeviceRadixSort::SortPairs. + * + * \par Snippet * The code snippet below illustrates the sorting of a device vector of \p int keys * with associated vector of \p int values. * \par @@ -1133,7 +1147,14 @@ struct DeviceRadixSort * - \devicestorage * - \cdp * - * \par + * \par Performance + * The following charts illustrate saturated sorting performance across different + * CUDA architectures for uniform-random \p uint32 and \p uint64 keys, respectively. + * + * \image html lsb_radix_sort_int32_keys.png + * \image html lsb_radix_sort_int64_keys.png + * + * \par Snippet * The code snippet below illustrates the sorting of a device vector of \p int keys. * \par * \code @@ -1207,7 +1228,10 @@ struct DeviceRadixSort * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is similar to DeviceRadixSort::SortKeys. + * + * \par Snippet * The code snippet below illustrates the sorting of a device vector of \p int keys. * \par * \code diff --git a/cub/device/device_reduce.cuh b/cub/device/device_reduce.cuh index 116da91f75..029d5d13c5 100644 --- a/cub/device/device_reduce.cuh +++ b/cub/device/device_reduce.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -752,8 +752,31 @@ struct DeviceReduceDispatch * \cdp_class{DeviceReduce} * * \par Performance + * \linear_performance{reduction, reduce-by-key, and run-length encode} + * + * \par + * The following chart illustrates DeviceReduce::Sum + * performance across different CUDA architectures for \p int32 keys. + * + * \image html reduce_int32.png + * + * \par + * The following chart illustrates DeviceReduce::ReduceByKey (summation) + * performance across different CUDA architectures for \p fp32 + * values. Segments are identified by \p int32 keys, and have lengths uniformly sampled from [1,1000]. + * + * \image html reduce_by_key_fp32_len_500.png + * + * \par + * The following chart illustrates DeviceReduce::RunLengthEncode performance across + * different CUDA architectures for \p int32 items. + * Segments have lengths uniformly sampled from [1,1000]. + * + * \image html rle_int32_len_500.png + * + * \par + * \plots_below * - * \image html reduction_perf.png * */ struct DeviceReduce @@ -766,7 +789,10 @@ struct DeviceReduce * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceReduce::Sum. + * + * \par Snippet * The code snippet below illustrates a custom min reduction of a device vector of \p int items. * \par * \code @@ -849,7 +875,14 @@ struct DeviceReduce * - \devicestorage * - \cdp * - * \par + * \par Performance + * The following charts illustrate saturated reduction (sum) performance across different + * CUDA architectures for \p int32 and \p int64 items, respectively. + * + * \image html reduce_int32.png + * \image html reduce_int64.png + * + * \par Snippet * The code snippet below illustrates the sum reduction of a device vector of \p int items. * \par * \code @@ -918,7 +951,10 @@ struct DeviceReduce * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceReduce::Sum. + * + * \par Snippet * The code snippet below illustrates the min-reduction of a device vector of \p int items. * \par * \code @@ -992,7 +1028,10 @@ struct DeviceReduce * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceReduce::Sum. + * + * \par Snippet * The code snippet below illustrates the argmin-reduction of a device vector of \p int items. * \par * \code @@ -1065,7 +1104,10 @@ struct DeviceReduce * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceReduce::Sum. + * + * \par Snippet * The code snippet below illustrates the max-reduction of a device vector of \p int items. * \par * \code @@ -1139,7 +1181,10 @@ struct DeviceReduce * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceReduce::Sum. + * + * \par Snippet * The code snippet below illustrates the argmax-reduction of a device vector of \p int items. * \par * \code @@ -1220,7 +1265,21 @@ struct DeviceReduce * - \devicestorage * - \cdp * + * \par Performance + * The following chart illustrates reduction-by-key (sum) performance across + * different CUDA architectures for \p fp32 and \p fp64 values, respectively. Segments + * are identified by \p int32 keys, and have lengths uniformly sampled from [1,1000]. + * + * \image html reduce_by_key_fp32_len_500.png + * \image html reduce_by_key_fp64_len_500.png + * * \par + * The following charts are similar, but with segment lengths uniformly sampled from [1,10]: + * + * \image html reduce_by_key_fp32_len_5.png + * \image html reduce_by_key_fp64_len_5.png + * + * \par Snippet * The code snippet below illustrates the segmented reduction of \p int values grouped * by runs of associated \p int keys. * \par @@ -1329,7 +1388,21 @@ struct DeviceReduce * - \devicestorage * - \cdp * + * \par Performance + * The following charts illustrate saturated encode performance across different + * CUDA architectures for \p int32 and \p int64 items, respectively. Segments have + * lengths uniformly sampled from [1,1000]. + * + * \image html rle_int32_len_500.png + * \image html rle_int64_len_500.png + * * \par + * The following charts are similar, but with segment lengths uniformly sampled from [1,10]: + * + * \image html rle_int32_len_5.png + * \image html rle_int64_len_5.png + * + * \par Snippet * The code snippet below illustrates the run-length encoding of a sequence of \p int values. * \par * \code diff --git a/cub/device/device_reduce_by_key.cuh b/cub/device/device_reduce_by_key.cuh index 8face1b961..ee859211e1 100644 --- a/cub/device/device_reduce_by_key.cuh +++ b/cub/device/device_reduce_by_key.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/device_scan.cuh b/cub/device/device_scan.cuh index 83b7d005fd..208e86777a 100644 --- a/cub/device/device_scan.cuh +++ b/cub/device/device_scan.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -581,8 +581,14 @@ struct DeviceScanDispatch * \cdp_class{DeviceScan} * * \par Performance + * \linear_performance{prefix scan} * - * \image html scan_perf.png + * \par + * The following chart illustrates DeviceScan::ExclusiveSum + * performance across different CUDA architectures for \p int32 keys. + * \plots_below + * + * \image html scan_int32.png * */ struct DeviceScan @@ -600,7 +606,14 @@ struct DeviceScan * - \devicestorage * - \cdp * - * \par + * \par Performance + * The following charts illustrate saturated exclusive sum performance across different + * CUDA architectures for \p int32 and \p int64 items, respectively. + * + * \image html scan_int32.png + * \image html scan_int64.png + * + * \par Snippet * The code snippet below illustrates the exclusive prefix sum of an \p int device vector. * \par * \code @@ -670,7 +683,10 @@ struct DeviceScan * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceScan::ExclusiveSum. + * + * \par Snippet * The code snippet below illustrates the exclusive prefix min-scan of an \p int device vector * \par * \code @@ -761,7 +777,10 @@ struct DeviceScan * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceScan::ExclusiveSum. + * + * \par Snippet * The code snippet below illustrates the inclusive prefix sum of an \p int device vector. * \par * \code @@ -828,7 +847,10 @@ struct DeviceScan * - \devicestorage * - \cdp * - * \par + * \par Performance + * Performance is typically similar to DeviceScan::ExclusiveSum. + * + * \par Snippet * The code snippet below illustrates the inclusive prefix min-scan of an \p int device vector. * \par * \code diff --git a/cub/device/device_select.cuh b/cub/device/device_select.cuh index edf5b647e2..168a386588 100644 --- a/cub/device/device_select.cuh +++ b/cub/device/device_select.cuh @@ -1,7 +1,7 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -566,17 +566,35 @@ struct DeviceSelectDispatch *****************************************************************************/ /** - * \brief DeviceSelect provides device-wide, parallel operations for selecting items from sequences of data items residing within global memory. ![](select_logo.png) + * \brief DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within global memory. ![](select_logo.png) * \ingroup DeviceModule * * \par Overview * These operations apply a selection criterion to selectively copy - * items from a specified input sequence to a corresponding output sequence. + * items from a specified input sequence to a compact output sequence. * * \par Usage Considerations * \cdp_class{DeviceSelect} * * \par Performance + * \linear_performance{select-flagged, select-if, and select-unique} + * + * \par + * The following chart illustrates DeviceSelect::If + * performance across different CUDA architectures for \p int32 items, + * where 50% of the items are randomly selected. + * + * \image html select_if_int32_50_percent.png + * + * \par + * The following chart illustrates DeviceSelect::Unique + * performance across different CUDA architectures for \p int32 items + * where segments have lengths uniformly sampled from [1,1000]. + * + * \image html select_unique_int32_len_500.png + * + * \par + * \plots_below * */ struct DeviceSelect @@ -590,7 +608,7 @@ struct DeviceSelect * - \devicestorage * - \cdp * - * \par + * \par Snippet * The code snippet below illustrates the compaction of items selected from an \p int device vector. * \par * \code @@ -669,19 +687,37 @@ struct DeviceSelect * - \devicestorage * - \cdp * + * \par Performance + * The following charts illustrate saturated select-if performance across different + * CUDA architectures for \p int32 and \p int64 items, respectively. Items are + * selected with 50% probability. + * + * \image html select_if_int32_50_percent.png + * \image html select_if_int64_50_percent.png + * * \par + * The following charts are similar, but 5% selection probability: + * + * \image html select_if_int32_5_percent.png + * \image html select_if_int64_5_percent.png + * + * \par Snippet * The code snippet below illustrates the compaction of items selected from an \p int device vector. * \par * \code * #include // or equivalently * - * // Functor for selecting values that are multiples of three - * struct IsTriple + * // Functor type for selecting values less than some criteria + * struct LessThan * { - * template + * int compare; + * + * __host__ __device__ __forceinline__ + * LessThan(int compare) : compare(compare) {} + * * __host__ __device__ __forceinline__ - * bool operator()(const T &a) const { - * return (a % 3 == 0); + * bool operator()(const int &a) const { + * return (a < compare); * } * }; * @@ -690,7 +726,7 @@ struct DeviceSelect * int *d_in; // e.g., [0, 2, 3, 9, 5, 2, 81, 8] * int *d_out; // e.g., [ , , , , , , , ] * int *d_num_selected; // e.g., [ ] - * IsTriple select_op; + * LessThan select_op(7); * ... * * // Determine temporary device storage requirements @@ -704,8 +740,8 @@ struct DeviceSelect * // Run selection * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected, num_items, select_op); * - * // d_out <-- [0, 3, 9, 81] - * // d_num_selected <-- [4] + * // d_out <-- [0, 2, 3, 5, 2] + * // d_num_selected <-- [5] * * \endcode * @@ -759,7 +795,21 @@ struct DeviceSelect * - \devicestorage * - \cdp * + * \par Performance + * The following charts illustrate saturated select-unique performance across different + * CUDA architectures for \p int32 and \p int64 items, respectively. Segments have + * lengths uniformly sampled from [1,1000]. + * + * \image html select_unique_int32_len_500.png + * \image html select_unique_int64_len_500.png + * * \par + * The following charts are similar, but with segment lengths uniformly sampled from [1,10]: + * + * \image html select_unique_int32_len_5.png + * \image html select_unique_int64_len_5.png + * + * \par Snippet * The code snippet below illustrates the compaction of items selected from an \p int device vector. * \par * \code diff --git a/cub/device/region/block_histo_region.cuh b/cub/device/region/block_histo_region.cuh index 08bd907ddf..64e3b20e4f 100644 --- a/cub/device/region/block_histo_region.cuh +++ b/cub/device/region/block_histo_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/block_radix_sort_downsweep_region.cuh b/cub/device/region/block_radix_sort_downsweep_region.cuh index 0585581090..89c4c69c0c 100644 --- a/cub/device/region/block_radix_sort_downsweep_region.cuh +++ b/cub/device/region/block_radix_sort_downsweep_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/block_radix_sort_upsweep_region.cuh b/cub/device/region/block_radix_sort_upsweep_region.cuh index 20ccae2e70..7f1e92ed41 100644 --- a/cub/device/region/block_radix_sort_upsweep_region.cuh +++ b/cub/device/region/block_radix_sort_upsweep_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/block_reduce_by_key_region.cuh b/cub/device/region/block_reduce_by_key_region.cuh index e059afa841..247db193e0 100644 --- a/cub/device/region/block_reduce_by_key_region.cuh +++ b/cub/device/region/block_reduce_by_key_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/block_reduce_region.cuh b/cub/device/region/block_reduce_region.cuh index e2a976190c..a5cf89d38a 100644 --- a/cub/device/region/block_reduce_region.cuh +++ b/cub/device/region/block_reduce_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/block_scan_region.cuh b/cub/device/region/block_scan_region.cuh index a44741e25a..308f7c2920 100644 --- a/cub/device/region/block_scan_region.cuh +++ b/cub/device/region/block_scan_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/block_select_region.cuh b/cub/device/region/block_select_region.cuh index f4aa0e25c9..6bbf430a96 100644 --- a/cub/device/region/block_select_region.cuh +++ b/cub/device/region/block_select_region.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/device_scan_types.cuh b/cub/device/region/device_scan_types.cuh index db058dd894..95889f6001 100644 --- a/cub/device/region/device_scan_types.cuh +++ b/cub/device/region/device_scan_types.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/specializations/block_histo_region_gatomic.cuh b/cub/device/region/specializations/block_histo_region_gatomic.cuh index 84eeaa0274..fdffb872bf 100644 --- a/cub/device/region/specializations/block_histo_region_gatomic.cuh +++ b/cub/device/region/specializations/block_histo_region_gatomic.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/specializations/block_histo_region_satomic.cuh b/cub/device/region/specializations/block_histo_region_satomic.cuh index 1f5fdfe1a7..d496e7dcba 100644 --- a/cub/device/region/specializations/block_histo_region_satomic.cuh +++ b/cub/device/region/specializations/block_histo_region_satomic.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/device/region/specializations/block_histo_region_sort.cuh b/cub/device/region/specializations/block_histo_region_sort.cuh index 11a9530833..ab8bd03424 100644 --- a/cub/device/region/specializations/block_histo_region_sort.cuh +++ b/cub/device/region/specializations/block_histo_region_sort.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/grid/grid_barrier.cuh b/cub/grid/grid_barrier.cuh index ebdc4b552a..eab5b518ec 100644 --- a/cub/grid/grid_barrier.cuh +++ b/cub/grid/grid_barrier.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/grid/grid_even_share.cuh b/cub/grid/grid_even_share.cuh index 48b761dcf1..e331d0d86a 100644 --- a/cub/grid/grid_even_share.cuh +++ b/cub/grid/grid_even_share.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/grid/grid_mapping.cuh b/cub/grid/grid_mapping.cuh index 419f9ac0e0..ff6679b9b7 100644 --- a/cub/grid/grid_mapping.cuh +++ b/cub/grid/grid_mapping.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/grid/grid_queue.cuh b/cub/grid/grid_queue.cuh index 5189cc4010..eab13fc4ae 100644 --- a/cub/grid/grid_queue.cuh +++ b/cub/grid/grid_queue.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/host/spinlock.cuh b/cub/host/spinlock.cuh index 5621b6f1a3..6e4b47c7dd 100644 --- a/cub/host/spinlock.cuh +++ b/cub/host/spinlock.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/arg_index_input_iterator.cuh b/cub/iterator/arg_index_input_iterator.cuh index 0371894717..e2d095a4c9 100644 --- a/cub/iterator/arg_index_input_iterator.cuh +++ b/cub/iterator/arg_index_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/cache_modified_input_iterator.cuh b/cub/iterator/cache_modified_input_iterator.cuh index 510658c42c..b9ce48a0ff 100644 --- a/cub/iterator/cache_modified_input_iterator.cuh +++ b/cub/iterator/cache_modified_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/cache_modified_output_iterator.cuh b/cub/iterator/cache_modified_output_iterator.cuh index b24247ddc5..e91b2aef98 100644 --- a/cub/iterator/cache_modified_output_iterator.cuh +++ b/cub/iterator/cache_modified_output_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/constant_input_iterator.cuh b/cub/iterator/constant_input_iterator.cuh index 5a923bf548..f2c247fd5c 100644 --- a/cub/iterator/constant_input_iterator.cuh +++ b/cub/iterator/constant_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/counting_input_iterator.cuh b/cub/iterator/counting_input_iterator.cuh index 24b7573b38..02791f4113 100644 --- a/cub/iterator/counting_input_iterator.cuh +++ b/cub/iterator/counting_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/tex_obj_input_iterator.cuh b/cub/iterator/tex_obj_input_iterator.cuh index 466ec0e47f..e448a46732 100644 --- a/cub/iterator/tex_obj_input_iterator.cuh +++ b/cub/iterator/tex_obj_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/tex_ref_input_iterator.cuh b/cub/iterator/tex_ref_input_iterator.cuh index 5e1055ff6c..f3f6dca317 100644 --- a/cub/iterator/tex_ref_input_iterator.cuh +++ b/cub/iterator/tex_ref_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/iterator/transform_input_iterator.cuh b/cub/iterator/transform_input_iterator.cuh index feb020c301..1e86d4507b 100644 --- a/cub/iterator/transform_input_iterator.cuh +++ b/cub/iterator/transform_input_iterator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/thread/thread_load.cuh b/cub/thread/thread_load.cuh index 5d838fc2e1..3142524681 100644 --- a/cub/thread/thread_load.cuh +++ b/cub/thread/thread_load.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/thread/thread_operators.cuh b/cub/thread/thread_operators.cuh index 3e97611505..75c962731a 100644 --- a/cub/thread/thread_operators.cuh +++ b/cub/thread/thread_operators.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/thread/thread_reduce.cuh b/cub/thread/thread_reduce.cuh index e1188523ff..29bc8ce0c0 100644 --- a/cub/thread/thread_reduce.cuh +++ b/cub/thread/thread_reduce.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/thread/thread_scan.cuh b/cub/thread/thread_scan.cuh index 0fa4d746c7..0eefed95a5 100644 --- a/cub/thread/thread_scan.cuh +++ b/cub/thread/thread_scan.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/thread/thread_store.cuh b/cub/thread/thread_store.cuh index 5463a912c5..e9e2b86edb 100644 --- a/cub/thread/thread_store.cuh +++ b/cub/thread/thread_store.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_allocator.cuh b/cub/util_allocator.cuh index c154acf97e..919d048315 100644 --- a/cub/util_allocator.cuh +++ b/cub/util_allocator.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_arch.cuh b/cub/util_arch.cuh index 53bbc38f96..574a6a5b5f 100644 --- a/cub/util_arch.cuh +++ b/cub/util_arch.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_debug.cuh b/cub/util_debug.cuh index 0f936e1cb8..b4eca8a582 100644 --- a/cub/util_debug.cuh +++ b/cub/util_debug.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_device.cuh b/cub/util_device.cuh index a021dc620b..361297ae57 100644 --- a/cub/util_device.cuh +++ b/cub/util_device.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_macro.cuh b/cub/util_macro.cuh index acce3c8404..a94031a4cb 100644 --- a/cub/util_macro.cuh +++ b/cub/util_macro.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_namespace.cuh b/cub/util_namespace.cuh index 869ecc613a..39603644ec 100644 --- a/cub/util_namespace.cuh +++ b/cub/util_namespace.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_ptx.cuh b/cub/util_ptx.cuh index ad25615f99..59ef23b4fd 100644 --- a/cub/util_ptx.cuh +++ b/cub/util_ptx.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/util_type.cuh b/cub/util_type.cuh index a57b41e246..225f0deb9e 100644 --- a/cub/util_type.cuh +++ b/cub/util_type.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/warp/specializations/warp_reduce_shfl.cuh b/cub/warp/specializations/warp_reduce_shfl.cuh index 2c967d7de3..33d74aecf4 100644 --- a/cub/warp/specializations/warp_reduce_shfl.cuh +++ b/cub/warp/specializations/warp_reduce_shfl.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/warp/specializations/warp_reduce_smem.cuh b/cub/warp/specializations/warp_reduce_smem.cuh index b8cffbf74f..4b210fd162 100644 --- a/cub/warp/specializations/warp_reduce_smem.cuh +++ b/cub/warp/specializations/warp_reduce_smem.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/warp/specializations/warp_scan_shfl.cuh b/cub/warp/specializations/warp_scan_shfl.cuh index d15ceeb67a..c25d195a0b 100644 --- a/cub/warp/specializations/warp_scan_shfl.cuh +++ b/cub/warp/specializations/warp_scan_shfl.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/warp/specializations/warp_scan_smem.cuh b/cub/warp/specializations/warp_scan_smem.cuh index 9617b10b7f..4e292d8a52 100644 --- a/cub/warp/specializations/warp_scan_smem.cuh +++ b/cub/warp/specializations/warp_scan_smem.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/warp/warp_reduce.cuh b/cub/warp/warp_reduce.cuh index aff53a0855..bae40da28c 100644 --- a/cub/warp/warp_reduce.cuh +++ b/cub/warp/warp_reduce.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/cub/warp/warp_scan.cuh b/cub/warp/warp_scan.cuh index 9dca865154..4c79682b85 100644 --- a/cub/warp/warp_scan.cuh +++ b/cub/warp/warp_scan.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/docs/Doxyfile b/docs/Doxyfile index e17ded3a7c..55b41d73fd 100644 --- a/docs/Doxyfile +++ b/docs/Doxyfile @@ -212,6 +212,9 @@ ALIASES += blocked="Assumes a [blocked arrangement](inde ALIASES += striped="Assumes a [striped arrangement](index.html#sec4sec3) of elements across threads, where the items owned by each thread have a ranked distance of \p BLOCK_THREADS between them." ALIASES += warpstriped="Assumes a warp-striped arrangement of elements across threads, where each warp owns a segment of (\p WARP_THREADS * \p ITEMS_PER_THREAD) consecutively ranked items, and the items owned by each thread have a ranked distance of \p WARP_THREADS between them." +ALIASES += linear_performance{1}="The work-complexity of \1 as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU." +ALIASES += plots_below="Performance plots for other scenarios can be found in the detailed method descriptions below." + # This tag can be used to specify a number of word-keyword mappings (TCL only). # A mapping has the form "name=value". For example adding # "class=itcl::class" will allow you to use the command class in the @@ -784,6 +787,7 @@ EXAMPLE_RECURSIVE = NO # the \image command). IMAGE_PATH = images +IMAGE_PATH += images\performance # The INPUT_FILTER tag can be used to specify a program that doxygen should # invoke to filter for each input file. Doxygen will invoke the filter program diff --git a/docs/device_perf.xls.REMOVED.git-id b/docs/device_perf.xls.REMOVED.git-id index 26d37d283c..1bbe64b5b8 100644 --- a/docs/device_perf.xls.REMOVED.git-id +++ b/docs/device_perf.xls.REMOVED.git-id @@ -1 +1 @@ -2a322a3b9d7726282a4d8bf675a3ecb0e191b7c1 \ No newline at end of file +e800a0bf952280fd94e932b0bfa1a40cd0420446 \ No newline at end of file diff --git a/docs/download_cub.html b/docs/download_cub.html index 5748bb5008..fd5eef826d 100644 --- a/docs/download_cub.html +++ b/docs/download_cub.html @@ -37,14 +37,14 @@
If your download doesn't start in 3s:

- -Download CUB! + +Download CUB!
diff --git a/docs/images/cub_overview.png b/docs/images/cub_overview.png index a8d8c34ec7..d75e1f7b4f 100644 Binary files a/docs/images/cub_overview.png and b/docs/images/cub_overview.png differ diff --git a/docs/images/performance/lsb_radix_sort_int32_keys.png b/docs/images/performance/lsb_radix_sort_int32_keys.png index 631d04276f..a8ce4aec70 100644 Binary files a/docs/images/performance/lsb_radix_sort_int32_keys.png and b/docs/images/performance/lsb_radix_sort_int32_keys.png differ diff --git a/docs/images/performance/lsb_radix_sort_int32_pairs.png b/docs/images/performance/lsb_radix_sort_int32_pairs.png index 226a2f5494..f06f1bfd7e 100644 Binary files a/docs/images/performance/lsb_radix_sort_int32_pairs.png and b/docs/images/performance/lsb_radix_sort_int32_pairs.png differ diff --git a/docs/images/performance/lsb_radix_sort_int64_keys.png b/docs/images/performance/lsb_radix_sort_int64_keys.png index 27c23f3316..30dd9864a4 100644 Binary files a/docs/images/performance/lsb_radix_sort_int64_keys.png and b/docs/images/performance/lsb_radix_sort_int64_keys.png differ diff --git a/docs/images/performance/lsb_radix_sort_int64_pairs.png b/docs/images/performance/lsb_radix_sort_int64_pairs.png index bafcd85d42..0e57653b1d 100644 Binary files a/docs/images/performance/lsb_radix_sort_int64_pairs.png and b/docs/images/performance/lsb_radix_sort_int64_pairs.png differ diff --git a/docs/images/performance/partition_if_int32_50_percent.png b/docs/images/performance/partition_if_int32_50_percent.png index 08400d4265..319d9f28e4 100644 Binary files a/docs/images/performance/partition_if_int32_50_percent.png and b/docs/images/performance/partition_if_int32_50_percent.png differ diff --git a/docs/images/performance/partition_if_int32_5_percent.png b/docs/images/performance/partition_if_int32_5_percent.png index 5d02739006..d9d4ffbea7 100644 Binary files a/docs/images/performance/partition_if_int32_5_percent.png and b/docs/images/performance/partition_if_int32_5_percent.png differ diff --git a/docs/images/performance/partition_if_int64_50_percent.png b/docs/images/performance/partition_if_int64_50_percent.png index 24f7a1cd79..34332a1a52 100644 Binary files a/docs/images/performance/partition_if_int64_50_percent.png and b/docs/images/performance/partition_if_int64_50_percent.png differ diff --git a/docs/images/performance/partition_if_int64_5_percent.png b/docs/images/performance/partition_if_int64_5_percent.png index b7a5e33302..3cc68bc2cc 100644 Binary files a/docs/images/performance/partition_if_int64_5_percent.png and b/docs/images/performance/partition_if_int64_5_percent.png differ diff --git a/docs/images/performance/reduce_by_key_fp32_len_5.png b/docs/images/performance/reduce_by_key_fp32_len_5.png index 908598c696..2a3407120a 100644 Binary files a/docs/images/performance/reduce_by_key_fp32_len_5.png and b/docs/images/performance/reduce_by_key_fp32_len_5.png differ diff --git a/docs/images/performance/reduce_by_key_fp32_len_500.png b/docs/images/performance/reduce_by_key_fp32_len_500.png index 56f7860573..5043a840ee 100644 Binary files a/docs/images/performance/reduce_by_key_fp32_len_500.png and b/docs/images/performance/reduce_by_key_fp32_len_500.png differ diff --git a/docs/images/performance/reduce_by_key_fp64_len_5.png b/docs/images/performance/reduce_by_key_fp64_len_5.png index 0b0aa38d66..fdfb0eb734 100644 Binary files a/docs/images/performance/reduce_by_key_fp64_len_5.png and b/docs/images/performance/reduce_by_key_fp64_len_5.png differ diff --git a/docs/images/performance/reduce_by_key_fp64_len_500.png b/docs/images/performance/reduce_by_key_fp64_len_500.png index 7c600f3338..69f64229d6 100644 Binary files a/docs/images/performance/reduce_by_key_fp64_len_500.png and b/docs/images/performance/reduce_by_key_fp64_len_500.png differ diff --git a/docs/images/performance/reduce_int32.png b/docs/images/performance/reduce_int32.png index 6ead630142..db15ae64bb 100644 Binary files a/docs/images/performance/reduce_int32.png and b/docs/images/performance/reduce_int32.png differ diff --git a/docs/images/performance/reduce_int64.png b/docs/images/performance/reduce_int64.png index d43cc93fe5..bd4c9fd60a 100644 Binary files a/docs/images/performance/reduce_int64.png and b/docs/images/performance/reduce_int64.png differ diff --git a/docs/images/performance/rle_int32_len_5.png b/docs/images/performance/rle_int32_len_5.png index 59e5a8c6b9..dbadb5e1dd 100644 Binary files a/docs/images/performance/rle_int32_len_5.png and b/docs/images/performance/rle_int32_len_5.png differ diff --git a/docs/images/performance/rle_int32_len_500.png b/docs/images/performance/rle_int32_len_500.png index 2334b919f4..ec886fa6d0 100644 Binary files a/docs/images/performance/rle_int32_len_500.png and b/docs/images/performance/rle_int32_len_500.png differ diff --git a/docs/images/performance/rle_int64_len_5.png b/docs/images/performance/rle_int64_len_5.png index 3731bd9b60..033cbe1517 100644 Binary files a/docs/images/performance/rle_int64_len_5.png and b/docs/images/performance/rle_int64_len_5.png differ diff --git a/docs/images/performance/rle_int64_len_500.png b/docs/images/performance/rle_int64_len_500.png index d02aca7385..ef2b007030 100644 Binary files a/docs/images/performance/rle_int64_len_500.png and b/docs/images/performance/rle_int64_len_500.png differ diff --git a/docs/images/performance/scan_int32.png b/docs/images/performance/scan_int32.png index c6193ec6e2..d9a0f7a158 100644 Binary files a/docs/images/performance/scan_int32.png and b/docs/images/performance/scan_int32.png differ diff --git a/docs/images/performance/scan_int64.png b/docs/images/performance/scan_int64.png index 1b33e3a79b..5512241028 100644 Binary files a/docs/images/performance/scan_int64.png and b/docs/images/performance/scan_int64.png differ diff --git a/docs/images/performance/select_if_int32_50_percent.png b/docs/images/performance/select_if_int32_50_percent.png index 246bcc31e6..c4120e7eb9 100644 Binary files a/docs/images/performance/select_if_int32_50_percent.png and b/docs/images/performance/select_if_int32_50_percent.png differ diff --git a/docs/images/performance/select_if_int32_5_percent.png b/docs/images/performance/select_if_int32_5_percent.png index ff89c6f44c..16ece60b04 100644 Binary files a/docs/images/performance/select_if_int32_5_percent.png and b/docs/images/performance/select_if_int32_5_percent.png differ diff --git a/docs/images/performance/select_if_int64_50_percent.png b/docs/images/performance/select_if_int64_50_percent.png index 374a1b76e0..8dd2a2faf4 100644 Binary files a/docs/images/performance/select_if_int64_50_percent.png and b/docs/images/performance/select_if_int64_50_percent.png differ diff --git a/docs/images/performance/select_if_int64_5_percent.png b/docs/images/performance/select_if_int64_5_percent.png index cb1afbdf05..13c3860df1 100644 Binary files a/docs/images/performance/select_if_int64_5_percent.png and b/docs/images/performance/select_if_int64_5_percent.png differ diff --git a/docs/images/performance/select_unique_int32_len_5.png b/docs/images/performance/select_unique_int32_len_5.png index 5043ebdd3e..11a9a416e0 100644 Binary files a/docs/images/performance/select_unique_int32_len_5.png and b/docs/images/performance/select_unique_int32_len_5.png differ diff --git a/docs/images/performance/select_unique_int32_len_500.png b/docs/images/performance/select_unique_int32_len_500.png index d711d296ef..90d25e9188 100644 Binary files a/docs/images/performance/select_unique_int32_len_500.png and b/docs/images/performance/select_unique_int32_len_500.png differ diff --git a/docs/images/performance/select_unique_int64_len_5.png b/docs/images/performance/select_unique_int64_len_5.png index c3acce1803..a806baae60 100644 Binary files a/docs/images/performance/select_unique_int64_len_5.png and b/docs/images/performance/select_unique_int64_len_5.png differ diff --git a/docs/images/performance/select_unique_int64_len_500.png b/docs/images/performance/select_unique_int64_len_500.png index 709943555b..2a5fb17c5d 100644 Binary files a/docs/images/performance/select_unique_int64_len_500.png and b/docs/images/performance/select_unique_int64_len_500.png differ diff --git a/docs/mainpage.dox b/docs/mainpage.dox index 3faa5a489e..5f45cd3d57 100644 --- a/docs/mainpage.dox +++ b/docs/mainpage.dox @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -53,21 +53,35 @@ \tableofcontents \htmlonly + +
+ + +   +Download CUB v1.2.0 + + + +          NVIDIA Research -
+ +
+    Browse or fork CUB at GitHub -
+ +
+ +          The cub-users discussion forum -
- -   -Download CUB v1.1.1 (Dec 11, 2013) + +
+ \endhtmlonly \section sec1 (1) What is CUB? @@ -75,40 +89,43 @@ \par CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model: -- Primitives - - [Device-wide] (group___device_module.html) +- Parallel primitives + - [Device-wide primitives] (group___device_module.html) - Parallel sort, prefix scan, reduction, histogram, etc. - Compatible with CUDA dynamic parallelism - - [Block-wide "collective"] (group___block_module.html) - - Cooperative I/O, sort, prefix scan, reduction, histogram, etc. + - [Block-wide "collective" primitives] (group___block_module.html) + - Cooperative I/O, sort, scan, reduction, histogram, etc. - Compatible with arbitrary thread block sizes and types - - [Warp-wide "collective"] (group___warp_module.html) + - [Warp-wide "collective" primitives] (group___warp_module.html) - Cooperative warp-wide prefix scan, reduction, etc. - - Safe and architecture-specific + - Safely specialized for each underlying CUDA architecture - Utilities - [Fancy iterators] (group___util_iterator.html) - [Thread and thread block I/O] (group___util_io.html) - [PTX intrinsics] (group___util_ptx.html) - [Device, kernel, and storage management] (group___util_mgmt.html) -\subsection sec1sec1 1.1 Collective Primitives +\section sec2 (2) Collective Primitives +\par +CUB provides collective primitives, which are essential for constructing high-performance, +maintainable kernel code. These collective primitives allow complex parallel code to be +re-used rather than re-implemented, and be re-compiled rather than hand-ported.

+ +\par +\image html cub_overview.png +
Orientation of collective primitives within the CUDA software stack
+ \par As a SIMT programming model, CUDA engenders both scalar and collective software interfaces. Traditional software interfaces are scalar : a single thread invokes a library routine to perform some operation (which may include spawning parallel subtasks). Alternatively, a collective interface is entered simultaneously by a group of parallel threads to perform -some cooperative operation. Collective SIMT primitives are essential for constructing -performance-portable kernels for use in higher level software abstractions, libraries, -domain-specific languages, etc. - -\par -\image html cub_overview.png -
Orientation of collective primitives within the CUDA software stack
+some cooperative operation. \par CUB's collective primitives are not bound to any particular width of parallelism -or to any particular data type. This allows them to be: +or data type. This flexibility makes them: - Adaptable to fit the needs of the enclosing kernel computation - Trivially tunable to different grain sizes (threads per block, items per thread, etc.) @@ -116,20 +133,7 @@ or to any particular data type. This allows them to be: \par Thus CUB is [CUDA Unbound](index.html). -\subsection sec1sec2 1.2 Design Motivation -\par -CUB is inspired by the following goals: -- Absolute performance. CUB primitives are specialized and tuned to - best match the features and capabilities of each CUDA architecture. -- Enhanced programmer productivity. CUB primitives allow developers to quickly - compose sequences of complex parallel operations in both CUDA kernel code and CUDA host code. -- Enhanced tunability. CUB primitives allow developers to quickly - change grain sizes (threads per block, items per thread, etc.). -- Reduced maintenance burden. CUB provides a SIMT software abstraction layer - over the diversity of CUDA hardware. With CUB, applications can enjoy - performance-portability without intensive and costly rewriting or porting efforts. - -\section sec2 (2) An Example (block-wide sorting) +\section sec3 (3) An Example (block-wide sorting) \par The following code snippet presents a CUDA kernel in which each block of 128 threads @@ -173,8 +177,8 @@ __global__ void BlockSortKernel(int *d_in, int *d_out) \endcode \par -Threads use cub::BlockLoad, cub::BlockRadixSort, and cub::BlockStore to collectively -load, sort and store a "tile" of input items. Because these operations are +In this example, threads use cub::BlockLoad, cub::BlockRadixSort, and cub::BlockStore to collectively +load, sort and store the block's segment of input items. Because these operations are cooperative, each primitive requires an allocation of shared memory for threads to communicate through. The typical usage pattern for a CUB collective is: -# Statically specialize the primitive for the specific problem setting at hand, e.g., @@ -188,8 +192,8 @@ through. The typical usage pattern for a CUB collective is: -# Invoke methods on the primitive instance. \par -In this example, each thread block uses cub::BlockRadixSort to collectively sort -the data items partitioned across the thread block. To provide coalesced accesses +In particular, cub::BlockRadixSort is used to collectively sort the segment of data items +that have been partitioned across the thread block. To provide coalesced accesses to device memory, we configure the cub::BlockLoad and cub::BlockStore primitives to access memory using a striped access pattern (where consecutive threads simultaneously access consecutive items) and then transpose the keys into @@ -197,49 +201,66 @@ a [blocked arrangement](index.html#sec4sec3) of elements across threads To reuse shared memory across all three primitives, the thread block statically allocates a union of their \p TempStorage types. -\section sec3 (3) Why do you need CUB? +\section sec4 (4) Why do you need CUB? \par -Constructing, tuning, and maintaining kernel code is perhaps the most challenging, -time-consuming aspect of CUDA programming. CUDA kernel software is where +Writing, tuning, and maintaining kernel code is perhaps the most challenging, +time-consuming aspect of CUDA programming. Kernel software is where the complexity of parallelism is expressed. Programmers must reason about deadlock, livelock, synchronization, race conditions, shared memory layout, plurality of state, granularity, throughput, latency, memory bottlenecks, etc. \par -However, with the exception of CUB, there are few (if any) software libraries of -reusable kernel primitives. In the CUDA ecosystem, CUB is unique in this regard. +With the exception of CUB, however, there are few (if any) software libraries of +reusable kernel primitives. In the CUDA ecosystem, CUB is unique in this regard. As a [SIMT](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation) library and software abstraction layer, CUB provides: --# Simplicity of composition. CUB primitives can be simply sequenced - and nested in kernel code. For example, cub::BlockRadixSort is constructed from - cub::BlockExchange and cub::BlockRadixRank. The latter is composed of cub::BlockScan + +-# Simplicity of composition. CUB enhances programmer productivity by + allowing complex parallel operations to be easily sequenced and nested. + For example, cub::BlockRadixSort is constructed from cub::BlockExchange and + cub::BlockRadixRank. The latter is composed of cub::BlockScan which incorporates cub::WarpScan. \image html nested_composition.png --# High performance. CUB simplifies high-performance kernel - development by taking care to implement the state-of-the-art in parallel algorithms. - Expert code should be reused rather than reimplemented. --# Performance portability. CUB primitives are specialized to - match the diversity of NVIDIA hardware, continuously evolving to accommodate new - features and instructions. For example, CUB reductions and prefix scans employ - warp-shuffle on Kepler GPUs. Code should be recompiled rather than hand-ported. + +-# High performance. CUB simplifies high-performance program and kernel + development by taking care to implement the state-of-the-art in parallel algorithms.

+ +-# Performance portability. + CUB primitives are specialized to match the diversity of NVIDIA hardware, continuously + evolving to accommodate new architecture-specific features and instructions. And + because CUB's device-wide primitives are implemented using flexible block-wide and + warp-wide collectives, we are able to performance-tune them to match the processor + resources provided by each CUDA processor architecture. As a result, our CUB + implementations demonstrate much better performance-portability when compared to more + traditional, rigidly-coded parallel libraries such + as [Thrust](http://thrust.github.com/):



+ -# Simplicity of performance tuning: - - Variant tuning. Most CUB primitives support alternative algorithmic + - Resource utilization. CUB primitives allow developers to quickly + change grain sizes (threads per block, items per thread, etc.) to best match + the processor resources of their target architecture + - Variant tuning. Most CUB primitives support alternative algorithmic strategies. For example, cub::BlockHistogram is parameterized to implement either an atomic-based approach or a sorting-based approach. (The latter provides uniform performance regardless of input distribution.) - - Kernel+library co-optimization. Most CUB primitives support arbitrary - granularity (threads per block, items per thread, etc.). When the enclosing kernel - is similarly parameterizable, a configuration can be found that optimally - accommodates their combined register and shared memory pressure. + - Co-optimization. When the enclosing kernel + is similarly parameterizable, a tuning configuration can be found that optimally + accommodates their combined register and shared memory pressure.

+ -# Robustness and durability. CUB just works. CUB primitives are designed to function properly for arbitrary data types and widths of parallelism (not just for the built-in C++ types or for powers-of-two threads - per block). + per block).

+ +-# Reduced maintenance burden. CUB provides a SIMT software abstraction layer + over the diversity of CUDA hardware. With CUB, applications can enjoy + performance-portability without intensive and costly rewriting or porting efforts.

+ -# A path for language evolution. CUB primitives are designed to easily accommodate new features in the CUDA programming model, e.g., thread - subgroups and named barriers, dynamic shared memory allocators, etc. + subgroups and named barriers, dynamic shared memory allocators, etc.

-\section sec4 (4) How do CUB collectives work? +\section sec5 (5) How do CUB collectives work? \par Four programming idioms are central to the design of CUB: @@ -259,7 +280,7 @@ Four programming idioms are central to the design of CUB: configuration can be determined that best accommodates the combined behavior and resource consumption of all primitives within the kernel. -\subsection sec4sec1 4.1 Generic programming +\subsection sec5sec1 5.1 Generic programming \par We use template parameters to specialize CUB primitives for the particular problem setting at hand. Until compile time, CUB primitives are not bound @@ -270,7 +291,7 @@ to any particular: - Underlying processor (special instructions, warp size, rules for bank conflicts, etc.) - Tuning configuration (e.g., latency vs. throughput, algorithm selection, etc.) -\subsection sec4sec2 4.2 Reflective class interfaces +\subsection sec5sec2 5.2 Reflective class interfaces \par Unlike traditional function-oriented interfaces, CUB exposes its collective primitives as templated C++ classes. The resource requirements for a specific @@ -316,7 +337,7 @@ This allows CUB types to easily accommodate new programming model features (e.g., named barriers, memory allocators, etc.) without incurring a combinatorial growth of interface methods. -\subsection sec4sec3 4.3 Flexible data arrangement across threads +\subsection sec5sec3 5.3 Flexible data arrangement across threads \par CUDA kernels are often designed such that each thread block is assigned a segment of data items for processing. @@ -380,7 +401,7 @@ The benefits of processing multiple items per thread (a.k.a., register block Finally, cub::BlockExchange provides operations for converting between blocked and striped arrangements. -\subsection sec4sec4 4.4 Static tuning and co-tuning +\subsection sec5sec4 5.4 Static tuning and co-tuning \par This style of flexible interface simplifies performance tuning. Most CUB primitives support alternative algorithmic strategies that can be @@ -401,7 +422,7 @@ parameterized, the coupled CUB primitives adjust accordingly. This enables autotuners to search for a single configuration that maximizes the performance of the entire kernel for a given set of hardware resources. -\section sec5 (5) How do I get started using CUB? +\section sec6 (6) How do I get started using CUB? \par CUB is implemented as a C++ header library. There is no need to build CUB @@ -414,11 +435,14 @@ separately. To use CUB primitives in your code, simply: specifying a \p -I include-path flag to reference the location of the CUB header library. -\section sec6 (6) How is CUB different than Thrust? - \par -CUB and [Thrust](http://thrust.github.com/) have some -similarities in that they both provide device-wide primitives for CUDA. +We also have collection of simple [CUB example programs] (examples.html) + +\section sec7 (7) How is CUB different than Thrust and Modern GPU? + +\par CUB and Thrust +CUB and [Thrust](http://thrust.github.com/) share some +similarities in that they both provide similar device-wide primitives for CUDA. However, they target different abstraction layers for parallel computing. Thrust abstractions are agnostic of any particular parallel framework (e.g., CUDA, TBB, OpenMP, sequential CPU, etc.). While Thrust has a "backend" @@ -437,11 +461,38 @@ project arose out of a maintenance need to achieve better performance-portabilit within Thrust by using reusable block-wide primitives to reduce maintenance and tuning effort. -\section sec7 (7) Recent News +\par CUB and Modern GPU +CUB and [Modern GPU](http://nvlabs.github.io/moderngpu/) also +share some similarities in that they both implement similar device-wide primitives for CUDA. +However, they serve different purposes for the CUDA programming community. MGPU +is a pedagogical tool for high-performance GPU computing, providing clear and concise +exemplary code and accompanying commentary. It serves as an excellent source of +educational, tutorial, CUDA-by-example material. The MGPU source code is intended +to be read and studied, and often favors simplicity at the expense of portability and +flexibility. + +\par +CUB, on the other hand, is a production-quality library whose sources are complicated +by support for every version of CUDA architecture, and is validated by an extensive +suite of regression tests. Although well-documented, the CUB source text is verbose +and relies heavily on C++ template metaprogramming for situational specialization. + +\par +CUB and MGPU are complementary in that MGPU serves as an excellent descriptive source +for many of the algorithmic techniques used by CUB. + +\section sec8 (8) Recent News \par + +
+02/25/2014
+[CUB v1.0.2](https://github.com/NVlabs/cub/archive/1.0.2.zip) +
+- See the [change-log](CHANGE_LOG.TXT) for further details +
12/11/2013
[CUB v1.1.1](https://github.com/NVlabs/cub/archive/1.1.1.zip) @@ -513,13 +564,13 @@ tuning effort.
-\section sec8 (8) Contributors +\section sec9 (9) Contributors \par CUB is developed as an open-source project by [NVIDIA Research](http://research.nvidia.com). The primary contributor is [Duane Merrill](http://github.com/dumerrill). -\section sec9 (9) Open Source License +\section sec10 (10) Open Source License \par CUB is available under the "New BSD" open-source license: @@ -527,7 +578,7 @@ CUB is available under the "New BSD" open-source license: \par \code Copyright (c) 2011, Duane Merrill. All rights reserved. -Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: diff --git a/examples/block/Makefile b/examples/block/Makefile index fe6612e26e..02f4257d71 100644 --- a/examples/block/Makefile +++ b/examples/block/Makefile @@ -1,6 +1,6 @@ #/****************************************************************************** # * Copyright (c) 2011, Duane Merrill. All rights reserved. -# * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +# * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. # * # * Redistribution and use in source and binary forms, with or without # * modification, are permitted provided that the following conditions are met: diff --git a/examples/block/example_block_radix_sort.cu b/examples/block/example_block_radix_sort.cu index 180040b7ed..9d8f211621 100644 --- a/examples/block/example_block_radix_sort.cu +++ b/examples/block/example_block_radix_sort.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/block/example_block_reduce.cu b/examples/block/example_block_reduce.cu index 251956bc1f..0e73876726 100644 --- a/examples/block/example_block_reduce.cu +++ b/examples/block/example_block_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/block/example_block_scan.cu b/examples/block/example_block_scan.cu index a91821b562..45e7789fdc 100644 --- a/examples/block/example_block_scan.cu +++ b/examples/block/example_block_scan.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/block/experimental/example_block_tridiagonal_solve.cu b/examples/block/experimental/example_block_tridiagonal_solve.cu index a7ad6bf864..57b2bc694c 100644 --- a/examples/block/experimental/example_block_tridiagonal_solve.cu +++ b/examples/block/experimental/example_block_tridiagonal_solve.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/Makefile b/examples/device/Makefile index af758ceedf..47df1b7a1b 100644 --- a/examples/device/Makefile +++ b/examples/device/Makefile @@ -1,6 +1,6 @@ #/****************************************************************************** # * Copyright (c) 2011, Duane Merrill. All rights reserved. -# * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +# * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. # * # * Redistribution and use in source and binary forms, with or without # * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_histogram.cu b/examples/device/example_device_histogram.cu index b0e4e4c70e..e7cda08899 100644 --- a/examples/device/example_device_histogram.cu +++ b/examples/device/example_device_histogram.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_partition_flagged.cu b/examples/device/example_device_partition_flagged.cu index 2746f6a084..125749d02a 100644 --- a/examples/device/example_device_partition_flagged.cu +++ b/examples/device/example_device_partition_flagged.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_partition_if.cu b/examples/device/example_device_partition_if.cu index 1583b467c9..c53e90213a 100644 --- a/examples/device/example_device_partition_if.cu +++ b/examples/device/example_device_partition_if.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_radix_sort.cu b/examples/device/example_device_radix_sort.cu index a4ba6f2bab..04b604449c 100644 --- a/examples/device/example_device_radix_sort.cu +++ b/examples/device/example_device_radix_sort.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_reduce.cu b/examples/device/example_device_reduce.cu index eac2b53438..9b3d4863fb 100644 --- a/examples/device/example_device_reduce.cu +++ b/examples/device/example_device_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_scan.cu b/examples/device/example_device_scan.cu index 43edd1193a..f7a89c7ed0 100644 --- a/examples/device/example_device_scan.cu +++ b/examples/device/example_device_scan.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_select_flagged.cu b/examples/device/example_device_select_flagged.cu index dfe9d09c49..89f9c083ec 100644 --- a/examples/device/example_device_select_flagged.cu +++ b/examples/device/example_device_select_flagged.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_select_if.cu b/examples/device/example_device_select_if.cu index 9333960191..ee8437a2d8 100644 --- a/examples/device/example_device_select_if.cu +++ b/examples/device/example_device_select_if.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/example_device_select_unique.cu b/examples/device/example_device_select_unique.cu index 4b88f6c188..a7dd09a342 100644 --- a/examples/device/example_device_select_unique.cu +++ b/examples/device/example_device_select_unique.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/experimental/coo_graph.cuh b/examples/device/experimental/coo_graph.cuh index 6b1cd0cd52..5a92352309 100644 --- a/examples/device/experimental/coo_graph.cuh +++ b/examples/device/experimental/coo_graph.cuh @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/experimental/example_coo_spmv.cu b/examples/device/experimental/example_coo_spmv.cu index 287abff2fc..0e8d73f818 100644 --- a/examples/device/experimental/example_coo_spmv.cu +++ b/examples/device/experimental/example_coo_spmv.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/examples/device/experimental/test_device_seg_reduce.cu b/examples/device/experimental/test_device_seg_reduce.cu index c1072eed56..5b1044db79 100644 --- a/examples/device/experimental/test_device_seg_reduce.cu +++ b/examples/device/experimental/test_device_seg_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/Makefile b/test/Makefile index fbcbd52a7b..cdcfd6d9e0 100644 --- a/test/Makefile +++ b/test/Makefile @@ -1,6 +1,6 @@ #/****************************************************************************** # * Copyright (c) 2011, Duane Merrill. All rights reserved. -# * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +# * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. # * # * Redistribution and use in source and binary forms, with or without # * modification, are permitted provided that the following conditions are met: diff --git a/test/test_allocator.cu b/test/test_allocator.cu index 9ebbe6eaac..ec67f96778 100644 --- a/test/test_allocator.cu +++ b/test/test_allocator.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_block_histogram.cu b/test/test_block_histogram.cu index 024842fcfd..b16ae5284e 100644 --- a/test/test_block_histogram.cu +++ b/test/test_block_histogram.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_block_load_store.cu b/test/test_block_load_store.cu index 45f0e3c42f..23c65e18ae 100644 --- a/test/test_block_load_store.cu +++ b/test/test_block_load_store.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_block_radix_sort.cu b/test/test_block_radix_sort.cu index 696d3c961f..7ebdc718cf 100644 --- a/test/test_block_radix_sort.cu +++ b/test/test_block_radix_sort.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_block_reduce.cu b/test/test_block_reduce.cu index babc36d94c..430840b7f2 100644 --- a/test/test_block_reduce.cu +++ b/test/test_block_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_block_scan.cu b/test/test_block_scan.cu index 2652a8831c..f9c44f082c 100644 --- a/test/test_block_scan.cu +++ b/test/test_block_scan.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_histogram.cu b/test/test_device_histogram.cu index dfd123b60c..fc2bd03295 100644 --- a/test/test_device_histogram.cu +++ b/test/test_device_histogram.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_radix_sort.cu b/test/test_device_radix_sort.cu index be962b2639..2fae096e21 100644 --- a/test/test_device_radix_sort.cu +++ b/test/test_device_radix_sort.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_reduce.cu b/test/test_device_reduce.cu index a2b2f1fed2..6e0b58c8b6 100644 --- a/test/test_device_reduce.cu +++ b/test/test_device_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_reduce_by_key.cu b/test/test_device_reduce_by_key.cu index 15481d309e..4d089260e5 100644 --- a/test/test_device_reduce_by_key.cu +++ b/test/test_device_reduce_by_key.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -309,8 +309,16 @@ cudaError_t Dispatch( /** * Simple wrapper kernel to invoke DeviceSelect - * / -template + */ +template < + typename KeyInputIterator, + typename KeyOutputIterator, + typename ValueInputIterator, + typename ValueOutputIterator, + typename NumSegmentsIterator, + typename EqualityOp, + typename ReductionOp, + typename Offset> __global__ void CnpDispatchKernel( int timing_timing_iterations, size_t *d_temp_storage_bytes, @@ -318,10 +326,15 @@ __global__ void CnpDispatchKernel( void *d_temp_storage, size_t temp_storage_bytes, - InputIterator d_in, - OutputIterator d_out, - NumSelectedIterator d_num_segments, + KeyInputIterator d_keys_in, + KeyOutputIterator d_keys_out, + ValueInputIterator d_values_in, + ValueOutputIterator d_values_out, + NumSegmentsIterator d_num_segments, + EqualityOp equality_op, + ReductionOp reduction_op, Offset num_items, + cudaStream_t stream, bool debug_synchronous) { @@ -329,16 +342,26 @@ __global__ void CnpDispatchKernel( *d_cdp_error = cudaErrorNotSupported; #else *d_cdp_error = Dispatch(Int2Type(), timing_timing_iterations, d_temp_storage_bytes, d_cdp_error, - d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_segments, num_items, 0, debug_synchronous); + d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, equality_op, reduction_op, num_items, 0, debug_synchronous); + *d_temp_storage_bytes = temp_storage_bytes; #endif } -/ ** +/** * Dispatch to CDP kernel - * / -template + */ +template < + typename KeyInputIterator, + typename KeyOutputIterator, + typename ValueInputIterator, + typename ValueOutputIterator, + typename NumSegmentsIterator, + typename EqualityOp, + typename ReductionOp, + typename Offset> +__host__ __device__ __forceinline__ cudaError_t Dispatch( Int2Type dispatch_to, int timing_timing_iterations, @@ -347,16 +370,20 @@ cudaError_t Dispatch( void *d_temp_storage, size_t &temp_storage_bytes, - InputIterator d_in, - OutputIterator d_out, - NumSelectedIterator d_num_segments, + KeyInputIterator d_keys_in, + KeyOutputIterator d_keys_out, + ValueInputIterator d_values_in, + ValueOutputIterator d_values_out, + NumSegmentsIterator d_num_segments, + EqualityOp equality_op, + ReductionOp reduction_op, Offset num_items, cudaStream_t stream, bool debug_synchronous) { // Invoke kernel to invoke device-side dispatch CnpDispatchKernel<<<1,1>>>(timing_timing_iterations, d_temp_storage_bytes, d_cdp_error, - d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_segments, num_items, debug_synchronous); + d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, equality_op, reduction_op, num_items, 0, debug_synchronous); // Copy out temp_storage_bytes CubDebugExit(cudaMemcpy(&temp_storage_bytes, d_temp_storage_bytes, sizeof(size_t) * 1, cudaMemcpyDeviceToHost)); @@ -367,7 +394,7 @@ cudaError_t Dispatch( return retval; } -*/ + //--------------------------------------------------------------------- // Test generation diff --git a/test/test_device_scan.cu b/test/test_device_scan.cu index 33e0fc12cf..221746a7f5 100644 --- a/test/test_device_scan.cu +++ b/test/test_device_scan.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_select_flagged.cu b/test/test_device_select_flagged.cu index 7d60fc6f99..6da5db1765 100644 --- a/test/test_device_select_flagged.cu +++ b/test/test_device_select_flagged.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_select_if.cu b/test/test_device_select_if.cu index b17385e0cc..91113894fb 100644 --- a/test/test_device_select_if.cu +++ b/test/test_device_select_if.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_device_select_unique.cu b/test/test_device_select_unique.cu index 0c040988d1..9a2c63789d 100644 --- a/test/test_device_select_unique.cu +++ b/test/test_device_select_unique.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_grid_barrier.cu b/test/test_grid_barrier.cu index fcbce2a161..10a1fe991e 100644 --- a/test/test_grid_barrier.cu +++ b/test/test_grid_barrier.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_iterator.cu b/test/test_iterator.cu index c8ba88bd00..58140a84ac 100644 --- a/test/test_iterator.cu +++ b/test/test_iterator.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_util.h b/test/test_util.h index 08b389bcd7..987c4f1c5e 100644 --- a/test/test_util.h +++ b/test/test_util.h @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_warp_reduce.cu b/test/test_warp_reduce.cu index 2828422cb1..173b2938c6 100644 --- a/test/test_warp_reduce.cu +++ b/test/test_warp_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/test/test_warp_scan.cu b/test/test_warp_scan.cu index f19a730f98..24cfd61d1d 100644 --- a/test/test_warp_scan.cu +++ b/test/test_warp_scan.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: diff --git a/tune/Makefile b/tune/Makefile index 630c6df67d..51ccde31e6 100644 --- a/tune/Makefile +++ b/tune/Makefile @@ -1,6 +1,6 @@ #/****************************************************************************** # * Copyright (c) 2011, Duane Merrill. All rights reserved. -# * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. +# * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. # * # * Redistribution and use in source and binary forms, with or without # * modification, are permitted provided that the following conditions are met: diff --git a/tune/tune_device_reduce.cu b/tune/tune_device_reduce.cu index 2180bd1326..f189e88ca7 100644 --- a/tune/tune_device_reduce.cu +++ b/tune/tune_device_reduce.cu @@ -1,6 +1,6 @@ /****************************************************************************** * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2011-2014, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: