* The cub::BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block.
* \brief The BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block. ![](reduce_logo.png)
* \ingroup BlockModule
*
* \par Overview
* A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
* uses a binary combining operator to compute a single aggregate from a list of input elements.
*
* \par
* Optionally, BlockReduce can be specialized by algorithm to accommodate different latency/throughput workload profiles:
* -# <b>cub::BLOCK_REDUCE_RAKING</b>. An efficient "raking" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm)
* \tparam BLOCK_THREADS The thread block size in threads
* \tparam ALGORITHM <b>[optional]</b> cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_RAKING)
*
* \par Performance Considerations
* - Very efficient (only one synchronization barrier).
* - Zero bank conflicts for most types.
* - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
* \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
*/
__device__ __forceinline__ BlockReduce()
:
temp_storage(PrivateStorage()),
linear_tid(threadIdx.x)
{}
/**
* \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
*/
__device__ __forceinline__ BlockReduce(
TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
:
temp_storage(temp_storage.Alias()),
linear_tid(threadIdx.x)
{}
/**
* \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
*/
__device__ __forceinline__ BlockReduce(
int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
:
temp_storage(PrivateStorage()),
linear_tid(linear_tid)
{}
/**
* \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
*/
__device__ __forceinline__ BlockReduce(
TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
* \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor. Each thread contributes one input element.
*
* The return value is undefined in threads other than thread<sub>0</sub>.
*
* Supports non-commutative reduction operators.
*
* \smemreuse
*
* The code snippet below illustrates a max reduction of 128 integer items that
* are partitioned across 128 threads.
* \par
* \code
* #include <cub/cub.cuh>
*
* __global__ void ExampleKernel(...)
* {
* // Specialize BlockReduce for 128 threads on type int
* \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.
*
* The return value is undefined in threads other than thread<sub>0</sub>.
*
* Supports non-commutative reduction operators.
*
* \blocked
*
* \smemreuse
*
* The code snippet below illustrates a max reduction of 512 integer items that
* are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
* where each thread owns 4 consecutive items.
* \par
* \code
* #include <cub/cub.cuh>
*
* __global__ void ExampleKernel(...)
* {
* // Specialize BlockReduce for 128 threads on type int
* \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor. The first \p num_valid threads each contribute one input element.
*
* The return value is undefined in threads other than thread<sub>0</sub>.
*
* Supports non-commutative reduction operators.
*
* \blocked
*
* \smemreuse
*
* The code snippet below illustrates a max reduction of a partially-full tile of integer items that
* \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes one input element.
*
* The return value is undefined in threads other than thread<sub>0</sub>.
*
* \smemreuse
*
* The code snippet below illustrates a sum reduction of 128 integer items that
* are partitioned across 128 threads.
* \par
* \code
* #include <cub/cub.cuh>
*
* __global__ void ExampleKernel(...)
* {
* // Specialize BlockReduce for 128 threads on type int
* \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.
*
* The return value is undefined in threads other than thread<sub>0</sub>.
*
* \smemreuse
*
* The code snippet below illustrates a sum reduction of 512 integer items that
* are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
* where each thread owns 4 consecutive items.
* \par
* \code
* #include <cub/cub.cuh>
*
* __global__ void ExampleKernel(...)
* {
* // Specialize BlockReduce for 128 threads on type int
* // Obtain a segment of consecutive items that are blocked across threads
* int thread_data[4];
* ...
*
* // Compute the block-wide sum for thread0
* int aggregate = BlockReduce(temp_storage).Sum(thread_data);
*
* \endcode
*
* \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
*/
template <int ITEMS_PER_THREAD>
__device__ __forceinline__ T Sum(
T (&inputs)[ITEMS_PER_THREAD]) ///< [in] Calling thread's input segment
{
// Reduce partials
T partial = ThreadReduce(inputs, cub::Sum());
return Sum(partial);
}
/**
* \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. The first \p num_valid threads each contribute one input element.
*
* The return value is undefined in threads other than thread<sub>0</sub>.
*
* \smemreuse
*
* The code snippet below illustrates a sum reduction of a partially-full tile of integer items that