* -# The \link cudpp_kernel Kernel-Level API\endlink comprises functions
* that run entirely on the GPU across an entire grid of thread blocks.
* These functions may call into the \link cudpp_cta CTA-Level API\endlink
* below them.
* -# The \link cudpp_cta CTA-Level API\endlink comprises functions that run
* entirely on the GPU within a single Cooperative Thread Array (CTA,
* aka thread block). These are low-level functions that implement core
* data-parallel algorithms, typically by processing data within shared
* (CUDA \c __shared__) memory.
*
* Programmers may use any of the lower three CUDPP layers in their own
* programs by building the source directly into their application. However,
* the typical usage of CUDPP is to link to the library and invoke functions in
* the CUDPP \link publicInterface Public Interface\endlink, as in the
* \ref example_simpleCUDPP "simpleCUDPP", satGL, and cudpp_testrig application
* examples included in the CUDPP distribution.
*
* In the future, if and when CUDA supports building device-level libraries, we
* hope to enhance CUDPP to ease the use of CUDPP internal algorithms at all
* levels.
*
* \subsection uses Use Cases
* We expect the normal use of CUDPP will be in one of two ways:
* -# Linking the CUDPP library against another application.
* -# Running our "test" application, cudpp_testrig, that exercises
* CUDPP functionality.
*
* \section references References
* The following publications describe work incorporated in CUDPP.
*
* - Mark Harris, Shubhabrata Sengupta, and John D. Owens. "Parallel Prefix Sum (Scan) with CUDA". In Hubert Nguyen, editor, <i>GPU Gems 3</i>, chapter 39, pages 851–876. Addison Wesley, August 2007. http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=916
* - Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. "Scan Primitives for GPU Computing". In <i>Graphics Hardware 2007</i>, pages 97–106, August 2007. http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=915
* - Shubhabrata Sengupta, Mark Harris, and Michael Garland. "Efficient parallel scan algorithms for GPUs". NVIDIA Technical Report NVR-2008-003, December 2008. http://mgarland.org/papers.html#segscan-tr
* - Nadathur Satish, Mark Harris, and Michael Garland. "Designing Efficient Sorting Algorithms for Manycore GPUs". In <i>Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium</i>, May 2009. http://mgarland.org/papers.html#gpusort
* - Stanley Tzeng, Li-Yi Wei. "Parallel White Noise Generation on a GPU via Cryptographic Hash". In <i>Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games</i>, pages 79–87, February 2008. http://research.microsoft.com/apps/pubs/default.aspx?id=70502
*
* Many researchers are using CUDPP in their work, and there are many publications
* that have used it \ref cudpp_refs "(references)". If your work uses CUDPP, please
* let us know by sending us a reference (preferably in BibTeX format) to your work.
*
* \section citing Citing CUDPP
*
* If you make use of CUDPP primitives in your work and want to cite
* CUDPP (thanks!), we would prefer for you to cite the appropriate
* papers above, since they form the core of CUDPP. To be more specific,
* the GPU Gems paper describes (unsegmented) scan, multi-scan for
* summed-area tables, and stream compaction. The NVIDIA technical report
* describes the current scan and segmented scan algorithms used in the
* library, and the Graphics Hardware paper describes an earlier
* implementation of segmented scan, quicksort, and sparse matrix-vector
* multiply. The IPDPS paper describes the radix sort used in CUDPP, and
* the I3D paper describes the random number generation algorithm.