No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Tue, Apr 15, 18:07

Offset	End	Complete
0	4194304	Yes
4194304	4679341	Yes

View Options

This file is larger than 256 KB, so syntax highlighting was skipped.

	diff --git a/doc/Manual.html b/doc/Manual.html
	index 6a97ddf30..d11b9de22 100644
	--- a/doc/Manual.html
	+++ b/doc/Manual.html
	@@ -1,443 +1,443 @@
	<HTML>
	<HEAD>
	<TITLE>LAMMPS-ICMS Users Manual</TITLE>
	-<META NAME="docnumber" CONTENT="8 Jul 2015 version">
	+<META NAME="docnumber" CONTENT="15 Jul 2015 version">
	<META NAME="author" CONTENT="http://lammps.sandia.gov - Sandia National Laboratories">
	<META NAME="copyright" CONTENT="Copyright (2003) Sandia Corporation. This software and manual is distributed under the GNU General Public License.">
	</HEAD>

	<BODY>

	<CENTER><A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A>
	</CENTER>






	<HR>

	<H1></H1>

	<CENTER><H3>LAMMPS-ICMS Documentation
	</H3></CENTER>
	-<CENTER><H4>8 Jul 2015 version
	+<CENTER><H4>15 Jul 2015 version
	</H4></CENTER>
	<H4>Version info:
	</H4>
	<P>The LAMMPS "version" is the date when it was released, such as 1 May
	2010. LAMMPS is updated continuously. Whenever we fix a bug or add a
	feature, we release it immediately, and post a notice on <A HREF = "http://lammps.sandia.gov/bug.html">this page of
	the WWW site</A>. Each dated copy of LAMMPS contains all the
	features and bug-fixes up to and including that version date. The
	version date is printed to the screen and logfile every time you run
	LAMMPS. It is also in the file src/version.h and in the LAMMPS
	directory name created when you unpack a tarball, and at the top of
	the first page of the manual (this page).
	</P>
	<P>LAMMPS-ICMS is an experimental variant of LAMMPS with additional
	features made available for testing before they will be submitted
	for inclusion into the official LAMMPS tree. The source code is
	based on the official LAMMPS svn repository mirror at the Institute
	for Computational Molecular Science at Temple University and generally
	kept up-to-date as much as possible. Sometimes, e.g. when additional
	development work is needed to adapt the upstream changes into
	LAMMPS-ICMS it can take longer until synchronization; and occasionally,
	e.g. in case of the rewrite of the multi-threading support, the
	development will be halted except for important bugfixes until
	all features of LAMMPS-ICMS fully compatible with the upstream
	version or replaced by alternate implementations.
	</P>
	<UL><LI>If you browse the HTML doc pages on the LAMMPS WWW site, they always
	describe the most current version of upstream LAMMPS, but may be
	missing some new features in LAMMPS-ICMS.

	<LI>If you browse the HTML doc pages included in your tarball, they
	describe the version you have, however, not all new features in
	LAMMPS-ICMS are documented immediately.

	<LI>The <A HREF = "Manual.pdf">PDF file</A> on the WWW site or in the tarball is updated
	about once per month. This is because it is large, and we don't want
	it to be part of every patch.

	<LI>There is also a <A HREF = "Developer.pdf">Developer.pdf</A> file in the doc
	directory, which describes the internal structure and algorithms of
	LAMMPS.
	</UL>
	<P>LAMMPS stands for Large-scale Atomic/Molecular Massively Parallel
	Simulator.
	</P>
	<P>LAMMPS is a classical molecular dynamics simulation code designed to
	run efficiently on parallel computers. It was developed at Sandia
	National Laboratories, a US Department of Energy facility, with
	funding from the DOE. It is an open-source code, distributed freely
	under the terms of the GNU Public License (GPL).
	</P>
	<P>The primary developers of LAMMPS are <A HREF = "http://www.sandia.gov/~sjplimp">Steve Plimpton</A>, Aidan
	Thompson, and Paul Crozier who can be contacted at
	sjplimp,athomps,pscrozi at sandia.gov. The <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> at
	http://lammps.sandia.gov has more information about the code and its
	uses.
	</P>




	<HR>

	<P>The LAMMPS documentation is organized into the following sections. If
	you find errors or omissions in this manual or have suggestions for
	useful information to add, please send an email to the developers so
	we can improve the LAMMPS documentation.
	</P>
	<P>Once you are familiar with LAMMPS, you may want to bookmark <A HREF = "Section_commands.html#comm">this
	page</A> at Section_commands.html#comm since
	it gives quick access to documentation for all LAMMPS commands.
	</P>
	<P><A HREF = "Manual.pdf">PDF file</A> of the entire manual, generated by
	<A HREF = "http://freecode.com/projects/htmldoc">htmldoc</A>
	</P>
	<OL><LI><A HREF = "Section_intro.html">Introduction</A>

	<UL> 1.1 <A HREF = "Section_intro.html#intro_1">What is LAMMPS</A>
	<BR>
	1.2 <A HREF = "Section_intro.html#intro_2">LAMMPS features</A>
	<BR>
	1.3 <A HREF = "Section_intro.html#intro_3">LAMMPS non-features</A>
	<BR>
	1.4 <A HREF = "Section_intro.html#intro_4">Open source distribution</A>
	<BR>
	1.5 <A HREF = "Section_intro.html#intro_5">Acknowledgments and citations</A>
	<BR></UL>
	<LI><A HREF = "Section_start.html">Getting started</A>

	<UL> 2.1 <A HREF = "Section_start.html#start_1">What's in the LAMMPS distribution</A>
	<BR>
	2.2 <A HREF = "Section_start.html#start_2">Making LAMMPS</A>
	<BR>
	2.3 <A HREF = "Section_start.html#start_3">Making LAMMPS with optional packages</A>
	<BR>
	2.4 <A HREF = "Section_start.html#start_4">Building LAMMPS via the Make.py script</A>
	<BR>
	2.5 <A HREF = "Section_start.html#start_5">Building LAMMPS as a library</A>
	<BR>
	2.6 <A HREF = "Section_start.html#start_6">Running LAMMPS</A>
	<BR>
	2.7 <A HREF = "Section_start.html#start_7">Command-line options</A>
	<BR>
	2.8 <A HREF = "Section_start.html#start_8">Screen output</A>
	<BR>
	2.9 <A HREF = "Section_start.html#start_9">Tips for users of previous versions</A>
	<BR></UL>
	<LI><A HREF = "Section_commands.html">Commands</A>

	<UL> 3.1 <A HREF = "Section_commands.html#cmd_1">LAMMPS input script</A>
	<BR>
	3.2 <A HREF = "Section_commands.html#cmd_2">Parsing rules</A>
	<BR>
	3.3 <A HREF = "Section_commands.html#cmd_3">Input script structure</A>
	<BR>
	3.4 <A HREF = "Section_commands.html#cmd_4">Commands listed by category</A>
	<BR>
	3.5 <A HREF = "Section_commands.html#cmd_5">Commands listed alphabetically</A>
	<BR></UL>
	<LI><A HREF = "Section_packages.html">Packages</A>

	<UL> 4.1 <A HREF = "Section_packages.html#pkg_1">Standard packages</A>
	<BR>
	4.2 <A HREF = "Section_packages.html#pkg_2">User packages</A>
	<BR></UL>
	<LI><A HREF = "Section_accelerate.html">Accelerating LAMMPS performance</A>

	<UL> 5.1 <A HREF = "Section_accelerate.html#acc_1">Measuring performance</A>
	<BR>
	5.2 <A HREF = "Section_accelerate.html#acc_2">Algorithms and code options to boost performace</A>
	<BR>
	5.3 <A HREF = "Section_accelerate.html#acc_3">Accelerator packages with optimized styles</A>
	<BR>
	<UL> 5.3.1 <A HREF = "accelerate_cuda.html">USER-CUDA package</A>
	<BR>
	5.3.2 <A HREF = "accelerate_gpu.html">GPU package</A>
	<BR>
	5.3.3 <A HREF = "accelerate_intel.html">USER-INTEL package</A>
	<BR>
	5.3.4 <A HREF = "accelerate_kokkos.html">KOKKOS package</A>
	<BR>
	5.3.5 <A HREF = "accelerate_omp.html">USER-OMP package</A>
	<BR>
	5.3.6 <A HREF = "accelerate_opt.html">OPT package</A>
	<BR></UL>
	5.4 <A HREF = "Section_accelerate.html#acc_4">Comparison of various accelerator packages</A>
	<BR></UL>
	<LI><A HREF = "Section_howto.html">How-to discussions</A>

	<UL> 6.1 <A HREF = "Section_howto.html#howto_1">Restarting a simulation</A>
	<BR>
	6.2 <A HREF = "Section_howto.html#howto_2">2d simulations</A>
	<BR>
	6.3 <A HREF = "Section_howto.html#howto_3">CHARMM and AMBER force fields</A>
	<BR>
	6.4 <A HREF = "Section_howto.html#howto_4">Running multiple simulations from one input script</A>
	<BR>
	6.5 <A HREF = "Section_howto.html#howto_5">Multi-replica simulations</A>
	<BR>
	6.6 <A HREF = "Section_howto.html#howto_6">Granular models</A>
	<BR>
	6.7 <A HREF = "Section_howto.html#howto_7">TIP3P water model</A>
	<BR>
	6.8 <A HREF = "Section_howto.html#howto_8">TIP4P water model</A>
	<BR>
	6.9 <A HREF = "Section_howto.html#howto_9">SPC water model</A>
	<BR>
	6.10 <A HREF = "Section_howto.html#howto_10">Coupling LAMMPS to other codes</A>
	<BR>
	6.11 <A HREF = "Section_howto.html#howto_11">Visualizing LAMMPS snapshots</A>
	<BR>
	6.12 <A HREF = "Section_howto.html#howto_12">Triclinic (non-orthogonal) simulation boxes</A>
	<BR>
	6.13 <A HREF = "Section_howto.html#howto_13">NEMD simulations</A>
	<BR>
	6.14 <A HREF = "Section_howto.html#howto_14">Finite-size spherical and aspherical particles</A>
	<BR>
	6.15 <A HREF = "Section_howto.html#howto_15">Output from LAMMPS (thermo, dumps, computes, fixes, variables)</A>
	<BR>
	6.16 <A HREF = "Section_howto.html#howto_16">Thermostatting, barostatting, and compute temperature</A>
	<BR>
	6.17 <A HREF = "Section_howto.html#howto_17">Walls</A>
	<BR>
	6.18 <A HREF = "Section_howto.html#howto_18">Elastic constants</A>
	<BR>
	6.19 <A HREF = "Section_howto.html#howto_19">Library interface to LAMMPS</A>
	<BR>
	6.20 <A HREF = "Section_howto.html#howto_20">Calculating thermal conductivity</A>
	<BR>
	6.21 <A HREF = "Section_howto.html#howto_21">Calculating viscosity</A>
	<BR>
	6.22 <A HREF = "howto_22">Calculating a diffusion coefficient</A>
	<BR>
	6.23 <A HREF = "howto_23">Using chunks to calculate system properties</A>
	<BR>
	6.24 <A HREF = "howto_24">Setting parameters for pppm/disp</A>
	<BR>
	6.25 <A HREF = "howto_25">Adiabatic core/shell model</A>
	<BR></UL>
	<LI><A HREF = "Section_example.html">Example problems</A>

	<LI><A HREF = "Section_perf.html">Performance & scalability</A>

	<LI><A HREF = "Section_tools.html">Additional tools</A>

	<LI><A HREF = "Section_modify.html">Modifying & extending LAMMPS</A>

	<UL> 10.1 <A HREF = "Section_modify.html#mod_1">Atom styles</A>
	<BR>
	10.2 <A HREF = "Section_modify.html#mod_2">Bond, angle, dihedral, improper potentials</A>
	<BR>
	10.3 <A HREF = "Section_modify.html#mod_3">Compute styles</A>
	<BR>
	10.4 <A HREF = "Section_modify.html#mod_4">Dump styles</A>
	<BR>
	10.5 <A HREF = "Section_modify.html#mod_5">Dump custom output options</A>
	<BR>
	10.6 <A HREF = "Section_modify.html#mod_6">Fix styles</A>
	<BR>
	10.7 <A HREF = "Section_modify.html#mod_7">Input script commands</A>
	<BR>
	10.8 <A HREF = "Section_modify.html#mod_8">Kspace computations</A>
	<BR>
	10.9 <A HREF = "Section_modify.html#mod_9">Minimization styles</A>
	<BR>
	10.10 <A HREF = "Section_modify.html#mod_10">Pairwise potentials</A>
	<BR>
	10.11 <A HREF = "Section_modify.html#mod_11">Region styles</A>
	<BR>
	10.12 <A HREF = "Section_modify.html#mod_12">Body styles</A>
	<BR>
	10.13 <A HREF = "Section_modify.html#mod_13">Thermodynamic output options</A>
	<BR>
	10.14 <A HREF = "Section_modify.html#mod_14">Variable options</A>
	<BR>
	10.15 <A HREF = "Section_modify.html#mod_15">Submitting new features for inclusion in LAMMPS</A>
	<BR></UL>
	<LI><A HREF = "Section_python.html">Python interface</A>

	<UL> 11.1 <A HREF = "Section_python.html#py_1">Overview of running LAMMPS from Python</A>
	<BR>
	11.2 <A HREF = "Section_python.html#py_2">Overview of using Python from a LAMMPS script</A>
	<BR>
	11.3 <A HREF = "Section_python.html#py_3">Building LAMMPS as a shared library</A>
	<BR>
	11.4 <A HREF = "Section_python.html#py_4">Installing the Python wrapper into Python</A>
	<BR>
	11.5 <A HREF = "Section_python.html#py_5">Extending Python with MPI to run in parallel</A>
	<BR>
	11.6 <A HREF = "Section_python.html#py_6">Testing the Python-LAMMPS interface</A>
	<BR>
	11.7 <A HREF = "py_7">Using LAMMPS from Python</A>
	<BR>
	11.8 <A HREF = "py_8">Example Python scripts that use LAMMPS</A>
	<BR></UL>
	<LI><A HREF = "Section_errors.html">Errors</A>

	<UL> 12.1 <A HREF = "Section_errors.html#err_1">Common problems</A>
	<BR>
	12.2 <A HREF = "Section_errors.html#err_2">Reporting bugs</A>
	<BR>
	12.3 <A HREF = "Section_errors.html#err_3">Error & warning messages</A>
	<BR></UL>
	<LI><A HREF = "Section_history.html">Future and history</A>

	<UL> 13.1 <A HREF = "Section_history.html#hist_1">Coming attractions</A>
	<BR>
	13.2 <A HREF = "Section_history.html#hist_2">Past versions</A>
	<BR></UL>

	</OL>
















































































































































	</BODY>

	</HTML>
	diff --git a/doc/Manual.txt b/doc/Manual.txt
	index 209e9d7ee..07a347266 100644
	--- a/doc/Manual.txt
	+++ b/doc/Manual.txt
	@@ -1,277 +1,277 @@
	<HEAD>
	<TITLE>LAMMPS-ICMS Users Manual</TITLE>
	-<META NAME="docnumber" CONTENT="8 Jul 2015 version">
	+<META NAME="docnumber" CONTENT="15 Jul 2015 version">
	<META NAME="author" CONTENT="http://lammps.sandia.gov - Sandia National Laboratories">
	<META NAME="copyright" CONTENT="Copyright (2003) Sandia Corporation. This software and manual is distributed under the GNU General Public License.">
	</HEAD>

	<BODY>

	"LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

	:link(lws,http://lammps.sandia.gov)
	:link(ld,Manual.html)
	:link(lc,Section_commands.html#comm)

	:line

	<H1></H1>

	LAMMPS-ICMS Documentation :c,h3
	-8 Jul 2015 version :c,h4
	+15 Jul 2015 version :c,h4

	Version info: :h4

	The LAMMPS "version" is the date when it was released, such as 1 May
	2010. LAMMPS is updated continuously. Whenever we fix a bug or add a
	feature, we release it immediately, and post a notice on "this page of
	the WWW site"_bug. Each dated copy of LAMMPS contains all the
	features and bug-fixes up to and including that version date. The
	version date is printed to the screen and logfile every time you run
	LAMMPS. It is also in the file src/version.h and in the LAMMPS
	directory name created when you unpack a tarball, and at the top of
	the first page of the manual (this page).

	LAMMPS-ICMS is an experimental variant of LAMMPS with additional
	features made available for testing before they will be submitted
	for inclusion into the official LAMMPS tree. The source code is
	based on the official LAMMPS svn repository mirror at the Institute
	for Computational Molecular Science at Temple University and generally
	kept up-to-date as much as possible. Sometimes, e.g. when additional
	development work is needed to adapt the upstream changes into
	LAMMPS-ICMS it can take longer until synchronization; and occasionally,
	e.g. in case of the rewrite of the multi-threading support, the
	development will be halted except for important bugfixes until
	all features of LAMMPS-ICMS fully compatible with the upstream
	version or replaced by alternate implementations.

	If you browse the HTML doc pages on the LAMMPS WWW site, they always
	describe the most current version of upstream LAMMPS, but may be
	missing some new features in LAMMPS-ICMS. :ulb,l

	If you browse the HTML doc pages included in your tarball, they
	describe the version you have, however, not all new features in
	LAMMPS-ICMS are documented immediately. :l

	The "PDF file"_Manual.pdf on the WWW site or in the tarball is updated
	about once per month. This is because it is large, and we don't want
	it to be part of every patch. :l

	There is also a "Developer.pdf"_Developer.pdf file in the doc
	directory, which describes the internal structure and algorithms of
	LAMMPS. :ule,l

	LAMMPS stands for Large-scale Atomic/Molecular Massively Parallel
	Simulator.

	LAMMPS is a classical molecular dynamics simulation code designed to
	run efficiently on parallel computers. It was developed at Sandia
	National Laboratories, a US Department of Energy facility, with
	funding from the DOE. It is an open-source code, distributed freely
	under the terms of the GNU Public License (GPL).

	The primary developers of LAMMPS are "Steve Plimpton"_sjp, Aidan
	Thompson, and Paul Crozier who can be contacted at
	sjplimp,athomps,pscrozi at sandia.gov. The "LAMMPS WWW Site"_lws at
	http://lammps.sandia.gov has more information about the code and its
	uses.

	:link(bug,http://lammps.sandia.gov/bug.html)
	:link(sjp,http://www.sandia.gov/~sjplimp)

	:line

	The LAMMPS documentation is organized into the following sections. If
	you find errors or omissions in this manual or have suggestions for
	useful information to add, please send an email to the developers so
	we can improve the LAMMPS documentation.

	Once you are familiar with LAMMPS, you may want to bookmark "this
	page"_Section_commands.html#comm at Section_commands.html#comm since
	it gives quick access to documentation for all LAMMPS commands.

	"PDF file"_Manual.pdf of the entire manual, generated by
	"htmldoc"_http://freecode.com/projects/htmldoc

	"Introduction"_Section_intro.html :olb,l
	1.1 "What is LAMMPS"_intro_1 :ulb,b
	1.2 "LAMMPS features"_intro_2 :b
	1.3 "LAMMPS non-features"_intro_3 :b
	1.4 "Open source distribution"_intro_4 :b
	1.5 "Acknowledgments and citations"_intro_5 :ule,b
	"Getting started"_Section_start.html :l
	2.1 "What's in the LAMMPS distribution"_start_1 :ulb,b
	2.2 "Making LAMMPS"_start_2 :b
	2.3 "Making LAMMPS with optional packages"_start_3 :b
	2.4 "Building LAMMPS via the Make.py script"_start_4 :b
	2.5 "Building LAMMPS as a library"_start_5 :b
	2.6 "Running LAMMPS"_start_6 :b
	2.7 "Command-line options"_start_7 :b
	2.8 "Screen output"_start_8 :b
	2.9 "Tips for users of previous versions"_start_9 :ule,b
	"Commands"_Section_commands.html :l
	3.1 "LAMMPS input script"_cmd_1 :ulb,b
	3.2 "Parsing rules"_cmd_2 :b
	3.3 "Input script structure"_cmd_3 :b
	3.4 "Commands listed by category"_cmd_4 :b
	3.5 "Commands listed alphabetically"_cmd_5 :ule,b
	"Packages"_Section_packages.html :l
	4.1 "Standard packages"_pkg_1 :ulb,b
	4.2 "User packages"_pkg_2 :ule,b
	"Accelerating LAMMPS performance"_Section_accelerate.html :l
	5.1 "Measuring performance"_acc_1 :ulb,b
	5.2 "Algorithms and code options to boost performace"_acc_2 :b
	5.3 "Accelerator packages with optimized styles"_acc_3 :b
	5.3.1 "USER-CUDA package"_accelerate_cuda.html :ulb,b
	5.3.2 "GPU package"_accelerate_gpu.html :b
	5.3.3 "USER-INTEL package"_accelerate_intel.html :b
	5.3.4 "KOKKOS package"_accelerate_kokkos.html :b
	5.3.5 "USER-OMP package"_accelerate_omp.html :b
	5.3.6 "OPT package"_accelerate_opt.html :ule,b
	5.4 "Comparison of various accelerator packages"_acc_4 :ule,b
	"How-to discussions"_Section_howto.html :l
	6.1 "Restarting a simulation"_howto_1 :ulb,b
	6.2 "2d simulations"_howto_2 :b
	6.3 "CHARMM and AMBER force fields"_howto_3 :b
	6.4 "Running multiple simulations from one input script"_howto_4 :b
	6.5 "Multi-replica simulations"_howto_5 :b
	6.6 "Granular models"_howto_6 :b
	6.7 "TIP3P water model"_howto_7 :b
	6.8 "TIP4P water model"_howto_8 :b
	6.9 "SPC water model"_howto_9 :b
	6.10 "Coupling LAMMPS to other codes"_howto_10 :b
	6.11 "Visualizing LAMMPS snapshots"_howto_11 :b
	6.12 "Triclinic (non-orthogonal) simulation boxes"_howto_12 :b
	6.13 "NEMD simulations"_howto_13 :b
	6.14 "Finite-size spherical and aspherical particles"_howto_14 :b
	6.15 "Output from LAMMPS (thermo, dumps, computes, fixes, variables)"_howto_15 :b
	6.16 "Thermostatting, barostatting, and compute temperature"_howto_16 :b
	6.17 "Walls"_howto_17 :b
	6.18 "Elastic constants"_howto_18 :b
	6.19 "Library interface to LAMMPS"_howto_19 :b
	6.20 "Calculating thermal conductivity"_howto_20 :b
	6.21 "Calculating viscosity"_howto_21 :b
	6.22 "Calculating a diffusion coefficient"_howto_22 :b
	6.23 "Using chunks to calculate system properties"_howto_23 :b
	6.24 "Setting parameters for pppm/disp"_howto_24 :b
	6.25 "Adiabatic core/shell model"_howto_25 :ule,b
	"Example problems"_Section_example.html :l
	"Performance & scalability"_Section_perf.html :l
	"Additional tools"_Section_tools.html :l
	"Modifying & extending LAMMPS"_Section_modify.html :l
	10.1 "Atom styles"_mod_1 :ulb,b
	10.2 "Bond, angle, dihedral, improper potentials"_mod_2 :b
	10.3 "Compute styles"_mod_3 :b
	10.4 "Dump styles"_mod_4 :b
	10.5 "Dump custom output options"_mod_5 :b
	10.6 "Fix styles"_mod_6 :b
	10.7 "Input script commands"_mod_7 :b
	10.8 "Kspace computations"_mod_8 :b
	10.9 "Minimization styles"_mod_9 :b
	10.10 "Pairwise potentials"_mod_10 :b
	10.11 "Region styles"_mod_11 :b
	10.12 "Body styles"_mod_12 :b
	10.13 "Thermodynamic output options"_mod_13 :b
	10.14 "Variable options"_mod_14 :b
	10.15 "Submitting new features for inclusion in LAMMPS"_mod_15 :ule,b
	"Python interface"_Section_python.html :l
	11.1 "Overview of running LAMMPS from Python"_py_1 :ulb,b
	11.2 "Overview of using Python from a LAMMPS script"_py_2 :b
	11.3 "Building LAMMPS as a shared library"_py_3 :b
	11.4 "Installing the Python wrapper into Python"_py_4 :b
	11.5 "Extending Python with MPI to run in parallel"_py_5 :b
	11.6 "Testing the Python-LAMMPS interface"_py_6 :b
	11.7 "Using LAMMPS from Python"_py_7 :b
	11.8 "Example Python scripts that use LAMMPS"_py_8 :ule,b
	"Errors"_Section_errors.html :l
	12.1 "Common problems"_err_1 :ulb,b
	12.2 "Reporting bugs"_err_2 :b
	12.3 "Error & warning messages"_err_3 :ule,b
	"Future and history"_Section_history.html :l
	13.1 "Coming attractions"_hist_1 :ulb,b
	13.2 "Past versions"_hist_2 :ule,b
	:ole

	:link(intro_1,Section_intro.html#intro_1)
	:link(intro_2,Section_intro.html#intro_2)
	:link(intro_3,Section_intro.html#intro_3)
	:link(intro_4,Section_intro.html#intro_4)
	:link(intro_5,Section_intro.html#intro_5)

	:link(start_1,Section_start.html#start_1)
	:link(start_2,Section_start.html#start_2)
	:link(start_3,Section_start.html#start_3)
	:link(start_4,Section_start.html#start_4)
	:link(start_5,Section_start.html#start_5)
	:link(start_6,Section_start.html#start_6)
	:link(start_7,Section_start.html#start_7)
	:link(start_8,Section_start.html#start_8)
	:link(start_9,Section_start.html#start_9)

	:link(cmd_1,Section_commands.html#cmd_1)
	:link(cmd_2,Section_commands.html#cmd_2)
	:link(cmd_3,Section_commands.html#cmd_3)
	:link(cmd_4,Section_commands.html#cmd_4)
	:link(cmd_5,Section_commands.html#cmd_5)

	:link(pkg_1,Section_packages.html#pkg_1)
	:link(pkg_2,Section_packages.html#pkg_2)

	:link(acc_1,Section_accelerate.html#acc_1)
	:link(acc_2,Section_accelerate.html#acc_2)
	:link(acc_3,Section_accelerate.html#acc_3)
	:link(acc_4,Section_accelerate.html#acc_4)

	:link(howto_1,Section_howto.html#howto_1)
	:link(howto_2,Section_howto.html#howto_2)
	:link(howto_3,Section_howto.html#howto_3)
	:link(howto_4,Section_howto.html#howto_4)
	:link(howto_5,Section_howto.html#howto_5)
	:link(howto_6,Section_howto.html#howto_6)
	:link(howto_7,Section_howto.html#howto_7)
	:link(howto_8,Section_howto.html#howto_8)
	:link(howto_9,Section_howto.html#howto_9)
	:link(howto_10,Section_howto.html#howto_10)
	:link(howto_11,Section_howto.html#howto_11)
	:link(howto_12,Section_howto.html#howto_12)
	:link(howto_13,Section_howto.html#howto_13)
	:link(howto_14,Section_howto.html#howto_14)
	:link(howto_15,Section_howto.html#howto_15)
	:link(howto_16,Section_howto.html#howto_16)
	:link(howto_17,Section_howto.html#howto_17)
	:link(howto_18,Section_howto.html#howto_18)
	:link(howto_19,Section_howto.html#howto_19)
	:link(howto_20,Section_howto.html#howto_20)
	:link(howto_21,Section_howto.html#howto_21)

	:link(mod_1,Section_modify.html#mod_1)
	:link(mod_2,Section_modify.html#mod_2)
	:link(mod_3,Section_modify.html#mod_3)
	:link(mod_4,Section_modify.html#mod_4)
	:link(mod_5,Section_modify.html#mod_5)
	:link(mod_6,Section_modify.html#mod_6)
	:link(mod_7,Section_modify.html#mod_7)
	:link(mod_8,Section_modify.html#mod_8)
	:link(mod_9,Section_modify.html#mod_9)
	:link(mod_10,Section_modify.html#mod_10)
	:link(mod_11,Section_modify.html#mod_11)
	:link(mod_12,Section_modify.html#mod_12)
	:link(mod_13,Section_modify.html#mod_13)
	:link(mod_14,Section_modify.html#mod_14)
	:link(mod_15,Section_modify.html#mod_15)

	:link(py_1,Section_python.html#py_1)
	:link(py_2,Section_python.html#py_2)
	:link(py_3,Section_python.html#py_3)
	:link(py_4,Section_python.html#py_4)
	:link(py_5,Section_python.html#py_5)
	:link(py_6,Section_python.html#py_6)

	:link(err_1,Section_errors.html#err_1)
	:link(err_2,Section_errors.html#err_2)
	:link(err_3,Section_errors.html#err_3)

	:link(hist_1,Section_history.html#hist_1)
	:link(hist_2,Section_history.html#hist_2)

	</BODY>
	diff --git a/doc/accelerate_kokkos.html b/doc/accelerate_kokkos.html
	index fa3c98cef..45cfa5824 100644
	--- a/doc/accelerate_kokkos.html
	+++ b/doc/accelerate_kokkos.html
	@@ -1,514 +1,515 @@
	<HTML>
	<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
	<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A>
	</CENTER>






	<HR>

	<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
	</P>
	<H4>5.3.4 KOKKOS package
	</H4>
	<P>The KOKKOS package was developed primaritly by Christian Trott
	(Sandia) with contributions of various styles by others, including
	-Sikandar Mashayak (UIUC). The underlying Kokkos library was written
	+Sikandar Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).
	+The underlying Kokkos library was written
	primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
	Sandia).
	</P>
	<P>The KOKKOS package contains versions of pair, fix, and atom styles
	that use data structures and macros provided by the Kokkos library,
	which is included with LAMMPS in lib/kokkos.
	</P>
	<P>The Kokkos library is part of
	-<A HREF = "http://trilinos.sandia.gov/packages/kokkos">Trilinos</A> and is a
	+<A HREF = "http://trilinos.sandia.gov/packages/kokkos">Trilinos</A> and can also
	+be downloaded from <A HREF = "https://github.com/kokkos/kokkos">Github</A>. Kokkos is a
	templated C++ library that provides two key abstractions for an
	application like LAMMPS. First, it allows a single implementation of
	an application kernel (e.g. a pair style) to run efficiently on
	different kinds of hardware, such as a GPU, Intel Phi, or many-core
	chip.
	</P>
	<P>The Kokkos library also provides data abstractions to adjust (at
	compile time) the memory layout of basic data structures like 2d and
	3d arrays and allow the transparent utilization of special hardware
	load and store operations. Such data structures are used in LAMMPS to
	store atom coordinates or forces or neighbor lists. The layout is
	chosen to optimize performance on different platforms. Again this
	functionality is hidden from the developer, and does not affect how
	the kernel is coded.
	</P>
	<P>These abstractions are set at build time, when LAMMPS is compiled with
	the KOKKOS package installed. This is done by selecting a "host" and
	"device" to build for, compatible with the compute nodes in your
	machine (one on a desktop machine or 1000s on a supercomputer).
	</P>
	<P>All Kokkos operations occur within the context of an individual MPI
	task running on a single node of the machine. The total number of MPI
	tasks used by LAMMPS (one or multiple per compute node) is set in the
	usual manner via the mpirun or mpiexec commands, and is independent of
	Kokkos.
	</P>
	<P>Kokkos provides support for two different modes of execution per MPI
	task. This means that computational tasks (pairwise interactions,
	neighbor list builds, time integration, etc) can be parallelized for
	one or the other of the two modes. The first mode is called the
	"host" and is one or more threads running on one or more physical CPUs
	(within the node). Currently, both multi-core CPUs and an Intel Phi
	processor (running in native mode, not offload mode like the
	USER-INTEL package) are supported. The second mode is called the
	"device" and is an accelerator chip of some kind. Currently only an
	NVIDIA GPU is supported via Cuda. If your compute node does not have
	a GPU, then there is only one mode of execution, i.e. the host and
	device are the same.
	</P>
	<P>When using the KOKKOS package, you must choose at build time whether
	you are building for OpenMP, GPU, or for using the Xeon Phi in native
	mode.
	</P>
	<P>Here is a quick overview of how to use the KOKKOS package:
	</P>
	-<UL><LI>specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
	-<LI>include the KOKKOS package and build LAMMPS
	-<LI>enable the KOKKOS package and its hardware options via the "-k on" command-line switch
	-<LI>use KOKKOS styles in your input script
	+<UL><LI>specify variables and settings in your Makefile.machine that enable
	+<LI>OpenMP, GPU, or Phi support include the KOKKOS package and build
	+<LI>LAMMPS enable the KOKKOS package and its hardware options via the "-k
	+<LI>on" command-line switch use KOKKOS styles in your input script
	</UL>
	<P>The latter two steps can be done using the "-k on", "-pk kokkos" and
	"-sf kk" <A HREF = "Section_start.html#start_7">command-line switches</A>
	respectively. Or the effect of the "-pk" or "-sf" switches can be
	duplicated by adding the <A HREF = "package.html">package kokkos</A> or <A HREF = "suffix.html">suffix
	kk</A> commands respectively to your input script.
	</P>
	<P><B>Required hardware/software:</B>
	</P>
	<P>The KOKKOS package can be used to build and run LAMMPS on the
	following kinds of hardware:
	</P>
	-<UL><LI>CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
	-<LI>CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
	-<LI>Phi: on one or more Intel Phi coprocessors (per node)
	-<LI>GPU: on the GPUs of a node with additional OpenMP threading on the CPUs
	+<UL><LI>CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS
	+<LI>styles) CPU-only: one or a few MPI tasks per node with additional
	+<LI>threading via OpenMP Phi: on one or more Intel Phi coprocessors (per
	+<LI>node) GPU: on the GPUs of a node with additional OpenMP threading on
	+<LI>the CPUs
	</UL>
	<P>Note that Intel Xeon Phi coprocessors are supported in "native" mode,
	not "offload" mode like the USER-INTEL package supports.
	</P>
	<P>Only NVIDIA GPUs are currently supported.
	</P>
	<P>IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
	you must have Kepler generation GPUs (or later). The Kokkos library
	exploits texture cache options not supported by Telsa generation GPUs
	(or older).
	</P>
	<P>To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
	installed on your system. See the discussion above for the USER-CUDA
	and GPU packages for details of how to check and do this.
	</P>
	<P><B>Building LAMMPS with the KOKKOS package:</B>
	</P>
	<P>You must choose at build time whether to build for OpenMP, Cuda, or
	Phi.
	</P>
	<P>You can do any of these in one line, using the src/Make.py script,
	described in <A HREF = "Section_start.html#start_4">Section 2.4</A> of the manual.
	Type "Make.py -h" for help. If run from the src directory, these
	commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and
	lmp_kokkos_phi. The OMP and PHI options use src/MAKE/Makefile.mpi as
	the starting Makefile.machine. The CUDA option uses
	src/MAKE/OPTIONS/Makefile.cuda since the NVIDIA nvcc compiler is
	required.
	</P>
	<P>Make.py -p kokkos -kokkos omp -o kokkos_omp file mpi
	Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda file kokkos_cuda
	Make.py -p kokkos -kokkos phi -o kokkos_phi file mpi
	</P>
	<P>Or you can follow these steps:
	</P>
	<P>CPU-only (run all-MPI or with OpenMP threading):
	</P>
	<PRE>cd lammps/src
	make yes-kokkos
	-make g++ OMP=yes
	+make g++ KOKKOS_DEVICES=OpenMP
	</PRE>
	<P>Intel Xeon Phi:
	</P>
	<PRE>cd lammps/src
	make yes-kokkos
	-make g++ OMP=yes MIC=yes
	+make g++ KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=KNC
	</PRE>
	<P>CPUs and GPUs:
	</P>
	<PRE>cd lammps/src
	make yes-kokkos
	-make cuda CUDA=yes
	+make cuda KOKKOS_DEVICES=Cuda
	</PRE>
	<P>These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
	make command line which requires a GNU-compatible make command. Try
	"gmake" if your system's standard make complains.
	</P>
	<P>IMPORTANT NOTE: If you build using make line variables and re-build
	LAMMPS twice with different KOKKOS options and the same target,
	e.g. g++ in the first two examples above, then you must perform a
	"make clean-all" or "make clean-machine" before each build. This is
	to force all the KOKKOS-dependent files to be re-compiled with the new
	options.
	</P>
	<P>You can also hardwire these make variables in the specified machine
	makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
	with a line like:
	</P>
	-<PRE>MIC = yes
	+<PRE>KOKKOS_ARCH = KNC
	</PRE>
	<P>Note that if you build LAMMPS multiple times in this manner, using
	different KOKKOS options (defined in different machine makefiles), you
	do not have to worry about doing a "clean" in between. This is
	because the targets will be different.
	</P>
	<P>IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
	machine makefile, in this case src/MAKE/Makefile.cuda, which is
	included in the LAMMPS distribution. To build the KOKKOS package for
	a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must
	-have a CCFLAGS -arch setting that is appropriate for your NVIDIA
	-hardware and installed software. Typical values for -arch are given
	-in <A HREF = "Section_start.html#start_3_4">Section 2.3.4</A> of the manual, as well
	+have a KOKKOS_ARCH setting that is appropriate for your NVIDIA
	+hardware and installed software. Typical values for KOKKOS_ARCH are given
	+below, as well
	as other settings that must be included in the machine makefile, if
	you create your own.
	</P>
	<P>IMPORTANT NOTE: Currently, there are no precision options with the
	KOKKOS package. All compilation and computation is performed in
	double precision.
	</P>
	<P>There are other allowed options when building with the KOKKOS package.
	As above, they can be set either as variables on the make command line
	or in Makefile.machine. This is the full list of options, including
	-those discussed above, Each takes a value of <I>yes</I> or <I>no</I>. The
	+those discussed above, Each takes a value shown below. The
	default value is listed, which is set in the
	-lib/kokkos/Makefile.lammps file.
	-</P>
	-<UL><LI>OMP, default = <I>yes</I>
	-<LI>CUDA, default = <I>no</I>
	-<LI>HWLOC, default = <I>no</I>
	-<LI>AVX, default = <I>no</I>
	-<LI>MIC, default = <I>no</I>
	-<LI>LIBRT, default = <I>no</I>
	-<LI>DEBUG, default = <I>no</I>
	+lib/kokkos/Makefile.kokkos file.
	+</P>
	+<P>#Default settings specific options
	+#Options: force_uvm,use_ldg,rdc
	+</P>
	+<UL><LI>KOKKOS_DEVICES, values = <I>OpenMP</I>, <I>Serial</I>, <I>Pthreads</I>, <I>Cuda</I>, default = <I>OpenMP</I>
	+<LI>KOKKOS_ARCH, values = <I>KNC</I>, <I>SNB</I>, <I>HSW</I>, <I>Kepler</I>, <I>Kepler30</I>, <I>Kepler32</I>, <I>Kepler35</I>,
	+<LI><I>Kepler37</I>, <I>Maxwell</I>, <I>Maxwell50</I>, <I>Maxwell52</I>, <I>Maxwell53</I>, <I>ARMv8</I>, <I>BGQ</I>, <I>Power7</I>, <I>Power8</I>,
	+<LI>default = <I>none</I>
	+<LI>KOKKOS_DEBUG, values = <I>yes</I>, <I>no</I>, default = <I>no</I>
	+<LI>KOKKOS_USE_TPLS, values = <I>hwloc</I>, <I>librt</I>, default = <I>none</I>
	+<LI>KOKKOS_CUDA_OPTIONS, values = <I>force_uvm</I>, <I>use_ldg</I>, <I>rdc</I>
	</UL>
	-<P>OMP sets the parallelization method used for Kokkos code (within
	-LAMMPS) that runs on the host. OMP=yes means that OpenMP will be
	-used. OMP=no means that pthreads will be used.
	-</P>
	-<P>CUDA sets the parallelization method used for Kokkos code (within
	-LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running
	-CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting
	-will be used for the device as well as the host.
	-</P>
	-<P>If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must
	-use "nvcc" as its compiler, via its CC setting. For best performance
	-its CCFLAGS setting should use -O3 and have an -arch setting that
	-matches the compute capability of your NVIDIA hardware and software
	-installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are
	-sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell
	-cards are sm_50. A complete list can be found on
	-<A HREF = "http://en.wikipedia.org/wiki/CUDA#Supported_GPUs">wikipedia</A>. You can
	-also use the deviceQuery tool that comes with the CUDA samples. Note
	+<P>KOKKOS_DEVICE sets the parallelization method used for Kokkos code (within
	+LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be
	+used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
	+KOKKOS_DEVICES=Cuda means an NVIDIA GPU running
	+CUDA will be used.
	+</P>
	+<P>If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
	+directory must use "nvcc" as its compiler, via its CC setting. For
	+best performance its CCFLAGS setting should use -O3 and have a
	+KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
	+hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note
	the minimal required compute capability is 2.0, but this will give
	signicantly reduced performance compared to Kepler generation GPUs
	with compute capability 3.x. For the LINK setting, "nvcc" should not
	be used; instead use g++ or another compiler suitable for linking C++
	applications. Often you will want to use your MPI compiler wrapper
	for this setting (i.e. mpicxx). Finally, the lo-level Makefile must
	also have a "Compilation rule" for creating .o files from .cu files.
	See src/Makefile.cuda for an example of a lo-level Makefile with all
	of these settings.
	</P>
	-<P>HWLOC binds threads to hardware cores, so they do not migrate during a
	-simulation. HWLOC=yes should always be used if running with OMP=no
	-for pthreads. It is not necessary for OMP=yes for OpenMP, because
	-OpenMP provides alternative methods via environment variables for
	-binding threads to hardware cores. More info on binding threads to
	-cores is given in <A HREF = "Section_accelerate.html#acc_8">this section</A>.
	+<P>KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
	+migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
	+used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
	+necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
	+provides alternative methods via environment variables for binding
	+threads to hardware cores. More info on binding threads to cores is
	+given in <A HREF = "Section_accelerate.html#acc_8">this section</A>.
	</P>
	-<P>AVX enables Intel advanced vector extensions when compiling for an
	-Intel-compatible chip. AVX=yes should only be set if your host
	-hardware supports AVX. If it does not support it, this will cause a
	-run-time crash.
	+<P>KOKKOS_ARCH=KNC enables compiler switches needed when compling for an
	+Intel Phi processor.
	</P>
	-<P>MIC enables compiler switches needed when compling for an Intel Phi
	-processor.
	+<P>KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
	+on most Unix platforms. This library is not available on all
	+platforms.
	</P>
	-<P>LIBRT enables use of a more accurate timer mechanism on most Unix
	-platforms. This library is not available on all platforms.
	+<P>KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
	+within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
	+debugging information that can be useful. It also enables runtime
	+bounds checking on Kokkos data structures.
	</P>
	-<P>DEBUG is only useful when developing a Kokkos-enabled style within
	-LAMMPS. DEBUG=yes enables printing of run-time debugging information
	-that can be useful. It also enables runtime bounds checking on Kokkos
	-data structures.
	+<P>KOKKOS_CUDA_OPTIONS are additional options for CUDA.
	+</P>
	+<P>For more information on Kokkos see the Kokkos programmers' guide here:
	+/lib/kokkos/doc/Kokkos_PG.pdf.
	</P>
	<P><B>Run with the KOKKOS package from the command line:</B>
	</P>
	<P>The mpirun or mpiexec command sets the total number of MPI tasks used
	by LAMMPS (one or multiple per compute node) and the number of MPI
	tasks used per node. E.g. the mpirun command in MPICH does this via
	its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
	</P>
	<P>When using KOKKOS built with host=OMP, you need to choose how many
	OpenMP threads per MPI task will be used (via the "-k" command-line
	switch discussed below). Note that the product of MPI tasks * OpenMP
	threads/task should not exceed the physical number of cores (on a
	node), otherwise performance will suffer.
	</P>
	<P>When using the KOKKOS package built with device=CUDA, you must use
	exactly one MPI task per physical GPU.
	</P>
	<P>When using the KOKKOS package built with host=MIC for Intel Xeon Phi
	coprocessor support you need to insure there are one or more MPI tasks
	per coprocessor, and choose the number of coprocessor threads to use
	per MPI task (via the "-k" command-line switch discussed below). The
	product of MPI tasks * coprocessor threads/task should not exceed the
	maximum number of threads the coproprocessor is designed to run,
	otherwise performance will suffer. This value is 240 for current
	generation Xeon Phi(TM) chips, which is 60 physical cores * 4
	threads/core. Note that with the KOKKOS package you do not need to
	specify how many Phi coprocessors there are per node; each
	coprocessors is simply treated as running some number of MPI tasks.
	</P>
	<P>You must use the "-k on" <A HREF = "Section_start.html#start_7">command-line
	switch</A> to enable the KOKKOS package. It
	takes additional arguments for hardware settings appropriate to your
	system. Those arguments are <A HREF = "Section_start.html#start_7">documented
	here</A>. The two most commonly used
	options are:
	</P>
	<PRE>-k on t Nt g Ng
	</PRE>
	<P>The "t Nt" option applies to host=OMP (even if device=CUDA) and
	host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
	task to use with a node. For host=MIC, it specifies how many Xeon Phi
	threads per MPI task to use within a node. The default is Nt = 1.
	Note that for host=OMP this is effectively MPI-only mode which may be
	fine. But for host=MIC you will typically end up using far less than
	all the 240 available threads, which could give very poor performance.
	</P>
	<P>The "g Ng" option applies to device=CUDA. It specifies how many GPUs
	per compute node to use. The default is 1, so this only needs to be
	specified is you have 2 or more GPUs per compute node.
	</P>
	<P>The "-k on" switch also issues a "package kokkos" command (with no
	additional arguments) which sets various KOKKOS options to default
	values, as discussed on the <A HREF = "package.html">package</A> command doc page.
	</P>
	<P>Use the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A>,
	which will automatically append "kk" to styles that support it. Use
	the "-pk kokkos" <A HREF = "Section_start.html#start_7">command-line switch</A> if
	you wish to change any of the default <A HREF = "package.html">package kokkos</A>
	optionns set by the "-k on" <A HREF = "Section_start.html#start_7">command-line
	switch</A>.
	</P>
	<PRE>host=OMP, dual hex-core nodes (12 threads/node):
	mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
	mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
	mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
	mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task
	mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj # ditto on 16 nodes
	</PRE>
	<P>host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
	mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240
	mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240
	mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240
	mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis
	</P>
	<PRE>host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
	mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
	mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes
	</PRE>
	<PRE>host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
	mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
	mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes
	</PRE>
	<P>Note that the default for the <A HREF = "package.html">package kokkos</A> command is
	to use "full" neighbor lists and set the Newton flag to "off" for both
	pairwise and bonded interactions. This typically gives fastest
	performance. If the <A HREF = "newton.html">newton</A> command is used in the input
	script, it can override the Newton flag defaults.
	</P>
	<P>However, when running in MPI-only mode with 1 thread per MPI task, it
	will typically be faster to use "half" neighbor lists and set the
	Newton flag to "on", just as is the case for non-accelerated pair
	styles. You can do this with the "-pk" <A HREF = "Section_start.html#start_7">command-line
	switch</A>.
	</P>
	<P><B>Or run with the KOKKOS package by editing an input script:</B>
	</P>
	<P>The discussion above for the mpirun/mpiexec command and setting
	appropriate thread and GPU values for host=OMP or host=MIC or
	device=CUDA are the same.
	</P>
	<P>You must still use the "-k on" <A HREF = "Section_start.html#start_7">command-line
	switch</A> to enable the KOKKOS package, and
	specify its additional arguments for hardware options appopriate to
	your system, as documented above.
	</P>
	<P>Use the <A HREF = "suffix.html">suffix kk</A> command, or you can explicitly add a
	"kk" suffix to individual styles in your input script, e.g.
	</P>
	<PRE>pair_style lj/cut/kk 2.5
	</PRE>
	<P>You only need to use the <A HREF = "package.html">package kokkos</A> command if you
	wish to change any of its option defaults, as set by the "-k on"
	<A HREF = "Section_start.html#start_7">command-line switch</A>.
	</P>
	<P><B>Speed-ups to expect:</B>
	</P>
	<P>The performance of KOKKOS running in different modes is a function of
	your hardware, which KOKKOS-enable styles are used, and the problem
	size.
	</P>
	<P>Generally speaking, the following rules of thumb apply:
	</P>
	<UL><LI>When running on CPUs only, with a single thread per MPI task,
	performance of a KOKKOS style is somewhere between the standard
	(un-accelerated) styles (MPI-only mode), and those provided by the
	USER-OMP package. However the difference between all 3 is small (less
	than 20%).

	<LI>When running on CPUs only, with multiple threads per MPI task,
	performance of a KOKKOS style is a bit slower than the USER-OMP
	package.

	<LI>When running on GPUs, KOKKOS is typically faster than the USER-CUDA
	and GPU packages.

	<LI>When running on Intel Xeon Phi, KOKKOS is not as fast as
	the USER-INTEL package, which is optimized for that hardware.
	</UL>
	<P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
	LAMMPS web site for performance of the KOKKOS package on different
	hardware.
	</P>
	<P><B>Guidelines for best performance:</B>
	</P>
	<P>Here are guidline for using the KOKKOS package on the different
	hardware configurations listed above.
	</P>
	<P>Many of the guidelines use the <A HREF = "package.html">package kokkos</A> command
	See its doc page for details and default settings. Experimenting with
	its options can provide a speed-up for specific calculations.
	</P>
	<P><B>Running on a multi-core CPU:</B>
	</P>
	<P>If N is the number of physical cores/node, then the number of MPI
	tasks/node * number of threads/task should not exceed N, and should
	typically equal N. Note that the default threads/task is 1, as set by
	the "t" keyword of the "-k" <A HREF = "Section_start.html#start_7">command-line
	switch</A>. If you do not change this, no
	additional parallelism (beyond MPI) will be invoked on the host
	CPU(s).
	</P>
	<P>You can compare the performance running in different modes:
	</P>
	<UL><LI>run with 1 MPI task/node and N threads/task
	<LI>run with N MPI tasks/node and 1 thread/task
	<LI>run with settings in between these extremes
	</UL>
	<P>Examples of mpirun commands in these modes are shown above.
	</P>
	<P>When using KOKKOS to perform multi-threading, it is important for
	performance to bind both MPI tasks to physical cores, and threads to
	physical cores, so they do not migrate during a simulation.
	</P>
	<P>If you are not certain MPI tasks are being bound (check the defaults
	for your MPI installation), binding can be forced with these flags:
	</P>
	<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
	Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
	</PRE>
	<P>For binding threads with the KOKKOS OMP option, use thread affinity
	environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
	later, intel 12 or later) setting the environment variable
	OMP_PROC_BIND=true should be sufficient. For binding threads with the
	KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
	discussed in <A HREF = "Sections_start.html#start_3_4">Section 2.3.4</A> of the
	manual.
	</P>
	<P><B>Running on GPUs:</B>
	</P>
	<P>Insure the -arch setting in the machine makefile you are using,
	e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
	(see <A HREF = "Section_start.html#start_3_4">this section</A> of the manual for
	details).
	</P>
	<P>The -np setting of the mpirun command should set the number of MPI
	tasks/node to be equal to the # of physical GPUs on the node.
	</P>
	<P>Use the "-k" <A HREF = "Section_commands.html#start_7">command-line switch</A> to
	specify the number of GPUs per node, and the number of threads per MPI
	task. As above for multi-core CPUs (and no GPU), if N is the number
	of physical cores/node, then the number of MPI tasks/node * number of
	threads/task should not exceed N. With one GPU (and one MPI task) it
	may be faster to use less than all the available cores, by setting
	threads/task to a smaller value. This is because using all the cores
	on a dual-socket node will incur extra cost to copy memory from the
	2nd socket to the GPU.
	</P>
	<P>Examples of mpirun commands that follow these rules are shown above.
	</P>
	<P>IMPORTANT NOTE: When using a GPU, you will achieve the best
	performance if your input script does not use any fix or compute
	styles which are not yet Kokkos-enabled. This allows data to stay on
	the GPU for multiple timesteps, without being copied back to the host
	CPU. Invoking a non-Kokkos fix or compute, or performing I/O for
	<A HREF = "thermo_style.html">thermo</A> or <A HREF = "dump.html">dump</A> output will cause data
	to be copied back to the CPU.
	</P>
	<P>You cannot yet assign multiple MPI tasks to the same GPU with the
	KOKKOS package. We plan to support this in the future, similar to the
	GPU package in LAMMPS.
	</P>
	<P>You cannot yet use both the host (multi-threaded) and device (GPU)
	together to compute pairwise interactions with the KOKKOS package. We
	hope to support this in the future, similar to the GPU package in
	LAMMPS.
	</P>
	<P><B>Running on an Intel Phi:</B>
	</P>
	<P>Kokkos only uses Intel Phi processors in their "native" mode, i.e.
	not hosted by a CPU.
	</P>
	<P>As illustrated above, build LAMMPS with OMP=yes (the default) and
	MIC=yes. The latter insures code is correctly compiled for the Intel
	Phi. The OMP setting means OpenMP will be used for parallelization on
	the Phi, which is currently the best option within Kokkos. In the
	future, other options may be added.
	</P>
	<P>Current-generation Intel Phi chips have either 61 or 57 cores. One
	core should be excluded for running the OS, leaving 60 or 56 cores.
	Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
	N = 224 (4*56) cores to run on.
	</P>
	<P>The -np setting of the mpirun command sets the number of MPI
	tasks/node. The "-k on t Nt" command-line switch sets the number of
	threads/task as Nt. The product of these 2 values should be N, i.e.
	240 or 224. Also, the number of threads/task should be a multiple of
	4 so that logical threads from more than one MPI task do not run on
	the same physical core.
	</P>
	<P>Examples of mpirun commands that follow these rules are shown above.
	</P>
	<P><B>Restrictions:</B>
	</P>
	<P>As noted above, if using GPUs, the number of MPI tasks per compute
	node should equal to the number of GPUs per compute node. In the
	future Kokkos will support assigning multiple MPI tasks to a single
	GPU.
	</P>
	<P>Currently Kokkos does not support AMD GPUs due to limits in the
	available backend programming models. Specifically, Kokkos requires
	extensive C++ support from the Kernel language. This is expected to
	change in the future.
	</P>
	<P>Kokkos must be built with a C++11 compatible compiler. For example,
	gcc 4.7.2 or later.
	</P>
	</HTML>
	diff --git a/doc/accelerate_kokkos.txt b/doc/accelerate_kokkos.txt
	index 5433ff586..78c2220e8 100644
	--- a/doc/accelerate_kokkos.txt
	+++ b/doc/accelerate_kokkos.txt
	@@ -1,509 +1,510 @@
	"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
	"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

	:link(lws,http://lammps.sandia.gov)
	:link(ld,Manual.html)
	:link(lc,Section_commands.html#comm)

	:line

	"Return to Section accelerate overview"_Section_accelerate.html

	5.3.4 KOKKOS package :h4

	The KOKKOS package was developed primaritly by Christian Trott
	(Sandia) with contributions of various styles by others, including
	-Sikandar Mashayak (UIUC). The underlying Kokkos library was written
	+Sikandar Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).
	+The underlying Kokkos library was written
	primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
	Sandia).

	The KOKKOS package contains versions of pair, fix, and atom styles
	that use data structures and macros provided by the Kokkos library,
	which is included with LAMMPS in lib/kokkos.

	The Kokkos library is part of
	-"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and is a
	+"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also
	+be downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
	templated C++ library that provides two key abstractions for an
	application like LAMMPS. First, it allows a single implementation of
	an application kernel (e.g. a pair style) to run efficiently on
	different kinds of hardware, such as a GPU, Intel Phi, or many-core
	chip.

	The Kokkos library also provides data abstractions to adjust (at
	compile time) the memory layout of basic data structures like 2d and
	3d arrays and allow the transparent utilization of special hardware
	load and store operations. Such data structures are used in LAMMPS to
	store atom coordinates or forces or neighbor lists. The layout is
	chosen to optimize performance on different platforms. Again this
	functionality is hidden from the developer, and does not affect how
	the kernel is coded.

	These abstractions are set at build time, when LAMMPS is compiled with
	the KOKKOS package installed. This is done by selecting a "host" and
	"device" to build for, compatible with the compute nodes in your
	machine (one on a desktop machine or 1000s on a supercomputer).

	All Kokkos operations occur within the context of an individual MPI
	task running on a single node of the machine. The total number of MPI
	tasks used by LAMMPS (one or multiple per compute node) is set in the
	usual manner via the mpirun or mpiexec commands, and is independent of
	Kokkos.

	Kokkos provides support for two different modes of execution per MPI
	task. This means that computational tasks (pairwise interactions,
	neighbor list builds, time integration, etc) can be parallelized for
	one or the other of the two modes. The first mode is called the
	"host" and is one or more threads running on one or more physical CPUs
	(within the node). Currently, both multi-core CPUs and an Intel Phi
	processor (running in native mode, not offload mode like the
	USER-INTEL package) are supported. The second mode is called the
	"device" and is an accelerator chip of some kind. Currently only an
	NVIDIA GPU is supported via Cuda. If your compute node does not have
	a GPU, then there is only one mode of execution, i.e. the host and
	device are the same.

	When using the KOKKOS package, you must choose at build time whether
	you are building for OpenMP, GPU, or for using the Xeon Phi in native
	mode.

	Here is a quick overview of how to use the KOKKOS package:

	-specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
	-include the KOKKOS package and build LAMMPS
	-enable the KOKKOS package and its hardware options via the "-k on" command-line switch
	-use KOKKOS styles in your input script :ul
	+specify variables and settings in your Makefile.machine that enable
	+OpenMP, GPU, or Phi support include the KOKKOS package and build
	+LAMMPS enable the KOKKOS package and its hardware options via the "-k
	+on" command-line switch use KOKKOS styles in your input script :ul

	The latter two steps can be done using the "-k on", "-pk kokkos" and
	"-sf kk" "command-line switches"_Section_start.html#start_7
	respectively. Or the effect of the "-pk" or "-sf" switches can be
	duplicated by adding the "package kokkos"_package.html or "suffix
	kk"_suffix.html commands respectively to your input script.

	[Required hardware/software:]

	The KOKKOS package can be used to build and run LAMMPS on the
	following kinds of hardware:

	-CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
	-CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
	-Phi: on one or more Intel Phi coprocessors (per node)
	-GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul
	+CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS
	+styles) CPU-only: one or a few MPI tasks per node with additional
	+threading via OpenMP Phi: on one or more Intel Phi coprocessors (per
	+node) GPU: on the GPUs of a node with additional OpenMP threading on
	+the CPUs :ul

	Note that Intel Xeon Phi coprocessors are supported in "native" mode,
	not "offload" mode like the USER-INTEL package supports.

	Only NVIDIA GPUs are currently supported.

	IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
	you must have Kepler generation GPUs (or later). The Kokkos library
	exploits texture cache options not supported by Telsa generation GPUs
	(or older).

	To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
	installed on your system. See the discussion above for the USER-CUDA
	and GPU packages for details of how to check and do this.

	[Building LAMMPS with the KOKKOS package:]

	You must choose at build time whether to build for OpenMP, Cuda, or
	Phi.

	You can do any of these in one line, using the src/Make.py script,
	described in "Section 2.4"_Section_start.html#start_4 of the manual.
	Type "Make.py -h" for help. If run from the src directory, these
	commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and
	lmp_kokkos_phi. The OMP and PHI options use src/MAKE/Makefile.mpi as
	the starting Makefile.machine. The CUDA option uses
	src/MAKE/OPTIONS/Makefile.cuda since the NVIDIA nvcc compiler is
	required.

	Make.py -p kokkos -kokkos omp -o kokkos_omp file mpi
	Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda file kokkos_cuda
	Make.py -p kokkos -kokkos phi -o kokkos_phi file mpi

	Or you can follow these steps:

	CPU-only (run all-MPI or with OpenMP threading):

	cd lammps/src
	make yes-kokkos
	-make g++ OMP=yes :pre
	+make g++ KOKKOS_DEVICES=OpenMP :pre

	Intel Xeon Phi:

	cd lammps/src
	make yes-kokkos
	-make g++ OMP=yes MIC=yes :pre
	+make g++ KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=KNC :pre

	CPUs and GPUs:

	cd lammps/src
	make yes-kokkos
	-make cuda CUDA=yes :pre
	+make cuda KOKKOS_DEVICES=Cuda :pre

	These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
	make command line which requires a GNU-compatible make command. Try
	"gmake" if your system's standard make complains.

	IMPORTANT NOTE: If you build using make line variables and re-build
	LAMMPS twice with different KOKKOS options and the same target,
	e.g. g++ in the first two examples above, then you must perform a
	"make clean-all" or "make clean-machine" before each build. This is
	to force all the KOKKOS-dependent files to be re-compiled with the new
	options.

	You can also hardwire these make variables in the specified machine
	makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
	with a line like:

	-MIC = yes :pre
	+KOKKOS_ARCH = KNC :pre

	Note that if you build LAMMPS multiple times in this manner, using
	different KOKKOS options (defined in different machine makefiles), you
	do not have to worry about doing a "clean" in between. This is
	because the targets will be different.

	IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
	machine makefile, in this case src/MAKE/Makefile.cuda, which is
	included in the LAMMPS distribution. To build the KOKKOS package for
	a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must
	-have a CCFLAGS -arch setting that is appropriate for your NVIDIA
	-hardware and installed software. Typical values for -arch are given
	-in "Section 2.3.4"_Section_start.html#start_3_4 of the manual, as well
	+have a KOKKOS_ARCH setting that is appropriate for your NVIDIA
	+hardware and installed software. Typical values for KOKKOS_ARCH are given
	+below, as well
	as other settings that must be included in the machine makefile, if
	you create your own.

	IMPORTANT NOTE: Currently, there are no precision options with the
	KOKKOS package. All compilation and computation is performed in
	double precision.

	There are other allowed options when building with the KOKKOS package.
	As above, they can be set either as variables on the make command line
	or in Makefile.machine. This is the full list of options, including
	-those discussed above, Each takes a value of {yes} or {no}. The
	+those discussed above, Each takes a value shown below. The
	default value is listed, which is set in the
	-lib/kokkos/Makefile.lammps file.
	-
	-OMP, default = {yes}
	-CUDA, default = {no}
	-HWLOC, default = {no}
	-AVX, default = {no}
	-MIC, default = {no}
	-LIBRT, default = {no}
	-DEBUG, default = {no} :ul
	-
	-OMP sets the parallelization method used for Kokkos code (within
	-LAMMPS) that runs on the host. OMP=yes means that OpenMP will be
	-used. OMP=no means that pthreads will be used.
	-
	-CUDA sets the parallelization method used for Kokkos code (within
	-LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running
	-CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting
	-will be used for the device as well as the host.
	-
	-If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must
	-use "nvcc" as its compiler, via its CC setting. For best performance
	-its CCFLAGS setting should use -O3 and have an -arch setting that
	-matches the compute capability of your NVIDIA hardware and software
	-installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are
	-sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell
	-cards are sm_50. A complete list can be found on
	-"wikipedia"_http://en.wikipedia.org/wiki/CUDA#Supported_GPUs. You can
	-also use the deviceQuery tool that comes with the CUDA samples. Note
	+lib/kokkos/Makefile.kokkos file.
	+
	+#Default settings specific options
	+#Options: force_uvm,use_ldg,rdc
	+
	+KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
	+KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35},
	+{Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8},
	+default = {none}
	+KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
	+KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
	+KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
	+
	+KOKKOS_DEVICE sets the parallelization method used for Kokkos code (within
	+LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be
	+used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
	+KOKKOS_DEVICES=Cuda means an NVIDIA GPU running
	+CUDA will be used.
	+
	+If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
	+directory must use "nvcc" as its compiler, via its CC setting. For
	+best performance its CCFLAGS setting should use -O3 and have a
	+KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
	+hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note
	the minimal required compute capability is 2.0, but this will give
	signicantly reduced performance compared to Kepler generation GPUs
	with compute capability 3.x. For the LINK setting, "nvcc" should not
	be used; instead use g++ or another compiler suitable for linking C++
	applications. Often you will want to use your MPI compiler wrapper
	for this setting (i.e. mpicxx). Finally, the lo-level Makefile must
	also have a "Compilation rule" for creating .o files from .cu files.
	See src/Makefile.cuda for an example of a lo-level Makefile with all
	of these settings.

	-HWLOC binds threads to hardware cores, so they do not migrate during a
	-simulation. HWLOC=yes should always be used if running with OMP=no
	-for pthreads. It is not necessary for OMP=yes for OpenMP, because
	-OpenMP provides alternative methods via environment variables for
	-binding threads to hardware cores. More info on binding threads to
	-cores is given in "this section"_Section_accelerate.html#acc_8.
	+KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
	+migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
	+used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
	+necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
	+provides alternative methods via environment variables for binding
	+threads to hardware cores. More info on binding threads to cores is
	+given in "this section"_Section_accelerate.html#acc_8.

	-AVX enables Intel advanced vector extensions when compiling for an
	-Intel-compatible chip. AVX=yes should only be set if your host
	-hardware supports AVX. If it does not support it, this will cause a
	-run-time crash.
	+KOKKOS_ARCH=KNC enables compiler switches needed when compling for an
	+Intel Phi processor.

	-MIC enables compiler switches needed when compling for an Intel Phi
	-processor.
	+KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
	+on most Unix platforms. This library is not available on all
	+platforms.

	-LIBRT enables use of a more accurate timer mechanism on most Unix
	-platforms. This library is not available on all platforms.
	+KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
	+within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
	+debugging information that can be useful. It also enables runtime
	+bounds checking on Kokkos data structures.

	-DEBUG is only useful when developing a Kokkos-enabled style within
	-LAMMPS. DEBUG=yes enables printing of run-time debugging information
	-that can be useful. It also enables runtime bounds checking on Kokkos
	-data structures.
	+KOKKOS_CUDA_OPTIONS are additional options for CUDA.
	+
	+For more information on Kokkos see the Kokkos programmers' guide here:
	+/lib/kokkos/doc/Kokkos_PG.pdf.

	[Run with the KOKKOS package from the command line:]

	The mpirun or mpiexec command sets the total number of MPI tasks used
	by LAMMPS (one or multiple per compute node) and the number of MPI
	tasks used per node. E.g. the mpirun command in MPICH does this via
	its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.

	When using KOKKOS built with host=OMP, you need to choose how many
	OpenMP threads per MPI task will be used (via the "-k" command-line
	switch discussed below). Note that the product of MPI tasks * OpenMP
	threads/task should not exceed the physical number of cores (on a
	node), otherwise performance will suffer.

	When using the KOKKOS package built with device=CUDA, you must use
	exactly one MPI task per physical GPU.

	When using the KOKKOS package built with host=MIC for Intel Xeon Phi
	coprocessor support you need to insure there are one or more MPI tasks
	per coprocessor, and choose the number of coprocessor threads to use
	per MPI task (via the "-k" command-line switch discussed below). The
	product of MPI tasks * coprocessor threads/task should not exceed the
	maximum number of threads the coproprocessor is designed to run,
	otherwise performance will suffer. This value is 240 for current
	generation Xeon Phi(TM) chips, which is 60 physical cores * 4
	threads/core. Note that with the KOKKOS package you do not need to
	specify how many Phi coprocessors there are per node; each
	coprocessors is simply treated as running some number of MPI tasks.

	You must use the "-k on" "command-line
	switch"_Section_start.html#start_7 to enable the KOKKOS package. It
	takes additional arguments for hardware settings appropriate to your
	system. Those arguments are "documented
	here"_Section_start.html#start_7. The two most commonly used
	options are:

	-k on t Nt g Ng :pre

	The "t Nt" option applies to host=OMP (even if device=CUDA) and
	host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
	task to use with a node. For host=MIC, it specifies how many Xeon Phi
	threads per MPI task to use within a node. The default is Nt = 1.
	Note that for host=OMP this is effectively MPI-only mode which may be
	fine. But for host=MIC you will typically end up using far less than
	all the 240 available threads, which could give very poor performance.

	The "g Ng" option applies to device=CUDA. It specifies how many GPUs
	per compute node to use. The default is 1, so this only needs to be
	specified is you have 2 or more GPUs per compute node.

	The "-k on" switch also issues a "package kokkos" command (with no
	additional arguments) which sets various KOKKOS options to default
	values, as discussed on the "package"_package.html command doc page.

	Use the "-sf kk" "command-line switch"_Section_start.html#start_7,
	which will automatically append "kk" to styles that support it. Use
	the "-pk kokkos" "command-line switch"_Section_start.html#start_7 if
	you wish to change any of the default "package kokkos"_package.html
	optionns set by the "-k on" "command-line
	switch"_Section_start.html#start_7.

	host=OMP, dual hex-core nodes (12 threads/node):
	mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
	mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
	mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
	mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task
	mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj # ditto on 16 nodes :pre

	host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
	mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240
	mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240
	mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240
	mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis

	host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
	mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
	mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes :pre

	host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
	mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
	mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes :pre

	Note that the default for the "package kokkos"_package.html command is
	to use "full" neighbor lists and set the Newton flag to "off" for both
	pairwise and bonded interactions. This typically gives fastest
	performance. If the "newton"_newton.html command is used in the input
	script, it can override the Newton flag defaults.

	However, when running in MPI-only mode with 1 thread per MPI task, it
	will typically be faster to use "half" neighbor lists and set the
	Newton flag to "on", just as is the case for non-accelerated pair
	styles. You can do this with the "-pk" "command-line
	switch"_Section_start.html#start_7.

	[Or run with the KOKKOS package by editing an input script:]

	The discussion above for the mpirun/mpiexec command and setting
	appropriate thread and GPU values for host=OMP or host=MIC or
	device=CUDA are the same.

	You must still use the "-k on" "command-line
	switch"_Section_start.html#start_7 to enable the KOKKOS package, and
	specify its additional arguments for hardware options appopriate to
	your system, as documented above.

	Use the "suffix kk"_suffix.html command, or you can explicitly add a
	"kk" suffix to individual styles in your input script, e.g.

	pair_style lj/cut/kk 2.5 :pre

	You only need to use the "package kokkos"_package.html command if you
	wish to change any of its option defaults, as set by the "-k on"
	"command-line switch"_Section_start.html#start_7.

	[Speed-ups to expect:]

	The performance of KOKKOS running in different modes is a function of
	your hardware, which KOKKOS-enable styles are used, and the problem
	size.

	Generally speaking, the following rules of thumb apply:

	When running on CPUs only, with a single thread per MPI task,
	performance of a KOKKOS style is somewhere between the standard
	(un-accelerated) styles (MPI-only mode), and those provided by the
	USER-OMP package. However the difference between all 3 is small (less
	than 20%). :ulb,l

	When running on CPUs only, with multiple threads per MPI task,
	performance of a KOKKOS style is a bit slower than the USER-OMP
	package. :l

	When running on GPUs, KOKKOS is typically faster than the USER-CUDA
	and GPU packages. :l

	When running on Intel Xeon Phi, KOKKOS is not as fast as
	the USER-INTEL package, which is optimized for that hardware. :l,ule

	See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
	LAMMPS web site for performance of the KOKKOS package on different
	hardware.

	[Guidelines for best performance:]

	Here are guidline for using the KOKKOS package on the different
	hardware configurations listed above.

	Many of the guidelines use the "package kokkos"_package.html command
	See its doc page for details and default settings. Experimenting with
	its options can provide a speed-up for specific calculations.

	[Running on a multi-core CPU:]

	If N is the number of physical cores/node, then the number of MPI
	tasks/node * number of threads/task should not exceed N, and should
	typically equal N. Note that the default threads/task is 1, as set by
	the "t" keyword of the "-k" "command-line
	switch"_Section_start.html#start_7. If you do not change this, no
	additional parallelism (beyond MPI) will be invoked on the host
	CPU(s).

	You can compare the performance running in different modes:

	run with 1 MPI task/node and N threads/task
	run with N MPI tasks/node and 1 thread/task
	run with settings in between these extremes :ul

	Examples of mpirun commands in these modes are shown above.

	When using KOKKOS to perform multi-threading, it is important for
	performance to bind both MPI tasks to physical cores, and threads to
	physical cores, so they do not migrate during a simulation.

	If you are not certain MPI tasks are being bound (check the defaults
	for your MPI installation), binding can be forced with these flags:

	OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
	Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre

	For binding threads with the KOKKOS OMP option, use thread affinity
	environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
	later, intel 12 or later) setting the environment variable
	OMP_PROC_BIND=true should be sufficient. For binding threads with the
	KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
	discussed in "Section 2.3.4"_Sections_start.html#start_3_4 of the
	manual.

	[Running on GPUs:]

	Insure the -arch setting in the machine makefile you are using,
	e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
	(see "this section"_Section_start.html#start_3_4 of the manual for
	details).

	The -np setting of the mpirun command should set the number of MPI
	tasks/node to be equal to the # of physical GPUs on the node.

	Use the "-k" "command-line switch"_Section_commands.html#start_7 to
	specify the number of GPUs per node, and the number of threads per MPI
	task. As above for multi-core CPUs (and no GPU), if N is the number
	of physical cores/node, then the number of MPI tasks/node * number of
	threads/task should not exceed N. With one GPU (and one MPI task) it
	may be faster to use less than all the available cores, by setting
	threads/task to a smaller value. This is because using all the cores
	on a dual-socket node will incur extra cost to copy memory from the
	2nd socket to the GPU.

	Examples of mpirun commands that follow these rules are shown above.

	IMPORTANT NOTE: When using a GPU, you will achieve the best
	performance if your input script does not use any fix or compute
	styles which are not yet Kokkos-enabled. This allows data to stay on
	the GPU for multiple timesteps, without being copied back to the host
	CPU. Invoking a non-Kokkos fix or compute, or performing I/O for
	"thermo"_thermo_style.html or "dump"_dump.html output will cause data
	to be copied back to the CPU.

	You cannot yet assign multiple MPI tasks to the same GPU with the
	KOKKOS package. We plan to support this in the future, similar to the
	GPU package in LAMMPS.

	You cannot yet use both the host (multi-threaded) and device (GPU)
	together to compute pairwise interactions with the KOKKOS package. We
	hope to support this in the future, similar to the GPU package in
	LAMMPS.

	[Running on an Intel Phi:]

	Kokkos only uses Intel Phi processors in their "native" mode, i.e.
	not hosted by a CPU.

	As illustrated above, build LAMMPS with OMP=yes (the default) and
	MIC=yes. The latter insures code is correctly compiled for the Intel
	Phi. The OMP setting means OpenMP will be used for parallelization on
	the Phi, which is currently the best option within Kokkos. In the
	future, other options may be added.

	Current-generation Intel Phi chips have either 61 or 57 cores. One
	core should be excluded for running the OS, leaving 60 or 56 cores.
	Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
	N = 224 (4*56) cores to run on.

	The -np setting of the mpirun command sets the number of MPI
	tasks/node. The "-k on t Nt" command-line switch sets the number of
	threads/task as Nt. The product of these 2 values should be N, i.e.
	240 or 224. Also, the number of threads/task should be a multiple of
	4 so that logical threads from more than one MPI task do not run on
	the same physical core.

	Examples of mpirun commands that follow these rules are shown above.

	[Restrictions:]

	As noted above, if using GPUs, the number of MPI tasks per compute
	node should equal to the number of GPUs per compute node. In the
	future Kokkos will support assigning multiple MPI tasks to a single
	GPU.

	Currently Kokkos does not support AMD GPUs due to limits in the
	available backend programming models. Specifically, Kokkos requires
	extensive C++ support from the Kernel language. This is expected to
	change in the future.

	Kokkos must be built with a C++11 compatible compiler. For example,
	gcc 4.7.2 or later.
	diff --git a/doc/fix_rigid.html b/doc/fix_rigid.html
	index 5d7086dfe..3cf8e7fae 100644
	--- a/doc/fix_rigid.html
	+++ b/doc/fix_rigid.html
	@@ -1,802 +1,814 @@
	<HTML>
	<CENTER><A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A>
	</CENTER>






	<HR>

	<H3>fix rigid command
	</H3>
	<H3>fix rigid/nve command
	</H3>
	<H3>fix rigid/nvt command
	</H3>
	<H3>fix rigid/npt command
	</H3>
	<H3>fix rigid/nph command
	</H3>
	<H3>fix rigid/small command
	</H3>
	<H3>fix rigid/nve/small command
	</H3>
	<H3>fix rigid/nvt/small command
	</H3>
	<H3>fix rigid/npt/small command
	</H3>
	<H3>fix rigid/nph/small command
	</H3>
	<P><B>Syntax:</B>
	</P>
	<PRE>fix ID group-ID style bodystyle args keyword values ...
	</PRE>
	<UL><LI>ID, group-ID are documented in <A HREF = "fix.html">fix</A> command

	<LI>style = <I>rigid</I> or <I>rigid/nve</I> or <I>rigid/nvt</I> or <I>rigid/npt</I> or <I>rigid/nph</I> or <I>rigid/small</I> or <I>rigid/nve/small</I> or <I>rigid/nvt/small</I> or <I>rigid/npt/small</I> or <I>rigid/nph/small</I>

	<LI>bodystyle = <I>single</I> or <I>molecule</I> or <I>group</I>

	<PRE> <I>single</I> args = none
	<I>molecule</I> args = none
	<I>group</I> args = N groupID1 groupID2 ...
	N = # of groups
	groupID1, groupID2, ... = list of N group IDs
	</PRE>
	<LI>zero or more keyword/value pairs may be appended

	<LI>keyword = <I>langevin</I> or <I>temp</I> or <I>iso</I> or <I>aniso</I> or <I>x</I> or <I>y</I> or <I>z</I> or <I>couple</I> or <I>tparam</I> or <I>pchain</I> or <I>dilate</I> or <I>force</I> or <I>torque</I> or <I>infile</I>

	<PRE> <I>langevin</I> values = Tstart Tstop Tperiod seed
	Tstart,Tstop = desired temperature at start/stop of run (temperature units)
	Tdamp = temperature damping parameter (time units)
	seed = random number seed to use for white noise (positive integer)
	<I>temp</I> values = Tstart Tstop Tdamp
	Tstart,Tstop = desired temperature at start/stop of run (temperature units)
	Tdamp = temperature damping parameter (time units)
	<I>iso</I> or <I>aniso</I> values = Pstart Pstop Pdamp
	Pstart,Pstop = scalar external pressure at start/end of run (pressure units)
	Pdamp = pressure damping parameter (time units)
	<I>x</I> or <I>y</I> or <I>z</I> values = Pstart Pstop Pdamp
	Pstart,Pstop = external stress tensor component at start/end of run (pressure units)
	Pdamp = stress damping parameter (time units)
	<I>couple</I> = <I>none</I> or <I>xyz</I> or <I>xy</I> or <I>yz</I> or <I>xz</I>
	<I>tparam</I> values = Tchain Titer Torder
	Tchain = length of Nose/Hoover thermostat chain
	Titer = number of thermostat iterations performed
	Torder = 3 or 5 = Yoshida-Suzuki integration parameters
	<I>pchain</I> values = Pchain
	Pchain = length of the Nose/Hoover thermostat chain coupled with the barostat
	<I>dilate</I> value = dilate-group-ID
	dilate-group-ID = only dilate atoms in this group due to barostat volume changes
	<I>force</I> values = M xflag yflag zflag
	M = which rigid body from 1-Nbody (see asterisk form below)
	xflag,yflag,zflag = off/on if component of center-of-mass force is active
	<I>torque</I> values = M xflag yflag zflag
	M = which rigid body from 1-Nbody (see asterisk form below)
	xflag,yflag,zflag = off/on if component of center-of-mass torque is active
	<I>infile</I> filename
	filename = file with per-body values of mass, center-of-mass, moments of inertia
	<I>mol</I> value = template-ID
	template-ID = ID of molecule template specified in a separate <A HREF = "molecule.html">molecule</A> command
	</PRE>

	</UL>
	<P><B>Examples:</B>
	</P>
	<PRE>fix 1 clump rigid single
	fix 1 clump rigid/small molecule
	fix 1 clump rigid single force 1 off off on langevin 1.0 1.0 1.0 428984
	fix 1 polychains rigid/nvt molecule temp 1.0 1.0 5.0
	fix 1 polychains rigid molecule force 15 off off off force 610 off off on
	fix 1 polychains rigid/small molecule langevin 1.0 1.0 1.0 428984
	fix 2 fluid rigid group 3 clump1 clump2 clump3 torque * off off off
	fix 1 rods rigid/npt molecule temp 300.0 300.0 100.0 iso 0.5 0.5 10.0
	fix 1 particles rigid/npt molecule temp 1.0 1.0 5.0 x 0.5 0.5 1.0 z 0.5 0.5 1.0 couple xz
	fix 1 water rigid/nph molecule iso 0.5 0.5 1.0
	fix 1 particles rigid/npt/small molecule temp 1.0 1.0 1.0 iso 0.5 0.5 1.0
	</PRE>
	<P><B>Description:</B>
	</P>
	<P>Treat one or more sets of atoms as independent rigid bodies. This
	means that each timestep the total force and torque on each rigid body
	is computed as the sum of the forces and torques on its constituent
	particles. The coordinates, velocities, and orientations of the atoms
	in each body are then updated so that the body moves and rotates as a
	single entity.
	</P>
	<P>Examples of large rigid bodies are a colloidal particle, or portions
	of a biomolecule such as a protein.
	</P>
	<P>Example of small rigid bodies are patchy nanoparticles, such as those
	modeled in <A HREF = "#Zhang">this paper</A> by Sharon Glotzer's group, clumps of
	granular particles, lipid molecules consiting of one or more point
	dipoles connected to other spheroids or ellipsoids, irregular
	particles built from line segments (2d) or triangles (3d), and
	coarse-grain models of nano or colloidal particles consisting of a
	small number of constituent particles. Note that the <A HREF = "fix_shake.html">fix
	shake</A> command can also be used to rigidify small
	molecules of 2, 3, or 4 atoms, e.g. water molecules. That fix treats
	the constituent atoms as point masses.
	</P>
	<P>These fixes also update the positions and velocities of the atoms in
	each rigid body via time integration, in the NVE, NVT, NPT, or NPH
	ensemble, as described below.
	</P>
	<P>There are two main variants of this fix, fix rigid and fix
	rigid/small. The NVE/NVT/NPT/NHT versions belong to one of the two
	variants, as their style names indicate.
	</P>
	<P>IMPORTANT NOTE: Not all of the <I>bodystyle</I> options and keyword/value
	options are available for both the <I>rigid</I> and <I>rigid/small</I> variants.
	See details below.
	</P>
	<P>The <I>rigid</I> variant is typically the best choice for a system with a
	small number of large rigid bodies, each of which can extend across
	the domain of many processors. It operates by creating a single
	global list of rigid bodies, which all processors contribute to.
	MPI_Allreduce operations are performed each timestep to sum the
	contributions from each processor to the force and torque on all the
	bodies. This operation will not scale well in parallel if large
	numbers of rigid bodies are simulated.
	</P>
	<P>The <I>rigid/small</I> variant is typically best for a system with a large
	number of small rigid bodies. Each body is assigned to the atom
	closest to the geometrical center of the body. The fix operates using
	local lists of rigid bodies owned by each processor and information is
	exchanged and summed via local communication between neighboring
	processors when ghost atom info is accumlated.
	</P>
	<P>IMPORTANT NOTE: To use <I>rigid/small</I> the ghost atom cutoff must be
	large enough to span the distance between the atom that owns the body
	and every other atom in the body. This distance value is printed out
	when the rigid bodies are defined. If the
	<A HREF = "pair_style.html">pair_style</A> cutoff plus neighbor skin does not span
	this distance, then you should use the <A HREF = "communicate.html">communicate
	cutoff</A> command with a setting epsilon larger than
	the distance.
	</P>
	<P>Which of the two variants is faster for a particular problem is hard
	to predict. The best way to decide is to perform a short test run.
	Both variants should give identical numerical answers for short runs.
	Long runs should give statistically similar results, but round-off
	differences may accumulate to produce divergent trajectories.
	</P>
	<P>IMPORTANT NOTE: You should not update the atoms in rigid bodies via
	other time-integration fixes (e.g. <A HREF = "fix_nve.html">fix nve</A>, <A HREF = "fix_nvt.html">fix
	nvt</A>, <A HREF = "fix_npt.html">fix npt</A>), or you will be integrating
	their motion more than once each timestep. When performing a hybrid
	simulation with some atoms in rigid bodies, and some not, a separate
	time integration fix like <A HREF = "fix_nve.html">fix nve</A> or <A HREF = "fix_nh.html">fix
	nvt</A> should be used for the non-rigid particles.
	</P>
	<P>IMPORTANT NOTE: These fixes are overkill if you simply want to hold a
	collection of atoms stationary or have them move with a constant
	velocity. A simpler way to hold atoms stationary is to not include
	those atoms in your time integration fix. E.g. use "fix 1 mobile nve"
	instead of "fix 1 all nve", where "mobile" is the group of atoms that
	you want to move. You can move atoms with a constant velocity by
	assigning them an initial velocity (via the <A HREF = "velocity.html">velocity</A>
	command), setting the force on them to 0.0 (via the <A HREF = "fix_setforce.html">fix
	setforce</A> command), and integrating them as usual
	(e.g. via the <A HREF = "fix_nve.html">fix nve</A> command).
	</P>
	<P>IMPORTANT NOTE: The aggregate properties of each rigid body are
	calculated one time at the start of the first simulation run after
	this fix is specified. The properties include the position and
	velocity of the center-of-mass of the body, its moments of inertia,
	and its angular momentum. This is done using the properties of the
	constituent atoms of the body at that point in time (or see the
	<I>infile</I> keyword option). Thereafter, changing properties of
	individual atoms in the body will have no effect on a rigid body's
	dynamics, unless they effect the <A HREF = "pair_style.html">pair_style</A>
	interactions that individual particles are part of. For example, you
	might think you could displace the atoms in a body or add a large
	velocity to each atom in a body to make it move in a desired direction
	before a 2nd run is performed, using the <A HREF = "set.html">set</A> or
	<A HREF = "displace_atoms.html">displace_atoms</A> or <A HREF = "velocity.html">velocity</A>
	command. But these commands will not affect the internal attributes
	of the body, and the position and velocity or individual atoms in the
	body will be reset when time integration starts.
	</P>
	<HR>

	<P>Each rigid body must have two or more atoms. An atom can belong to at
	most one rigid body. Which atoms are in which bodies can be defined
	via several options.
	</P>
	<P>IMPORTANT NOTE: With fix rigid/small, which requires bodystyle
	<I>molecule</I>, you can define a system that has no rigid bodies
	initially. This is useful when you are using the <I>mol</I> keyword in
	conjunction with another fix that is adding rigid bodies on-the-fly,
	such as <A HREF = "fix_deposit.html">fix deposit</A> or <A HREF = "fix_pour.html">fix pour</A>.
	</P>
	<P>For bodystyle <I>single</I> the entire fix group of atoms is treated as one
	rigid body. This option is only allowed for fix rigid and its
	sub-styles.
	</P>
	<P>For bodystyle <I>molecule</I>, each set of atoms in the fix group with a
	different molecule ID is treated as a rigid body. This option is
	allowed for fix rigid and fix rigid/small, and their sub-styles. Note
	that atoms with a molecule ID = 0 will be treated as a single rigid
	body. For a system with atomic solvent (typically this is atoms with
	molecule ID = 0) surrounding rigid bodies, this may not be what you
	want. Thus you should be careful to use a fix group that only
	includes atoms you want to be part of rigid bodies.
	</P>
	<P>For bodystyle <I>group</I>, each of the listed groups is treated as a
	separate rigid body. Only atoms that are also in the fix group are
	included in each rigid body. This option is only allowed for fix
	rigid and its sub-styles.
	</P>
	<P>IMPORTANT NOTE: To compute the initial center-of-mass position and
	other properties of each rigid body, the image flags for each atom in
	the body are used to "unwrap" the atom coordinates. Thus you must
	insure that these image flags are consistent so that the unwrapping
	creates a valid rigid body (one where the atoms are close together),
	particularly if the atoms in a single rigid body straddle a periodic
	boundary. This means the input data file or restart file must define
	the image flags for each atom consistently or that you have used the
	<A HREF = "set.html">set</A> command to specify them correctly. If a dimension is
	non-periodic then the image flag of each atom must be 0 in that
	dimension, else an error is generated.
	</P>
	<P>The <I>force</I> and <I>torque</I> keywords discussed next are only allowed for
	fix rigid and its sub-styles.
	</P>
	<P>By default, each rigid body is acted on by other atoms which induce an
	external force and torque on its center of mass, causing it to
	translate and rotate. Components of the external center-of-mass force
	and torque can be turned off by the <I>force</I> and <I>torque</I> keywords.
	This may be useful if you wish a body to rotate but not translate, or
	vice versa, or if you wish it to rotate or translate continuously
	unaffected by interactions with other particles. Note that if you
	expect a rigid body not to move or rotate by using these keywords, you
	must insure its initial center-of-mass translational or angular
	velocity is 0.0. Otherwise the initial translational or angular
	momentum the body has will persist.
	</P>
	<P>An xflag, yflag, or zflag set to <I>off</I> means turn off the component of
	force of torque in that dimension. A setting of <I>on</I> means turn on
	the component, which is the default. Which rigid body(s) the settings
	apply to is determined by the first argument of the <I>force</I> and
	<I>torque</I> keywords. It can be an integer M from 1 to Nbody, where
	Nbody is the number of rigid bodies defined. A wild-card asterisk can
	be used in place of, or in conjunction with, the M argument to set the
	flags for multiple rigid bodies. This takes the form "" or "n" or
	"n" or "mn". If N = the number of rigid bodies, then an asterisk
	with no numeric values means all bodies from 1 to N. A leading
	asterisk means all bodies from 1 to n (inclusive). A trailing
	asterisk means all bodies from n to N (inclusive). A middle asterisk
	means all types from m to n (inclusive). Note that you can use the
	<I>force</I> or <I>torque</I> keywords as many times as you like. If a
	particular rigid body has its component flags set multiple times, the
	settings from the final keyword are used.
	</P>
	<P>IMPORTANT NOTE: For computational efficiency, you may wish to turn off
	pairwise and bond interactions within each rigid body, as they no
	longer contribute to the motion. The <A HREF = "neigh_modify.html">neigh_modify
	exclude</A> and <A HREF = "delete_bonds.html">delete_bonds</A>
	commands are used to do this. If the rigid bodies have strongly
	overalapping atoms, you may need to turn off these interactions to
	avoid numerical problems due to large equal/opposite intra-body forces
	swamping the contribution of small inter-body forces.
	</P>
	<P>For computational efficiency, you should typically define one fix
	rigid or fix rigid/small command which includes all the desired rigid
	bodies. LAMMPS will allow multiple rigid fixes to be defined, but it
	is more expensive.
	</P>
	<HR>

	<P>The constituent particles within a rigid body can be point particles
	(the default in LAMMPS) or finite-size particles, such as spheres or
	ellipsoids or line segments or triangles. See the <A HREF = "atom_style.html">atom_style sphere
	and ellipsoid and line and tri</A> commands for more
	details on these kinds of particles. Finite-size particles contribute
	differently to the moment of inertia of a rigid body than do point
	particles. Finite-size particles can also experience torque (e.g. due
	to <A HREF = "pair_gran.html">frictional granular interactions</A>) and have an
	orientation. These contributions are accounted for by these fixes.
	</P>
	<P>Forces between particles within a body do not contribute to the
	external force or torque on the body. Thus for computational
	efficiency, you may wish to turn off pairwise and bond interactions
	between particles within each rigid body. The <A HREF = "neigh_modify.html">neigh_modify
	exclude</A> and <A HREF = "delete_bonds.html">delete_bonds</A>
	commands are used to do this. For finite-size particles this also
	means the particles can be highly overlapped when creating the rigid
	body.
	</P>
	<HR>

	<P>The <I>rigid</I> and <I>rigid/small</I> and <I>rigid/nve</I> styles perform constant
	NVE time integration. The only difference is that the <I>rigid</I> and
	<I>rigid/small</I> styles use an integration technique based on Richardson
	iterations. The <I>rigid/nve</I> style uses the methods described in the
	paper by <A HREF = "#Miller">Miller</A>, which are thought to provide better energy
	conservation than an iterative approach.
	</P>
	<P>The <I>rigid/nvt</I> and <I>rigid/nvt/small</I> styles performs constant NVT
	integration using a Nose/Hoover thermostat with chains as described
	originally in <A HREF = "#Hoover">(Hoover)</A> and <A HREF = "#Martyna">(Martyna)</A>, which
	thermostats both the translational and rotational degrees of freedom
	of the rigid bodies. The rigid-body algorithm used by <I>rigid/nvt</I>
	is described in the paper by <A HREF = "#Kamberaj">Kamberaj</A>.
	</P>
	<P>The <I>rigid/npt</I> and <I>rigid/nph</I> (and their /small counterparts) styles
	perform constant NPT or NPH integration using a Nose/Hoover barostat
	with chains. For the NPT case, the same Nose/Hoover thermostat is also
	used as with <I>rigid/nvt</I>.
	</P>
	<P>The barostat parameters are specified using one or more of the <I>iso</I>,
	<I>aniso</I>, <I>x</I>, <I>y</I>, <I>z</I> and <I>couple</I> keywords. These keywords give you
	the ability to specify 3 diagonal components of the external stress
	tensor, and to couple these components together so that the dimensions
	they represent are varied together during a constant-pressure
	simulation. The effects of these keywords are similar to those
	defined in <A HREF = "fix_nh.html">fix npt/nph</A>
	</P>
	<P>NOTE: Currently the <I>rigid/npt</I> and <I>rigid/nph</I> (and their /small
	counterparts) styles do not support triclinic (non-orthongonal) boxes.
	</P>
	<P>The target pressures for each of the 6 components of the stress tensor
	can be specified independently via the <I>x</I>, <I>y</I>, <I>z</I> keywords, which
	correspond to the 3 simulation box dimensions. For each component,
	the external pressure or tensor component at each timestep is a ramped
	value during the run from <I>Pstart</I> to <I>Pstop</I>. If a target pressure is
	specified for a component, then the corresponding box dimension will
	change during a simulation. For example, if the <I>y</I> keyword is used,
	the y-box length will change. A box dimension will not change if that
	component is not specified, although you have the option to change
	that dimension via the <A HREF = "fix_deform.html">fix deform</A> command.
	</P>
	<P>For all barostat keywords, the <I>Pdamp</I> parameter operates like the
	<I>Tdamp</I> parameter, determining the time scale on which pressure is
	relaxed. For example, a value of 10.0 means to relax the pressure in
	a timespan of (roughly) 10 time units (e.g. tau or fmsec or psec - see
	the <A HREF = "units.html">units</A> command).
	</P>
	<P>Regardless of what atoms are in the fix group (the only atoms which
	are time integrated), a global pressure or stress tensor is computed
	for all atoms. Similarly, when the size of the simulation box is
	changed, all atoms are re-scaled to new positions, unless the keyword
	<I>dilate</I> is specified with a <I>dilate-group-ID</I> for a group that
	represents a subset of the atoms. This can be useful, for example, to
	leave the coordinates of atoms in a solid substrate unchanged and
	controlling the pressure of a surrounding fluid. Another example is a
	system consisting of rigid bodies and point particles where the
	barostat is only coupled with the rigid bodies. This option should be
	used with care, since it can be unphysical to dilate some atoms and
	not others, because it can introduce large, instantaneous
	displacements between a pair of atoms (one dilated, one not) that are
	far from the dilation origin.
	</P>
	<P>The <I>couple</I> keyword allows two or three of the diagonal components of
	the pressure tensor to be "coupled" together. The value specified
	with the keyword determines which are coupled. For example, <I>xz</I>
	means the <I>Pxx</I> and <I>Pzz</I> components of the stress tensor are coupled.
	<I>Xyz</I> means all 3 diagonal components are coupled. Coupling means two
	things: the instantaneous stress will be computed as an average of the
	corresponding diagonal components, and the coupled box dimensions will
	be changed together in lockstep, meaning coupled dimensions will be
	dilated or contracted by the same percentage every timestep. The
	<I>Pstart</I>, <I>Pstop</I>, <I>Pdamp</I> parameters for any coupled dimensions must
	be identical. <I>Couple xyz</I> can be used for a 2d simulation; the <I>z</I>
	dimension is simply ignored.
	</P>
	<P>The <I>iso</I> and <I>aniso</I> keywords are simply shortcuts that are
	equivalent to specifying several other keywords together.
	</P>
	<P>The keyword <I>iso</I> means couple all 3 diagonal components together when
	pressure is computed (hydrostatic pressure), and dilate/contract the
	dimensions together. Using "iso Pstart Pstop Pdamp" is the same as
	specifying these 4 keywords:
	</P>
	<PRE>x Pstart Pstop Pdamp
	y Pstart Pstop Pdamp
	z Pstart Pstop Pdamp
	couple xyz
	</PRE>
	<P>The keyword <I>aniso</I> means <I>x</I>, <I>y</I>, and <I>z</I> dimensions are controlled
	independently using the <I>Pxx</I>, <I>Pyy</I>, and <I>Pzz</I> components of the
	stress tensor as the driving forces, and the specified scalar external
	pressure. Using "aniso Pstart Pstop Pdamp" is the same as specifying
	these 4 keywords:
	</P>
	<PRE>x Pstart Pstop Pdamp
	y Pstart Pstop Pdamp
	z Pstart Pstop Pdamp
	couple none
	</PRE>
	<HR>

	<P>The keyword/value option pairs are used in the following ways.
	</P>
	<P>The <I>langevin</I> and <I>temp</I> and <I>tparam</I> keywords perform thermostatting
	of the rigid bodies, altering both their translational and rotational
	degrees of freedom. What is meant by "temperature" of a collection of
	rigid bodies and how it can be monitored via the fix output is
	discussed below.
	</P>
	<P>The <I>langevin</I> keyword applies a Langevin thermostat to the constant
	NVE time integration performed by either the <I>rigid</I> or <I>rigid/small</I>
	or <I>rigid/nve</I> styles. It cannot be used with the <I>rigid/nvt</I> style.
	The desired temperature at each timestep is a ramped value during the
	run from <I>Tstart</I> to <I>Tstop</I>. The <I>Tdamp</I> parameter is specified in
	time units and determines how rapidly the temperature is relaxed. For
	example, a value of 100.0 means to relax the temperature in a timespan
	of (roughly) 100 time units (tau or fmsec or psec - see the
	<A HREF = "units.html">units</A> command). The random # <I>seed</I> must be a positive
	integer.
	</P>
	<P>The way that Langevin thermostatting operates is explained on the <A HREF = "fix_langevin.html">fix
	langevin</A> doc page. If you wish to simply viscously
	damp the rotational motion without thermostatting, you can set
	<I>Tstart</I> and <I>Tstop</I> to 0.0, which means only the viscous drag term in
	the Langevin thermostat will be applied. See the discussion on the
	<A HREF = "doc/fix_viscous.html">fix viscous</A> doc page for details.
	</P>
	<P>IMPORTANT NOTE: When the <I>langevin</I> keyword is used with fix rigid
	versus fix rigid/small, different dynamics will result for parallel
	runs. This is because of the way random numbers are used in the two
	cases. The dynamics for the two cases should be statistically
	similar, but will not be identical, even for a single timestep.
	</P>
	<P>The <I>temp</I> and <I>tparam</I> keywords apply a Nose/Hoover thermostat to the
	NVT time integration performed by the <I>rigid/nvt</I> style. They cannot
	be used with the <I>rigid</I> or <I>rigid/small</I> or <I>rigid/nve</I> styles. The
	desired temperature at each timestep is a ramped value during the run
	from <I>Tstart</I> to <I>Tstop</I>. The <I>Tdamp</I> parameter is specified in time
	units and determines how rapidly the temperature is relaxed. For
	example, a value of 100.0 means to relax the temperature in a timespan
	of (roughly) 100 time units (tau or fmsec or psec - see the
	<A HREF = "units.html">units</A> command).
	</P>
	<P>Nose/Hoover chains are used in conjunction with this thermostat. The
	<I>tparam</I> keyword can optionally be used to change the chain settings
	used. <I>Tchain</I> is the number of thermostats in the Nose Hoover chain.
	This value, along with <I>Tdamp</I> can be varied to dampen undesirable
	oscillations in temperature that can occur in a simulation. As a rule
	of thumb, increasing the chain length should lead to smaller
	oscillations. The keyword <I>pchain</I> specifies the number of
	thermostats in the chain thermostatting the barostat degrees of
	freedom.
	</P>
	<P>IMPORTANT NOTE: There are alternate ways to thermostat a system of
	rigid bodies. You can use <A HREF = "fix_langevin.html">fix langevin</A> to treat
	the individual particles in the rigid bodies as effectively immersed
	in an implicit solvent, e.g. a Brownian dynamics model. For hybrid
	systems with both rigid bodies and solvent particles, you can
	thermostat only the solvent particles that surround one or more rigid
	bodies by appropriate choice of groups in the compute and fix commands
	for temperature and thermostatting. The solvent interactions with the
	rigid bodies should then effectively thermostat the rigid body
	temperature as well without use of the Langevin or Nose/Hoover options
	associated with the fix rigid commands.
	</P>
	<HR>

	<P>The <I>mol</I> keyword can only be used with fix rigid/small. It must be
	used when other commands, such as <A HREF = "fix_deposit.html">fix deposit</A> or
	<A HREF = "fix_pour.html">fix pour</A>, add rigid bodies on-the-fly during a
	simulation. You specify a <I>template-ID</I> previously defined using the
	<A HREF = "molecule.html">molecule</A> command, which reads a file that defines the
	molecule. You must use the same <I>template-ID</I> that the other fix
	which is adding rigid bodies uses. The coordinates, atom types, atom
	diameters, center-of-mass, and moments of inertia can be specified in
	the molecule file. See the <A HREF = "molecule.html">molecule</A> command for
	details. The only settings required to be in this file are the
	coordinates and types of atoms in the molecule, in which case the
	molecule command calculates the other quantities itself.
	</P>
	<P>Note that these other fixes create new rigid bodies, in addition to
	those defined initially by this fix via the <I>bodystyle</I> setting.
	</P>
	<P>Also note that when using the <I>mol</I> keyword, extra restart information
	about all rigid bodies is written out whenever a restart file is
	written out. See the IMPORTANT NOTE in the next section for details.
	</P>
	<HR>

	<P>The <I>infile</I> keyword allows a file of rigid body attributes to be read
	-in from a file, rather then having LAMMPS compute them. There are 3
	+in from a file, rather then having LAMMPS compute them. There are 5
	such attributes: the total mass of the rigid body, its center-of-mass
	-position, and its 6 moments of inertia. For rigid bodies consisting
	-of point particles or non-overlapping finite-size particles, LAMMPS
	-can compute these values accurately. However, for rigid bodies
	-consisting of finite-size particles which overlap each other, LAMMPS
	-will ignore the overlaps when computing these 3 attributes. The
	-amount of error this induces depends on the amount of overlap. To
	-avoid this issue, the values can be pre-computed (e.g. using Monte
	-Carlo integration).
	+position, its 6 moments of inertia, its center-of-mass velocity, and
	+the 3 image flags of the center-of-mass position. For rigid bodies
	+consisting of point particles or non-overlapping finite-size
	+particles, LAMMPS can compute these values accurately. However, for
	+rigid bodies consisting of finite-size particles which overlap each
	+other, LAMMPS will ignore the overlaps when computing these 4
	+attributes. The amount of error this induces depends on the amount of
	+overlap. To avoid this issue, the values can be pre-computed
	+(e.g. using Monte Carlo integration).
	</P>
	<P>The format of the file is as follows. Note that the file does not
	have to list attributes for every rigid body integrated by fix rigid.
	Only bodies which the file specifies will have their computed
	attributes overridden. The file can contain initial blank lines or
	comment lines starting with "#" which are ignored. The first
	non-blank, non-comment line should list N = the number of lines to
	follow. The N successive lines contain the following information:
	</P>
	-<PRE>ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz
	-ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz
	+<PRE>ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm
	+ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm
	...
	-IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz
	+IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm
	</PRE>
	<P>The rigid body IDs are all positive integers. For the <I>single</I>
	bodystyle, only an ID of 1 can be used. For the <I>group</I> bodystyle,
	IDs from 1 to Ng can be used where Ng is the number of specified
	groups. For the <I>molecule</I> bodystyle, use the molecule ID for the
	atoms in a specific rigid body as the rigid body ID.
	</P>
	<P>The masstotal and center-of-mass coordinates (xcm,ycm,zcm) are
	self-explanatory. The center-of-mass should be consistent with what
	is calculated for the position of the rigid body with all its atoms
	unwrapped by their respective image flags. If this produces a
	center-of-mass that is outside the simulation box, LAMMPS wraps it
	-back into the box. The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz)
	-should be the values consistent with the current orientation of the
	-rigid body around its center of mass. The values are with respect to
	-the simulation box XYZ axes, not with respect to the prinicpal axes of
	-the rigid body itself. LAMMPS performs the latter calculation
	-internally. The (vxcm,vycm,vzcm) values are the velocity of the
	-center of mass. The (lx,ly,lz) values are the angular momentum of the
	-body. These last 6 values can simply be set to 0 if you wish the
	-body to have no initial motion.
	+back into the box.
	+</P>
	+<P>The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz) should be the
	+values consistent with the current orientation of the rigid body
	+around its center of mass. The values are with respect to the
	+simulation box XYZ axes, not with respect to the prinicpal axes of the
	+rigid body itself. LAMMPS performs the latter calculation internally.
	+</P>
	+<P>The (vxcm,vycm,vzcm) values are the velocity of the center of mass.
	+The (lx,ly,lz) values are the angular momentum of the body. The
	+(vxcm,vycm,vzcm) and (lx,ly,lz) values can simply be set to 0 if you
	+wish the body to have no initial motion.
	+</P>
	+<P>The (ixcm,iycm,izcm) values are the image flags of the center of mass
	+of the body. For periodic dimensions, they specify which image of the
	+simulation box the body is considered to be in. An image of 0 means
	+it is inside the box as defined. A value of 2 means add 2 box lengths
	+to get the true value. A value of -1 means subtract 1 box length to
	+get the true value. LAMMPS updates these flags as the rigid bodies
	+cross periodic boundaries during the simulation.
	</P>
	<P>IMPORTANT NOTE: If you use the <I>infile</I> or <I>mol</I> keywords and write
	restart files during a simulation, then each time a restart file is
	written, the fix also write an auxiliary restart file with the name
	rfile.rigid, where "rfile" is the name of the restart file,
	e.g. tmp.restart.10000 and tmp.restart.10000.rigid. This auxiliary
	file is in the same format described above. Thus it can be used in a
	new input script that restarts the run and re-specifies a rigid fix
	using an <I>infile</I> keyword and the appropriate filename. Note that the
	auxiliary file will contain one line for every rigid body, even if the
	original file only listed a subset of the rigid bodies.
	</P>
	<HR>

	<P>If you use a <A HREF = "compute.html">temperature compute</A> with a group that
	includes particles in rigid bodies, the degrees-of-freedom removed by
	each rigid body are accounted for in the temperature (and pressure)
	computation, but only if the temperature group includes all the
	particles in a particular rigid body.
	</P>
	<P>A 3d rigid body has 6 degrees of freedom (3 translational, 3
	rotational), except for a collection of point particles lying on a
	straight line, which has only 5, e.g a dimer. A 2d rigid body has 3
	degrees of freedom (2 translational, 1 rotational).
	</P>
	<P>IMPORTANT NOTE: You may wish to explicitly subtract additional
	degrees-of-freedom if you use the <I>force</I> and <I>torque</I> keywords to
	eliminate certain motions of one or more rigid bodies. LAMMPS does
	not do this automatically.
	</P>
	<P>The rigid body contribution to the pressure of the system (virial) is
	also accounted for by this fix.
	</P>
	<HR>

	<P>If your simlulation is a hybrid model with a mixture of rigid bodies
	and non-rigid particles (e.g. solvent) there are several ways these
	rigid fixes can be used in tandem with <A HREF = "fix_nve.html">fix nve</A>, <A HREF = "fix_nh.html">fix
	nvt</A>, <A HREF = "fix_nh.html">fix npt</A>, and <A HREF = "fix_nh.html">fix nph</A>.
	</P>
	<P>If you wish to perform NVE dynamics (no thermostatting or
	barostatting), use fix rigid or fix rigid/nve to integrate the rigid
	bodies, and <A HREF = "fix_nve.html">fix nve</A> to integrate the non-rigid
	particles.
	</P>
	<P>If you wish to perform NVT dynamics (thermostatting, but no
	barostatting), you can use fix rigid/nvt for the rigid bodies, and any
	thermostatting fix for the non-rigid particles (<A HREF = "fix_nh.html">fix nvt</A>,
	<A HREF = "fix_langevin.html">fix langevin</A>, <A HREF = "fix_temp_berendsen.html">fix
	temp/berendsen</A>). You can also use fix rigid
	or fix rigid/nve for the rigid bodies and thermostat them using <A HREF = "fix_langevin.html">fix
	langevin</A> on the group that contains all the
	particles in the rigid bodies. The net force added by <A HREF = "fix_langevin.html">fix
	langevin</A> to each rigid body effectively thermostats
	its translational center-of-mass motion. Not sure how well it does at
	thermostatting its rotational motion.
	</P>
	<P>If you with to perform NPT or NPH dynamics (barostatting), you cannot
	use both <A HREF = "fix_nh.html">fix npt</A> and fix rigid/npt (or the nph
	variants). This is because there can only be one fix which monitors
	the global pressure and changes the simulation box dimensions. So you
	have 3 choices:
	</P>
	<UL><LI>Use fix rigid/npt for the rigid bodies. Use the <I>dilate</I> all option
	so that it will dilate the positions of the non-rigid particles as
	well. Use <A HREF = "fix_nh.html">fix nvt</A> (or any other thermostat) for the
	non-rigid particles.

	<LI>Use <A HREF = "fix_nh.html">fix npt</A> for the group of non-rigid particles. Use
	the <I>dilate</I> all option so that it will dilate the center-of-mass
	positions of the rigid bodies as well. Use fix rigid/nvt for the
	rigid bodies.

	<LI>Use <A HREF = "fix_press_berendsen.html">fix press/berendsen</A> to compute the
	pressure and change the box dimensions. Use fix rigid/nvt for the
	rigid bodies. Use <A HREF = "fix_nh.thml">fix nvt</A> (or any other thermostat) for
	the non-rigid particles.
	</UL>
	<P>In all case, the rigid bodies and non-rigid particles both contribute
	to the global pressure and the box is scaled the same by any of the
	barostatting fixes.
	</P>
	<P>You could even use the 2nd and 3rd options for a non-hybrid simulation
	consisting of only rigid bodies, assuming you give <A HREF = "fix_nh.html">fix
	npt</A> an empty group, though it's an odd thing to do. The
	barostatting fixes (<A HREF = "fix_nh.html">fix npt</A> and <A HREF = "fix_press_berendsen.html">fix
	press/berensen</A>) will monitor the pressure
	and change the box dimensions, but not time integrate any particles.
	The integration of the rigid bodies will be performed by fix
	rigid/nvt.
	</P>
	<HR>

	<P>Styles with a <I>cuda</I>, <I>gpu</I>, <I>intel</I>, <I>kk</I>, <I>omp</I>, or <I>opt</I> suffix are
	functionally the same as the corresponding style without the suffix.
	They have been optimized to run faster, depending on your available
	hardware, as discussed in <A HREF = "Section_accelerate.html">Section_accelerate</A>
	of the manual. The accelerated styles take the same arguments and
	should produce the same results, except for round-off and precision
	issues.
	</P>
	<P>These accelerated styles are part of the USER-CUDA, GPU, USER-INTEL,
	KOKKOS, USER-OMP and OPT packages, respectively. They are only
	enabled if LAMMPS was built with those packages. See the <A HREF = "Section_start.html#start_3">Making
	LAMMPS</A> section for more info.
	</P>
	<P>You can specify the accelerated styles explicitly in your input script
	by including their suffix, or you can use the <A HREF = "Section_start.html#start_7">-suffix command-line
	switch</A> when you invoke LAMMPS, or you can
	use the <A HREF = "suffix.html">suffix</A> command in your input script.
	</P>
	<P>See <A HREF = "Section_accelerate.html">Section_accelerate</A> of the manual for
	more instructions on how to use the accelerated styles effectively.
	</P>
	<HR>

	<P><B>Restart, fix_modify, output, run start/stop, minimize info:</B>
	</P>
	<P>No information about the <I>rigid</I> and <I>rigid/small</I> and <I>rigid/nve</I>
	fixes are written to <A HREF = "restart.html">binary restart files</A>. The
	exception is if the <I>infile</I> or <I>mol</I> keyword is used, in which case
	an auxiliary file is written out with rigid body information each time
	a restart file is written, as explained above for the <I>infile</I>
	keyword. For style <I>rigid/nvt</I> the state of the Nose/Hoover
	thermostat is written to <A HREF = "restart.html">binary restart files</A>. See the
	<A HREF = "read_restart.html">read_restart</A> command for info on how to re-specify
	a fix in an input script that reads a restart file, so that the
	operation of the fix continues in an uninterrupted fashion.
	</P>
	<P>The <A HREF = "fix_modify.html">fix_modify</A> <I>energy</I> option is supported by the
	rigid/nvt fix to add the energy change induced by the thermostatting
	to the system's potential energy as part of <A HREF = "thermo_style.html">thermodynamic
	output</A>.
	</P>
	<P>The <A HREF = "fix_modify.html">fix_modify</A> <I>temp</I> and <I>press</I> options are
	supported by the rigid/npt and rigid/nph fixes to change the computes used
	to calculate the instantaneous pressure tensor. Note that the rigid/nvt fix
	does not use any external compute to compute instantaneous temperature.
	</P>
	<P>The <I>rigid</I> and <I>rigid/small</I> and <I>rigid/nve</I> fixes compute a global
	scalar which can be accessed by various <A HREF = "Section_howto.html#howto_15">output
	commands</A>. The scalar value calculated by
	these fixes is "intensive". The scalar is the current temperature of
	the collection of rigid bodies. This is averaged over all rigid
	bodies and their translational and rotational degrees of freedom. The
	translational energy of a rigid body is 1/2 m v^2, where m = total
	mass of the body and v = the velocity of its center of mass. The
	rotational energy of a rigid body is 1/2 I w^2, where I = the moment
	of inertia tensor of the body and w = its angular velocity. Degrees
	of freedom constrained by the <I>force</I> and <I>torque</I> keywords are
	removed from this calculation, but only for the <I>rigid</I> and
	<I>rigid/nve</I> fixes.
	</P>
	<P>The <I>rigid/nvt</I>, <I>rigid/npt</I>, and <I>rigid/nph</I> fixes compute a global
	scalar which can be accessed by various <A HREF = "Section_howto.html#howto_15">output
	commands</A>. The scalar value calculated by
	these fixes is "extensive". The scalar is the cumulative energy
	change due to the thermostatting and barostatting the fix performs.
	</P>
	<P>All of the <I>rigid</I> fixes except <I>rigid/small</I> compute a global array
	of values which can be accessed by various <A HREF = "Section_howto.html#howto_15">output
	commands</A>. The number of rows in the
	array is equal to the number of rigid bodies. The number of columns
	is 15. Thus for each rigid body, 15 values are stored: the xyz coords
	of the center of mass (COM), the xyz components of the COM velocity,
	the xyz components of the force acting on the COM, the xyz components
	of the torque acting on the COM, and the xyz image flags of the COM.
	</P>
	<P>The center of mass (COM) for each body is similar to unwrapped
	coordinates written to a dump file. It will always be inside (or
	slightly outside) the simulation box. The image flags have the same
	meaning as image flags for atom positions (see the "dump" command).
	This means you can calculate the unwrapped COM by applying the image
	flags to the COM, the same as when unwrapped coordinates are written
	to a dump file.
	</P>
	<P>The force and torque values in the array are not affected by the
	<I>force</I> and <I>torque</I> keywords in the fix rigid command; they reflect
	values before any changes are made by those keywords.
	</P>
	<P>The ordering of the rigid bodies (by row in the array) is as follows.
	For the <I>single</I> keyword there is just one rigid body. For the
	<I>molecule</I> keyword, the bodies are ordered by ascending molecule ID.
	For the <I>group</I> keyword, the list of group IDs determines the ordering
	of bodies.
	</P>
	<P>The array values calculated by these fixes are "intensive", meaning
	they are independent of the number of atoms in the simulation.
	</P>
	<P>No parameter of these fixes can be used with the <I>start/stop</I> keywords
	of the <A HREF = "run.html">run</A> command. These fixes are not invoked during
	<A HREF = "minimize.html">energy minimization</A>.
	</P>
	<HR>

	<P><B>Restrictions:</B>
	</P>
	<P>These fixes are all part of the RIGID package. It is only enabled if
	LAMMPS was built with that package. See the <A HREF = "Section_start.html#start_3">Making
	LAMMPS</A> section for more info.
	</P>
	<P>Assigning a temperature via the <A HREF = "velocity.html">velocity create</A>
	command to a system with <A HREF = "fix_rigid.html">rigid bodies</A> may not have
	the desired outcome for two reasons. First, the velocity command can
	be invoked before the rigid-body fix is invoked or initialized and the
	number of adjusted degrees of freedom (DOFs) is known. Thus it is not
	possible to compute the target temperature correctly. Second, the
	assigned velocities may be partially canceled when constraints are
	first enforced, leading to a different temperature than desired. A
	workaround for this is to perform a <A HREF = "run.html">run 0</A> command, which
	insures all DOFs are accounted for properly, and then rescale the
	temperature to the desired value before performing a simulation. For
	example:
	</P>
	<PRE>velocity all create 300.0 12345
	run 0 # temperature may not be 300K
	velocity all scale 300.0 # now it should be
	</PRE>
	<P><B>Related commands:</B>
	</P>
	<P><A HREF = "delete_bonds.html">delete_bonds</A>, <A HREF = "neigh_modify.html">neigh_modify</A>
	exclude, <A HREF = "fix_shake.html">fix shake</A>
	</P>
	<P><B>Default:</B>
	</P>
	<P>The option defaults are force * on on on and torque * on on on,
	meaning all rigid bodies are acted on by center-of-mass force and
	torque. Also Tchain = Pchain = 10, Titer = 1, Torder = 3.
	</P>
	<HR>

	<A NAME = "Hoover"></A>

	<P><B>(Hoover)</B> Hoover, Phys Rev A, 31, 1695 (1985).
	</P>
	<A NAME = "Kamberaj"></A>

	<P><B>(Kamberaj)</B> Kamberaj, Low, Neal, J Chem Phys, 122, 224114 (2005).
	</P>
	<A NAME = "Martyna"></A>

	<P><B>(Martyna)</B> Martyna, Klein, Tuckerman, J Chem Phys, 97, 2635 (1992);
	Martyna, Tuckerman, Tobias, Klein, Mol Phys, 87, 1117.
	</P>
	<A NAME = "Miller"></A>

	<P><B>(Miller)</B> Miller, Eleftheriou, Pattnaik, Ndirango, and Newns,
	J Chem Phys, 116, 8649 (2002).
	</P>
	<A NAME = "Zhang"></A>

	<P><B>(Zhang)</B> Zhang, Glotzer, Nanoletters, 4, 1407-1413 (2004).
	</P>
	</HTML>
	diff --git a/doc/fix_rigid.txt b/doc/fix_rigid.txt
	index 135ea2653..225b43966 100644
	--- a/doc/fix_rigid.txt
	+++ b/doc/fix_rigid.txt
	@@ -1,777 +1,789 @@
	"LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

	:link(lws,http://lammps.sandia.gov)
	:link(ld,Manual.html)
	:link(lc,Section_commands.html#comm)

	:line

	fix rigid command :h3
	fix rigid/nve command :h3
	fix rigid/nvt command :h3
	fix rigid/npt command :h3
	fix rigid/nph command :h3
	fix rigid/small command :h3
	fix rigid/nve/small command :h3
	fix rigid/nvt/small command :h3
	fix rigid/npt/small command :h3
	fix rigid/nph/small command :h3

	[Syntax:]

	fix ID group-ID style bodystyle args keyword values ... :pre

	ID, group-ID are documented in "fix"_fix.html command :ulb,l
	style = {rigid} or {rigid/nve} or {rigid/nvt} or {rigid/npt} or {rigid/nph} or {rigid/small} or {rigid/nve/small} or {rigid/nvt/small} or {rigid/npt/small} or {rigid/nph/small} :l
	bodystyle = {single} or {molecule} or {group} :l
	{single} args = none
	{molecule} args = none
	{group} args = N groupID1 groupID2 ...
	N = # of groups
	groupID1, groupID2, ... = list of N group IDs :pre

	zero or more keyword/value pairs may be appended :l
	keyword = {langevin} or {temp} or {iso} or {aniso} or {x} or {y} or {z} or {couple} or {tparam} or {pchain} or {dilate} or {force} or {torque} or {infile} :l
	{langevin} values = Tstart Tstop Tperiod seed
	Tstart,Tstop = desired temperature at start/stop of run (temperature units)
	Tdamp = temperature damping parameter (time units)
	seed = random number seed to use for white noise (positive integer)
	{temp} values = Tstart Tstop Tdamp
	Tstart,Tstop = desired temperature at start/stop of run (temperature units)
	Tdamp = temperature damping parameter (time units)
	{iso} or {aniso} values = Pstart Pstop Pdamp
	Pstart,Pstop = scalar external pressure at start/end of run (pressure units)
	Pdamp = pressure damping parameter (time units)
	{x} or {y} or {z} values = Pstart Pstop Pdamp
	Pstart,Pstop = external stress tensor component at start/end of run (pressure units)
	Pdamp = stress damping parameter (time units)
	{couple} = {none} or {xyz} or {xy} or {yz} or {xz}
	{tparam} values = Tchain Titer Torder
	Tchain = length of Nose/Hoover thermostat chain
	Titer = number of thermostat iterations performed
	Torder = 3 or 5 = Yoshida-Suzuki integration parameters
	{pchain} values = Pchain
	Pchain = length of the Nose/Hoover thermostat chain coupled with the barostat
	{dilate} value = dilate-group-ID
	dilate-group-ID = only dilate atoms in this group due to barostat volume changes
	{force} values = M xflag yflag zflag
	M = which rigid body from 1-Nbody (see asterisk form below)
	xflag,yflag,zflag = off/on if component of center-of-mass force is active
	{torque} values = M xflag yflag zflag
	M = which rigid body from 1-Nbody (see asterisk form below)
	xflag,yflag,zflag = off/on if component of center-of-mass torque is active
	{infile} filename
	filename = file with per-body values of mass, center-of-mass, moments of inertia
	{mol} value = template-ID
	template-ID = ID of molecule template specified in a separate "molecule"_molecule.html command :pre
	:ule

	[Examples:]

	fix 1 clump rigid single
	fix 1 clump rigid/small molecule
	fix 1 clump rigid single force 1 off off on langevin 1.0 1.0 1.0 428984
	fix 1 polychains rigid/nvt molecule temp 1.0 1.0 5.0
	fix 1 polychains rigid molecule force 15 off off off force 610 off off on
	fix 1 polychains rigid/small molecule langevin 1.0 1.0 1.0 428984
	fix 2 fluid rigid group 3 clump1 clump2 clump3 torque * off off off
	fix 1 rods rigid/npt molecule temp 300.0 300.0 100.0 iso 0.5 0.5 10.0
	fix 1 particles rigid/npt molecule temp 1.0 1.0 5.0 x 0.5 0.5 1.0 z 0.5 0.5 1.0 couple xz
	fix 1 water rigid/nph molecule iso 0.5 0.5 1.0
	fix 1 particles rigid/npt/small molecule temp 1.0 1.0 1.0 iso 0.5 0.5 1.0 :pre

	[Description:]

	Treat one or more sets of atoms as independent rigid bodies. This
	means that each timestep the total force and torque on each rigid body
	is computed as the sum of the forces and torques on its constituent
	particles. The coordinates, velocities, and orientations of the atoms
	in each body are then updated so that the body moves and rotates as a
	single entity.

	Examples of large rigid bodies are a colloidal particle, or portions
	of a biomolecule such as a protein.

	Example of small rigid bodies are patchy nanoparticles, such as those
	modeled in "this paper"_#Zhang by Sharon Glotzer's group, clumps of
	granular particles, lipid molecules consiting of one or more point
	dipoles connected to other spheroids or ellipsoids, irregular
	particles built from line segments (2d) or triangles (3d), and
	coarse-grain models of nano or colloidal particles consisting of a
	small number of constituent particles. Note that the "fix
	shake"_fix_shake.html command can also be used to rigidify small
	molecules of 2, 3, or 4 atoms, e.g. water molecules. That fix treats
	the constituent atoms as point masses.

	These fixes also update the positions and velocities of the atoms in
	each rigid body via time integration, in the NVE, NVT, NPT, or NPH
	ensemble, as described below.

	There are two main variants of this fix, fix rigid and fix
	rigid/small. The NVE/NVT/NPT/NHT versions belong to one of the two
	variants, as their style names indicate.

	IMPORTANT NOTE: Not all of the {bodystyle} options and keyword/value
	options are available for both the {rigid} and {rigid/small} variants.
	See details below.

	The {rigid} variant is typically the best choice for a system with a
	small number of large rigid bodies, each of which can extend across
	the domain of many processors. It operates by creating a single
	global list of rigid bodies, which all processors contribute to.
	MPI_Allreduce operations are performed each timestep to sum the
	contributions from each processor to the force and torque on all the
	bodies. This operation will not scale well in parallel if large
	numbers of rigid bodies are simulated.

	The {rigid/small} variant is typically best for a system with a large
	number of small rigid bodies. Each body is assigned to the atom
	closest to the geometrical center of the body. The fix operates using
	local lists of rigid bodies owned by each processor and information is
	exchanged and summed via local communication between neighboring
	processors when ghost atom info is accumlated.

	IMPORTANT NOTE: To use {rigid/small} the ghost atom cutoff must be
	large enough to span the distance between the atom that owns the body
	and every other atom in the body. This distance value is printed out
	when the rigid bodies are defined. If the
	"pair_style"_pair_style.html cutoff plus neighbor skin does not span
	this distance, then you should use the "communicate
	cutoff"_communicate.html command with a setting epsilon larger than
	the distance.

	Which of the two variants is faster for a particular problem is hard
	to predict. The best way to decide is to perform a short test run.
	Both variants should give identical numerical answers for short runs.
	Long runs should give statistically similar results, but round-off
	differences may accumulate to produce divergent trajectories.

	IMPORTANT NOTE: You should not update the atoms in rigid bodies via
	other time-integration fixes (e.g. "fix nve"_fix_nve.html, "fix
	nvt"_fix_nvt.html, "fix npt"_fix_npt.html), or you will be integrating
	their motion more than once each timestep. When performing a hybrid
	simulation with some atoms in rigid bodies, and some not, a separate
	time integration fix like "fix nve"_fix_nve.html or "fix
	nvt"_fix_nh.html should be used for the non-rigid particles.

	IMPORTANT NOTE: These fixes are overkill if you simply want to hold a
	collection of atoms stationary or have them move with a constant
	velocity. A simpler way to hold atoms stationary is to not include
	those atoms in your time integration fix. E.g. use "fix 1 mobile nve"
	instead of "fix 1 all nve", where "mobile" is the group of atoms that
	you want to move. You can move atoms with a constant velocity by
	assigning them an initial velocity (via the "velocity"_velocity.html
	command), setting the force on them to 0.0 (via the "fix
	setforce"_fix_setforce.html command), and integrating them as usual
	(e.g. via the "fix nve"_fix_nve.html command).

	IMPORTANT NOTE: The aggregate properties of each rigid body are
	calculated one time at the start of the first simulation run after
	this fix is specified. The properties include the position and
	velocity of the center-of-mass of the body, its moments of inertia,
	and its angular momentum. This is done using the properties of the
	constituent atoms of the body at that point in time (or see the
	{infile} keyword option). Thereafter, changing properties of
	individual atoms in the body will have no effect on a rigid body's
	dynamics, unless they effect the "pair_style"_pair_style.html
	interactions that individual particles are part of. For example, you
	might think you could displace the atoms in a body or add a large
	velocity to each atom in a body to make it move in a desired direction
	before a 2nd run is performed, using the "set"_set.html or
	"displace_atoms"_displace_atoms.html or "velocity"_velocity.html
	command. But these commands will not affect the internal attributes
	of the body, and the position and velocity or individual atoms in the
	body will be reset when time integration starts.

	:line

	Each rigid body must have two or more atoms. An atom can belong to at
	most one rigid body. Which atoms are in which bodies can be defined
	via several options.

	IMPORTANT NOTE: With fix rigid/small, which requires bodystyle
	{molecule}, you can define a system that has no rigid bodies
	initially. This is useful when you are using the {mol} keyword in
	conjunction with another fix that is adding rigid bodies on-the-fly,
	such as "fix deposit"_fix_deposit.html or "fix pour"_fix_pour.html.

	For bodystyle {single} the entire fix group of atoms is treated as one
	rigid body. This option is only allowed for fix rigid and its
	sub-styles.

	For bodystyle {molecule}, each set of atoms in the fix group with a
	different molecule ID is treated as a rigid body. This option is
	allowed for fix rigid and fix rigid/small, and their sub-styles. Note
	that atoms with a molecule ID = 0 will be treated as a single rigid
	body. For a system with atomic solvent (typically this is atoms with
	molecule ID = 0) surrounding rigid bodies, this may not be what you
	want. Thus you should be careful to use a fix group that only
	includes atoms you want to be part of rigid bodies.

	For bodystyle {group}, each of the listed groups is treated as a
	separate rigid body. Only atoms that are also in the fix group are
	included in each rigid body. This option is only allowed for fix
	rigid and its sub-styles.

	IMPORTANT NOTE: To compute the initial center-of-mass position and
	other properties of each rigid body, the image flags for each atom in
	the body are used to "unwrap" the atom coordinates. Thus you must
	insure that these image flags are consistent so that the unwrapping
	creates a valid rigid body (one where the atoms are close together),
	particularly if the atoms in a single rigid body straddle a periodic
	boundary. This means the input data file or restart file must define
	the image flags for each atom consistently or that you have used the
	"set"_set.html command to specify them correctly. If a dimension is
	non-periodic then the image flag of each atom must be 0 in that
	dimension, else an error is generated.

	The {force} and {torque} keywords discussed next are only allowed for
	fix rigid and its sub-styles.

	By default, each rigid body is acted on by other atoms which induce an
	external force and torque on its center of mass, causing it to
	translate and rotate. Components of the external center-of-mass force
	and torque can be turned off by the {force} and {torque} keywords.
	This may be useful if you wish a body to rotate but not translate, or
	vice versa, or if you wish it to rotate or translate continuously
	unaffected by interactions with other particles. Note that if you
	expect a rigid body not to move or rotate by using these keywords, you
	must insure its initial center-of-mass translational or angular
	velocity is 0.0. Otherwise the initial translational or angular
	momentum the body has will persist.

	An xflag, yflag, or zflag set to {off} means turn off the component of
	force of torque in that dimension. A setting of {on} means turn on
	the component, which is the default. Which rigid body(s) the settings
	apply to is determined by the first argument of the {force} and
	{torque} keywords. It can be an integer M from 1 to Nbody, where
	Nbody is the number of rigid bodies defined. A wild-card asterisk can
	be used in place of, or in conjunction with, the M argument to set the
	flags for multiple rigid bodies. This takes the form "" or "n" or
	"n" or "mn". If N = the number of rigid bodies, then an asterisk
	with no numeric values means all bodies from 1 to N. A leading
	asterisk means all bodies from 1 to n (inclusive). A trailing
	asterisk means all bodies from n to N (inclusive). A middle asterisk
	means all types from m to n (inclusive). Note that you can use the
	{force} or {torque} keywords as many times as you like. If a
	particular rigid body has its component flags set multiple times, the
	settings from the final keyword are used.

	IMPORTANT NOTE: For computational efficiency, you may wish to turn off
	pairwise and bond interactions within each rigid body, as they no
	longer contribute to the motion. The "neigh_modify
	exclude"_neigh_modify.html and "delete_bonds"_delete_bonds.html
	commands are used to do this. If the rigid bodies have strongly
	overalapping atoms, you may need to turn off these interactions to
	avoid numerical problems due to large equal/opposite intra-body forces
	swamping the contribution of small inter-body forces.

	For computational efficiency, you should typically define one fix
	rigid or fix rigid/small command which includes all the desired rigid
	bodies. LAMMPS will allow multiple rigid fixes to be defined, but it
	is more expensive.

	:line

	The constituent particles within a rigid body can be point particles
	(the default in LAMMPS) or finite-size particles, such as spheres or
	ellipsoids or line segments or triangles. See the "atom_style sphere
	and ellipsoid and line and tri"_atom_style.html commands for more
	details on these kinds of particles. Finite-size particles contribute
	differently to the moment of inertia of a rigid body than do point
	particles. Finite-size particles can also experience torque (e.g. due
	to "frictional granular interactions"_pair_gran.html) and have an
	orientation. These contributions are accounted for by these fixes.

	Forces between particles within a body do not contribute to the
	external force or torque on the body. Thus for computational
	efficiency, you may wish to turn off pairwise and bond interactions
	between particles within each rigid body. The "neigh_modify
	exclude"_neigh_modify.html and "delete_bonds"_delete_bonds.html
	commands are used to do this. For finite-size particles this also
	means the particles can be highly overlapped when creating the rigid
	body.

	:line

	The {rigid} and {rigid/small} and {rigid/nve} styles perform constant
	NVE time integration. The only difference is that the {rigid} and
	{rigid/small} styles use an integration technique based on Richardson
	iterations. The {rigid/nve} style uses the methods described in the
	paper by "Miller"_#Miller, which are thought to provide better energy
	conservation than an iterative approach.

	The {rigid/nvt} and {rigid/nvt/small} styles performs constant NVT
	integration using a Nose/Hoover thermostat with chains as described
	originally in "(Hoover)"_#Hoover and "(Martyna)"_#Martyna, which
	thermostats both the translational and rotational degrees of freedom
	of the rigid bodies. The rigid-body algorithm used by {rigid/nvt}
	is described in the paper by "Kamberaj"_#Kamberaj.

	The {rigid/npt} and {rigid/nph} (and their /small counterparts) styles
	perform constant NPT or NPH integration using a Nose/Hoover barostat
	with chains. For the NPT case, the same Nose/Hoover thermostat is also
	used as with {rigid/nvt}.

	The barostat parameters are specified using one or more of the {iso},
	{aniso}, {x}, {y}, {z} and {couple} keywords. These keywords give you
	the ability to specify 3 diagonal components of the external stress
	tensor, and to couple these components together so that the dimensions
	they represent are varied together during a constant-pressure
	simulation. The effects of these keywords are similar to those
	defined in "fix npt/nph"_fix_nh.html

	NOTE: Currently the {rigid/npt} and {rigid/nph} (and their /small
	counterparts) styles do not support triclinic (non-orthongonal) boxes.

	The target pressures for each of the 6 components of the stress tensor
	can be specified independently via the {x}, {y}, {z} keywords, which
	correspond to the 3 simulation box dimensions. For each component,
	the external pressure or tensor component at each timestep is a ramped
	value during the run from {Pstart} to {Pstop}. If a target pressure is
	specified for a component, then the corresponding box dimension will
	change during a simulation. For example, if the {y} keyword is used,
	the y-box length will change. A box dimension will not change if that
	component is not specified, although you have the option to change
	that dimension via the "fix deform"_fix_deform.html command.

	For all barostat keywords, the {Pdamp} parameter operates like the
	{Tdamp} parameter, determining the time scale on which pressure is
	relaxed. For example, a value of 10.0 means to relax the pressure in
	a timespan of (roughly) 10 time units (e.g. tau or fmsec or psec - see
	the "units"_units.html command).

	Regardless of what atoms are in the fix group (the only atoms which
	are time integrated), a global pressure or stress tensor is computed
	for all atoms. Similarly, when the size of the simulation box is
	changed, all atoms are re-scaled to new positions, unless the keyword
	{dilate} is specified with a {dilate-group-ID} for a group that
	represents a subset of the atoms. This can be useful, for example, to
	leave the coordinates of atoms in a solid substrate unchanged and
	controlling the pressure of a surrounding fluid. Another example is a
	system consisting of rigid bodies and point particles where the
	barostat is only coupled with the rigid bodies. This option should be
	used with care, since it can be unphysical to dilate some atoms and
	not others, because it can introduce large, instantaneous
	displacements between a pair of atoms (one dilated, one not) that are
	far from the dilation origin.

	The {couple} keyword allows two or three of the diagonal components of
	the pressure tensor to be "coupled" together. The value specified
	with the keyword determines which are coupled. For example, {xz}
	means the {Pxx} and {Pzz} components of the stress tensor are coupled.
	{Xyz} means all 3 diagonal components are coupled. Coupling means two
	things: the instantaneous stress will be computed as an average of the
	corresponding diagonal components, and the coupled box dimensions will
	be changed together in lockstep, meaning coupled dimensions will be
	dilated or contracted by the same percentage every timestep. The
	{Pstart}, {Pstop}, {Pdamp} parameters for any coupled dimensions must
	be identical. {Couple xyz} can be used for a 2d simulation; the {z}
	dimension is simply ignored.

	The {iso} and {aniso} keywords are simply shortcuts that are
	equivalent to specifying several other keywords together.

	The keyword {iso} means couple all 3 diagonal components together when
	pressure is computed (hydrostatic pressure), and dilate/contract the
	dimensions together. Using "iso Pstart Pstop Pdamp" is the same as
	specifying these 4 keywords:

	x Pstart Pstop Pdamp
	y Pstart Pstop Pdamp
	z Pstart Pstop Pdamp
	couple xyz :pre

	The keyword {aniso} means {x}, {y}, and {z} dimensions are controlled
	independently using the {Pxx}, {Pyy}, and {Pzz} components of the
	stress tensor as the driving forces, and the specified scalar external
	pressure. Using "aniso Pstart Pstop Pdamp" is the same as specifying
	these 4 keywords:

	x Pstart Pstop Pdamp
	y Pstart Pstop Pdamp
	z Pstart Pstop Pdamp
	couple none :pre

	:line

	The keyword/value option pairs are used in the following ways.

	The {langevin} and {temp} and {tparam} keywords perform thermostatting
	of the rigid bodies, altering both their translational and rotational
	degrees of freedom. What is meant by "temperature" of a collection of
	rigid bodies and how it can be monitored via the fix output is
	discussed below.

	The {langevin} keyword applies a Langevin thermostat to the constant
	NVE time integration performed by either the {rigid} or {rigid/small}
	or {rigid/nve} styles. It cannot be used with the {rigid/nvt} style.
	The desired temperature at each timestep is a ramped value during the
	run from {Tstart} to {Tstop}. The {Tdamp} parameter is specified in
	time units and determines how rapidly the temperature is relaxed. For
	example, a value of 100.0 means to relax the temperature in a timespan
	of (roughly) 100 time units (tau or fmsec or psec - see the
	"units"_units.html command). The random # {seed} must be a positive
	integer.

	The way that Langevin thermostatting operates is explained on the "fix
	langevin"_fix_langevin.html doc page. If you wish to simply viscously
	damp the rotational motion without thermostatting, you can set
	{Tstart} and {Tstop} to 0.0, which means only the viscous drag term in
	the Langevin thermostat will be applied. See the discussion on the
	"fix viscous"_doc/fix_viscous.html doc page for details.

	IMPORTANT NOTE: When the {langevin} keyword is used with fix rigid
	versus fix rigid/small, different dynamics will result for parallel
	runs. This is because of the way random numbers are used in the two
	cases. The dynamics for the two cases should be statistically
	similar, but will not be identical, even for a single timestep.

	The {temp} and {tparam} keywords apply a Nose/Hoover thermostat to the
	NVT time integration performed by the {rigid/nvt} style. They cannot
	be used with the {rigid} or {rigid/small} or {rigid/nve} styles. The
	desired temperature at each timestep is a ramped value during the run
	from {Tstart} to {Tstop}. The {Tdamp} parameter is specified in time
	units and determines how rapidly the temperature is relaxed. For
	example, a value of 100.0 means to relax the temperature in a timespan
	of (roughly) 100 time units (tau or fmsec or psec - see the
	"units"_units.html command).

	Nose/Hoover chains are used in conjunction with this thermostat. The
	{tparam} keyword can optionally be used to change the chain settings
	used. {Tchain} is the number of thermostats in the Nose Hoover chain.
	This value, along with {Tdamp} can be varied to dampen undesirable
	oscillations in temperature that can occur in a simulation. As a rule
	of thumb, increasing the chain length should lead to smaller
	oscillations. The keyword {pchain} specifies the number of
	thermostats in the chain thermostatting the barostat degrees of
	freedom.

	IMPORTANT NOTE: There are alternate ways to thermostat a system of
	rigid bodies. You can use "fix langevin"_fix_langevin.html to treat
	the individual particles in the rigid bodies as effectively immersed
	in an implicit solvent, e.g. a Brownian dynamics model. For hybrid
	systems with both rigid bodies and solvent particles, you can
	thermostat only the solvent particles that surround one or more rigid
	bodies by appropriate choice of groups in the compute and fix commands
	for temperature and thermostatting. The solvent interactions with the
	rigid bodies should then effectively thermostat the rigid body
	temperature as well without use of the Langevin or Nose/Hoover options
	associated with the fix rigid commands.

	:line

	The {mol} keyword can only be used with fix rigid/small. It must be
	used when other commands, such as "fix deposit"_fix_deposit.html or
	"fix pour"_fix_pour.html, add rigid bodies on-the-fly during a
	simulation. You specify a {template-ID} previously defined using the
	"molecule"_molecule.html command, which reads a file that defines the
	molecule. You must use the same {template-ID} that the other fix
	which is adding rigid bodies uses. The coordinates, atom types, atom
	diameters, center-of-mass, and moments of inertia can be specified in
	the molecule file. See the "molecule"_molecule.html command for
	details. The only settings required to be in this file are the
	coordinates and types of atoms in the molecule, in which case the
	molecule command calculates the other quantities itself.

	Note that these other fixes create new rigid bodies, in addition to
	those defined initially by this fix via the {bodystyle} setting.

	Also note that when using the {mol} keyword, extra restart information
	about all rigid bodies is written out whenever a restart file is
	written out. See the IMPORTANT NOTE in the next section for details.

	:line

	The {infile} keyword allows a file of rigid body attributes to be read
	-in from a file, rather then having LAMMPS compute them. There are 3
	+in from a file, rather then having LAMMPS compute them. There are 5
	such attributes: the total mass of the rigid body, its center-of-mass
	-position, and its 6 moments of inertia. For rigid bodies consisting
	-of point particles or non-overlapping finite-size particles, LAMMPS
	-can compute these values accurately. However, for rigid bodies
	-consisting of finite-size particles which overlap each other, LAMMPS
	-will ignore the overlaps when computing these 3 attributes. The
	-amount of error this induces depends on the amount of overlap. To
	-avoid this issue, the values can be pre-computed (e.g. using Monte
	-Carlo integration).
	+position, its 6 moments of inertia, its center-of-mass velocity, and
	+the 3 image flags of the center-of-mass position. For rigid bodies
	+consisting of point particles or non-overlapping finite-size
	+particles, LAMMPS can compute these values accurately. However, for
	+rigid bodies consisting of finite-size particles which overlap each
	+other, LAMMPS will ignore the overlaps when computing these 4
	+attributes. The amount of error this induces depends on the amount of
	+overlap. To avoid this issue, the values can be pre-computed
	+(e.g. using Monte Carlo integration).

	The format of the file is as follows. Note that the file does not
	have to list attributes for every rigid body integrated by fix rigid.
	Only bodies which the file specifies will have their computed
	attributes overridden. The file can contain initial blank lines or
	comment lines starting with "#" which are ignored. The first
	non-blank, non-comment line should list N = the number of lines to
	follow. The N successive lines contain the following information:

	-ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz
	-ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz
	+ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm
	+ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm
	...
	-IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz :pre
	+IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm :pre

	The rigid body IDs are all positive integers. For the {single}
	bodystyle, only an ID of 1 can be used. For the {group} bodystyle,
	IDs from 1 to Ng can be used where Ng is the number of specified
	groups. For the {molecule} bodystyle, use the molecule ID for the
	atoms in a specific rigid body as the rigid body ID.

	The masstotal and center-of-mass coordinates (xcm,ycm,zcm) are
	self-explanatory. The center-of-mass should be consistent with what
	is calculated for the position of the rigid body with all its atoms
	unwrapped by their respective image flags. If this produces a
	center-of-mass that is outside the simulation box, LAMMPS wraps it
	-back into the box. The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz)
	-should be the values consistent with the current orientation of the
	-rigid body around its center of mass. The values are with respect to
	-the simulation box XYZ axes, not with respect to the prinicpal axes of
	-the rigid body itself. LAMMPS performs the latter calculation
	-internally. The (vxcm,vycm,vzcm) values are the velocity of the
	-center of mass. The (lx,ly,lz) values are the angular momentum of the
	-body. These last 6 values can simply be set to 0 if you wish the
	-body to have no initial motion.
	+back into the box.
	+
	+The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz) should be the
	+values consistent with the current orientation of the rigid body
	+around its center of mass. The values are with respect to the
	+simulation box XYZ axes, not with respect to the prinicpal axes of the
	+rigid body itself. LAMMPS performs the latter calculation internally.
	+
	+The (vxcm,vycm,vzcm) values are the velocity of the center of mass.
	+The (lx,ly,lz) values are the angular momentum of the body. The
	+(vxcm,vycm,vzcm) and (lx,ly,lz) values can simply be set to 0 if you
	+wish the body to have no initial motion.
	+
	+The (ixcm,iycm,izcm) values are the image flags of the center of mass
	+of the body. For periodic dimensions, they specify which image of the
	+simulation box the body is considered to be in. An image of 0 means
	+it is inside the box as defined. A value of 2 means add 2 box lengths
	+to get the true value. A value of -1 means subtract 1 box length to
	+get the true value. LAMMPS updates these flags as the rigid bodies
	+cross periodic boundaries during the simulation.

	IMPORTANT NOTE: If you use the {infile} or {mol} keywords and write
	restart files during a simulation, then each time a restart file is
	written, the fix also write an auxiliary restart file with the name
	rfile.rigid, where "rfile" is the name of the restart file,
	e.g. tmp.restart.10000 and tmp.restart.10000.rigid. This auxiliary
	file is in the same format described above. Thus it can be used in a
	new input script that restarts the run and re-specifies a rigid fix
	using an {infile} keyword and the appropriate filename. Note that the
	auxiliary file will contain one line for every rigid body, even if the
	original file only listed a subset of the rigid bodies.

	:line

	If you use a "temperature compute"_compute.html with a group that
	includes particles in rigid bodies, the degrees-of-freedom removed by
	each rigid body are accounted for in the temperature (and pressure)
	computation, but only if the temperature group includes all the
	particles in a particular rigid body.

	A 3d rigid body has 6 degrees of freedom (3 translational, 3
	rotational), except for a collection of point particles lying on a
	straight line, which has only 5, e.g a dimer. A 2d rigid body has 3
	degrees of freedom (2 translational, 1 rotational).

	IMPORTANT NOTE: You may wish to explicitly subtract additional
	degrees-of-freedom if you use the {force} and {torque} keywords to
	eliminate certain motions of one or more rigid bodies. LAMMPS does
	not do this automatically.

	The rigid body contribution to the pressure of the system (virial) is
	also accounted for by this fix.

	:line

	If your simlulation is a hybrid model with a mixture of rigid bodies
	and non-rigid particles (e.g. solvent) there are several ways these
	rigid fixes can be used in tandem with "fix nve"_fix_nve.html, "fix
	nvt"_fix_nh.html, "fix npt"_fix_nh.html, and "fix nph"_fix_nh.html.

	If you wish to perform NVE dynamics (no thermostatting or
	barostatting), use fix rigid or fix rigid/nve to integrate the rigid
	bodies, and "fix nve"_fix_nve.html to integrate the non-rigid
	particles.

	If you wish to perform NVT dynamics (thermostatting, but no
	barostatting), you can use fix rigid/nvt for the rigid bodies, and any
	thermostatting fix for the non-rigid particles ("fix nvt"_fix_nh.html,
	"fix langevin"_fix_langevin.html, "fix
	temp/berendsen"_fix_temp_berendsen.html). You can also use fix rigid
	or fix rigid/nve for the rigid bodies and thermostat them using "fix
	langevin"_fix_langevin.html on the group that contains all the
	particles in the rigid bodies. The net force added by "fix
	langevin"_fix_langevin.html to each rigid body effectively thermostats
	its translational center-of-mass motion. Not sure how well it does at
	thermostatting its rotational motion.

	If you with to perform NPT or NPH dynamics (barostatting), you cannot
	use both "fix npt"_fix_nh.html and fix rigid/npt (or the nph
	variants). This is because there can only be one fix which monitors
	the global pressure and changes the simulation box dimensions. So you
	have 3 choices:

	Use fix rigid/npt for the rigid bodies. Use the {dilate} all option
	so that it will dilate the positions of the non-rigid particles as
	well. Use "fix nvt"_fix_nh.html (or any other thermostat) for the
	non-rigid particles. :ulb,l

	Use "fix npt"_fix_nh.html for the group of non-rigid particles. Use
	the {dilate} all option so that it will dilate the center-of-mass
	positions of the rigid bodies as well. Use fix rigid/nvt for the
	rigid bodies. :l

	Use "fix press/berendsen"_fix_press_berendsen.html to compute the
	pressure and change the box dimensions. Use fix rigid/nvt for the
	rigid bodies. Use "fix nvt"_fix_nh.thml (or any other thermostat) for
	the non-rigid particles. :l,ule

	In all case, the rigid bodies and non-rigid particles both contribute
	to the global pressure and the box is scaled the same by any of the
	barostatting fixes.

	You could even use the 2nd and 3rd options for a non-hybrid simulation
	consisting of only rigid bodies, assuming you give "fix
	npt"_fix_nh.html an empty group, though it's an odd thing to do. The
	barostatting fixes ("fix npt"_fix_nh.html and "fix
	press/berensen"_fix_press_berendsen.html) will monitor the pressure
	and change the box dimensions, but not time integrate any particles.
	The integration of the rigid bodies will be performed by fix
	rigid/nvt.

	:line

	Styles with a {cuda}, {gpu}, {intel}, {kk}, {omp}, or {opt} suffix are
	functionally the same as the corresponding style without the suffix.
	They have been optimized to run faster, depending on your available
	hardware, as discussed in "Section_accelerate"_Section_accelerate.html
	of the manual. The accelerated styles take the same arguments and
	should produce the same results, except for round-off and precision
	issues.

	These accelerated styles are part of the USER-CUDA, GPU, USER-INTEL,
	KOKKOS, USER-OMP and OPT packages, respectively. They are only
	enabled if LAMMPS was built with those packages. See the "Making
	LAMMPS"_Section_start.html#start_3 section for more info.

	You can specify the accelerated styles explicitly in your input script
	by including their suffix, or you can use the "-suffix command-line
	switch"_Section_start.html#start_7 when you invoke LAMMPS, or you can
	use the "suffix"_suffix.html command in your input script.

	See "Section_accelerate"_Section_accelerate.html of the manual for
	more instructions on how to use the accelerated styles effectively.

	:line

	[Restart, fix_modify, output, run start/stop, minimize info:]

	No information about the {rigid} and {rigid/small} and {rigid/nve}
	fixes are written to "binary restart files"_restart.html. The
	exception is if the {infile} or {mol} keyword is used, in which case
	an auxiliary file is written out with rigid body information each time
	a restart file is written, as explained above for the {infile}
	keyword. For style {rigid/nvt} the state of the Nose/Hoover
	thermostat is written to "binary restart files"_restart.html. See the
	"read_restart"_read_restart.html command for info on how to re-specify
	a fix in an input script that reads a restart file, so that the
	operation of the fix continues in an uninterrupted fashion.

	The "fix_modify"_fix_modify.html {energy} option is supported by the
	rigid/nvt fix to add the energy change induced by the thermostatting
	to the system's potential energy as part of "thermodynamic
	output"_thermo_style.html.

	The "fix_modify"_fix_modify.html {temp} and {press} options are
	supported by the rigid/npt and rigid/nph fixes to change the computes used
	to calculate the instantaneous pressure tensor. Note that the rigid/nvt fix
	does not use any external compute to compute instantaneous temperature.

	The {rigid} and {rigid/small} and {rigid/nve} fixes compute a global
	scalar which can be accessed by various "output
	commands"_Section_howto.html#howto_15. The scalar value calculated by
	these fixes is "intensive". The scalar is the current temperature of
	the collection of rigid bodies. This is averaged over all rigid
	bodies and their translational and rotational degrees of freedom. The
	translational energy of a rigid body is 1/2 m v^2, where m = total
	mass of the body and v = the velocity of its center of mass. The
	rotational energy of a rigid body is 1/2 I w^2, where I = the moment
	of inertia tensor of the body and w = its angular velocity. Degrees
	of freedom constrained by the {force} and {torque} keywords are
	removed from this calculation, but only for the {rigid} and
	{rigid/nve} fixes.

	The {rigid/nvt}, {rigid/npt}, and {rigid/nph} fixes compute a global
	scalar which can be accessed by various "output
	commands"_Section_howto.html#howto_15. The scalar value calculated by
	these fixes is "extensive". The scalar is the cumulative energy
	change due to the thermostatting and barostatting the fix performs.

	All of the {rigid} fixes except {rigid/small} compute a global array
	of values which can be accessed by various "output
	commands"_Section_howto.html#howto_15. The number of rows in the
	array is equal to the number of rigid bodies. The number of columns
	is 15. Thus for each rigid body, 15 values are stored: the xyz coords
	of the center of mass (COM), the xyz components of the COM velocity,
	the xyz components of the force acting on the COM, the xyz components
	of the torque acting on the COM, and the xyz image flags of the COM.

	The center of mass (COM) for each body is similar to unwrapped
	coordinates written to a dump file. It will always be inside (or
	slightly outside) the simulation box. The image flags have the same
	meaning as image flags for atom positions (see the "dump" command).
	This means you can calculate the unwrapped COM by applying the image
	flags to the COM, the same as when unwrapped coordinates are written
	to a dump file.

	The force and torque values in the array are not affected by the
	{force} and {torque} keywords in the fix rigid command; they reflect
	values before any changes are made by those keywords.

	The ordering of the rigid bodies (by row in the array) is as follows.
	For the {single} keyword there is just one rigid body. For the
	{molecule} keyword, the bodies are ordered by ascending molecule ID.
	For the {group} keyword, the list of group IDs determines the ordering
	of bodies.

	The array values calculated by these fixes are "intensive", meaning
	they are independent of the number of atoms in the simulation.

	No parameter of these fixes can be used with the {start/stop} keywords
	of the "run"_run.html command. These fixes are not invoked during
	"energy minimization"_minimize.html.

	:line

	[Restrictions:]

	These fixes are all part of the RIGID package. It is only enabled if
	LAMMPS was built with that package. See the "Making
	LAMMPS"_Section_start.html#start_3 section for more info.

	Assigning a temperature via the "velocity create"_velocity.html
	command to a system with "rigid bodies"_fix_rigid.html may not have
	the desired outcome for two reasons. First, the velocity command can
	be invoked before the rigid-body fix is invoked or initialized and the
	number of adjusted degrees of freedom (DOFs) is known. Thus it is not
	possible to compute the target temperature correctly. Second, the
	assigned velocities may be partially canceled when constraints are
	first enforced, leading to a different temperature than desired. A
	workaround for this is to perform a "run 0"_run.html command, which
	insures all DOFs are accounted for properly, and then rescale the
	temperature to the desired value before performing a simulation. For
	example:

	velocity all create 300.0 12345
	run 0 # temperature may not be 300K
	velocity all scale 300.0 # now it should be :pre

	[Related commands:]

	"delete_bonds"_delete_bonds.html, "neigh_modify"_neigh_modify.html
	exclude, "fix shake"_fix_shake.html

	[Default:]

	The option defaults are force * on on on and torque * on on on,
	meaning all rigid bodies are acted on by center-of-mass force and
	torque. Also Tchain = Pchain = 10, Titer = 1, Torder = 3.

	:line

	:link(Hoover)
	[(Hoover)] Hoover, Phys Rev A, 31, 1695 (1985).

	:link(Kamberaj)
	[(Kamberaj)] Kamberaj, Low, Neal, J Chem Phys, 122, 224114 (2005).

	:link(Martyna)
	[(Martyna)] Martyna, Klein, Tuckerman, J Chem Phys, 97, 2635 (1992);
	Martyna, Tuckerman, Tobias, Klein, Mol Phys, 87, 1117.

	:link(Miller)
	[(Miller)] Miller, Eleftheriou, Pattnaik, Ndirango, and Newns,
	J Chem Phys, 116, 8649 (2002).

	:link(Zhang)
	[(Zhang)] Zhang, Glotzer, Nanoletters, 4, 1407-1413 (2004).
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/Copyright.txt
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/Copyright.txt
	index 966291abd..05980758f 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/Copyright.txt
	@@ -1,64 +1,40 @@
	-/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	-*/
	-
	-
	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	-
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/LICENSE
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/LICENSE
	index 966291abd..05980758f 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/LICENSE
	@@ -1,64 +1,40 @@
	-/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	-*/
	-
	-
	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	-
	diff --git a/lib/kokkos/Makefile.kokkos b/lib/kokkos/Makefile.kokkos
	new file mode 100755
	index 000000000..473039af5
	--- /dev/null
	+++ b/lib/kokkos/Makefile.kokkos
	@@ -0,0 +1,318 @@
	+# Default settings common options
	+
	+KOKKOS_PATH=../../lib/kokkos
	+
	+#Options: OpenMP,Serial,Pthreads,Cuda
	+KOKKOS_DEVICES ?= "OpenMP"
	+#KOKKOS_DEVICES ?= "Pthreads"
	+#Options: KNC,SNB,HSW,Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,ARMv8,BGQ,Power7,Power8
	+KOKKOS_ARCH ?= ""
	+#Options: yes,no
	+KOKKOS_DEBUG ?= "no"
	+#Options: hwloc,librt
	+KOKKOS_USE_TPLS ?= ""
	+
	+#Default settings specific options
	+#Options: force_uvm,use_ldg,rdc
	+KOKKOS_CUDA_OPTIONS ?= ""
	+
	+# Check for general settings
	+
	+KOKKOS_CXX_STANDARD ?= "c++11"
	+
	+KOKKOS_INTERNAL_ENABLE_DEBUG := $(strip $(shell echo $(KOKKOS_DEBUG) \| grep "yes" \| wc -l))
	+KOKKOS_INTERNAL_ENABLE_PROFILING_COLLECT_KERNEL_DATA := $(strip $(shell echo $(KOKKOS_PROFILING) \| grep "kernel_times" \| wc -l))
	+KOKKOS_INTERNAL_ENABLE_PROFILING_AGGREGATE_MPI := $(strip $(shell echo $(KOKKOS_PROFILING) \| grep "aggregate_mpi" \| wc -l))
	+KOKKOS_INTERNAL_ENABLE_CXX11 := $(strip $(shell echo $(KOKKOS_CXX_STANDARD) \| grep "c++11" \| wc -l))
	+
	+# Check for external libraries
	+KOKKOS_INTERNAL_USE_HWLOC := $(strip $(shell echo $(KOKKOS_USE_TPLS) \| grep "hwloc" \| wc -l))
	+KOKKOS_INTERNAL_USE_LIBRT := $(strip $(shell echo $(KOKKOS_USE_TPLS) \| grep "librt" \| wc -l))
	+
	+# Check for advanced settings
	+KOKKOS_INTERNAL_CUDA_USE_LDG := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) \| grep "use_ldg" \| wc -l))
	+KOKKOS_INTERNAL_CUDA_USE_UVM := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) \| grep "force_uvm" \| wc -l))
	+KOKKOS_INTERNAL_CUDA_USE_RELOC := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) \| grep "rdc" \| wc -l))
	+
	+# Check for Kokkos Host Execution Spaces one of which must be on
	+
	+KOKKOS_INTERNAL_USE_OPENMP := $(strip $(shell echo $(KOKKOS_DEVICES) \| grep OpenMP \| wc -l))
	+KOKKOS_INTERNAL_USE_PTHREADS := $(strip $(shell echo $(KOKKOS_DEVICES) \| grep Pthread \| wc -l))
	+KOKKOS_INTERNAL_USE_SERIAL := $(strip $(shell echo $(KOKKOS_DEVICES) \| grep Serial \| wc -l))
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 0)
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 0)
	+ KOKKOS_INTERNAL_USE_SERIAL := 1
	+endif
	+endif
	+
	+KOKKOS_INTERNAL_COMPILER_PGI := $(shell $(CXX) --version \| grep PGI \| wc -l)
	+
	+ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)
	+ KOKKOS_INTERNAL_OPENMP_FLAG := -mp
	+else
	+ KOKKOS_INTERNAL_OPENMP_FLAG := -fopenmp
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)
	+ KOKKOS_INTERNAL_CXX11_FLAG := --c++11
	+else
	+ KOKKOS_INTERNAL_CXX11_FLAG := --std=c++11
	+endif
	+# Check for other Execution Spaces
	+
	+KOKKOS_INTERNAL_USE_CUDA := $(strip $(shell echo $(KOKKOS_DEVICES) \| grep Cuda \| wc -l))
	+
	+# Check for Kokkos Architecture settings
	+
	+#Intel based
	+KOKKOS_INTERNAL_USE_ARCH_KNC := $(strip $(shell echo $(KOKKOS_ARCH) \| grep KNC \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_SNB := $(strip $(shell echo $(KOKKOS_ARCH) \| grep SNB \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_HSW := $(strip $(shell echo $(KOKKOS_ARCH) \| grep HSW \| wc -l))
	+
	+#NVIDIA based
	+KOKKOS_INTERNAL_USE_ARCH_KEPLER30 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Kepler30 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_KEPLER32 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Kepler32 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Kepler35 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_KEPLER37 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Kepler37 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Maxwell50 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_MAXWELL52 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Maxwell52 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_MAXWELL53 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Maxwell53 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53) \| bc))
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_NVIDIA), 0)
	+KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Maxwell \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Kepler \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \
	+ + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53) \| bc))
	+endif
	+
	+#ARM based
	+KOKKOS_INTERNAL_USE_ARCH_ARMV80 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep ARMv8 \| wc -l))
	+
	+#IBM based
	+KOKKOS_INTERNAL_USE_ARCH_BGQ := $(strip $(shell echo $(KOKKOS_ARCH) \| grep BGQ \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_POWER7 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Power7 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_POWER8 := $(strip $(shell echo $(KOKKOS_ARCH) \| grep Power8 \| wc -l))
	+KOKKOS_INTERNAL_USE_ARCH_IBM := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_BGQ)+$(KOKKOS_INTERNAL_USE_ARCH_POWER7)+$(KOKKOS_INTERNAL_USE_ARCH_POWER8) \| bc))
	+
	+#AMD based
	+KOKKOS_INTERNAL_USE_ARCH_AMDAVX := $(strip $(shell echo $(KOKKOS_ARCH) \| grep AMDAVX \| wc -l))
	+
	+#Any AVX?
	+KOKKOS_INTERNAL_USE_ARCH_AVX := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_SNB)+$(KOKKOS_INTERNAL_USE_ARCH_AMDAVX) \| bc ))
	+KOKKOS_INTERNAL_USE_ARCH_AVX2 := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_HSW) \| bc ))
	+
	+#Incompatible flags?
	+KOKKOS_INTERNAL_USE_ARCH_MULTIHOST := $(strip $(shell echo "$(KOKKOS_INTERNAL_USE_ARCH_AVX)+$(KOKKOS_INTERNAL_USE_ARCH_AVX2)+$(KOKKOS_INTERNAL_USE_ARCH_KNC)+$(KOKKOS_INTERNAL_USE_ARCH_IBM)+$(KOKKOS_INTERNAL_USE_ARCH_AMDAVX)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV80)>1" \| bc ))
	+KOKKOS_INTERNAL_USE_ARCH_MULTIGPU := $(strip $(shell echo "$(KOKKOS_INTERNAL_USE_ARCH_NVIDIA)>1" \| bc))
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MULTIHOST), 1)
	+ $(error Defined Multiple Host architectures: KOKKOS_ARCH=$(KOKKOS_ARCH) )
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MULTIGPU), 1)
	+ $(error Defined Multiple GPU architectures: KOKKOS_ARCH=$(KOKKOS_ARCH) )
	+endif
	+
	+#Generating the list of Flags
	+
	+KOKKOS_CPPFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src
	+# No warnings:
	+KOKKOS_CXXFLAGS =
	+# INTEL and CLANG warnings:
	+#KOKKOS_CXXFLAGS = -Wall -Wshadow -pedantic -Wsign-compare -Wtype-limits -Wuninitialized
	+# GCC warnings:
	+#KOKKOS_CXXFLAGS = -Wall -Wshadow -pedantic -Wsign-compare -Wtype-limits -Wuninitialized -Wignored-qualifiers -Wempty-body -Wclobbered
	+
	+KOKKOS_LIBS = -lkokkos
	+KOKKOS_LDFLAGS = -L$(shell pwd)
	+KOKKOS_SRC =
	+KOKKOS_HEADERS =
	+
	+#Generating the KokkosCore_config.h file
	+
	+tmp := $(shell echo "/* ---------------------------------------------" > KokkosCore_config.tmp)
	+tmp := $(shell echo "Makefile constructed configuration:" >> KokkosCore_config.tmp)
	+tmp := $(shell date >> KokkosCore_config.tmp)
	+tmp := $(shell echo "----------------------------------------------*/" >> KokkosCore_config.tmp)
	+
	+
	+tmp := $(shell echo "/* Execution Spaces */" >> KokkosCore_config.tmp)
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ tmp := $(shell echo '\#define KOKKOS_HAVE_OPENMP 1' >> KokkosCore_config.tmp)
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ tmp := $(shell echo "\#define KOKKOS_HAVE_PTHREAD 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
	+ tmp := $(shell echo "\#define KOKKOS_HAVE_SERIAL 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ tmp := $(shell echo "\#define KOKKOS_HAVE_CUDA 1" >> KokkosCore_config.tmp )
	+endif
	+
	+tmp := $(shell echo "/* General Settings */" >> KokkosCore_config.tmp)
	+ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1)
	+ KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG)
	+ tmp := $(shell echo "\#define KOKKOS_HAVE_CXX11 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1)
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ KOKKOS_CXXFLAGS += -G
	+endif
	+ KOKKOS_CXXFLAGS += -g
	+ KOKKOS_LDFLAGS += -g -ldl
	+ tmp := $(shell echo "\#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK 1" >> KokkosCore_config.tmp )
	+ tmp := $(shell echo "\#define KOKKOS_HAVE_DEBUG 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_HWLOC), 1)
	+ KOKKOS_CPPFLAGS += -I$(HWLOC_PATH)/include
	+ KOKKOS_LDFLAGS += -L$(HWLOC_PATH)/lib
	+ KOKKOS_LIBS += -lhwloc
	+ tmp := $(shell echo "\#define KOKKOS_HAVE_HWLOC 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_LIBRT), 1)
	+ tmp := $(shell echo "\#define KOKKOS_USE_LIBRT 1" >> KokkosCore_config.tmp )
	+ tmp := $(shell echo "\#define PREC_TIMER 1" >> KokkosCore_config.tmp )
	+ tmp := $(shell echo "\#define KOKKOSP_ENABLE_RTLIB 1" >> KokkosCore_config.tmp )
	+ KOKKOS_LIBS += -lrt
	+endif
	+
	+tmp := $(shell echo "/* Cuda Settings */" >> KokkosCore_config.tmp)
	+
	+ifeq ($(KOKKOS_INTERNAL_CUDA_USE_LDG), 1)
	+ tmp := $(shell echo "\#define KOKKOS_CUDA_USE_LDG_INTRINSIC 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_CUDA_USE_UVM), 1)
	+ tmp := $(shell echo "\#define KOKKOS_CUDA_USE_UVM 1" >> KokkosCore_config.tmp )
	+ tmp := $(shell echo "\#define KOKKOS_USE_CUDA_UVM 1" >> KokkosCore_config.tmp )
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_CUDA_USE_RELOC), 1)
	+ tmp := $(shell echo "\#define KOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE 1" >> KokkosCore_config.tmp )
	+ KOKKOS_CXXFLAGS += --relocatable-device-code=true
	+ KOKKOS_LDFLAGS += --relocatable-device-code=true
	+endif
	+
	+#Add Architecture flags
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX), 1)
	+ KOKKOS_CXXFLAGS += -mavx
	+ KOKKOS_LDFLAGS += -mavx
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX2), 1)
	+ KOKKOS_CXXFLAGS += -xcore-avx2
	+ KOKKOS_LDFLAGS += -xcore-avx2
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KNC), 1)
	+ KOKKOS_CXXFLAGS += -mmic
	+ KOKKOS_LDFLAGS += -mmic
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER30), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_30
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER32), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_32
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER35), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_35
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER37), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_37
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_50
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_52
	+endif
	+ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53), 1)
	+ KOKKOS_CXXFLAGS += -arch=sm_53
	+endif
	+endif
	+
	+KOKKOS_INTERNAL_LS_CONFIG := $(shell ls KokkosCore_config.h)
	+ifeq ($(KOKKOS_INTERNAL_LS_CONFIG), KokkosCore_config.h)
	+KOKKOS_INTERNAL_NEW_CONFIG := $(strip $(shell diff KokkosCore_config.h KokkosCore_config.tmp \| grep define \| wc -l))
	+else
	+KOKKOS_INTERNAL_NEW_CONFIG := 1
	+endif
	+
	+ifneq ($(KOKKOS_INTERNAL_NEW_CONFIG), 0)
	+ tmp := $(shell cp KokkosCore_config.tmp KokkosCore_config.h)
	+endif
	+
	+KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/*.hpp)
	+KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp)
	+KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp)
	+KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp)
	+KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/algorithms/src/*.hpp)
	+
	+KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.cpp)
	+KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.cpp)
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cpp)
	+ KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
	+ KOKKOS_LDFLAGS += -L$(CUDA_PATH)/lib64
	+ KOKKOS_LIBS += -lcudart -lcuda
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ KOKKOS_LIBS += -lpthread
	+ KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.cpp)
	+ KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp)
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.cpp)
	+ KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp)
	+ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ KOKKOS_CXXFLAGS += -Xcompiler $(KOKKOS_INTERNAL_OPENMP_FLAG)
	+ else
	+ KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_OPENMP_FLAG)
	+ endif
	+ KOKKOS_LDFLAGS += $(KOKKOS_INTERNAL_OPENMP_FLAG)
	+endif
	+
	+
	+# Setting up dependencies
	+
	+KokkosCore_config.h:
	+
	+KOKKOS_CPP_DEPENDS := KokkosCore_config.h $(KOKKOS_HEADERS)
	+
	+KOKKOS_OBJ = $(KOKKOS_SRC:.cpp=.o)
	+KOKKOS_OBJ_LINK = $(notdir $(KOKKOS_OBJ))
	+
	+include $(KOKKOS_PATH)/Makefile.targets
	+
	+kokkos-clean:
	+ rm -f $(KOKKOS_OBJ_LINK) KokkosCore_config.h KokkosCore_config.tmp libkokkos.a
	+
	+libkokkos.a: $(KOKKOS_OBJ_LINK) $(KOKKOS_SRC) $(KOKKOS_HEADERS)
	+ ar cr libkokkos.a $(KOKKOS_OBJ_LINK)
	+
	+KOKKOS_LINK_DEPENDS=libkokkos.a
	diff --git a/lib/kokkos/Makefile.lammps b/lib/kokkos/Makefile.lammps
	deleted file mode 100755
	index 00b55f4f6..000000000
	--- a/lib/kokkos/Makefile.lammps
	+++ /dev/null
	@@ -1,171 +0,0 @@
	-# This Makefile is intended to be include in an application Makefile.
	-# It will append the OBJ variable with objects which need to be build for Kokkos.
	-# It also will produce a KOKKOS_INC and a KOKKOS_LINK variable which must be
	-# appended to the compile and link flags of the application Makefile.
	-# Note that you cannot compile and link at the same time!
	-# If you want to include dependencies (i.e. trigger a rebuild of the application
	-# object files when Kokkos files change, you can include KOKKOS_HEADERS in your
	-# dependency list.
	-# The Makefile uses a number of variables which can be set on the commandline, or
	-# in the application Makefile prior to including this Makefile. These options set
	-# certain build options and are explained in the following.
	-
	-# Directory path to the Kokkos source directory (this could be the kokkos directory
	-# in the Trilinos git repository
	-KOKKOS_PATH ?= ../../lib/kokkos
	-# Directory paths to libraries potentially used by Kokkos (if the respective options
	-# are chosen)
	-CUDA_PATH ?= /usr/local/cuda
	-HWLOC_PATH ?= /usr/local/hwloc/default
	-
	-# Device options: enable Pthreads, OpenMP and/or CUDA device (if none is enabled
	-# the Serial device will be used)
	-PTHREADS ?= yes
	-OMP ?= yes
	-CUDA ?= no
	-
	-# Build for Debug mode: add debug flags and enable boundschecks within Kokkos
	-DEBUG ?= no
	-
	-# Code generation options: use AVX instruction set; build for Xeon Phi (MIC); use
	-# reduced precision math (sets compiler flags such --fast_math)
	-AVX ?= no
	-MIC ?= no
	-RED_PREC ?=no
	-
	-# Optional Libraries: use hwloc for thread affinity; use librt for timers
	-HWLOC ?= no
	-LIBRT ?= no
	-
	-# CUDA specific options: use UVM (requires CUDA 6+); use LDG loads instead of
	-# texture fetches; compile for relocatable device code (function pointers)
	-CUDA_UVM ?= no
	-CUDA_LDG ?= no
	-CUDA_RELOC ?= no
	-
	-# Settings for replacing generic linear algebra kernels of Kokkos with vendor
	-# libraries.
	-CUSPARSE ?= no
	-CUBLAS ?= no
	-
	-#Typically nothing should be changed after this point
	-
	-KOKKOS_INC = -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_PATH)/linalg/src -I../ -DKOKKOS_DONT_INCLUDE_CORE_CONFIG_H
	-
	-KOKKOS_HEADERS = $(wildcard $(KOKKOS_PATH)/core/src/*.hpp)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/linalg/src/*.hpp)
	-
	-SRC_KOKKOS = $(wildcard $(KOKKOS_PATH)/core/src/impl/*.cpp)
	-SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.cpp)
	-KOKKOS_LIB = libkokkoscore.a
	-
	-ifeq ($(CUDA), yes)
	-KOKKOS_INC += -x cu -DKOKKOS_HAVE_CUDA
	-SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cpp)
	-SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cu)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
	-KOKKOS_LINK += -L$(CUDA_PATH)/lib64 -lcudart -lcuda
	-ifeq ($(CUDA_UVM), yes)
	-KOKKOS_INC += -DKOKKOS_USE_CUDA_UVM
	-endif
	-endif
	-
	-ifeq ($(CUSPARSE), yes)
	-KOKKOS_INC += -DKOKKOS_USE_CUSPARSE
	-KOKKOS_LIB += -lcusparse
	-endif
	-
	-ifeq ($(CUBLAS), yes)
	-KOKKOS_INC += -DKOKKOS_USE_CUBLAS
	-KOKKOS_LIB += -lcublas
	-endif
	-
	-ifeq ($(MIC), yes)
	-KOKKOS_INC += -mmic
	-KOKKOS_LINK += -mmic
	-AVX = no
	-endif
	-
	-ifeq ($(AVX), yes)
	-ifeq ($(CUDA), yes)
	-KOKKOS_INC += -Xcompiler -mavx
	-else
	-KOKKOS_INC += -mavx
	-endif
	-KOKKOS_LINK += -mavx
	-endif
	-
	-ifeq ($(PTHREADS),yes)
	-KOKKOS_INC += -DKOKKOS_HAVE_PTHREAD
	-KOKKOS_LIB += -lpthread
	-SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.cpp)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp)
	-endif
	-
	-ifeq ($(OMP),yes)
	-KOKKOS_INC += -DKOKKOS_HAVE_OPENMP
	-SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.cpp)
	-KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp)
	-ifeq ($(CUDA), yes)
	-KOKKOS_INC += -Xcompiler -fopenmp
	-KOKKOS_LINK += -Xcompiler -fopenmp
	-else
	-KOKKOS_INC += -fopenmp
	-KOKKOS_LINK += -fopenmp
	-endif
	-endif
	-
	-ifeq ($(HWLOC),yes)
	-KOKKOS_INC += -DKOKKOS_HAVE_HWLOC -I$(HWLOC_PATH)/include
	-KOKKOS_LINK += -L$(HWLOC_PATH)/lib -lhwloc
	-endif
	-
	-ifeq ($(RED_PREC), yes)
	-KOKKOS_INC += --use_fast_math
	-endif
	-
	-ifeq ($(DEBUG), yes)
	-ifeq ($(CUDA), yes)
	-KOKKOS_INC += -G
	-endif
	-KOKKOS_INC += -g -DKOKKOS_EXPRESSION_CHECK -DENABLE_TRACEBACK
	-KOKKOS_LINK += -g -ldl
	-endif
	-
	-ifeq ($(LIBRT),yes)
	-KOKKOS_INC += -DKOKKOS_USE_LIBRT -DPREC_TIMER
	-KOKKOS_LIB += -lrt
	-endif
	-
	-ifeq ($(CUDA_LDG), yes)
	-KOKKOS_INC += -DKOKKOS_USE_LDG_INTRINSIC
	-endif
	-
	-ifeq ($(CUDA), yes)
	-ifeq ($(CUDA_RELOC), yes)
	-KOKKOS_INC += -DKOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE --relocatable-device-code=true
	-KOKKOS_LINK += --relocatable-device-code=true
	-endif
	-endif
	-
	-# Must build with C++11
	-KOKKOS_INC += --std=c++11 -DKOKKOS_HAVE_CXX11
	-
	-OBJ_KOKKOS_TMP = $(SRC_KOKKOS:.cpp=.o)
	-OBJ_KOKKOS = $(OBJ_KOKKOS_TMP:.cu=.o)
	-OBJ_KOKKOS_LINK = $(notdir $(OBJ_KOKKOS))
	-
	-override OBJ += kokkos_depend.o
	-
	-libkokkoscore.a: $(OBJ_KOKKOS)
	- ar cr libkokkoscore.a $(OBJ_KOKKOS_LINK)
	-
	-kokkos_depend.o: libkokkoscore.a
	- touch kokkos_depend.cpp
	- $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c kokkos_depend.cpp
	-
	-
	-KOKKOS_LINK += -L./ $(KOKKOS_LIB)
	diff --git a/lib/kokkos/Makefile.targets b/lib/kokkos/Makefile.targets
	new file mode 100755
	index 000000000..86708ac80
	--- /dev/null
	+++ b/lib/kokkos/Makefile.targets
	@@ -0,0 +1,50 @@
	+Kokkos_UnorderedMap_impl.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/containers/src/impl/Kokkos_UnorderedMap_impl.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/containers/src/impl/Kokkos_UnorderedMap_impl.cpp
	+Kokkos_AllocationTracker.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_AllocationTracker.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_AllocationTracker.cpp
	+Kokkos_BasicAllocators.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_BasicAllocators.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_BasicAllocators.cpp
	+Kokkos_Core.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Core.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Core.cpp
	+Kokkos_Error.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Error.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Error.cpp
	+Kokkos_HostSpace.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_HostSpace.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_HostSpace.cpp
	+Kokkos_hwloc.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_hwloc.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_hwloc.cpp
	+Kokkos_Serial.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial.cpp
	+Kokkos_Serial_TaskPolicy.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial_TaskPolicy.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial_TaskPolicy.cpp
	+Kokkos_Shape.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Shape.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Shape.cpp
	+Kokkos_spinwait.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_spinwait.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_spinwait.cpp
	+Kokkos_Profiling_Interface.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Profiling_Interface.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Profiling_Interface.cpp
	+KokkosExp_SharedAlloc.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/KokkosExp_SharedAlloc.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/KokkosExp_SharedAlloc.cpp
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+Kokkos_Cuda_BasicAllocators.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp
	+Kokkos_Cuda_Impl.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_Impl.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_Impl.cpp
	+Kokkos_CudaSpace.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Cuda/Kokkos_CudaSpace.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Cuda/Kokkos_CudaSpace.cpp
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+Kokkos_ThreadsExec_base.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp
	+Kokkos_ThreadsExec.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec.cpp
	+Kokkos_Threads_TaskPolicy.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+Kokkos_OpenMPexec.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/OpenMP/Kokkos_OpenMPexec.cpp
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/OpenMP/Kokkos_OpenMPexec.cpp
	+endif
	+
	diff --git a/lib/kokkos/README b/lib/kokkos/README
	index 59f5685ba..f979495bf 100755
	--- a/lib/kokkos/README
	+++ b/lib/kokkos/README
	@@ -1,44 +1,97 @@
	-Kokkos library
	-
	-Carter Edwards, Christian Trott, Daniel Sunderland
	-Sandia National Labs
	-
	-29 May 2014
	-http://trilinos.sandia.gov/packages/kokkos/
	-
	--------------------------
	-
	-This directory has source files from the Kokkos library that LAMMPS
	-uses when building with its KOKKOS package. The package contains
	-versions of pair, fix, and atom styles written with Kokkos data
	-structures and calls to the Kokkos library that should run efficiently
	-on various kinds of accelerated nodes, including GPU and many-core
	-chips.
	-
	-Kokkos is a C++ library that provides two key abstractions for an
	-application like LAMMPS. First, it allows a single implementation of
	-an application kernel (e.g. a pair style) to run efficiently on
	-different kinds of hardware (GPU, Intel Phi, many-core chip).
	-
	-Second, it provides data abstractions to adjust (at compile time) the
	-memory layout of basic data structures like 2d and 3d arrays and allow
	-the transparent utilization of special hardware load and store units.
	-Such data structures are used in LAMMPS to store atom coordinates or
	-forces or neighbor lists. The layout is chosen to optimize
	-performance on different platforms. Again this operation is hidden
	-from the developer, and does not affect how the single implementation
	-of the kernel is coded.
	-
	-To build LAMMPS with Kokkos, you should not need to make any changes
	-to files in this directory. You can overrided defaults that are set
	-in Makefile.lammps when building LAMMPS, by defining variables as part
	-of the make command. Details of the build process with Kokkos are
	-explained in Section 2.3 of doc/Section_start.html. and in Section 5.9
	-of doc/Section_accelerate.html.
	-
	-The one exception is that when using Kokkos with NVIDIA GPUs, the
	-CUDA_PATH setting in Makefile.lammps needs to point to the
	-installation of the Cuda software on your machine. The normal default
	-location is /usr/local/cuda. If this is not correct, you need to edit
	-Makefile.lammps.
	+Kokkos implements a programming model in C++ for writing performance portable
	+applications targeting all major HPC platforms. For that purpose it provides
	+abstractions for both parallel execution of code and data management.
	+Kokkos is designed to target complex node architectures with N-level memory
	+hierarchies and multiple types of execution resources. It currently can use
	+OpenMP, Pthreads and CUDA as backend programming models.
	+
	+The core developers of Kokkos are Carter Edwards and Christian Trott
	+at the Computer Science Research Institute of the Sandia National
	+Laboratories.
	+
	+The KokkosP interface and associated tools are developed by the Application
	+Performance Team and Kokkos core developers at Sandia National Laboratories.
	+
	+To learn more about Kokkos consider watching one of our presentations:
	+GTC 2015:
	+ http://on-demand.gputechconf.com/gtc/2015/video/S5166.html
	+ http://on-demand.gputechconf.com/gtc/2015/presentation/S5166-H-Carter-Edwards.pdf
	+
	+A programming guide can be found under doc/Kokkos_PG.pdf. This is an initial version
	+and feedback is greatly appreciated.
	+
	+For questions please send an email to
	+kokkos-users@software.sandia.gov
	+
	+For non-public questions send an email to
	+hcedwar(at)sandia.gov and crtrott(at)sandia.gov
	+
	+============================================================================
	+====Requirements============================================================
	+============================================================================
	+
	+Primary tested compilers are:
	+ GCC 4.7.2
	+ GCC 5.1.0
	+ Intel 14.0.1
	+ Intel 15.0.1
	+ Clang 3.7.0
	+
	+Secondary tested compilers are:
	+ CUDA 6.5
	+ CUDA 7.0
	+
	+Primary tested compiler are passing in release mode
	+with warnings as errors. We are using the following set
	+of flags:
	+GCC: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits
	+ -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized
	+Intel: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized
	+Clang: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized
	+
	+
	+============================================================================
	+====Getting started=========================================================
	+============================================================================
	+
	+In the 'example/tutorial' directory you will find step by step tutorial
	+examples which explain many of the features of Kokkos. They work with
	+simple Makefiles. To build with g++ and OpenMP simply type 'make openmp'
	+in the 'example/tutorial' directory. This will build all examples in the
	+subfolders.
	+
	+============================================================================
	+====Running Unit Tests======================================================
	+============================================================================
	+
	+To run the unit tests create a build directory and run the following commands
	+
	+KOKKOS_PATH/generate_makefile.bash
	+make build-test
	+make test
	+
	+Run KOKKOS_PATH/generate_makefile.bash --help for more detailed options such as
	+changing the device type for which to build.
	+
	+============================================================================
	+====Install the library=====================================================
	+============================================================================
	+
	+To install Kokkos as a library create a build directory and run the following
	+
	+KOKKOS_PATH/generate_makefile.bash --prefix=INSTALL_PATH
	+make lib
	+make install
	+
	+KOKKOS_PATH/generate_makefile.bash --help for more detailed options such as
	+changing the device type for which to build.
	+
	+============================================================================
	+====CMakeFiles==============================================================
	+============================================================================
	+
	+The CMake files contained in this repository require Tribits and are used
	+for integration with Trilinos. They do not currently support a standalone
	+CMake build.
	+

	diff --git a/lib/kokkos/TPL/cmake/Dependencies.cmake b/lib/kokkos/TPL/cmake/Dependencies.cmake
	deleted file mode 100755
	index 7ea652bf3..000000000
	--- a/lib/kokkos/TPL/cmake/Dependencies.cmake
	+++ /dev/null
	@@ -1,9 +0,0 @@
	-SET(LIB_REQUIRED_DEP_PACKAGES)
	-SET(LIB_OPTIONAL_DEP_PACKAGES)
	-SET(TEST_REQUIRED_DEP_PACKAGES)
	-SET(TEST_OPTIONAL_DEP_PACKAGES)
	-SET(LIB_REQUIRED_DEP_TPLS)
	-# Only dependency:
	-SET(LIB_OPTIONAL_DEP_TPLS CUDA)
	-SET(TEST_REQUIRED_DEP_TPLS )
	-SET(TEST_OPTIONAL_DEP_TPLS )
	diff --git a/lib/kokkos/TPL/cub/block/block_discontinuity.cuh b/lib/kokkos/TPL/cub/block/block_discontinuity.cuh
	deleted file mode 100755
	index 76af003e5..000000000
	--- a/lib/kokkos/TPL/cub/block/block_discontinuity.cuh
	+++ /dev/null
	@@ -1,587 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockDiscontinuity class provides [<em>collective</em>](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief The BlockDiscontinuity class provides [<em>collective</em>](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. ![](discont_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * A set of "head flags" (or "tail flags") is often used to indicate corresponding items
	- * that differ from their predecessors (or successors). For example, head flags are convenient
	- * for demarcating disjoint data segments as part of a segmented scan or reduction.
	- *
	- * \tparam T The data type to be flagged.
	- * \tparam BLOCK_THREADS The thread block size in threads.
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockDiscontinuity}
	- * \par
	- * The code snippet below illustrates the head flagging of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockDiscontinuity for 128 threads on type int
	- * typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
	- *
	- * // Allocate shared memory for BlockDiscontinuity
	- * __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute head flags for discontinuities in the segment
	- * int head_flags[4];
	- * BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }</tt>.
	- * The corresponding output \p head_flags in those threads will be
	- * <tt>{ [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
	- *
	- * \par Performance Considerations
	- * - Zero bank conflicts for most types.
	- *
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS>
	-class BlockDiscontinuity
	-{
	-private:
	-
	- /******************************************************************************
	- * Type definitions
	- ******************************************************************************/
	-
	- /// Shared memory storage layout type (last element from each thread's input)
	- typedef T _TempStorage[BLOCK_THREADS];
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /// Specialization for when FlagOp has third index param
	- template <typename FlagOp, bool HAS_PARAM = BinaryOpHasIdxParam<T, FlagOp>::HAS_PARAM>
	- struct ApplyOp
	- {
	- // Apply flag operator
	- static __device__ __forceinline__ bool Flag(FlagOp flag_op, const T &a, const T &b, int idx)
	- {
	- return flag_op(a, b, idx);
	- }
	- };
	-
	- /// Specialization for when FlagOp does not have a third index param
	- template <typename FlagOp>
	- struct ApplyOp<FlagOp, false>
	- {
	- // Apply flag operator
	- static __device__ __forceinline__ bool Flag(FlagOp flag_op, const T &a, const T &b, int idx)
	- {
	- return flag_op(a, b);
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-
	-public:
	-
	- /// \smemstorage{BlockDiscontinuity}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockDiscontinuity()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockDiscontinuity(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockDiscontinuity(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockDiscontinuity(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Head flag operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged.
	- *
	- * The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
	- * <tt>input<sub><em>i</em></sub></tt> when
	- * <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
	- * returns \p true (where <em>previous-item</em> is either the preceding item
	- * in the same thread or the last item in the previous thread).
	- * Furthermore, <tt>head_flags<sub><em>i</em></sub></tt> is always set for
	- * <tt>input><sub>0</sub></tt> in <em>thread</em><sub>0</sub>.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the head-flagging of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockDiscontinuity for 128 threads on type int
	- * typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
	- *
	- * // Allocate shared memory for BlockDiscontinuity
	- * __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute head flags for discontinuities in the segment
	- * int head_flags[4];
	- * BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }</tt>.
	- * The corresponding output \p head_flags in those threads will be
	- * <tt>{ [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam FlagT <b>[inferred]</b> The flag type (must be an integer type)
	- * \tparam FlagOp <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data.
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename FlagT,
	- typename FlagOp>
	- __device__ __forceinline__ void FlagHeads(
	- FlagT (&head_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity head_flags
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- FlagOp flag_op) ///< [in] Binary boolean flag predicate
	- {
	- // Share last item
	- temp_storage[linear_tid] = input[ITEMS_PER_THREAD - 1];
	-
	- __syncthreads();
	-
	- // Set flag for first item
	- head_flags[0] = (linear_tid == 0) ?
	- 1 : // First thread
	- ApplyOp<FlagOp>::Flag(
	- flag_op,
	- temp_storage[linear_tid - 1],
	- input[0],
	- linear_tid * ITEMS_PER_THREAD);
	-
	- // Set head_flags for remaining items
	- #pragma unroll
	- for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- head_flags[ITEM] = ApplyOp<FlagOp>::Flag(
	- flag_op,
	- input[ITEM - 1],
	- input[ITEM],
	- (linear_tid * ITEMS_PER_THREAD) + ITEM);
	- }
	- }
	-
	-
	- /**
	- * \brief Sets head flags indicating discontinuities between items partitioned across the thread block.
	- *
	- * The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
	- * <tt>input<sub><em>i</em></sub></tt> when
	- * <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
	- * returns \p true (where <em>previous-item</em> is either the preceding item
	- * in the same thread or the last item in the previous thread).
	- * For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
	- * against \p tile_predecessor_item.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the head-flagging of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockDiscontinuity for 128 threads on type int
	- * typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
	- *
	- * // Allocate shared memory for BlockDiscontinuity
	- * __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Have thread0 obtain the predecessor item for the entire tile
	- * int tile_predecessor_item;
	- * if (threadIdx.x == 0) tile_predecessor_item == ...
	- *
	- * // Collectively compute head flags for discontinuities in the segment
	- * int head_flags[4];
	- * BlockDiscontinuity(temp_storage).FlagHeads(
	- * head_flags, thread_data, cub::Inequality(), tile_predecessor_item);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }</tt>,
	- * and that \p tile_predecessor_item is \p 0. The corresponding output \p head_flags in those threads will be
	- * <tt>{ [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam FlagT <b>[inferred]</b> The flag type (must be an integer type)
	- * \tparam FlagOp <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data.
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename FlagT,
	- typename FlagOp>
	- __device__ __forceinline__ void FlagHeads(
	- FlagT (&head_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity head_flags
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- FlagOp flag_op, ///< [in] Binary boolean flag predicate
	- T tile_predecessor_item) ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
	- {
	- // Share last item
	- temp_storage[linear_tid] = input[ITEMS_PER_THREAD - 1];
	-
	- __syncthreads();
	-
	- // Set flag for first item
	- int predecessor = (linear_tid == 0) ?
	- tile_predecessor_item : // First thread
	- temp_storage[linear_tid - 1];
	-
	- head_flags[0] = ApplyOp<FlagOp>::Flag(
	- flag_op,
	- predecessor,
	- input[0],
	- linear_tid * ITEMS_PER_THREAD);
	-
	- // Set flag for remaining items
	- #pragma unroll
	- for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- head_flags[ITEM] = ApplyOp<FlagOp>::Flag(
	- flag_op,
	- input[ITEM - 1],
	- input[ITEM],
	- (linear_tid * ITEMS_PER_THREAD) + ITEM);
	- }
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Tail flag operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged.
	- *
	- * The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
	- * <tt>input<sub><em>i</em></sub></tt> when
	- * <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
	- * returns \p true (where <em>next-item</em> is either the next item
	- * in the same thread or the first item in the next thread).
	- * Furthermore, <tt>tail_flags<sub>ITEMS_PER_THREAD-1</sub></tt> is always
	- * set for <em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub>.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the tail-flagging of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockDiscontinuity for 128 threads on type int
	- * typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
	- *
	- * // Allocate shared memory for BlockDiscontinuity
	- * __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute tail flags for discontinuities in the segment
	- * int tail_flags[4];
	- * BlockDiscontinuity(temp_storage).FlagTails(tail_flags, thread_data, cub::Inequality());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>.
	- * The corresponding output \p tail_flags in those threads will be
	- * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam FlagT <b>[inferred]</b> The flag type (must be an integer type)
	- * \tparam FlagOp <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data.
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename FlagT,
	- typename FlagOp>
	- __device__ __forceinline__ void FlagTails(
	- FlagT (&tail_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity tail_flags
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- FlagOp flag_op) ///< [in] Binary boolean flag predicate
	- {
	- // Share first item
	- temp_storage[linear_tid] = input[0];
	-
	- __syncthreads();
	-
	- // Set flag for last item
	- tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
	- 1 : // Last thread
	- ApplyOp<FlagOp>::Flag(
	- flag_op,
	- input[ITEMS_PER_THREAD - 1],
	- temp_storage[linear_tid + 1],
	- (linear_tid * ITEMS_PER_THREAD) + (ITEMS_PER_THREAD - 1));
	-
	- // Set flags for remaining items
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD - 1; ITEM++)
	- {
	- tail_flags[ITEM] = ApplyOp<FlagOp>::Flag(
	- flag_op,
	- input[ITEM],
	- input[ITEM + 1],
	- (linear_tid * ITEMS_PER_THREAD) + ITEM);
	- }
	- }
	-
	-
	- /**
	- * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block.
	- *
	- * The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
	- * <tt>input<sub><em>i</em></sub></tt> when
	- * <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
	- * returns \p true (where <em>next-item</em> is either the next item
	- * in the same thread or the first item in the next thread).
	- * For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
	- * <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
	- * against \p tile_predecessor_item.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the tail-flagging of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockDiscontinuity for 128 threads on type int
	- * typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
	- *
	- * // Allocate shared memory for BlockDiscontinuity
	- * __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Have thread127 obtain the successor item for the entire tile
	- * int tile_successor_item;
	- * if (threadIdx.x == 127) tile_successor_item == ...
	- *
	- * // Collectively compute tail flags for discontinuities in the segment
	- * int tail_flags[4];
	- * BlockDiscontinuity(temp_storage).FlagTails(
	- * tail_flags, thread_data, cub::Inequality(), tile_successor_item);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>
	- * and that \p tile_successor_item is \p 125. The corresponding output \p tail_flags in those threads will be
	- * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam FlagT <b>[inferred]</b> The flag type (must be an integer type)
	- * \tparam FlagOp <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data.
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename FlagT,
	- typename FlagOp>
	- __device__ __forceinline__ void FlagTails(
	- FlagT (&tail_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity tail_flags
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- FlagOp flag_op, ///< [in] Binary boolean flag predicate
	- T tile_successor_item) ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
	- {
	- // Share first item
	- temp_storage[linear_tid] = input[0];
	-
	- __syncthreads();
	-
	- // Set flag for last item
	- int successor_item = (linear_tid == BLOCK_THREADS - 1) ?
	- tile_successor_item : // Last thread
	- temp_storage[linear_tid + 1];
	-
	- tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::Flag(
	- flag_op,
	- input[ITEMS_PER_THREAD - 1],
	- successor_item,
	- (linear_tid * ITEMS_PER_THREAD) + (ITEMS_PER_THREAD - 1));
	-
	- // Set flags for remaining items
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD - 1; ITEM++)
	- {
	- tail_flags[ITEM] = ApplyOp<FlagOp>::Flag(
	- flag_op,
	- input[ITEM],
	- input[ITEM + 1],
	- (linear_tid * ITEMS_PER_THREAD) + ITEM);
	- }
	- }
	-
	- //@} end member group
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/block/block_exchange.cuh b/lib/kokkos/TPL/cub/block/block_exchange.cuh
	deleted file mode 100755
	index b7b95343b..000000000
	--- a/lib/kokkos/TPL/cub/block/block_exchange.cuh
	+++ /dev/null
	@@ -1,918 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockExchange class provides [<em>collective</em>](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "../util_arch.cuh"
	-#include "../util_macro.cuh"
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief The BlockExchange class provides [<em>collective</em>](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block. ![](transpose_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * It is commonplace for blocks of threads to rearrange data items between
	- * threads. For example, the global memory subsystem prefers access patterns
	- * where data items are "striped" across threads (where consecutive threads access consecutive items),
	- * yet most block-wide operations prefer a "blocked" partitioning of items across threads
	- * (where consecutive items belong to a single thread).
	- *
	- * \par
	- * BlockExchange supports the following types of data exchanges:
	- * - Transposing between [<em>blocked</em>](index.html#sec5sec4) and [<em>striped</em>](index.html#sec5sec4) arrangements
	- * - Transposing between [<em>blocked</em>](index.html#sec5sec4) and [<em>warp-striped</em>](index.html#sec5sec4) arrangements
	- * - Scattering ranked items to a [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * - Scattering ranked items to a [<em>striped arrangement</em>](index.html#sec5sec4)
	- *
	- * \tparam T The data type to be exchanged.
	- * \tparam BLOCK_THREADS The thread block size in threads.
	- * \tparam ITEMS_PER_THREAD The number of items partitioned onto each thread.
	- * \tparam WARP_TIME_SLICING <b>[optional]</b> When \p true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds. Yields a smaller memory footprint at the expense of decreased parallelism. (Default: false)
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockExchange}
	- * \par
	- * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement
	- * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockExchange for 128 threads owning 4 integer items each
	- * typedef cub::BlockExchange<int, 128, 4> BlockExchange;
	- *
	- * // Allocate shared memory for BlockExchange
	- * __shared__ typename BlockExchange::TempStorage temp_storage;
	- *
	- * // Load a tile of data striped across threads
	- * int thread_data[4];
	- * cub::LoadStriped<LOAD_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
	- *
	- * // Collectively exchange data into a blocked arrangement across threads
	- * BlockExchange(temp_storage).StripedToBlocked(thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of striped input \p thread_data across the block of threads is
	- * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt>.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- *
	- * \par Performance Considerations
	- * - Proper device-specific padding ensures zero bank conflicts for most types.
	- *
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- bool WARP_TIME_SLICING = false>
	-class BlockExchange
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants
	- ******************************************************************************/
	-
	- enum
	- {
	- LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS,
	- WARP_THREADS = 1 << LOG_WARP_THREADS,
	- WARPS = (BLOCK_THREADS + PtxArchProps::WARP_THREADS - 1) / PtxArchProps::WARP_THREADS,
	-
	- LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS,
	- SMEM_BANKS = 1 << LOG_SMEM_BANKS,
	-
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	-
	- TIME_SLICES = (WARP_TIME_SLICING) ? WARPS : 1,
	-
	- TIME_SLICED_THREADS = (WARP_TIME_SLICING) ? CUB_MIN(BLOCK_THREADS, WARP_THREADS) : BLOCK_THREADS,
	- TIME_SLICED_ITEMS = TIME_SLICED_THREADS * ITEMS_PER_THREAD,
	-
	- WARP_TIME_SLICED_THREADS = CUB_MIN(BLOCK_THREADS, WARP_THREADS),
	- WARP_TIME_SLICED_ITEMS = WARP_TIME_SLICED_THREADS * ITEMS_PER_THREAD,
	-
	- // Insert padding if the number of items per thread is a power of two
	- INSERT_PADDING = ((ITEMS_PER_THREAD & (ITEMS_PER_THREAD - 1)) == 0),
	- PADDING_ITEMS = (INSERT_PADDING) ? (TIME_SLICED_ITEMS >> LOG_SMEM_BANKS) : 0,
	- };
	-
	- /******************************************************************************
	- * Type definitions
	- ******************************************************************************/
	-
	- /// Shared memory storage layout type
	- typedef T _TempStorage[TIME_SLICED_ITEMS + PADDING_ITEMS];
	-
	-public:
	-
	- /// \smemstorage{BlockExchange}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-private:
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	- int warp_lane;
	- int warp_id;
	- int warp_offset;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /**
	- * Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement. Specialized for no timeslicing.
	- */
	- __device__ __forceinline__ void BlockedToStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
	- Int2Type<false> time_slicing)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	-
	-
	- /**
	- * Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement. Specialized for warp-timeslicing.
	- */
	- __device__ __forceinline__ void BlockedToStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
	- Int2Type<true> time_slicing)
	- {
	- T temp_items[ITEMS_PER_THREAD];
	-
	- #pragma unroll
	- for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
	- {
	- const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS;
	- const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS;
	-
	- __syncthreads();
	-
	- if (warp_id == SLICE)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	- }
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- // Read a strip of items
	- const int STRIP_OFFSET = ITEM * BLOCK_THREADS;
	- const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS;
	-
	- if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET))
	- {
	- int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET;
	- if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS))
	- {
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_items[ITEM] = temp_storage[item_offset];
	- }
	- }
	- }
	- }
	-
	- // Copy
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = temp_items[ITEM];
	- }
	- }
	-
	-
	- /**
	- * Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement. Specialized for no timeslicing
	- */
	- __device__ __forceinline__ void BlockedToWarpStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>blocked</em> and <em>warp-striped</em> arrangements.
	- Int2Type<false> time_slicing)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = warp_offset + ITEM + (warp_lane * ITEMS_PER_THREAD);
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	-
	- /**
	- * Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement. Specialized for warp-timeslicing
	- */
	- __device__ __forceinline__ void BlockedToWarpStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>blocked</em> and <em>warp-striped</em> arrangements.
	- Int2Type<true> time_slicing)
	- {
	- #pragma unroll
	- for (int SLICE = 0; SLICE < TIME_SLICES; ++SLICE)
	- {
	- __syncthreads();
	-
	- if (warp_id == SLICE)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = ITEM + (warp_lane * ITEMS_PER_THREAD);
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	- }
	- }
	-
	-
	- /**
	- * Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement. Specialized for no timeslicing.
	- */
	- __device__ __forceinline__ void StripedToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
	- Int2Type<false> time_slicing)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- __syncthreads();
	-
	- // No timeslicing
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	-
	-
	- /**
	- * Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement. Specialized for warp-timeslicing.
	- */
	- __device__ __forceinline__ void StripedToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
	- Int2Type<true> time_slicing)
	- {
	- // Warp time-slicing
	- T temp_items[ITEMS_PER_THREAD];
	-
	- #pragma unroll
	- for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
	- {
	- const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS;
	- const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS;
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- // Write a strip of items
	- const int STRIP_OFFSET = ITEM * BLOCK_THREADS;
	- const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS;
	-
	- if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET))
	- {
	- int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET;
	- if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS))
	- {
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	- }
	- }
	-
	- __syncthreads();
	-
	- if (warp_id == SLICE)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_items[ITEM] = temp_storage[item_offset];
	- }
	- }
	- }
	-
	- // Copy
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = temp_items[ITEM];
	- }
	- }
	-
	-
	- /**
	- * Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement. Specialized for no timeslicing
	- */
	- __device__ __forceinline__ void WarpStripedToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>warp-striped</em> and <em>blocked</em> arrangements.
	- Int2Type<false> time_slicing)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = warp_offset + ITEM + (warp_lane * ITEMS_PER_THREAD);
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	-
	-
	- /**
	- * Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement. Specialized for warp-timeslicing
	- */
	- __device__ __forceinline__ void WarpStripedToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between <em>warp-striped</em> and <em>blocked</em> arrangements.
	- Int2Type<true> time_slicing)
	- {
	- #pragma unroll
	- for (int SLICE = 0; SLICE < TIME_SLICES; ++SLICE)
	- {
	- __syncthreads();
	-
	- if (warp_id == SLICE)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane;
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = ITEM + (warp_lane * ITEMS_PER_THREAD);
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	- }
	- }
	-
	-
	- /**
	- * Exchanges data items annotated by rank into <em>blocked</em> arrangement. Specialized for no timeslicing.
	- */
	- __device__ __forceinline__ void ScatterToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange
	- int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks
	- Int2Type<false> time_slicing)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = ranks[ITEM];
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	-
	- /**
	- * Exchanges data items annotated by rank into <em>blocked</em> arrangement. Specialized for warp-timeslicing.
	- */
	- __device__ __forceinline__ void ScatterToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange
	- int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks
	- Int2Type<true> time_slicing)
	- {
	- T temp_items[ITEMS_PER_THREAD];
	-
	- #pragma unroll
	- for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
	- {
	- __syncthreads();
	-
	- const int SLICE_OFFSET = TIME_SLICED_ITEMS * SLICE;
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = ranks[ITEM] - SLICE_OFFSET;
	- if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS))
	- {
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- temp_storage[item_offset] = items[ITEM];
	- }
	- }
	-
	- __syncthreads();
	-
	- if (warp_id == SLICE)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM;
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- temp_items[ITEM] = temp_storage[item_offset];
	- }
	- }
	- }
	-
	- // Copy
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = temp_items[ITEM];
	- }
	- }
	-
	-
	- /**
	- * Exchanges data items annotated by rank into <em>striped</em> arrangement. Specialized for no timeslicing.
	- */
	- __device__ __forceinline__ void ScatterToStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange
	- int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks
	- Int2Type<false> time_slicing)
	- {
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = ranks[ITEM];
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- temp_storage[item_offset] = items[ITEM];
	- }
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- items[ITEM] = temp_storage[item_offset];
	- }
	- }
	-
	-
	- /**
	- * Exchanges data items annotated by rank into <em>striped</em> arrangement. Specialized for warp-timeslicing.
	- */
	- __device__ __forceinline__ void ScatterToStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange
	- int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks
	- Int2Type<true> time_slicing)
	- {
	- T temp_items[ITEMS_PER_THREAD];
	-
	- #pragma unroll
	- for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
	- {
	- const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS;
	- const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS;
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- int item_offset = ranks[ITEM] - SLICE_OFFSET;
	- if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS))
	- {
	- if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
	- temp_storage[item_offset] = items[ITEM];
	- }
	- }
	-
	- __syncthreads();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- // Read a strip of items
	- const int STRIP_OFFSET = ITEM * BLOCK_THREADS;
	- const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS;
	-
	- if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET))
	- {
	- int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET;
	- if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS))
	- {
	- if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
	- temp_items[ITEM] = temp_storage[item_offset];
	- }
	- }
	- }
	- }
	-
	- // Copy
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = temp_items[ITEM];
	- }
	- }
	-
	-
	-public:
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockExchange()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x),
	- warp_lane(linear_tid & (WARP_THREADS - 1)),
	- warp_id(linear_tid >> LOG_WARP_THREADS),
	- warp_offset(warp_id * WARP_TIME_SLICED_ITEMS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockExchange(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x),
	- warp_lane(linear_tid & (WARP_THREADS - 1)),
	- warp_id(linear_tid >> LOG_WARP_THREADS),
	- warp_offset(warp_id * WARP_TIME_SLICED_ITEMS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockExchange(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid),
	- warp_lane(linear_tid & (WARP_THREADS - 1)),
	- warp_id(linear_tid >> LOG_WARP_THREADS),
	- warp_offset(warp_id * WARP_TIME_SLICED_ITEMS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockExchange(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid),
	- warp_lane(linear_tid & (WARP_THREADS - 1)),
	- warp_id(linear_tid >> LOG_WARP_THREADS),
	- warp_offset(warp_id * WARP_TIME_SLICED_ITEMS)
	- {}
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Structured exchanges
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement
	- * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockExchange for 128 threads owning 4 integer items each
	- * typedef cub::BlockExchange<int, 128, 4> BlockExchange;
	- *
	- * // Allocate shared memory for BlockExchange
	- * __shared__ typename BlockExchange::TempStorage temp_storage;
	- *
	- * // Load a tile of ordered data into a striped arrangement across block threads
	- * int thread_data[4];
	- * cub::LoadStriped<LOAD_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
	- *
	- * // Collectively exchange data into a blocked arrangement across threads
	- * BlockExchange(temp_storage).StripedToBlocked(thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of striped input \p thread_data across the block of threads is
	- * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt> after loading from global memory.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- *
	- */
	- __device__ __forceinline__ void StripedToBlocked(
	- T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
	- {
	- StripedToBlocked(items, Int2Type<WARP_TIME_SLICING>());
	- }
	-
	- /**
	- * \brief Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement
	- * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockExchange for 128 threads owning 4 integer items each
	- * typedef cub::BlockExchange<int, 128, 4> BlockExchange;
	- *
	- * // Allocate shared memory for BlockExchange
	- * __shared__ typename BlockExchange::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively exchange data into a striped arrangement across threads
	- * BlockExchange(temp_storage).BlockedToStriped(thread_data);
	- *
	- * // Store data striped across block threads into an ordered tile
	- * cub::StoreStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of blocked input \p thread_data across the block of threads is
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt> in
	- * preparation for storing to global memory.
	- *
	- */
	- __device__ __forceinline__ void BlockedToStriped(
	- T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
	- {
	- BlockedToStriped(items, Int2Type<WARP_TIME_SLICING>());
	- }
	-
	-
	- /**
	- * \brief Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the conversion from a "warp-striped" to a "blocked" arrangement
	- * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockExchange for 128 threads owning 4 integer items each
	- * typedef cub::BlockExchange<int, 128, 4> BlockExchange;
	- *
	- * // Allocate shared memory for BlockExchange
	- * __shared__ typename BlockExchange::TempStorage temp_storage;
	- *
	- * // Load a tile of ordered data into a warp-striped arrangement across warp threads
	- * int thread_data[4];
	- * cub::LoadSWarptriped<LOAD_DEFAULT>(threadIdx.x, d_data, thread_data);
	- *
	- * // Collectively exchange data into a blocked arrangement across threads
	- * BlockExchange(temp_storage).WarpStripedToBlocked(thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of warp-striped input \p thread_data across the block of threads is
	- * <tt>{ [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] }</tt>
	- * after loading from global memory. (The first 128 items are striped across
	- * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- *
	- */
	- __device__ __forceinline__ void WarpStripedToBlocked(
	- T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between <em>warp-striped</em> and <em>blocked</em> arrangements.
	- {
	- WarpStripedToBlocked(items, Int2Type<WARP_TIME_SLICING>());
	- }
	-
	- /**
	- * \brief Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates the conversion from a "blocked" to a "warp-striped" arrangement
	- * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockExchange for 128 threads owning 4 integer items each
	- * typedef cub::BlockExchange<int, 128, 4> BlockExchange;
	- *
	- * // Allocate shared memory for BlockExchange
	- * __shared__ typename BlockExchange::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively exchange data into a warp-striped arrangement across threads
	- * BlockExchange(temp_storage).BlockedToWarpStriped(thread_data);
	- *
	- * // Store data striped across warp threads into an ordered tile
	- * cub::StoreStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of blocked input \p thread_data across the block of threads is
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] }</tt>
	- * in preparation for storing to global memory. (The first 128 items are striped across
	- * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
	- *
	- */
	- __device__ __forceinline__ void BlockedToWarpStriped(
	- T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between <em>blocked</em> and <em>warp-striped</em> arrangements.
	- {
	- BlockedToWarpStriped(items, Int2Type<WARP_TIME_SLICING>());
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Scatter exchanges
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Exchanges data items annotated by rank into <em>blocked</em> arrangement.
	- *
	- * \smemreuse
	- */
	- __device__ __forceinline__ void ScatterToBlocked(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange
	- int ranks[ITEMS_PER_THREAD]) ///< [in] Corresponding scatter ranks
	- {
	- ScatterToBlocked(items, ranks, Int2Type<WARP_TIME_SLICING>());
	- }
	-
	-
	- /**
	- * \brief Exchanges data items annotated by rank into <em>striped</em> arrangement.
	- *
	- * \smemreuse
	- */
	- __device__ __forceinline__ void ScatterToStriped(
	- T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange
	- int ranks[ITEMS_PER_THREAD]) ///< [in] Corresponding scatter ranks
	- {
	- ScatterToStriped(items, ranks, Int2Type<WARP_TIME_SLICING>());
	- }
	-
	- //@} end member group
	-
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_histogram.cuh b/lib/kokkos/TPL/cub/block/block_histogram.cuh
	deleted file mode 100755
	index dd346e395..000000000
	--- a/lib/kokkos/TPL/cub/block/block_histogram.cuh
	+++ /dev/null
	@@ -1,414 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockHistogram class provides [<em>collective</em>](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "specializations/block_histogram_sort.cuh"
	-#include "specializations/block_histogram_atomic.cuh"
	-#include "../util_arch.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Algorithmic variants
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
	- */
	-enum BlockHistogramAlgorithm
	-{
	-
	- /**
	- * \par Overview
	- * Sorting followed by differentiation. Execution is comprised of two phases:
	- * -# Sort the data using efficient radix sort
	- * -# Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts.
	- *
	- * \par Performance Considerations
	- * Delivers consistent throughput regardless of sample bin distribution.
	- */
	- BLOCK_HISTO_SORT,
	-
	-
	- /**
	- * \par Overview
	- * Use atomic addition to update byte counts directly
	- *
	- * \par Performance Considerations
	- * Performance is strongly tied to the hardware implementation of atomic
	- * addition, and may be significantly degraded for non uniformly-random
	- * input distributions where many concurrent updates are likely to be
	- * made to the same bin counter.
	- */
	- BLOCK_HISTO_ATOMIC,
	-};
	-
	-
	-
	-/******************************************************************************
	- * Block histogram
	- ******************************************************************************/
	-
	-
	-/**
	- * \brief The BlockHistogram class provides [<em>collective</em>](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. ![](histogram_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * A <a href="http://en.wikipedia.org/wiki/Histogram"><em>histogram</em></a>
	- * counts the number of observations that fall into each of the disjoint categories (known as <em>bins</em>).
	- *
	- * \par
	- * Optionally, BlockHistogram can be specialized to use different algorithms:
	- * -# <b>cub::BLOCK_HISTO_SORT</b>. Sorting followed by differentiation. [More...](\ref cub::BlockHistogramAlgorithm)
	- * -# <b>cub::BLOCK_HISTO_ATOMIC</b>. Use atomic addition to update byte counts directly. [More...](\ref cub::BlockHistogramAlgorithm)
	- *
	- * \tparam T The sample type being histogrammed (must be castable to an integer bin identifier)
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam ITEMS_PER_THREAD The number of items per thread
	- * \tparam BINS The number bins within the histogram
	- * \tparam ALGORITHM <b>[optional]</b> cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT)
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockHistogram}
	- * \par
	- * The code snippet below illustrates a 256-bin histogram of 512 integer samples that
	- * are partitioned across 128 threads where each thread owns 4 samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each
	- * typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
	- *
	- * // Allocate shared memory for BlockHistogram
	- * __shared__ typename BlockHistogram::TempStorage temp_storage;
	- *
	- * // Allocate shared memory for block-wide histogram bin counts
	- * __shared__ unsigned int smem_histogram[256];
	- *
	- * // Obtain input samples per thread
	- * unsigned char data[4];
	- * ...
	- *
	- * // Compute the block-wide histogram
	- * BlockHistogram(temp_storage).Histogram(data, smem_histogram);
	- *
	- * \endcode
	- *
	- * \par Performance and Usage Considerations
	- * - The histogram output can be constructed in shared or global memory
	- * - See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives
	- *
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- int BINS,
	- BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT>
	-class BlockHistogram
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and type definitions
	- ******************************************************************************/
	-
	- /**
	- * Ensure the template parameterization meets the requirements of the
	- * targeted device architecture. BLOCK_HISTO_ATOMIC can only be used
	- * on version SM120 or later. Otherwise BLOCK_HISTO_SORT is used
	- * regardless.
	- */
	- static const BlockHistogramAlgorithm SAFE_ALGORITHM =
	- ((ALGORITHM == BLOCK_HISTO_ATOMIC) && (CUB_PTX_ARCH < 120)) ?
	- BLOCK_HISTO_SORT :
	- ALGORITHM;
	-
	- /// Internal specialization.
	- typedef typename If<(SAFE_ALGORITHM == BLOCK_HISTO_SORT),
	- BlockHistogramSort<T, BLOCK_THREADS, ITEMS_PER_THREAD, BINS>,
	- BlockHistogramAtomic<T, BLOCK_THREADS, ITEMS_PER_THREAD, BINS> >::Type InternalBlockHistogram;
	-
	- /// Shared memory storage layout type for BlockHistogram
	- typedef typename InternalBlockHistogram::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	-public:
	-
	- /// \smemstorage{BlockHistogram}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockHistogram()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockHistogram(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockHistogram(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockHistogram(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Histogram operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Initialize the shared histogram counters to zero.
	- *
	- * The code snippet below illustrates a the initialization and update of a
	- * histogram of 512 integer samples that are partitioned across 128 threads
	- * where each thread owns 4 samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each
	- * typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
	- *
	- * // Allocate shared memory for BlockHistogram
	- * __shared__ typename BlockHistogram::TempStorage temp_storage;
	- *
	- * // Allocate shared memory for block-wide histogram bin counts
	- * __shared__ unsigned int smem_histogram[256];
	- *
	- * // Obtain input samples per thread
	- * unsigned char thread_samples[4];
	- * ...
	- *
	- * // Initialize the block-wide histogram
	- * BlockHistogram(temp_storage).InitHistogram(smem_histogram);
	- *
	- * // Update the block-wide histogram
	- * BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
	- *
	- * \endcode
	- *
	- * \tparam HistoCounter <b>[inferred]</b> Histogram counter type
	- */
	- template <typename HistoCounter>
	- __device__ __forceinline__ void InitHistogram(HistoCounter histogram[BINS])
	- {
	- // Initialize histogram bin counts to zeros
	- int histo_offset = 0;
	-
	- #pragma unroll
	- for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
	- {
	- histogram[histo_offset + linear_tid] = 0;
	- }
	- // Finish up with guarded initialization if necessary
	- if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS))
	- {
	- histogram[histo_offset + linear_tid] = 0;
	- }
	- }
	-
	-
	- /**
	- * \brief Constructs a block-wide histogram in shared/global memory. Each thread contributes an array of input elements.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a 256-bin histogram of 512 integer samples that
	- * are partitioned across 128 threads where each thread owns 4 samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each
	- * typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
	- *
	- * // Allocate shared memory for BlockHistogram
	- * __shared__ typename BlockHistogram::TempStorage temp_storage;
	- *
	- * // Allocate shared memory for block-wide histogram bin counts
	- * __shared__ unsigned int smem_histogram[256];
	- *
	- * // Obtain input samples per thread
	- * unsigned char thread_samples[4];
	- * ...
	- *
	- * // Compute the block-wide histogram
	- * BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram);
	- *
	- * \endcode
	- *
	- * \tparam HistoCounter <b>[inferred]</b> Histogram counter type
	- */
	- template <
	- typename HistoCounter>
	- __device__ __forceinline__ void Histogram(
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram
	- HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram
	- {
	- // Initialize histogram bin counts to zeros
	- InitHistogram(histogram);
	-
	- // Composite the histogram
	- InternalBlockHistogram(temp_storage, linear_tid).Composite(items, histogram);
	- }
	-
	-
	-
	- /**
	- * \brief Updates an existing block-wide histogram in shared/global memory. Each thread composites an array of input elements.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a the initialization and update of a
	- * histogram of 512 integer samples that are partitioned across 128 threads
	- * where each thread owns 4 samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each
	- * typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
	- *
	- * // Allocate shared memory for BlockHistogram
	- * __shared__ typename BlockHistogram::TempStorage temp_storage;
	- *
	- * // Allocate shared memory for block-wide histogram bin counts
	- * __shared__ unsigned int smem_histogram[256];
	- *
	- * // Obtain input samples per thread
	- * unsigned char thread_samples[4];
	- * ...
	- *
	- * // Initialize the block-wide histogram
	- * BlockHistogram(temp_storage).InitHistogram(smem_histogram);
	- *
	- * // Update the block-wide histogram
	- * BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
	- *
	- * \endcode
	- *
	- * \tparam HistoCounter <b>[inferred]</b> Histogram counter type
	- */
	- template <
	- typename HistoCounter>
	- __device__ __forceinline__ void Composite(
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram
	- HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram
	- {
	- InternalBlockHistogram(temp_storage, linear_tid).Composite(items, histogram);
	- }
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_load.cuh b/lib/kokkos/TPL/cub/block/block_load.cuh
	deleted file mode 100755
	index e645bcdce..000000000
	--- a/lib/kokkos/TPL/cub/block/block_load.cuh
	+++ /dev/null
	@@ -1,1122 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Operations for reading linear tiles of data into the CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../util_namespace.cuh"
	-#include "../util_macro.cuh"
	-#include "../util_type.cuh"
	-#include "../util_vector.cuh"
	-#include "../thread/thread_load.cuh"
	-#include "block_exchange.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup IoModule
	- * @{
	- */
	-
	-
	-/****************************************************************//
	- * \name Blocked I/O
	- *********************************************************************/
	-//@{
	-
	-
	-/**
	- * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier.
	- *
	- * \blocked
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadBlocked(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	-{
	- // Load directly in thread-blocked order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = ThreadLoad<MODIFIER>(block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM);
	- }
	-}
	-
	-
	-/**
	- * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier, guarded by range.
	- *
	- * \blocked
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadBlocked(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	-{
	- int bounds = valid_items - (linear_tid * ITEMS_PER_THREAD);
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- if (ITEM < bounds)
	- {
	- items[ITEM] = ThreadLoad<MODIFIER>(block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM);
	- }
	- }
	-}
	-
	-
	-/**
	- * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements..
	- *
	- * \blocked
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadBlocked(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	-{
	- int bounds = valid_items - (linear_tid * ITEMS_PER_THREAD);
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = (ITEM < bounds) ?
	- ThreadLoad<MODIFIER>(block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM) :
	- oob_default;
	- }
	-}
	-
	-
	-
	-//@} end member group
	-/****************************************************************//
	- * \name Striped I/O
	- *********************************************************************/
	-//@{
	-
	-
	-/**
	- * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier.
	- *
	- * \striped
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- int BLOCK_THREADS,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	-{
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = ThreadLoad<MODIFIER>(block_itr + (ITEM * BLOCK_THREADS) + linear_tid);
	- }
	-}
	-
	-
	-/**
	- * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier, guarded by range
	- *
	- * \striped
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- int BLOCK_THREADS,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	-{
	- int bounds = valid_items - linear_tid;
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- if (ITEM * BLOCK_THREADS < bounds)
	- {
	- items[ITEM] = ThreadLoad<MODIFIER>(block_itr + linear_tid + (ITEM * BLOCK_THREADS));
	- }
	- }
	-}
	-
	-
	-/**
	- * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements.
	- *
	- * \striped
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- int BLOCK_THREADS,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	-{
	- int bounds = valid_items - linear_tid;
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = (ITEM * BLOCK_THREADS < bounds) ?
	- ThreadLoad<MODIFIER>(block_itr + linear_tid + (ITEM * BLOCK_THREADS)) :
	- oob_default;
	- }
	-}
	-
	-
	-
	-//@} end member group
	-/****************************************************************//
	- * \name Warp-striped I/O
	- *********************************************************************/
	-//@{
	-
	-
	-/**
	- * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier.
	- *
	- * \warpstriped
	- *
	- * \par Usage Considerations
	- * The number of threads in the thread block must be a multiple of the architecture's warp size.
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadWarpStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	-{
	- int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1);
	- int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS;
	- int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD;
	-
	- // Load directly in warp-striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = ThreadLoad<MODIFIER>(block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS));
	- }
	-}
	-
	-
	-/**
	- * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier, guarded by range
	- *
	- * \warpstriped
	- *
	- * \par Usage Considerations
	- * The number of threads in the thread block must be a multiple of the architecture's warp size.
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadWarpStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	-{
	- int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1);
	- int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS;
	- int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD;
	- int bounds = valid_items - warp_offset - tid;
	-
	- // Load directly in warp-striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- if ((ITEM * PtxArchProps::WARP_THREADS) < bounds)
	- {
	- items[ITEM] = ThreadLoad<MODIFIER>(block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS));
	- }
	- }
	-}
	-
	-
	-/**
	- * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements.
	- *
	- * \warpstriped
	- *
	- * \par Usage Considerations
	- * The number of threads in the thread block must be a multiple of the architecture's warp size.
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam InputIteratorRA <b>[inferred]</b> The random-access iterator type for input (may be a simple pointer type).
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename InputIteratorRA>
	-__device__ __forceinline__ void LoadWarpStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	-{
	- int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1);
	- int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS;
	- int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD;
	- int bounds = valid_items - warp_offset - tid;
	-
	- // Load directly in warp-striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = ((ITEM * PtxArchProps::WARP_THREADS) < bounds) ?
	- ThreadLoad<MODIFIER>(block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)) :
	- oob_default;
	- }
	-}
	-
	-
	-
	-//@} end member group
	-/****************************************************************//
	- * \name Blocked, vectorized I/O
	- *********************************************************************/
	-//@{
	-
	-/**
	- * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier.
	- *
	- * \blocked
	- *
	- * The input offset (\p block_ptr + \p block_offset) must be quad-item aligned
	- *
	- * The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
	- * - \p ITEMS_PER_THREAD is odd
	- * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
	- *
	- * \tparam MODIFIER cub::PtxLoadModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to load.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD>
	-__device__ __forceinline__ void LoadBlockedVectorized(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- T *block_ptr, ///< [in] Input pointer for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	-{
	- enum
	- {
	- // Maximum CUDA vector size is 4 elements
	- MAX_VEC_SIZE = CUB_MIN(4, ITEMS_PER_THREAD),
	-
	- // Vector size must be a power of two and an even divisor of the items per thread
	- VEC_SIZE = ((((MAX_VEC_SIZE - 1) & MAX_VEC_SIZE) == 0) && ((ITEMS_PER_THREAD % MAX_VEC_SIZE) == 0)) ?
	- MAX_VEC_SIZE :
	- 1,
	-
	- VECTORS_PER_THREAD = ITEMS_PER_THREAD / VEC_SIZE,
	- };
	-
	- // Vector type
	- typedef typename VectorHelper<T, VEC_SIZE>::Type Vector;
	-
	- // Alias local data (use raw_items array here which should get optimized away to prevent conservative PTXAS lmem spilling)
	- T raw_items[ITEMS_PER_THREAD];
	-
	- // Direct-load using vector types
	- LoadBlocked<MODIFIER>(
	- linear_tid,
	- reinterpret_cast<Vector *>(block_ptr),
	- reinterpret_cast<Vector (&)[VECTORS_PER_THREAD]>(raw_items));
	-
	- // Copy
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = raw_items[ITEM];
	- }
	-}
	-
	-
	-//@} end member group
	-
	-/** @} */ // end group IoModule
	-
	-
	-
	-//-----------------------------------------------------------------------------
	-// Generic BlockLoad abstraction
	-//-----------------------------------------------------------------------------
	-
	-/**
	- * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
	- */
	-enum BlockLoadAlgorithm
	-{
	- /**
	- * \par Overview
	- *
	- * A [<em>blocked arrangement</em>](index.html#sec5sec4) of data is read
	- * directly from memory. The thread block reads items in a parallel "raking" fashion: thread<sub><em>i</em></sub>
	- * reads the <em>i</em><sup>th</sup> segment of consecutive elements.
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) decreases as the
	- * access stride between threads increases (i.e., the number items per thread).
	- */
	- BLOCK_LOAD_DIRECT,
	-
	- /**
	- * \par Overview
	- *
	- * A [<em>blocked arrangement</em>](index.html#sec5sec4) of data is read directly
	- * from memory using CUDA's built-in vectorized loads as a coalescing optimization.
	- * The thread block reads items in a parallel "raking" fashion: thread<sub><em>i</em></sub> uses vector loads to
	- * read the <em>i</em><sup>th</sup> segment of consecutive elements.
	- *
	- * For example, <tt>ld.global.v4.s32</tt> instructions will be generated when \p T = \p int and \p ITEMS_PER_THREAD > 4.
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) remains high until the the
	- * access stride between threads (i.e., the number items per thread) exceeds the
	- * maximum vector load width (typically 4 items or 64B, whichever is lower).
	- * - The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
	- * - \p ITEMS_PER_THREAD is odd
	- * - The \p InputIteratorRA is not a simple pointer type
	- * - The block input offset is not quadword-aligned
	- * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
	- */
	- BLOCK_LOAD_VECTORIZE,
	-
	- /**
	- * \par Overview
	- *
	- * A [<em>striped arrangement</em>](index.html#sec5sec4) of data is read
	- * directly from memory and then is locally transposed into a
	- * [<em>blocked arrangement</em>](index.html#sec5sec4). The thread block
	- * reads items in a parallel "strip-mining" fashion:
	- * thread<sub><em>i</em></sub> reads items having stride \p BLOCK_THREADS
	- * between them. cub::BlockExchange is then used to locally reorder the items
	- * into a [<em>blocked arrangement</em>](index.html#sec5sec4).
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) remains high regardless
	- * of items loaded per thread.
	- * - The local reordering incurs slightly longer latencies and throughput than the
	- * direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
	- */
	- BLOCK_LOAD_TRANSPOSE,
	-
	-
	- /**
	- * \par Overview
	- *
	- * A [<em>warp-striped arrangement</em>](index.html#sec5sec4) of data is read
	- * directly from memory and then is locally transposed into a
	- * [<em>blocked arrangement</em>](index.html#sec5sec4). Each warp reads its own
	- * contiguous segment in a parallel "strip-mining" fashion: lane<sub><em>i</em></sub>
	- * reads items having stride \p WARP_THREADS between them. cub::BlockExchange
	- * is then used to locally reorder the items into a
	- * [<em>blocked arrangement</em>](index.html#sec5sec4).
	- *
	- * \par Usage Considerations
	- * - BLOCK_THREADS must be a multiple of WARP_THREADS
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) remains high regardless
	- * of items loaded per thread.
	- * - The local reordering incurs slightly longer latencies and throughput than the
	- * direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
	- */
	- BLOCK_LOAD_WARP_TRANSPOSE,
	-};
	-
	-
	-/**
	- * \brief The BlockLoad class provides [<em>collective</em>](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [<em>blocked arrangement</em>](index.html#sec5sec4) across a CUDA thread block. ![](block_load_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * The BlockLoad class provides a single data movement abstraction that can be specialized
	- * to implement different cub::BlockLoadAlgorithm strategies. This facilitates different
	- * performance policies for different architectures, data types, granularity sizes, etc.
	- *
	- * \par
	- * Optionally, BlockLoad can be specialized by different data movement strategies:
	- * -# <b>cub::BLOCK_LOAD_DIRECT</b>. A [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * of data is read directly from memory. [More...](\ref cub::BlockLoadAlgorithm)
	- * -# <b>cub::BLOCK_LOAD_VECTORIZE</b>. A [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * of data is read directly from memory using CUDA's built-in vectorized loads as a
	- * coalescing optimization. [More...](\ref cub::BlockLoadAlgorithm)
	- * -# <b>cub::BLOCK_LOAD_TRANSPOSE</b>. A [<em>striped arrangement</em>](index.html#sec5sec4)
	- * of data is read directly from memory and is then locally transposed into a
	- * [<em>blocked arrangement</em>](index.html#sec5sec4). [More...](\ref cub::BlockLoadAlgorithm)
	- * -# <b>cub::BLOCK_LOAD_WARP_TRANSPOSE</b>. A [<em>warp-striped arrangement</em>](index.html#sec5sec4)
	- * of data is read directly from memory and is then locally transposed into a
	- * [<em>blocked arrangement</em>](index.html#sec5sec4). [More...](\ref cub::BlockLoadAlgorithm)
	- *
	- * \tparam InputIteratorRA The input iterator type (may be a simple pointer type).
	- * \tparam BLOCK_THREADS The thread block size in threads.
	- * \tparam ITEMS_PER_THREAD The number of consecutive items partitioned onto each thread.
	- * \tparam ALGORITHM <b>[optional]</b> cub::BlockLoadAlgorithm tuning policy. default: cub::BLOCK_LOAD_DIRECT.
	- * \tparam MODIFIER <b>[optional]</b> cub::PtxLoadModifier cache modifier. default: cub::LOAD_DEFAULT.
	- * \tparam WARP_TIME_SLICING <b>[optional]</b> For transposition-based cub::BlockLoadAlgorithm parameterizations that utilize shared memory: When \p true, only use enough shared memory for a single warp's worth of data, time-slicing the block-wide exchange over multiple synchronized rounds (default: false)
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockLoad}
	- * \par
	- * The code snippet below illustrates the loading of a linear
	- * segment of 512 integers into a "blocked" arrangement across 128 threads where each
	- * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
	- * meaning memory references are efficiently coalesced using a warp-striped access
	- * pattern (after which items are locally reordered among threads).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockLoad for 128 threads owning 4 integer items each
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
	- *
	- * // Allocate shared memory for BlockLoad
	- * __shared__ typename BlockLoad::TempStorage temp_storage;
	- *
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage).Load(d_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt>.
	- * The set of \p thread_data across the block of threads in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- *
	- */
	-template <
	- typename InputIteratorRA,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT,
	- PtxLoadModifier MODIFIER = LOAD_DEFAULT,
	- bool WARP_TIME_SLICING = false>
	-class BlockLoad
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and typed definitions
	- ******************************************************************************/
	-
	- // Data type of input iterator
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	-
	- /******************************************************************************
	- * Algorithmic variants
	- ******************************************************************************/
	-
	- /// Load helper
	- template <BlockLoadAlgorithm _POLICY, int DUMMY = 0>
	- struct LoadInternal;
	-
	-
	- /**
	- * BLOCK_LOAD_DIRECT specialization of load helper
	- */
	- template <int DUMMY>
	- struct LoadInternal<BLOCK_LOAD_DIRECT, DUMMY>
	- {
	- /// Shared memory storage layout type
	- typedef NullType TempStorage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ LoadInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- linear_tid(linear_tid)
	- {}
	-
	- /// Load a linear segment of items from memory
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	- {
	- LoadBlocked<MODIFIER>(linear_tid, block_itr, items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	- {
	- LoadBlocked<MODIFIER>(linear_tid, block_itr, items, valid_items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	- {
	- LoadBlocked<MODIFIER>(linear_tid, block_itr, items, valid_items, oob_default);
	- }
	-
	- };
	-
	-
	- /**
	- * BLOCK_LOAD_VECTORIZE specialization of load helper
	- */
	- template <int DUMMY>
	- struct LoadInternal<BLOCK_LOAD_VECTORIZE, DUMMY>
	- {
	- /// Shared memory storage layout type
	- typedef NullType TempStorage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ LoadInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- linear_tid(linear_tid)
	- {}
	-
	- /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
	- __device__ __forceinline__ void Load(
	- T *block_ptr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	- {
	- LoadBlockedVectorized<MODIFIER>(linear_tid, block_ptr, items);
	- }
	-
	- /// Load a linear segment of items from memory, specialized for opaque input iterators (skips vectorization)
	- template <
	- typename T,
	- typename _InputIteratorRA>
	- __device__ __forceinline__ void Load(
	- _InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	- {
	- LoadBlocked<MODIFIER>(linear_tid, block_itr, items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range (skips vectorization)
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	- {
	- LoadBlocked<MODIFIER>(linear_tid, block_itr, items, valid_items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements (skips vectorization)
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	- {
	- LoadBlocked<MODIFIER>(linear_tid, block_itr, items, valid_items, oob_default);
	- }
	-
	- };
	-
	-
	- /**
	- * BLOCK_LOAD_TRANSPOSE specialization of load helper
	- */
	- template <int DUMMY>
	- struct LoadInternal<BLOCK_LOAD_TRANSPOSE, DUMMY>
	- {
	- // BlockExchange utility type for keys
	- typedef BlockExchange<T, BLOCK_THREADS, ITEMS_PER_THREAD, WARP_TIME_SLICING> BlockExchange;
	-
	- /// Shared memory storage layout type
	- typedef typename BlockExchange::TempStorage _TempStorage;
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	- /// Thread reference to shared storage
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ LoadInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	- /// Load a linear segment of items from memory
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load{
	- {
	- LoadStriped<MODIFIER, BLOCK_THREADS>(linear_tid, block_itr, items);
	- BlockExchange(temp_storage, linear_tid).StripedToBlocked(items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	- {
	- LoadStriped<MODIFIER, BLOCK_THREADS>(linear_tid, block_itr, items, valid_items);
	- BlockExchange(temp_storage, linear_tid).StripedToBlocked(items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	- {
	- LoadStriped<MODIFIER, BLOCK_THREADS>(linear_tid, block_itr, items, valid_items, oob_default);
	- BlockExchange(temp_storage, linear_tid).StripedToBlocked(items);
	- }
	-
	- };
	-
	-
	- /**
	- * BLOCK_LOAD_WARP_TRANSPOSE specialization of load helper
	- */
	- template <int DUMMY>
	- struct LoadInternal<BLOCK_LOAD_WARP_TRANSPOSE, DUMMY>
	- {
	- enum
	- {
	- WARP_THREADS = PtxArchProps::WARP_THREADS
	- };
	-
	- // Assert BLOCK_THREADS must be a multiple of WARP_THREADS
	- CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS");
	-
	- // BlockExchange utility type for keys
	- typedef BlockExchange<T, BLOCK_THREADS, ITEMS_PER_THREAD, WARP_TIME_SLICING> BlockExchange;
	-
	- /// Shared memory storage layout type
	- typedef typename BlockExchange::TempStorage _TempStorage;
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	- /// Thread reference to shared storage
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ LoadInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	- /// Load a linear segment of items from memory
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load{
	- {
	- LoadWarpStriped<MODIFIER>(linear_tid, block_itr, items);
	- BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items);
	- }
	-
	- /// Load a linear segment of items from memory, guarded by range
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	- {
	- LoadWarpStriped<MODIFIER>(linear_tid, block_itr, items, valid_items);
	- BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items);
	- }
	-
	-
	- /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	- {
	- LoadWarpStriped<MODIFIER>(linear_tid, block_itr, items, valid_items, oob_default);
	- BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items);
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Type definitions
	- ******************************************************************************/
	-
	- /// Internal load implementation to use
	- typedef LoadInternal<ALGORITHM> InternalLoad;
	-
	-
	- /// Shared memory storage layout type
	- typedef typename InternalLoad::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Thread reference to shared storage
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-public:
	-
	- /// \smemstorage{BlockLoad}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockLoad()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockLoad(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockLoad(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockLoad(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Data movement
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Load a linear segment of items from memory.
	- *
	- * \blocked
	- *
	- * The code snippet below illustrates the loading of a linear
	- * segment of 512 integers into a "blocked" arrangement across 128 threads where each
	- * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
	- * meaning memory references are efficiently coalesced using a warp-striped access
	- * pattern (after which items are locally reordered among threads).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockLoad for 128 threads owning 4 integer items each
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
	- *
	- * // Allocate shared memory for BlockLoad
	- * __shared__ typename BlockLoad::TempStorage temp_storage;
	- *
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage).Load(d_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt>.
	- * The set of \p thread_data across the block of threads in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- *
	- */
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	- {
	- InternalLoad(temp_storage, linear_tid).Load(block_itr, items);
	- }
	-
	-
	- /**
	- * \brief Load a linear segment of items from memory, guarded by range.
	- *
	- * \blocked
	- *
	- * The code snippet below illustrates the guarded loading of a linear
	- * segment of 512 integers into a "blocked" arrangement across 128 threads where each
	- * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
	- * meaning memory references are efficiently coalesced using a warp-striped access
	- * pattern (after which items are locally reordered among threads).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
	- * {
	- * // Specialize BlockLoad for 128 threads owning 4 integer items each
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
	- *
	- * // Allocate shared memory for BlockLoad
	- * __shared__ typename BlockLoad::TempStorage temp_storage;
	- *
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage).Load(d_data, thread_data, valid_items);
	- *
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, 6...</tt> and \p valid_items is \p 5.
	- * The set of \p thread_data across the block of threads in those threads will be
	- * <tt>{ [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }</tt>, with only the first two threads
	- * being unmasked to load portions of valid data (and other items remaining unassigned).
	- *
	- */
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items) ///< [in] Number of valid items to load
	- {
	- InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items);
	- }
	-
	-
	- /**
	- * \brief Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
	- *
	- * \blocked
	- *
	- * The code snippet below illustrates the guarded loading of a linear
	- * segment of 512 integers into a "blocked" arrangement across 128 threads where each
	- * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
	- * meaning memory references are efficiently coalesced using a warp-striped access
	- * pattern (after which items are locally reordered among threads).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
	- * {
	- * // Specialize BlockLoad for 128 threads owning 4 integer items each
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
	- *
	- * // Allocate shared memory for BlockLoad
	- * __shared__ typename BlockLoad::TempStorage temp_storage;
	- *
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage).Load(d_data, thread_data, valid_items, -1);
	- *
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, 6...</tt>,
	- * \p valid_items is \p 5, and the out-of-bounds default is \p -1.
	- * The set of \p thread_data across the block of threads in those threads will be
	- * <tt>{ [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }</tt>, with only the first two threads
	- * being unmasked to load portions of valid data (and other items are assigned \p -1)
	- *
	- */
	- __device__ __forceinline__ void Load(
	- InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from
	- T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
	- int valid_items, ///< [in] Number of valid items to load
	- T oob_default) ///< [in] Default value to assign out-of-bound items
	- {
	- InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items, oob_default);
	- }
	-
	-
	- //@} end member group
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_radix_rank.cuh b/lib/kokkos/TPL/cub/block/block_radix_rank.cuh
	deleted file mode 100755
	index 149a62c65..000000000
	--- a/lib/kokkos/TPL/cub/block/block_radix_rank.cuh
	+++ /dev/null
	@@ -1,479 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockRadixRank provides operations for ranking unsigned integer types within a CUDA threadblock
	- */
	-
	-#pragma once
	-
	-#include "../util_arch.cuh"
	-#include "../util_type.cuh"
	-#include "../thread/thread_reduce.cuh"
	-#include "../thread/thread_scan.cuh"
	-#include "../block/block_scan.cuh"
	-#include "../util_namespace.cuh"
	-
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief BlockRadixRank provides operations for ranking unsigned integer types within a CUDA threadblock.
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * Blah...
	- *
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam RADIX_BITS <b>[optional]</b> The number of radix bits per digit place (default: 5 bits)
	- * \tparam MEMOIZE_OUTER_SCAN <b>[optional]</b> Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). See BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE for more details.
	- * \tparam INNER_SCAN_ALGORITHM <b>[optional]</b> The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
	- * \tparam SMEM_CONFIG <b>[optional]</b> Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)
	- *
	- * \par Usage Considerations
	- * - Keys must be in a form suitable for radix ranking (i.e., unsigned bits).
	- * - Assumes a [<em>blocked arrangement</em>](index.html#sec5sec4) of elements across threads
	- * - \smemreuse{BlockRadixRank::TempStorage}
	- *
	- * \par Performance Considerations
	- *
	- * \par Algorithm
	- * These parallel radix ranking variants have <em>O</em>(<em>n</em>) work complexity and are implemented in XXX phases:
	- * -# blah
	- * -# blah
	- *
	- * \par Examples
	- * \par
	- * - <b>Example 1:</b> Simple radix rank of 32-bit integer keys
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * template <int BLOCK_THREADS>
	- * __global__ void ExampleKernel(...)
	- * {
	- *
	- * \endcode
	- */
	-template <
	- int BLOCK_THREADS,
	- int RADIX_BITS,
	- bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false,
	- BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS,
	- cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
	-class BlockRadixRank
	-{
	-private:
	-
	- /******************************************************************************
	- * Type definitions and constants
	- ******************************************************************************/
	-
	- // Integer type for digit counters (to be packed into words of type PackedCounters)
	- typedef unsigned short DigitCounter;
	-
	- // Integer type for packing DigitCounters into columns of shared memory banks
	- typedef typename If<(SMEM_CONFIG == cudaSharedMemBankSizeEightByte),
	- unsigned long long,
	- unsigned int>::Type PackedCounter;
	-
	- enum
	- {
	- RADIX_DIGITS = 1 << RADIX_BITS,
	-
	- LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS,
	- WARP_THREADS = 1 << LOG_WARP_THREADS,
	- WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
	-
	- BYTES_PER_COUNTER = sizeof(DigitCounter),
	- LOG_BYTES_PER_COUNTER = Log2<BYTES_PER_COUNTER>::VALUE,
	-
	- PACKING_RATIO = sizeof(PackedCounter) / sizeof(DigitCounter),
	- LOG_PACKING_RATIO = Log2<PACKING_RATIO>::VALUE,
	-
	- LOG_COUNTER_LANES = CUB_MAX((RADIX_BITS - LOG_PACKING_RATIO), 0), // Always at least one lane
	- COUNTER_LANES = 1 << LOG_COUNTER_LANES,
	-
	- // The number of packed counters per thread (plus one for padding)
	- RAKING_SEGMENT = COUNTER_LANES + 1,
	-
	- LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS,
	- SMEM_BANKS = 1 << LOG_SMEM_BANKS,
	- };
	-
	-
	- /// BlockScan type
	- typedef BlockScan<PackedCounter, BLOCK_THREADS, INNER_SCAN_ALGORITHM> BlockScan;
	-
	-
	- /// Shared memory storage layout type for BlockRadixRank
	- struct _TempStorage
	- {
	- // Storage for scanning local ranks
	- typename BlockScan::TempStorage block_scan;
	-
	- union
	- {
	- DigitCounter digit_counters[COUNTER_LANES + 1][BLOCK_THREADS][PACKING_RATIO];
	- PackedCounter raking_grid[BLOCK_THREADS][RAKING_SEGMENT];
	- };
	- };
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Copy of raking segment, promoted to registers
	- PackedCounter cached_segment[RAKING_SEGMENT];
	-
	-
	- /******************************************************************************
	- * Templated iteration
	- ******************************************************************************/
	-
	- // General template iteration
	- template <int COUNT, int MAX>
	- struct Iterate
	- {
	- /**
	- * Decode keys. Decodes the radix digit from the current digit place
	- * and increments the thread's corresponding counter in shared
	- * memory for that digit.
	- *
	- * Saves both (1) the prior value of that counter (the key's
	- * thread-local exclusive prefix sum for that digit), and (2) the shared
	- * memory offset of the counter (for later use).
	- */
	- template <typename UnsignedBits, int KEYS_PER_THREAD>
	- static __device__ __forceinline__ void DecodeKeys(
	- BlockRadixRank &cta, // BlockRadixRank instance
	- UnsignedBits (&keys)[KEYS_PER_THREAD], // Key to decode
	- DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value (out parameter)
	- DigitCounter* (&digit_counters)[KEYS_PER_THREAD], // Counter smem offset (out parameter)
	- int current_bit) // The least-significant bit position of the current digit to extract
	- {
	- // Add in sub-counter offset
	- UnsignedBits sub_counter = BFE(keys[COUNT], current_bit + LOG_COUNTER_LANES, LOG_PACKING_RATIO);
	-
	- // Add in row offset
	- UnsignedBits row_offset = BFE(keys[COUNT], current_bit, LOG_COUNTER_LANES);
	-
	- // Pointer to smem digit counter
	- digit_counters[COUNT] = &cta.temp_storage.digit_counters[row_offset][cta.linear_tid][sub_counter];
	-
	- // Load thread-exclusive prefix
	- thread_prefixes[COUNT] = *digit_counters[COUNT];
	-
	- // Store inclusive prefix
	- *digit_counters[COUNT] = thread_prefixes[COUNT] + 1;
	-
	- // Iterate next key
	- Iterate<COUNT + 1, MAX>::DecodeKeys(cta, keys, thread_prefixes, digit_counters, current_bit);
	- }
	-
	-
	- // Termination
	- template <int KEYS_PER_THREAD>
	- static __device__ __forceinline__ void UpdateRanks(
	- int (&ranks)[KEYS_PER_THREAD], // Local ranks (out parameter)
	- DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value
	- DigitCounter* (&digit_counters)[KEYS_PER_THREAD]) // Counter smem offset
	- {
	- // Add in threadblock exclusive prefix
	- ranks[COUNT] = thread_prefixes[COUNT] + *digit_counters[COUNT];
	-
	- // Iterate next key
	- Iterate<COUNT + 1, MAX>::UpdateRanks(ranks, thread_prefixes, digit_counters);
	- }
	- };
	-
	-
	- // Termination
	- template <int MAX>
	- struct Iterate<MAX, MAX>
	- {
	- // DecodeKeys
	- template <typename UnsignedBits, int KEYS_PER_THREAD>
	- static __device__ __forceinline__ void DecodeKeys(
	- BlockRadixRank &cta,
	- UnsignedBits (&keys)[KEYS_PER_THREAD],
	- DigitCounter (&thread_prefixes)[KEYS_PER_THREAD],
	- DigitCounter* (&digit_counters)[KEYS_PER_THREAD],
	- int current_bit) {}
	-
	-
	- // UpdateRanks
	- template <int KEYS_PER_THREAD>
	- static __device__ __forceinline__ void UpdateRanks(
	- int (&ranks)[KEYS_PER_THREAD],
	- DigitCounter (&thread_prefixes)[KEYS_PER_THREAD],
	- DigitCounter *(&digit_counters)[KEYS_PER_THREAD]) {}
	- };
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal storage allocator
	- */
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /**
	- * Performs upsweep raking reduction, returning the aggregate
	- */
	- __device__ __forceinline__ PackedCounter Upsweep()
	- {
	- PackedCounter *smem_raking_ptr = temp_storage.raking_grid[linear_tid];
	- PackedCounter *raking_ptr;
	-
	- if (MEMOIZE_OUTER_SCAN)
	- {
	- // Copy data into registers
	- #pragma unroll
	- for (int i = 0; i < RAKING_SEGMENT; i++)
	- {
	- cached_segment[i] = smem_raking_ptr[i];
	- }
	- raking_ptr = cached_segment;
	- }
	- else
	- {
	- raking_ptr = smem_raking_ptr;
	- }
	-
	- return ThreadReduce<RAKING_SEGMENT>(raking_ptr, Sum());
	- }
	-
	-
	- /// Performs exclusive downsweep raking scan
	- __device__ __forceinline__ void ExclusiveDownsweep(
	- PackedCounter raking_partial)
	- {
	- PackedCounter *smem_raking_ptr = temp_storage.raking_grid[linear_tid];
	-
	- PackedCounter *raking_ptr = (MEMOIZE_OUTER_SCAN) ?
	- cached_segment :
	- smem_raking_ptr;
	-
	- // Exclusive raking downsweep scan
	- ThreadScanExclusive<RAKING_SEGMENT>(raking_ptr, raking_ptr, Sum(), raking_partial);
	-
	- if (MEMOIZE_OUTER_SCAN)
	- {
	- // Copy data back to smem
	- #pragma unroll
	- for (int i = 0; i < RAKING_SEGMENT; i++)
	- {
	- smem_raking_ptr[i] = cached_segment[i];
	- }
	- }
	- }
	-
	-
	- /**
	- * Reset shared memory digit counters
	- */
	- __device__ __forceinline__ void ResetCounters()
	- {
	- // Reset shared memory digit counters
	- #pragma unroll
	- for (int LANE = 0; LANE < COUNTER_LANES + 1; LANE++)
	- {
	- ((PackedCounter) temp_storage.digit_counters[LANE][linear_tid]) = 0;
	- }
	- }
	-
	-
	- /**
	- * Scan shared memory digit counters.
	- */
	- __device__ __forceinline__ void ScanCounters()
	- {
	- // Upsweep scan
	- PackedCounter raking_partial = Upsweep();
	-
	- // Compute inclusive sum
	- PackedCounter inclusive_partial;
	- PackedCounter packed_aggregate;
	- BlockScan(temp_storage.block_scan, linear_tid).InclusiveSum(raking_partial, inclusive_partial, packed_aggregate);
	-
	- // Propagate totals in packed fields
	- #pragma unroll
	- for (int PACKED = 1; PACKED < PACKING_RATIO; PACKED++)
	- {
	- inclusive_partial += packed_aggregate << (sizeof(DigitCounter) * 8 * PACKED);
	- }
	-
	- // Downsweep scan with exclusive partial
	- PackedCounter exclusive_partial = inclusive_partial - raking_partial;
	- ExclusiveDownsweep(exclusive_partial);
	- }
	-
	-public:
	-
	- /// \smemstorage{BlockScan}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockRadixRank()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockRadixRank(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockRadixRank(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockRadixRank(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Raking
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Rank keys.
	- */
	- template <
	- typename UnsignedBits,
	- int KEYS_PER_THREAD>
	- __device__ __forceinline__ void RankKeys(
	- UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile
	- int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile
	- int current_bit) ///< [in] The least-significant bit position of the current digit to extract
	- {
	- DigitCounter thread_prefixes[KEYS_PER_THREAD]; // For each key, the count of previous keys in this tile having the same digit
	- DigitCounter* digit_counters[KEYS_PER_THREAD]; // For each key, the byte-offset of its corresponding digit counter in smem
	-
	- // Reset shared memory digit counters
	- ResetCounters();
	-
	- // Decode keys and update digit counters
	- Iterate<0, KEYS_PER_THREAD>::DecodeKeys(*this, keys, thread_prefixes, digit_counters, current_bit);
	-
	- __syncthreads();
	-
	- // Scan shared memory counters
	- ScanCounters();
	-
	- __syncthreads();
	-
	- // Extract the local ranks of each key
	- Iterate<0, KEYS_PER_THREAD>::UpdateRanks(ranks, thread_prefixes, digit_counters);
	- }
	-
	-
	- /**
	- * \brief Rank keys. For the lower \p RADIX_DIGITS threads, digit counts for each digit are provided for the corresponding thread.
	- */
	- template <
	- typename UnsignedBits,
	- int KEYS_PER_THREAD>
	- __device__ __forceinline__ void RankKeys(
	- UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile
	- int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile (out parameter)
	- int current_bit, ///< [in] The least-significant bit position of the current digit to extract
	- int &inclusive_digit_prefix) ///< [out] The incluisve prefix sum for the digit threadIdx.x
	- {
	- // Rank keys
	- RankKeys(keys, ranks, current_bit);
	-
	- // Get the inclusive and exclusive digit totals corresponding to the calling thread.
	- if ((BLOCK_THREADS == RADIX_DIGITS) \|\| (linear_tid < RADIX_DIGITS))
	- {
	- // Obtain ex/inclusive digit counts. (Unfortunately these all reside in the
	- // first counter column, resulting in unavoidable bank conflicts.)
	- int counter_lane = (linear_tid & (COUNTER_LANES - 1));
	- int sub_counter = linear_tid >> (LOG_COUNTER_LANES);
	- inclusive_digit_prefix = temp_storage.digit_counters[counter_lane + 1][0][sub_counter];
	- }
	- }
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_radix_sort.cuh b/lib/kokkos/TPL/cub/block/block_radix_sort.cuh
	deleted file mode 100755
	index 873d40126..000000000
	--- a/lib/kokkos/TPL/cub/block/block_radix_sort.cuh
	+++ /dev/null
	@@ -1,608 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockRadixSort class provides [<em>collective</em>](index.html#sec0) methods for radix sorting of items partitioned across a CUDA thread block.
	- */
	-
	-
	-#pragma once
	-
	-#include "../util_namespace.cuh"
	-#include "../util_arch.cuh"
	-#include "../util_type.cuh"
	-#include "block_exchange.cuh"
	-#include "block_radix_rank.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief The cub::BlockRadixSort class provides [<em>collective</em>](index.html#sec0) methods for sorting items partitioned across a CUDA thread block using a radix sorting method. ![](sorting_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
	- * items into ascending order. It relies upon a positional representation for
	- * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
	- * characters, etc.) specified from least-significant to most-significant. For a
	- * given input sequence of keys and a set of rules specifying a total ordering
	- * of the symbolic alphabet, the radix sorting method produces a lexicographic
	- * ordering of those keys.
	- *
	- * \par
	- * BlockRadixSort can sort all of the built-in C++ numeric primitive types, e.g.:
	- * <tt>unsigned char</tt>, \p int, \p double, etc. Within each key, the implementation treats fixed-length
	- * bit-sequences of \p RADIX_BITS as radix digit places. Although the direct radix sorting
	- * method can only be applied to unsigned integral types, BlockRadixSort
	- * is able to sort signed and floating-point types via simple bit-wise transformations
	- * that ensure lexicographic key ordering.
	- *
	- * \tparam Key Key type
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam ITEMS_PER_THREAD The number of items per thread
	- * \tparam Value <b>[optional]</b> Value type (default: cub::NullType)
	- * \tparam RADIX_BITS <b>[optional]</b> The number of radix bits per digit place (default: 4 bits)
	- * \tparam MEMOIZE_OUTER_SCAN <b>[optional]</b> Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).
	- * \tparam INNER_SCAN_ALGORITHM <b>[optional]</b> The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
	- * \tparam SMEM_CONFIG <b>[optional]</b> Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockRadixSort}
	- * \par
	- * The code snippet below illustrates a sort of 512 integer keys that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockRadixSort for 128 threads owning 4 integer items each
	- * typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
	- *
	- * // Allocate shared memory for BlockRadixSort
	- * __shared__ typename BlockRadixSort::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_keys[4];
	- * ...
	- *
	- * // Collectively sort the keys
	- * BlockRadixSort(temp_storage).Sort(thread_keys);
	- *
	- * ...
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_keys across the block of threads is
	- * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>. The
	- * corresponding output \p thread_keys in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- *
	- */
	-template <
	- typename Key,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- typename Value = NullType,
	- int RADIX_BITS = 4,
	- bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false,
	- BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS,
	- cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
	-class BlockRadixSort
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and type definitions
	- ******************************************************************************/
	-
	- // Key traits and unsigned bits type
	- typedef NumericTraits<Key> KeyTraits;
	- typedef typename KeyTraits::UnsignedBits UnsignedBits;
	-
	- /// BlockRadixRank utility type
	- typedef BlockRadixRank<BLOCK_THREADS, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG> BlockRadixRank;
	-
	- /// BlockExchange utility type for keys
	- typedef BlockExchange<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockExchangeKeys;
	-
	- /// BlockExchange utility type for values
	- typedef BlockExchange<Value, BLOCK_THREADS, ITEMS_PER_THREAD> BlockExchangeValues;
	-
	- /// Shared memory storage layout type
	- struct _TempStorage
	- {
	- union
	- {
	- typename BlockRadixRank::TempStorage ranking_storage;
	- typename BlockExchangeKeys::TempStorage exchange_keys;
	- typename BlockExchangeValues::TempStorage exchange_values;
	- };
	- };
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-
	-public:
	-
	- /// \smemstorage{BlockScan}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockRadixSort()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockRadixSort(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockRadixSort(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockRadixSort(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Sorting (blocked arrangements)
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Performs a block-wide radix sort over a [<em>blocked arrangement</em>](index.html#sec5sec4) of keys.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sort of 512 integer keys that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive keys.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockRadixSort for 128 threads owning 4 integer keys each
	- * typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
	- *
	- * // Allocate shared memory for BlockRadixSort
	- * __shared__ typename BlockRadixSort::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_keys[4];
	- * ...
	- *
	- * // Collectively sort the keys
	- * BlockRadixSort(temp_storage).Sort(thread_keys);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_keys across the block of threads is
	- * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.
	- * The corresponding output \p thread_keys in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- */
	- __device__ __forceinline__ void Sort(
	- Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8) ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- {
	- UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
	- reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);
	-
	- // Twiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
	- }
	-
	- // Radix sorting passes
	- while (true)
	- {
	- // Rank the blocked keys
	- int ranks[ITEMS_PER_THREAD];
	- BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit);
	- begin_bit += RADIX_BITS;
	-
	- __syncthreads();
	-
	- // Exchange keys through shared memory in blocked arrangement
	- BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks);
	-
	- // Quit if done
	- if (begin_bit >= end_bit) break;
	-
	- __syncthreads();
	- }
	-
	- // Untwiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
	- }
	- }
	-
	-
	- /**
	- * \brief Performs a block-wide radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec4) of keys and values.
	- *
	- * BlockRadixSort can only accommodate one associated tile of values. To "truck along"
	- * more than one tile of values, simply perform a key-value sort of the keys paired
	- * with a temporary value array that enumerates the key indices. The reordered indices
	- * can then be used as a gather-vector for exchanging other associated tile data through
	- * shared memory.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sort of 512 integer keys and values that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive pairs.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockRadixSort for 128 threads owning 4 integer keys and values each
	- * typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
	- *
	- * // Allocate shared memory for BlockRadixSort
	- * __shared__ typename BlockRadixSort::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_keys[4];
	- * int thread_values[4];
	- * ...
	- *
	- * // Collectively sort the keys and values among block threads
	- * BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_keys across the block of threads is
	- * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>. The
	- * corresponding output \p thread_keys in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
	- *
	- */
	- __device__ __forceinline__ void Sort(
	- Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort
	- Value (&values)[ITEMS_PER_THREAD], ///< [in-out] Values to sort
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8) ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- {
	- UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
	- reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);
	-
	- // Twiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
	- }
	-
	- // Radix sorting passes
	- while (true)
	- {
	- // Rank the blocked keys
	- int ranks[ITEMS_PER_THREAD];
	- BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit);
	- begin_bit += RADIX_BITS;
	-
	- __syncthreads();
	-
	- // Exchange keys through shared memory in blocked arrangement
	- BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks);
	-
	- __syncthreads();
	-
	- // Exchange values through shared memory in blocked arrangement
	- BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToBlocked(values, ranks);
	-
	- // Quit if done
	- if (begin_bit >= end_bit) break;
	-
	- __syncthreads();
	- }
	-
	- // Untwiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
	- }
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Sorting (blocked arrangement -> striped arrangement)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Performs a radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec4) of keys, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec4).
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sort of 512 integer keys that
	- * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive keys. The final partitioning is striped.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockRadixSort for 128 threads owning 4 integer keys each
	- * typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
	- *
	- * // Allocate shared memory for BlockRadixSort
	- * __shared__ typename BlockRadixSort::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_keys[4];
	- * ...
	- *
	- * // Collectively sort the keys
	- * BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_keys across the block of threads is
	- * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>. The
	- * corresponding output \p thread_keys in those threads will be
	- * <tt>{ [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }</tt>.
	- *
	- */
	- __device__ __forceinline__ void SortBlockedToStriped(
	- Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8) ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- {
	- UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
	- reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);
	-
	- // Twiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
	- }
	-
	- // Radix sorting passes
	- while (true)
	- {
	- // Rank the blocked keys
	- int ranks[ITEMS_PER_THREAD];
	- BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit);
	- begin_bit += RADIX_BITS;
	-
	- __syncthreads();
	-
	- // Check if this is the last pass
	- if (begin_bit >= end_bit)
	- {
	- // Last pass exchanges keys through shared memory in striped arrangement
	- BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToStriped(keys, ranks);
	-
	- // Quit
	- break;
	- }
	-
	- // Exchange keys through shared memory in blocked arrangement
	- BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks);
	-
	- __syncthreads();
	- }
	-
	- // Untwiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
	- }
	- }
	-
	-
	- /**
	- * \brief Performs a radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec4) of keys and values, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec4).
	- *
	- * BlockRadixSort can only accommodate one associated tile of values. To "truck along"
	- * more than one tile of values, simply perform a key-value sort of the keys paired
	- * with a temporary value array that enumerates the key indices. The reordered indices
	- * can then be used as a gather-vector for exchanging other associated tile data through
	- * shared memory.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sort of 512 integer keys and values that
	- * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive pairs. The final partitioning is striped.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockRadixSort for 128 threads owning 4 integer keys and values each
	- * typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
	- *
	- * // Allocate shared memory for BlockRadixSort
	- * __shared__ typename BlockRadixSort::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_keys[4];
	- * int thread_values[4];
	- * ...
	- *
	- * // Collectively sort the keys and values among block threads
	- * BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_keys across the block of threads is
	- * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>. The
	- * corresponding output \p thread_keys in those threads will be
	- * <tt>{ [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }</tt>.
	- *
	- */
	- __device__ __forceinline__ void SortBlockedToStriped(
	- Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort
	- Value (&values)[ITEMS_PER_THREAD], ///< [in-out] Values to sort
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8) ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- {
	- UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
	- reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);
	-
	- // Twiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
	- }
	-
	- // Radix sorting passes
	- while (true)
	- {
	- // Rank the blocked keys
	- int ranks[ITEMS_PER_THREAD];
	- BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit);
	- begin_bit += RADIX_BITS;
	-
	- __syncthreads();
	-
	- // Check if this is the last pass
	- if (begin_bit >= end_bit)
	- {
	- // Last pass exchanges keys through shared memory in striped arrangement
	- BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToStriped(keys, ranks);
	-
	- __syncthreads();
	-
	- // Last pass exchanges through shared memory in striped arrangement
	- BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToStriped(values, ranks);
	-
	- // Quit
	- break;
	- }
	-
	- // Exchange keys through shared memory in blocked arrangement
	- BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks);
	-
	- __syncthreads();
	-
	- // Exchange values through shared memory in blocked arrangement
	- BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToBlocked(values, ranks);
	-
	- __syncthreads();
	- }
	-
	- // Untwiddle bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
	- }
	- }
	-
	-
	- //@} end member group
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_raking_layout.cuh b/lib/kokkos/TPL/cub/block/block_raking_layout.cuh
	deleted file mode 100755
	index 878a786cd..000000000
	--- a/lib/kokkos/TPL/cub/block/block_raking_layout.cuh
	+++ /dev/null
	@@ -1,145 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockRakingLayout provides a conflict-free shared memory layout abstraction for warp-raking across thread block data.
	- */
	-
	-
	-#pragma once
	-
	-#include "../util_macro.cuh"
	-#include "../util_arch.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief BlockRakingLayout provides a conflict-free shared memory layout abstraction for raking across thread block data. ![](raking.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * This type facilitates a shared memory usage pattern where a block of CUDA
	- * threads places elements into shared memory and then reduces the active
	- * parallelism to one "raking" warp of threads for serially aggregating consecutive
	- * sequences of shared items. Padding is inserted to eliminate bank conflicts
	- * (for most data types).
	- *
	- * \tparam T The data type to be exchanged.
	- * \tparam BLOCK_THREADS The thread block size in threads.
	- * \tparam BLOCK_STRIPS When strip-mining, the number of threadblock-strips per tile
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- int BLOCK_STRIPS = 1>
	-struct BlockRakingLayout
	-{
	- //---------------------------------------------------------------------
	- // Constants and typedefs
	- //---------------------------------------------------------------------
	-
	- enum
	- {
	- /// The total number of elements that need to be cooperatively reduced
	- SHARED_ELEMENTS =
	- BLOCK_THREADS * BLOCK_STRIPS,
	-
	- /// Maximum number of warp-synchronous raking threads
	- MAX_RAKING_THREADS =
	- CUB_MIN(BLOCK_THREADS, PtxArchProps::WARP_THREADS),
	-
	- /// Number of raking elements per warp-synchronous raking thread (rounded up)
	- SEGMENT_LENGTH =
	- (SHARED_ELEMENTS + MAX_RAKING_THREADS - 1) / MAX_RAKING_THREADS,
	-
	- /// Never use a raking thread that will have no valid data (e.g., when BLOCK_THREADS is 62 and SEGMENT_LENGTH is 2, we should only use 31 raking threads)
	- RAKING_THREADS =
	- (SHARED_ELEMENTS + SEGMENT_LENGTH - 1) / SEGMENT_LENGTH,
	-
	- /// Pad each segment length with one element if it evenly divides the number of banks
	- SEGMENT_PADDING =
	- (PtxArchProps::SMEM_BANKS % SEGMENT_LENGTH == 0) ? 1 : 0,
	-
	- /// Total number of elements in the raking grid
	- GRID_ELEMENTS =
	- RAKING_THREADS * (SEGMENT_LENGTH + SEGMENT_PADDING),
	-
	- /// Whether or not we need bounds checking during raking (the number of reduction elements is not a multiple of the warp size)
	- UNGUARDED =
	- (SHARED_ELEMENTS % RAKING_THREADS == 0),
	- };
	-
	-
	- /**
	- * \brief Shared memory storage type
	- */
	- typedef T TempStorage[BlockRakingLayout::GRID_ELEMENTS];
	-
	-
	- /**
	- * \brief Returns the location for the calling thread to place data into the grid
	- */
	- static __device__ __forceinline__ T* PlacementPtr(
	- TempStorage &temp_storage,
	- int linear_tid,
	- int block_strip = 0)
	- {
	- // Offset for partial
	- unsigned int offset = (block_strip * BLOCK_THREADS) + linear_tid;
	-
	- // Add in one padding element for every segment
	- if (SEGMENT_PADDING > 0)
	- {
	- offset += offset / SEGMENT_LENGTH;
	- }
	-
	- // Incorporating a block of padding partials every shared memory segment
	- return temp_storage + offset;
	- }
	-
	-
	- /**
	- * \brief Returns the location for the calling thread to begin sequential raking
	- */
	- static __device__ __forceinline__ T* RakingPtr(
	- TempStorage &temp_storage,
	- int linear_tid)
	- {
	- return temp_storage + (linear_tid * (SEGMENT_LENGTH + SEGMENT_PADDING));
	- }
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_reduce.cuh b/lib/kokkos/TPL/cub/block/block_reduce.cuh
	deleted file mode 100755
	index ffdff7377..000000000
	--- a/lib/kokkos/TPL/cub/block/block_reduce.cuh
	+++ /dev/null
	@@ -1,563 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "specializations/block_reduce_raking.cuh"
	-#include "specializations/block_reduce_warp_reductions.cuh"
	-#include "../util_type.cuh"
	-#include "../thread/thread_operators.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-
	-/******************************************************************************
	- * Algorithmic variants
	- ******************************************************************************/
	-
	-/**
	- * BlockReduceAlgorithm enumerates alternative algorithms for parallel
	- * reduction across a CUDA threadblock.
	- */
	-enum BlockReduceAlgorithm
	-{
	-
	- /**
	- * \par Overview
	- * An efficient "raking" reduction algorithm. Execution is comprised of
	- * three phases:
	- * -# Upsweep sequential reduction in registers (if threads contribute more
	- * than one input each). Each thread then places the partial reduction
	- * of its item(s) into shared memory.
	- * -# Upsweep sequential reduction in shared memory. Threads within a
	- * single warp rake across segments of shared partial reductions.
	- * -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
	- *
	- * \par
	- * \image html block_reduce.png
	- * <div class="centercaption">\p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
	- *
	- * \par Performance Considerations
	- * - Although this variant may suffer longer turnaround latencies when the
	- * GPU is under-occupied, it can often provide higher overall throughput
	- * across the GPU when suitably occupied.
	- */
	- BLOCK_REDUCE_RAKING,
	-
	-
	- /**
	- * \par Overview
	- * A quick "tiled warp-reductions" reduction algorithm. Execution is
	- * comprised of four phases:
	- * -# Upsweep sequential reduction in registers (if threads contribute more
	- * than one input each). Each thread then places the partial reduction
	- * of its item(s) into shared memory.
	- * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style
	- * reduction within each warp.
	- * -# A propagation phase where the warp reduction outputs in each warp are
	- * updated with the aggregate from each preceding warp.
	- *
	- * \par
	- * \image html block_scan_warpscans.png
	- * <div class="centercaption">\p BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
	- *
	- * \par Performance Considerations
	- * - Although this variant may suffer lower overall throughput across the
	- * GPU because due to a heavy reliance on inefficient warp-reductions, it
	- * can often provide lower turnaround latencies when the GPU is
	- * under-occupied.
	- */
	- BLOCK_REDUCE_WARP_REDUCTIONS,
	-};
	-
	-
	-/******************************************************************************
	- * Block reduce
	- ******************************************************************************/
	-
	-/**
	- * \brief The BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block. ![](reduce_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
	- * uses a binary combining operator to compute a single aggregate from a list of input elements.
	- *
	- * \par
	- * Optionally, BlockReduce can be specialized by algorithm to accommodate different latency/throughput workload profiles:
	- * -# <b>cub::BLOCK_REDUCE_RAKING</b>. An efficient "raking" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm)
	- * -# <b>cub::BLOCK_REDUCE_WARP_REDUCTIONS</b>. A quick "tiled warp-reductions" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm)
	- *
	- * \tparam T Data type being reduced
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam ALGORITHM <b>[optional]</b> cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_RAKING)
	- *
	- * \par Performance Considerations
	- * - Very efficient (only one synchronization barrier).
	- * - Zero bank conflicts for most types.
	- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
	- * - Summation (<b><em>vs.</em></b> generic reduction)
	- * - \p BLOCK_THREADS is a multiple of the architecture's warp size
	- * - Every thread has a valid input (i.e., full <b><em>vs.</em></b> partial-tiles)
	- * - See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockReduce}
	- * \par
	- * The code snippet below illustrates a sum reduction of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Compute the block-wide sum for thread0
	- * int aggregate = BlockReduce(temp_storage).Sum(thread_data);
	- *
	- * \endcode
	- *
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_RAKING>
	-class BlockReduce
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Internal specialization.
	- typedef typename If<(ALGORITHM == BLOCK_REDUCE_WARP_REDUCTIONS),
	- BlockReduceWarpReductions<T, BLOCK_THREADS>,
	- BlockReduceRaking<T, BLOCK_THREADS> >::Type InternalBlockReduce;
	-
	- /// Shared memory storage layout type for BlockReduce
	- typedef typename InternalBlockReduce::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-
	-public:
	-
	- /// \smemstorage{BlockReduce}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockReduce()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockReduce(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockReduce(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockReduce(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Generic reductions
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor. Each thread contributes one input element.
	- *
	- * The return value is undefined in threads other than thread<sub>0</sub>.
	- *
	- * Supports non-commutative reduction operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a max reduction of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Each thread obtains an input item
	- * int thread_data;
	- * ...
	- *
	- * // Compute the block-wide max for thread0
	- * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
	- *
	- * \endcode
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- return InternalBlockReduce(temp_storage, linear_tid).template Reduce<true>(input, BLOCK_THREADS, reduction_op);
	- }
	-
	-
	- /**
	- * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.
	- *
	- * The return value is undefined in threads other than thread<sub>0</sub>.
	- *
	- * Supports non-commutative reduction operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a max reduction of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Compute the block-wide max for thread0
	- * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
	- *
	- * \endcode
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T (&inputs)[ITEMS_PER_THREAD], ///< [in] Calling thread's input segment
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- // Reduce partials
	- T partial = ThreadReduce(inputs, reduction_op);
	- return Reduce(partial, reduction_op);
	- }
	-
	-
	- /**
	- * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor. The first \p num_valid threads each contribute one input element.
	- *
	- * The return value is undefined in threads other than thread<sub>0</sub>.
	- *
	- * Supports non-commutative reduction operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a max reduction of a partially-full tile of integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int num_valid, ...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Each thread obtains an input item
	- * int thread_data;
	- * if (threadIdx.x < num_valid) thread_data = ...
	- *
	- * // Compute the block-wide max for thread0
	- * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max(), num_valid);
	- *
	- * \endcode
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input
	- ReductionOp reduction_op, ///< [in] Binary reduction operator
	- int num_valid) ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)
	- {
	- // Determine if we scan skip bounds checking
	- if (num_valid >= BLOCK_THREADS)
	- {
	- return InternalBlockReduce(temp_storage, linear_tid).template Reduce<true>(input, num_valid, reduction_op);
	- }
	- else
	- {
	- return InternalBlockReduce(temp_storage, linear_tid).template Reduce<false>(input, num_valid, reduction_op);
	- }
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Summation reductions
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes one input element.
	- *
	- * The return value is undefined in threads other than thread<sub>0</sub>.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sum reduction of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Each thread obtains an input item
	- * int thread_data;
	- * ...
	- *
	- * // Compute the block-wide sum for thread0
	- * int aggregate = BlockReduce(temp_storage).Sum(thread_data);
	- *
	- * \endcode
	- *
	- */
	- __device__ __forceinline__ T Sum(
	- T input) ///< [in] Calling thread's input
	- {
	- return InternalBlockReduce(temp_storage, linear_tid).template Sum<true>(input, BLOCK_THREADS);
	- }
	-
	- /**
	- * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.
	- *
	- * The return value is undefined in threads other than thread<sub>0</sub>.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sum reduction of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Compute the block-wide sum for thread0
	- * int aggregate = BlockReduce(temp_storage).Sum(thread_data);
	- *
	- * \endcode
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- */
	- template <int ITEMS_PER_THREAD>
	- __device__ __forceinline__ T Sum(
	- T (&inputs)[ITEMS_PER_THREAD]) ///< [in] Calling thread's input segment
	- {
	- // Reduce partials
	- T partial = ThreadReduce(inputs, cub::Sum());
	- return Sum(partial);
	- }
	-
	-
	- /**
	- * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. The first \p num_valid threads each contribute one input element.
	- *
	- * The return value is undefined in threads other than thread<sub>0</sub>.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sum reduction of a partially-full tile of integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int num_valid, ...)
	- * {
	- * // Specialize BlockReduce for 128 threads on type int
	- * typedef cub::BlockReduce<int, 128> BlockReduce;
	- *
	- * // Allocate shared memory for BlockReduce
	- * __shared__ typename BlockReduce::TempStorage temp_storage;
	- *
	- * // Each thread obtains an input item (up to num_items)
	- * int thread_data;
	- * if (threadIdx.x < num_valid)
	- * thread_data = ...
	- *
	- * // Compute the block-wide sum for thread0
	- * int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);
	- *
	- * \endcode
	- *
	- */
	- __device__ __forceinline__ T Sum(
	- T input, ///< [in] Calling thread's input
	- int num_valid) ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)
	- {
	- // Determine if we scan skip bounds checking
	- if (num_valid >= BLOCK_THREADS)
	- {
	- return InternalBlockReduce(temp_storage, linear_tid).template Sum<true>(input, num_valid);
	- }
	- else
	- {
	- return InternalBlockReduce(temp_storage, linear_tid).template Sum<false>(input, num_valid);
	- }
	- }
	-
	-
	- //@} end member group
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_scan.cuh b/lib/kokkos/TPL/cub/block/block_scan.cuh
	deleted file mode 100755
	index 1c1a2dac8..000000000
	--- a/lib/kokkos/TPL/cub/block/block_scan.cuh
	+++ /dev/null
	@@ -1,2233 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "specializations/block_scan_raking.cuh"
	-#include "specializations/block_scan_warp_scans.cuh"
	-#include "../util_arch.cuh"
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Algorithmic variants
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
	- */
	-enum BlockScanAlgorithm
	-{
	-
	- /**
	- * \par Overview
	- * An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases:
	- * -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
	- * -# Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
	- * -# A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
	- * -# Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
	- * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
	- *
	- * \par
	- * \image html block_scan_raking.png
	- * <div class="centercaption">\p BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
	- *
	- * \par Performance Considerations
	- * - Although this variant may suffer longer turnaround latencies when the
	- * GPU is under-occupied, it can often provide higher overall throughput
	- * across the GPU when suitably occupied.
	- */
	- BLOCK_SCAN_RAKING,
	-
	-
	- /**
	- * \par Overview
	- * Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at
	- * the expense of higher register pressure. Raking threads preserve their
	- * "upsweep" segment of values in registers while performing warp-synchronous
	- * scan, allowing the "downsweep" not to re-read them from shared memory.
	- */
	- BLOCK_SCAN_RAKING_MEMOIZE,
	-
	-
	- /**
	- * \par Overview
	- * A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases:
	- * -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
	- * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
	- * -# A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
	- * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
	- *
	- * \par
	- * \image html block_scan_warpscans.png
	- * <div class="centercaption">\p BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
	- *
	- * \par Performance Considerations
	- * - Although this variant may suffer lower overall throughput across the
	- * GPU because due to a heavy reliance on inefficient warpscans, it can
	- * often provide lower turnaround latencies when the GPU is under-occupied.
	- */
	- BLOCK_SCAN_WARP_SCANS,
	-};
	-
	-
	-/******************************************************************************
	- * Block scan
	- ******************************************************************************/
	-
	-/**
	- * \brief The BlockScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. ![](block_scan_logo.png)
	- * \ingroup BlockModule
	- *
	- * \par Overview
	- * Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
	- * produces an output list where each element is computed to be the reduction
	- * of the elements occurring earlier in the input list. <em>Prefix sum</em>
	- * connotes a prefix scan with the addition operator. The term \em inclusive indicates
	- * that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
	- * The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
	- * the <em>i</em><sup>th</sup> output reduction.
	- *
	- * \par
	- * Optionally, BlockScan can be specialized by algorithm to accommodate different latency/throughput workload profiles:
	- * -# <b>cub::BLOCK_SCAN_RAKING</b>. An efficient "raking reduce-then-scan" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm)
	- * -# <b>cub::BLOCK_SCAN_WARP_SCANS</b>. A quick "tiled warpscans" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm)
	- *
	- * \tparam T Data type being scanned
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam ALGORITHM <b>[optional]</b> cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockScan}
	- * \par
	- * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- *
	- * \par Performance Considerations
	- * - Uses special instructions when applicable (e.g., warp \p SHFL)
	- * - Uses synchronization-free communication between warp lanes when applicable
	- * - Uses only one or two block-wide synchronization barriers (depending on
	- * algorithm selection)
	- * - Zero bank conflicts for most types
	- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
	- * - Prefix sum variants (<b><em>vs.</em></b> generic scan)
	- * - Exclusive variants (<b><em>vs.</em></b> inclusive)
	- * - \p BLOCK_THREADS is a multiple of the architecture's warp size
	- * - See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives
	- *
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING>
	-class BlockScan
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /**
	- * Ensure the template parameterization meets the requirements of the
	- * specified algorithm. Currently, the BLOCK_SCAN_WARP_SCANS policy
	- * cannot be used with threadblock sizes not a multiple of the
	- * architectural warp size.
	- */
	- static const BlockScanAlgorithm SAFE_ALGORITHM =
	- ((ALGORITHM == BLOCK_SCAN_WARP_SCANS) && (BLOCK_THREADS % PtxArchProps::WARP_THREADS != 0)) ?
	- BLOCK_SCAN_RAKING :
	- ALGORITHM;
	-
	- /// Internal specialization.
	- typedef typename If<(SAFE_ALGORITHM == BLOCK_SCAN_WARP_SCANS),
	- BlockScanWarpScans<T, BLOCK_THREADS>,
	- BlockScanRaking<T, BLOCK_THREADS, (SAFE_ALGORITHM == BLOCK_SCAN_RAKING_MEMOIZE)> >::Type InternalBlockScan;
	-
	-
	- /// Shared memory storage layout type for BlockScan
	- typedef typename InternalBlockScan::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	-public:
	-
	- /// \smemstorage{BlockScan}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockScan()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockScan(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockScan(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockScan(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Exclusive prefix sum operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix sum of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>0, 1, ..., 127</tt>.
	- *
	- */
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output) ///< [out] Calling thread's output item (may be aliased to \p input)
	- {
	- T block_aggregate;
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix sum of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>0, 1, ..., 127</tt>.
	- * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads.
	- *
	- */
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an exclusive prefix sum over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 128 integer items that are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total += block_aggregate;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockScan for 128 threads
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(0);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage).ExclusiveSum(
	- * thread_data, thread_data, block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
	- * The corresponding output for the first segment will be <tt>0, 1, ..., 127</tt>.
	- * The output for the second segment will be <tt>128, 129, ..., 255</tt>. Furthermore,
	- * the value \p 128 will be stored in \p block_aggregate for all threads after each scan.
	- *
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate, block_prefix_op);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Exclusive prefix sum operations (multiple data per thread)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- */
	- template <int ITEMS_PER_THREAD>
	- __device__ __forceinline__ void ExclusiveSum(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD]) ///< [out] Calling thread's output items (may be aliased to \p input)
	- {
	- // Reduce consecutive thread items in registers
	- Sum scan_op;
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveSum(thread_partial, thread_partial);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- */
	- template <int ITEMS_PER_THREAD>
	- __device__ __forceinline__ void ExclusiveSum(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- // Reduce consecutive thread items in registers
	- Sum scan_op;
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveSum(thread_partial, thread_partial, block_aggregate);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an exclusive prefix sum over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * across 128 threads where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total += block_aggregate;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
	- * __shared__ union {
	- * typename BlockLoad::TempStorage load;
	- * typename BlockScan::TempStorage scan;
	- * typename BlockStore::TempStorage store;
	- * } temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(0);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
	- * __syncthreads();
	- *
	- * // Collectively compute the block-wide exclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage.scan).ExclusiveSum(
	- * thread_data, thread_data, block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
	- * __syncthreads();
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
	- * The corresponding output for the first segment will be <tt>0, 1, 2, 3, ..., 510, 511</tt>.
	- * The output for the second segment will be <tt>512, 513, 514, 515, ..., 1022, 1023</tt>. Furthermore,
	- * the value \p 512 will be stored in \p block_aggregate for all threads after each scan.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- // Reduce consecutive thread items in registers
	- Sum scan_op;
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveSum(thread_partial, thread_partial, block_aggregate, block_prefix_op);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	-
	- //@} end member group // Inclusive prefix sums
	- /****************************************************************//
	- * \name Exclusive prefix scan operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix max scan
	- * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>INT_MIN, 0, 0, 2, ..., 124, 126</tt>.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T identity, ///< [in] Identity value
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- T block_aggregate;
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>INT_MIN, 0, 0, 2, ..., 124, 126</tt>.
	- * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input items
	- T &output, ///< [out] Calling thread's output items (may be aliased to \p input)
	- const T &identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an exclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 128 integer items that are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockScan for 128 threads
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(INT_MIN);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the block-wide exclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage).ExclusiveScan(
	- * thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>INT_MIN, 0, 0, 2, ..., 124, 126</tt>.
	- * The output for the second segment will be <tt>126, 128, 128, 130, ..., 252, 254</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 126 in all threads after the first scan, assigned \p 254 after the second
	- * scan, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate, block_prefix_op);
	- }
	-
	-
	- //@} end member group // Inclusive prefix sums
	- /****************************************************************//
	- * \name Exclusive prefix scan operations (multiple data per thread)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix max scan
	- * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- const T &identity, ///< [in] Identity value
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, identity, scan_op);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide exclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>{ [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }</tt>.
	- * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- const T &identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, identity, scan_op, block_aggregate);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an exclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 128 integer items that are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
	- * __shared__ union {
	- * typename BlockLoad::TempStorage load;
	- * typename BlockScan::TempStorage scan;
	- * typename BlockStore::TempStorage store;
	- * } temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(0);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
	- * __syncthreads();
	- *
	- * // Collectively compute the block-wide exclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage.scan).ExclusiveScan(
	- * thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
	- * __syncthreads();
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510</tt>.
	- * The output for the second segment will be <tt>510, 512, 512, 514, 514, 516, ..., 1020, 1022</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 510 in all threads after the first scan, assigned \p 1022 after the second
	- * scan, etc.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, identity, scan_op, block_aggregate, block_prefix_op);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	- //@} end member group
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- /****************************************************************//
	- * \name Exclusive prefix scan operations (identityless, single datum per thread)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. With no identity value, the output computed for <em>thread</em><sub>0</sub> is undefined.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- T block_aggregate;
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for <em>thread</em><sub>0</sub> is undefined.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate, block_prefix_op);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Exclusive prefix scan operations (identityless, multiple data per thread)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. With no identity value, the output computed for <em>thread</em><sub>0</sub> is undefined.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, scan_op);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for <em>thread</em><sub>0</sub> is undefined.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate, block_prefix_op);
	-
	- // Exclusive scan in registers with prefix
	- ThreadScanExclusive(input, output, scan_op, thread_partial);
	- }
	-
	-
	- //@} end member group
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	- /****************************************************************//
	- * \name Inclusive prefix sum operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix sum of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix sum
	- * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>1, 2, ..., 128</tt>.
	- *
	- */
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output) ///< [out] Calling thread's output item (may be aliased to \p input)
	- {
	- T block_aggregate;
	- InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix sum of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>1, 2, ..., 128</tt>.
	- * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads.
	- *
	- */
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate);
	- }
	-
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an inclusive prefix sum over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 128 integer items that are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total += block_aggregate;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockScan for 128 threads
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(0);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the block-wide inclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage).InclusiveSum(
	- * thread_data, thread_data, block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
	- * The corresponding output for the first segment will be <tt>1, 2, ..., 128</tt>.
	- * The output for the second segment will be <tt>129, 130, ..., 256</tt>. Furthermore,
	- * the value \p 128 will be stored in \p block_aggregate for all threads after each scan.
	- *
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate, block_prefix_op);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Inclusive prefix sum operations (multiple data per thread)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix sum of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix sum
	- * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- */
	- template <int ITEMS_PER_THREAD>
	- __device__ __forceinline__ void InclusiveSum(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD]) ///< [out] Calling thread's output items (may be aliased to \p input)
	- {
	- if (ITEMS_PER_THREAD == 1)
	- {
	- InclusiveSum(input[0], output[0]);
	- }
	- else
	- {
	- // Reduce consecutive thread items in registers
	- Sum scan_op;
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveSum(thread_partial, thread_partial);
	-
	- // Inclusive scan in registers with prefix
	- ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
	- }
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix sum of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>. The
	- * corresponding output \p thread_data in those threads will be
	- * <tt>{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }</tt>.
	- * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <int ITEMS_PER_THREAD>
	- __device__ __forceinline__ void InclusiveSum(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- if (ITEMS_PER_THREAD == 1)
	- {
	- InclusiveSum(input[0], output[0], block_aggregate);
	- }
	- else
	- {
	- // Reduce consecutive thread items in registers
	- Sum scan_op;
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveSum(thread_partial, thread_partial, block_aggregate);
	-
	- // Inclusive scan in registers with prefix
	- ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
	- }
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an inclusive prefix sum over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * across 128 threads where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total += block_aggregate;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
	- * __shared__ union {
	- * typename BlockLoad::TempStorage load;
	- * typename BlockScan::TempStorage scan;
	- * typename BlockStore::TempStorage store;
	- * } temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(0);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
	- * __syncthreads();
	- *
	- * // Collectively compute the block-wide inclusive prefix sum
	- * int block_aggregate;
	- * BlockScan(temp_storage.scan).IncluisveSum(
	- * thread_data, thread_data, block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
	- * __syncthreads();
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
	- * The corresponding output for the first segment will be <tt>1, 2, 3, 4, ..., 511, 512</tt>.
	- * The output for the second segment will be <tt>513, 514, 515, 516, ..., 1023, 1024</tt>. Furthermore,
	- * the value \p 512 will be stored in \p block_aggregate for all threads after each scan.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveSum(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- if (ITEMS_PER_THREAD == 1)
	- {
	- InclusiveSum(input[0], output[0], block_aggregate, block_prefix_op);
	- }
	- else
	- {
	- // Reduce consecutive thread items in registers
	- Sum scan_op;
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveSum(thread_partial, thread_partial, block_aggregate, block_prefix_op);
	-
	- // Inclusive scan in registers with prefix
	- ThreadScanInclusive(input, output, scan_op, thread_partial);
	- }
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Inclusive prefix scan operations
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix max scan
	- * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>0, 0, 2, 2, ..., 126, 126</tt>.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- T block_aggregate;
	- InclusiveScan(input, output, scan_op, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that
	- * are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain input item for each thread
	- * int thread_data;
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>0, 0, 2, 2, ..., 126, 126</tt>.
	- * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- InternalBlockScan(temp_storage, linear_tid).InclusiveScan(input, output, scan_op, block_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an inclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 128 integer items that are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockScan for 128 threads
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(INT_MIN);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the block-wide inclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage).InclusiveScan(
	- * thread_data, thread_data, cub::Max(), block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>0, 0, 2, 2, ..., 126, 126</tt>.
	- * The output for the second segment will be <tt>128, 128, 130, 130, ..., 254, 254</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 126 in all threads after the first scan, assigned \p 254 after the second
	- * scan, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- InternalBlockScan(temp_storage, linear_tid).InclusiveScan(input, output, scan_op, block_aggregate, block_prefix_op);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Inclusive prefix scan operations (multiple data per thread)
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix max scan
	- * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>. The
	- * corresponding output \p thread_data in those threads will be <tt>{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }</tt>.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- if (ITEMS_PER_THREAD == 1)
	- {
	- InclusiveScan(input[0], output[0], scan_op);
	- }
	- else
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, scan_op);
	-
	- // Inclusive scan in registers with prefix
	- ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
	- }
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that
	- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec4) across 128 threads
	- * where each thread owns 4 consecutive items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize BlockScan for 128 threads on type int
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate shared memory for BlockScan
	- * __shared__ typename BlockScan::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Collectively compute the block-wide inclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is
	- * <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.
	- * The corresponding output \p thread_data in those threads will be
	- * <tt>{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }</tt>.
	- * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] block-wide aggregate reduction of input items
	- {
	- if (ITEMS_PER_THREAD == 1)
	- {
	- InclusiveScan(input[0], output[0], scan_op, block_aggregate);
	- }
	- else
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate);
	-
	- // Inclusive scan in registers with prefix
	- ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
	- }
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- *
	- * The \p block_prefix_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
	- * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the first warp of threads in the block, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the block-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \blocked
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block that progressively
	- * computes an inclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 128 integer items that are partitioned across 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct BlockPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the first warp of threads in the block.
	- * // Thread-0 is responsible for returning a value for seeding the block-wide scan.
	- * __device__ int operator()(int block_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread
	- * typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE> BlockLoad;
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_TRANSPOSE> BlockStore;
	- * typedef cub::BlockScan<int, 128> BlockScan;
	- *
	- * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
	- * __shared__ union {
	- * typename BlockLoad::TempStorage load;
	- * typename BlockScan::TempStorage scan;
	- * typename BlockStore::TempStorage store;
	- * } temp_storage;
	- *
	- * // Initialize running total
	- * BlockPrefixOp prefix_op(0);
	- *
	- * // Have the block iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
	- * {
	- * // Load a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
	- * __syncthreads();
	- *
	- * // Collectively compute the block-wide inclusive prefix max scan
	- * int block_aggregate;
	- * BlockScan(temp_storage.scan).InclusiveScan(
	- * thread_data, thread_data, cub::Max(), block_aggregate, prefix_op);
	- * __syncthreads();
	- *
	- * // Store scanned items to output segment
	- * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
	- * __syncthreads();
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>0, 0, 2, 2, 4, 4, ..., 510, 510</tt>.
	- * The output for the second segment will be <tt>512, 512, 514, 514, 516, 516, ..., 1022, 1022</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 510 in all threads after the first scan, assigned \p 1022 after the second
	- * scan, etc.
	- *
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam BlockPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
	- */
	- template <
	- int ITEMS_PER_THREAD,
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items
	- T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to all inputs.
	- {
	- if (ITEMS_PER_THREAD == 1)
	- {
	- InclusiveScan(input[0], output[0], scan_op, block_aggregate, block_prefix_op);
	- }
	- else
	- {
	- // Reduce consecutive thread items in registers
	- T thread_partial = ThreadReduce(input, scan_op);
	-
	- // Exclusive threadblock-scan
	- ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate, block_prefix_op);
	-
	- // Inclusive scan in registers with prefix
	- ThreadScanInclusive(input, output, scan_op, thread_partial);
	- }
	- }
	-
	- //@} end member group
	-
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/block_store.cuh b/lib/kokkos/TPL/cub/block/block_store.cuh
	deleted file mode 100755
	index fb990de1c..000000000
	--- a/lib/kokkos/TPL/cub/block/block_store.cuh
	+++ /dev/null
	@@ -1,926 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Operations for writing linear segments of data from the CUDA thread block
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../util_namespace.cuh"
	-#include "../util_macro.cuh"
	-#include "../util_type.cuh"
	-#include "../util_vector.cuh"
	-#include "../thread/thread_store.cuh"
	-#include "block_exchange.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup IoModule
	- * @{
	- */
	-
	-
	-/****************************************************************//
	- * \name Blocked I/O
	- *********************************************************************/
	-//@{
	-
	-/**
	- * \brief Store a blocked arrangement of items across a thread block into a linear segment of items using the specified cache modifier.
	- *
	- * \blocked
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam OutputIteratorRA <b>[inferred]</b> The random-access iterator type for output (may be a simple pointer type).
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename OutputIteratorRA>
	-__device__ __forceinline__ void StoreBlocked(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	-{
	- // Store directly in thread-blocked order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- ThreadStore<MODIFIER>(block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM, items[ITEM]);
	- }
	-}
	-
	-
	-/**
	- * \brief Store a blocked arrangement of items across a thread block into a linear segment of items using the specified cache modifier, guarded by range
	- *
	- * \blocked
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam OutputIteratorRA <b>[inferred]</b> The random-access iterator type for output (may be a simple pointer type).
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename OutputIteratorRA>
	-__device__ __forceinline__ void StoreBlocked(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	-{
	- // Store directly in thread-blocked order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- if (ITEM + (linear_tid * ITEMS_PER_THREAD) < valid_items)
	- {
	- ThreadStore<MODIFIER>(block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM, items[ITEM]);
	- }
	- }
	-}
	-
	-
	-
	-//@} end member group
	-/****************************************************************//
	- * \name Striped I/O
	- *********************************************************************/
	-//@{
	-
	-
	-/**
	- * \brief Store a striped arrangement of data across the thread block into a linear segment of items using the specified cache modifier.
	- *
	- * \striped
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam OutputIteratorRA <b>[inferred]</b> The random-access iterator type for output (may be a simple pointer type).
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- int BLOCK_THREADS,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename OutputIteratorRA>
	-__device__ __forceinline__ void StoreStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	-{
	- // Store directly in striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- ThreadStore<MODIFIER>(block_itr + (ITEM * BLOCK_THREADS) + linear_tid, items[ITEM]);
	- }
	-}
	-
	-
	-/**
	- * \brief Store a striped arrangement of data across the thread block into a linear segment of items using the specified cache modifier, guarded by range
	- *
	- * \striped
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam BLOCK_THREADS The thread block size in threads
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam OutputIteratorRA <b>[inferred]</b> The random-access iterator type for output (may be a simple pointer type).
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- int BLOCK_THREADS,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename OutputIteratorRA>
	-__device__ __forceinline__ void StoreStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	-{
	- // Store directly in striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- if ((ITEM * BLOCK_THREADS) + linear_tid < valid_items)
	- {
	- ThreadStore<MODIFIER>(block_itr + (ITEM * BLOCK_THREADS) + linear_tid, items[ITEM]);
	- }
	- }
	-}
	-
	-
	-
	-//@} end member group
	-/****************************************************************//
	- * \name Warp-striped I/O
	- *********************************************************************/
	-//@{
	-
	-
	-/**
	- * \brief Store a warp-striped arrangement of data across the thread block into a linear segment of items using the specified cache modifier.
	- *
	- * \warpstriped
	- *
	- * \par Usage Considerations
	- * The number of threads in the thread block must be a multiple of the architecture's warp size.
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam OutputIteratorRA <b>[inferred]</b> The random-access iterator type for output (may be a simple pointer type).
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename OutputIteratorRA>
	-__device__ __forceinline__ void StoreWarpStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
	-{
	- int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1);
	- int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS;
	- int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD;
	-
	- // Store directly in warp-striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- ThreadStore<MODIFIER>(block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS), items[ITEM]);
	- }
	-}
	-
	-
	-/**
	- * \brief Store a warp-striped arrangement of data across the thread block into a linear segment of items using the specified cache modifier, guarded by range
	- *
	- * \warpstriped
	- *
	- * \par Usage Considerations
	- * The number of threads in the thread block must be a multiple of the architecture's warp size.
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- * \tparam OutputIteratorRA <b>[inferred]</b> The random-access iterator type for output (may be a simple pointer type).
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD,
	- typename OutputIteratorRA>
	-__device__ __forceinline__ void StoreWarpStriped(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	-{
	- int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1);
	- int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS;
	- int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD;
	-
	- // Store directly in warp-striped order
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- if (warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS) < valid_items)
	- {
	- ThreadStore<MODIFIER>(block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS), items[ITEM]);
	- }
	- }
	-}
	-
	-
	-
	-//@} end member group
	-/****************************************************************//
	- * \name Blocked, vectorized I/O
	- *********************************************************************/
	-//@{
	-
	-/**
	- * \brief Store a blocked arrangement of items across a thread block into a linear segment of items using the specified cache modifier.
	- *
	- * \blocked
	- *
	- * The output offset (\p block_ptr + \p block_offset) must be quad-item aligned,
	- * which is the default starting offset returned by \p cudaMalloc()
	- *
	- * \par
	- * The following conditions will prevent vectorization and storing will fall back to cub::BLOCK_STORE_DIRECT:
	- * - \p ITEMS_PER_THREAD is odd
	- * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
	- *
	- * \tparam MODIFIER cub::PtxStoreModifier cache modifier.
	- * \tparam T <b>[inferred]</b> The data type to store.
	- * \tparam ITEMS_PER_THREAD <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
	- *
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- typename T,
	- int ITEMS_PER_THREAD>
	-__device__ __forceinline__ void StoreBlockedVectorized(
	- int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- T *block_ptr, ///< [in] Input pointer for storing from
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	-{
	- enum
	- {
	- // Maximum CUDA vector size is 4 elements
	- MAX_VEC_SIZE = CUB_MIN(4, ITEMS_PER_THREAD),
	-
	- // Vector size must be a power of two and an even divisor of the items per thread
	- VEC_SIZE = ((((MAX_VEC_SIZE - 1) & MAX_VEC_SIZE) == 0) && ((ITEMS_PER_THREAD % MAX_VEC_SIZE) == 0)) ?
	- MAX_VEC_SIZE :
	- 1,
	-
	- VECTORS_PER_THREAD = ITEMS_PER_THREAD / VEC_SIZE,
	- };
	-
	- // Vector type
	- typedef typename VectorHelper<T, VEC_SIZE>::Type Vector;
	-
	- // Alias global pointer
	- Vector block_ptr_vectors = reinterpret_cast<Vector >(block_ptr);
	-
	- // Alias pointers (use "raw" array here which should get optimized away to prevent conservative PTXAS lmem spilling)
	- Vector raw_vector[VECTORS_PER_THREAD];
	- T raw_items = reinterpret_cast<T>(raw_vector);
	-
	- // Copy
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- raw_items[ITEM] = items[ITEM];
	- }
	-
	- // Direct-store using vector types
	- StoreBlocked<MODIFIER>(linear_tid, block_ptr_vectors, raw_vector);
	-}
	-
	-
	-//@} end member group
	-
	-
	-/** @} */ // end group IoModule
	-
	-
	-//-----------------------------------------------------------------------------
	-// Generic BlockStore abstraction
	-//-----------------------------------------------------------------------------
	-
	-/**
	- * \brief cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
	- */
	-enum BlockStoreAlgorithm
	-{
	- /**
	- * \par Overview
	- *
	- * A [<em>blocked arrangement</em>](index.html#sec5sec4) of data is written
	- * directly to memory. The thread block writes items in a parallel "raking" fashion:
	- * thread<sub><em>i</em></sub> writes the <em>i</em><sup>th</sup> segment of consecutive elements.
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) decreases as the
	- * access stride between threads increases (i.e., the number items per thread).
	- */
	- BLOCK_STORE_DIRECT,
	-
	- /**
	- * \par Overview
	- *
	- * A [<em>blocked arrangement</em>](index.html#sec5sec4) of data is written directly
	- * to memory using CUDA's built-in vectorized stores as a coalescing optimization.
	- * The thread block writes items in a parallel "raking" fashion: thread<sub><em>i</em></sub> uses vector stores to
	- * write the <em>i</em><sup>th</sup> segment of consecutive elements.
	- *
	- * For example, <tt>st.global.v4.s32</tt> instructions will be generated when \p T = \p int and \p ITEMS_PER_THREAD > 4.
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) remains high until the the
	- * access stride between threads (i.e., the number items per thread) exceeds the
	- * maximum vector store width (typically 4 items or 64B, whichever is lower).
	- * - The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
	- * - \p ITEMS_PER_THREAD is odd
	- * - The \p OutputIteratorRA is not a simple pointer type
	- * - The block output offset is not quadword-aligned
	- * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
	- */
	- BLOCK_STORE_VECTORIZE,
	-
	- /**
	- * \par Overview
	- * A [<em>blocked arrangement</em>](index.html#sec5sec4) is locally
	- * transposed into a [<em>striped arrangement</em>](index.html#sec5sec4)
	- * which is then written to memory. More specifically, cub::BlockExchange
	- * used to locally reorder the items into a
	- * [<em>striped arrangement</em>](index.html#sec5sec4), after which the
	- * thread block writes items in a parallel "strip-mining" fashion: consecutive
	- * items owned by thread<sub><em>i</em></sub> are written to memory with
	- * stride \p BLOCK_THREADS between them.
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) remains high regardless
	- * of items written per thread.
	- * - The local reordering incurs slightly longer latencies and throughput than the
	- * direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
	- */
	- BLOCK_STORE_TRANSPOSE,
	-
	- /**
	- * \par Overview
	- * A [<em>blocked arrangement</em>](index.html#sec5sec4) is locally
	- * transposed into a [<em>warp-striped arrangement</em>](index.html#sec5sec4)
	- * which is then written to memory. More specifically, cub::BlockExchange used
	- * to locally reorder the items into a
	- * [<em>warp-striped arrangement</em>](index.html#sec5sec4), after which
	- * each warp writes its own contiguous segment in a parallel "strip-mining" fashion:
	- * consecutive items owned by lane<sub><em>i</em></sub> are written to memory
	- * with stride \p WARP_THREADS between them.
	- *
	- * \par Performance Considerations
	- * - The utilization of memory transactions (coalescing) remains high regardless
	- * of items written per thread.
	- * - The local reordering incurs slightly longer latencies and throughput than the
	- * direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
	- */
	- BLOCK_STORE_WARP_TRANSPOSE,
	-};
	-
	-
	-
	-/**
	- * \addtogroup BlockModule
	- * @{
	- */
	-
	-
	-/**
	- * \brief The BlockStore class provides [<em>collective</em>](index.html#sec0) data movement methods for writing a [<em>blocked arrangement</em>](index.html#sec5sec4) of items partitioned across a CUDA thread block to a linear segment of memory. ![](block_store_logo.png)
	- *
	- * \par Overview
	- * The BlockStore class provides a single data movement abstraction that can be specialized
	- * to implement different cub::BlockStoreAlgorithm strategies. This facilitates different
	- * performance policies for different architectures, data types, granularity sizes, etc.
	- *
	- * \par Optionally, BlockStore can be specialized by different data movement strategies:
	- * -# <b>cub::BLOCK_STORE_DIRECT</b>. A [<em>blocked arrangement</em>](index.html#sec5sec4) of data is written
	- * directly to memory. [More...](\ref cub::BlockStoreAlgorithm)
	- * -# <b>cub::BLOCK_STORE_VECTORIZE</b>. A [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * of data is written directly to memory using CUDA's built-in vectorized stores as a
	- * coalescing optimization. [More...](\ref cub::BlockStoreAlgorithm)
	- * -# <b>cub::BLOCK_STORE_TRANSPOSE</b>. A [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * is locally transposed into a [<em>striped arrangement</em>](index.html#sec5sec4) which is
	- * then written to memory. [More...](\ref cub::BlockStoreAlgorithm)
	- * -# <b>cub::BLOCK_STORE_WARP_TRANSPOSE</b>. A [<em>blocked arrangement</em>](index.html#sec5sec4)
	- * is locally transposed into a [<em>warp-striped arrangement</em>](index.html#sec5sec4) which is
	- * then written to memory. [More...](\ref cub::BlockStoreAlgorithm)
	- *
	- * \tparam OutputIteratorRA The input iterator type (may be a simple pointer type).
	- * \tparam BLOCK_THREADS The thread block size in threads.
	- * \tparam ITEMS_PER_THREAD The number of consecutive items partitioned onto each thread.
	- * \tparam ALGORITHM <b>[optional]</b> cub::BlockStoreAlgorithm tuning policy enumeration. default: cub::BLOCK_STORE_DIRECT.
	- * \tparam MODIFIER <b>[optional]</b> cub::PtxStoreModifier cache modifier. default: cub::STORE_DEFAULT.
	- * \tparam WARP_TIME_SLICING <b>[optional]</b> For transposition-based cub::BlockStoreAlgorithm parameterizations that utilize shared memory: When \p true, only use enough shared memory for a single warp's worth of data, time-slicing the block-wide exchange over multiple synchronized rounds (default: false)
	- *
	- * \par A Simple Example
	- * \blockcollective{BlockStore}
	- * \par
	- * The code snippet below illustrates the storing of a "blocked" arrangement
	- * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
	- * into a linear segment of memory. The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
	- * meaning items are locally reordered among threads so that memory references will be
	- * efficiently coalesced using a warp-striped access pattern.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockStore for 128 threads owning 4 integer items each
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
	- *
	- * // Allocate shared memory for BlockStore
	- * __shared__ typename BlockStore::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Store items to linear memory
	- * int thread_data[4];
	- * BlockStore(temp_storage).Store(d_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of \p thread_data across the block of threads is
	- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- * The output \p d_data will be <tt>0, 1, 2, 3, 4, 5, ...</tt>.
	- *
	- */
	-template <
	- typename OutputIteratorRA,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT,
	- PtxStoreModifier MODIFIER = STORE_DEFAULT,
	- bool WARP_TIME_SLICING = false>
	-class BlockStore
	-{
	-private:
	- /******************************************************************************
	- * Constants and typed definitions
	- ******************************************************************************/
	-
	- // Data type of input iterator
	- typedef typename std::iterator_traits<OutputIteratorRA>::value_type T;
	-
	-
	- /******************************************************************************
	- * Algorithmic variants
	- ******************************************************************************/
	-
	- /// Store helper
	- template <BlockStoreAlgorithm _POLICY, int DUMMY = 0>
	- struct StoreInternal;
	-
	-
	- /**
	- * BLOCK_STORE_DIRECT specialization of store helper
	- */
	- template <int DUMMY>
	- struct StoreInternal<BLOCK_STORE_DIRECT, DUMMY>
	- {
	- /// Shared memory storage layout type
	- typedef NullType TempStorage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ StoreInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- linear_tid(linear_tid)
	- {}
	-
	- /// Store items into a linear segment of memory
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	- {
	- StoreBlocked<MODIFIER>(linear_tid, block_itr, items);
	- }
	-
	- /// Store items into a linear segment of memory, guarded by range
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	- {
	- StoreBlocked<MODIFIER>(linear_tid, block_itr, items, valid_items);
	- }
	- };
	-
	-
	- /**
	- * BLOCK_STORE_VECTORIZE specialization of store helper
	- */
	- template <int DUMMY>
	- struct StoreInternal<BLOCK_STORE_VECTORIZE, DUMMY>
	- {
	- /// Shared memory storage layout type
	- typedef NullType TempStorage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ StoreInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- linear_tid(linear_tid)
	- {}
	-
	- /// Store items into a linear segment of memory, specialized for native pointer types (attempts vectorization)
	- __device__ __forceinline__ void Store(
	- T *block_ptr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	- {
	- StoreBlockedVectorized<MODIFIER>(linear_tid, block_ptr, items);
	- }
	-
	- /// Store items into a linear segment of memory, specialized for opaque input iterators (skips vectorization)
	- template <typename _OutputIteratorRA>
	- __device__ __forceinline__ void Store(
	- _OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	- {
	- StoreBlocked<MODIFIER>(linear_tid, block_itr, items);
	- }
	-
	- /// Store items into a linear segment of memory, guarded by range
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	- {
	- StoreBlocked<MODIFIER>(linear_tid, block_itr, items, valid_items);
	- }
	- };
	-
	-
	- /**
	- * BLOCK_STORE_TRANSPOSE specialization of store helper
	- */
	- template <int DUMMY>
	- struct StoreInternal<BLOCK_STORE_TRANSPOSE, DUMMY>
	- {
	- // BlockExchange utility type for keys
	- typedef BlockExchange<T, BLOCK_THREADS, ITEMS_PER_THREAD, WARP_TIME_SLICING> BlockExchange;
	-
	- /// Shared memory storage layout type
	- typedef typename BlockExchange::TempStorage _TempStorage;
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	- /// Thread reference to shared storage
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ StoreInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	- /// Store items into a linear segment of memory
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	- {
	- BlockExchange(temp_storage).BlockedToStriped(items);
	- StoreStriped<MODIFIER, BLOCK_THREADS>(linear_tid, block_itr, items);
	- }
	-
	- /// Store items into a linear segment of memory, guarded by range
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	- {
	- BlockExchange(temp_storage).BlockedToStriped(items);
	- StoreStriped<MODIFIER, BLOCK_THREADS>(linear_tid, block_itr, items, valid_items);
	- }
	- };
	-
	-
	- /**
	- * BLOCK_STORE_WARP_TRANSPOSE specialization of store helper
	- */
	- template <int DUMMY>
	- struct StoreInternal<BLOCK_STORE_WARP_TRANSPOSE, DUMMY>
	- {
	- enum
	- {
	- WARP_THREADS = PtxArchProps::WARP_THREADS
	- };
	-
	- // Assert BLOCK_THREADS must be a multiple of WARP_THREADS
	- CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS");
	-
	- // BlockExchange utility type for keys
	- typedef BlockExchange<T, BLOCK_THREADS, ITEMS_PER_THREAD, WARP_TIME_SLICING> BlockExchange;
	-
	- /// Shared memory storage layout type
	- typedef typename BlockExchange::TempStorage _TempStorage;
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	- /// Thread reference to shared storage
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	- /// Constructor
	- __device__ __forceinline__ StoreInternal(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	- /// Store items into a linear segment of memory
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	- {
	- BlockExchange(temp_storage).BlockedToWarpStriped(items);
	- StoreWarpStriped<MODIFIER>(linear_tid, block_itr, items);
	- }
	-
	- /// Store items into a linear segment of memory, guarded by range
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	- {
	- BlockExchange(temp_storage).BlockedToWarpStriped(items);
	- StoreWarpStriped<MODIFIER>(linear_tid, block_itr, items, valid_items);
	- }
	- };
	-
	- /******************************************************************************
	- * Type definitions
	- ******************************************************************************/
	-
	- /// Internal load implementation to use
	- typedef StoreInternal<ALGORITHM> InternalStore;
	-
	-
	- /// Shared memory storage layout type
	- typedef typename InternalStore::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ _TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Thread reference to shared storage
	- _TempStorage &temp_storage;
	-
	- /// Linear thread-id
	- int linear_tid;
	-
	-public:
	-
	-
	- /// \smemstorage{BlockStore}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockStore()
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ BlockStore(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(threadIdx.x)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier
	- */
	- __device__ __forceinline__ BlockStore(
	- int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(PrivateStorage()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier.
	- */
	- __device__ __forceinline__ BlockStore(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int linear_tid) ///< [in] <b>[optional]</b> A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Data movement
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Store items into a linear segment of memory.
	- *
	- * \blocked
	- *
	- * The code snippet below illustrates the storing of a "blocked" arrangement
	- * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
	- * into a linear segment of memory. The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
	- * meaning items are locally reordered among threads so that memory references will be
	- * efficiently coalesced using a warp-striped access pattern.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, ...)
	- * {
	- * // Specialize BlockStore for 128 threads owning 4 integer items each
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
	- *
	- * // Allocate shared memory for BlockStore
	- * __shared__ typename BlockStore::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Store items to linear memory
	- * int thread_data[4];
	- * BlockStore(temp_storage).Store(d_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of \p thread_data across the block of threads is
	- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
	- * The output \p d_data will be <tt>0, 1, 2, 3, 4, 5, ...</tt>.
	- *
	- */
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
	- {
	- InternalStore(temp_storage, linear_tid).Store(block_itr, items);
	- }
	-
	- /**
	- * \brief Store items into a linear segment of memory, guarded by range.
	- *
	- * \blocked
	- *
	- * The code snippet below illustrates the guarded storing of a "blocked" arrangement
	- * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
	- * into a linear segment of memory. The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
	- * meaning items are locally reordered among threads so that memory references will be
	- * efficiently coalesced using a warp-striped access pattern.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
	- * {
	- * // Specialize BlockStore for 128 threads owning 4 integer items each
	- * typedef cub::BlockStore<int*, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
	- *
	- * // Allocate shared memory for BlockStore
	- * __shared__ typename BlockStore::TempStorage temp_storage;
	- *
	- * // Obtain a segment of consecutive items that are blocked across threads
	- * int thread_data[4];
	- * ...
	- *
	- * // Store items to linear memory
	- * int thread_data[4];
	- * BlockStore(temp_storage).Store(d_data, thread_data, valid_items);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of \p thread_data across the block of threads is
	- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt> and \p valid_items is \p 5.
	- * The output \p d_data will be <tt>0, 1, 2, 3, 4, ?, ?, ?, ...</tt>, with
	- * only the first two threads being unmasked to store portions of valid data.
	- *
	- */
	- __device__ __forceinline__ void Store(
	- OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
	- int valid_items) ///< [in] Number of valid items to write
	- {
	- InternalStore(temp_storage, linear_tid).Store(block_itr, items, valid_items);
	- }
	-};
	-
	-/** @} */ // end group BlockModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/specializations/block_histogram_atomic.cuh b/lib/kokkos/TPL/cub/block/specializations/block_histogram_atomic.cuh
	deleted file mode 100755
	index ecc980098..000000000
	--- a/lib/kokkos/TPL/cub/block/specializations/block_histogram_atomic.cuh
	+++ /dev/null
	@@ -1,85 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \brief The BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- int BINS>
	-struct BlockHistogramAtomic
	-{
	- /// Shared memory storage layout type
	- struct TempStorage {};
	-
	-
	- /// Constructor
	- __device__ __forceinline__ BlockHistogramAtomic(
	- TempStorage &temp_storage,
	- int linear_tid)
	- {}
	-
	-
	- /// Composite data onto an existing histogram
	- template <
	- typename HistoCounter>
	- __device__ __forceinline__ void Composite(
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram
	- HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram
	- {
	- // Update histogram
	- #pragma unroll
	- for (int i = 0; i < ITEMS_PER_THREAD; ++i)
	- {
	- atomicAdd(histogram + items[i], 1);
	- }
	- }
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/specializations/block_histogram_sort.cuh b/lib/kokkos/TPL/cub/block/specializations/block_histogram_sort.cuh
	deleted file mode 100755
	index e81edec6c..000000000
	--- a/lib/kokkos/TPL/cub/block/specializations/block_histogram_sort.cuh
	+++ /dev/null
	@@ -1,197 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
	- */
	-
	-#pragma once
	-
	-#include "../../block/block_radix_sort.cuh"
	-#include "../../block/block_discontinuity.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-
	-/**
	- * \brief The BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS,
	- int ITEMS_PER_THREAD,
	- int BINS>
	-struct BlockHistogramSort
	-{
	- // Parameterize BlockRadixSort type for our thread block
	- typedef BlockRadixSort<T, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
	-
	- // Parameterize BlockDiscontinuity type for our thread block
	- typedef BlockDiscontinuity<T, BLOCK_THREADS> BlockDiscontinuityT;
	-
	- // Shared memory
	- union _TempStorage
	- {
	- // Storage for sorting bin values
	- typename BlockRadixSortT::TempStorage sort;
	-
	- struct
	- {
	- // Storage for detecting discontinuities in the tile of sorted bin values
	- typename BlockDiscontinuityT::TempStorage flag;
	-
	- // Storage for noting begin/end offsets of bin runs in the tile of sorted bin values
	- unsigned int run_begin[BINS];
	- unsigned int run_end[BINS];
	- };
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- // Thread fields
	- _TempStorage &temp_storage;
	- int linear_tid;
	-
	-
	- /// Constructor
	- __device__ __forceinline__ BlockHistogramSort(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- // Discontinuity functor
	- struct DiscontinuityOp
	- {
	- // Reference to temp_storage
	- _TempStorage &temp_storage;
	-
	- // Constructor
	- __device__ __forceinline__ DiscontinuityOp(_TempStorage &temp_storage) :
	- temp_storage(temp_storage)
	- {}
	-
	- // Discontinuity predicate
	- __device__ __forceinline__ bool operator()(const T &a, const T &b, unsigned int b_index)
	- {
	- if (a != b)
	- {
	- // Note the begin/end offsets in shared storage
	- temp_storage.run_begin[b] = b_index;
	- temp_storage.run_end[a] = b_index;
	-
	- return true;
	- }
	- else
	- {
	- return false;
	- }
	- }
	- };
	-
	-
	- // Composite data onto an existing histogram
	- template <
	- typename HistoCounter>
	- __device__ __forceinline__ void Composite(
	- T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram
	- HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram
	- {
	- enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
	-
	- // Sort bytes in blocked arrangement
	- BlockRadixSortT(temp_storage.sort, linear_tid).Sort(items);
	-
	- __syncthreads();
	-
	- // Initialize the shared memory's run_begin and run_end for each bin
	- int histo_offset = 0;
	-
	- #pragma unroll
	- for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
	- {
	- temp_storage.run_begin[histo_offset + linear_tid] = TILE_SIZE;
	- temp_storage.run_end[histo_offset + linear_tid] = TILE_SIZE;
	- }
	- // Finish up with guarded initialization if necessary
	- if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS))
	- {
	- temp_storage.run_begin[histo_offset + linear_tid] = TILE_SIZE;
	- temp_storage.run_end[histo_offset + linear_tid] = TILE_SIZE;
	- }
	-
	- __syncthreads();
	-
	- int flags[ITEMS_PER_THREAD]; // unused
	-
	- // Compute head flags to demarcate contiguous runs of the same bin in the sorted tile
	- DiscontinuityOp flag_op(temp_storage);
	- BlockDiscontinuityT(temp_storage.flag, linear_tid).FlagHeads(flags, items, flag_op);
	-
	- // Update begin for first item
	- if (linear_tid == 0) temp_storage.run_begin[items[0]] = 0;
	-
	- __syncthreads();
	-
	- // Composite into histogram
	- histo_offset = 0;
	-
	- #pragma unroll
	- for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
	- {
	- int thread_offset = histo_offset + linear_tid;
	- HistoCounter count = temp_storage.run_end[thread_offset] - temp_storage.run_begin[thread_offset];
	- histogram[thread_offset] += count;
	- }
	- // Finish up with guarded composition if necessary
	- if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS))
	- {
	- int thread_offset = histo_offset + linear_tid;
	- HistoCounter count = temp_storage.run_end[thread_offset] - temp_storage.run_begin[thread_offset];
	- histogram[thread_offset] += count;
	- }
	- }
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/specializations/block_reduce_raking.cuh b/lib/kokkos/TPL/cub/block/specializations/block_reduce_raking.cuh
	deleted file mode 100755
	index 434d25a87..000000000
	--- a/lib/kokkos/TPL/cub/block/specializations/block_reduce_raking.cuh
	+++ /dev/null
	@@ -1,214 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA threadblock
	- */
	-
	-#pragma once
	-
	-#include "../../block/block_raking_layout.cuh"
	-#include "../../warp/warp_reduce.cuh"
	-#include "../../thread/thread_reduce.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \brief BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA threadblock
	- */
	-template <
	- typename T, ///< Data type being reduced
	- int BLOCK_THREADS> ///< The thread block size in threads
	-struct BlockReduceRaking
	-{
	- /// Layout type for padded threadblock raking grid
	- typedef BlockRakingLayout<T, BLOCK_THREADS, 1> BlockRakingLayout;
	-
	- /// WarpReduce utility type
	- typedef typename WarpReduce<T, 1, BlockRakingLayout::RAKING_THREADS>::InternalWarpReduce WarpReduce;
	-
	- /// Constants
	- enum
	- {
	- /// Number of raking threads
	- RAKING_THREADS = BlockRakingLayout::RAKING_THREADS,
	-
	- /// Number of raking elements per warp synchronous raking thread
	- SEGMENT_LENGTH = BlockRakingLayout::SEGMENT_LENGTH,
	-
	- /// Cooperative work can be entirely warp synchronous
	- WARP_SYNCHRONOUS = (RAKING_THREADS == BLOCK_THREADS),
	-
	- /// Whether or not warp-synchronous reduction should be unguarded (i.e., the warp-reduction elements is a power of two
	- WARP_SYNCHRONOUS_UNGUARDED = ((RAKING_THREADS & (RAKING_THREADS - 1)) == 0),
	-
	- /// Whether or not accesses into smem are unguarded
	- RAKING_UNGUARDED = BlockRakingLayout::UNGUARDED,
	-
	- };
	-
	-
	- /// Shared memory storage layout type
	- struct _TempStorage
	- {
	- typename WarpReduce::TempStorage warp_storage; ///< Storage for warp-synchronous reduction
	- typename BlockRakingLayout::TempStorage raking_grid; ///< Padded threadblock raking grid
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- // Thread fields
	- _TempStorage &temp_storage;
	- int linear_tid;
	-
	-
	- /// Constructor
	- __device__ __forceinline__ BlockReduceRaking(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	-
	- /// Computes a threadblock-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread<sub>0</sub>.
	- template <bool FULL_TILE>
	- __device__ __forceinline__ T Sum(
	- T partial, ///< [in] Calling thread's input partial reductions
	- int num_valid) ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
	- {
	- cub::Sum reduction_op;
	-
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp synchronous reduction (unguarded if active threads is a power-of-two)
	- partial = WarpReduce(temp_storage.warp_storage, 0, linear_tid).template Sum<FULL_TILE, SEGMENT_LENGTH>(
	- partial,
	- num_valid);
	- }
	- else
	- {
	- // Place partial into shared memory grid.
	- *BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid) = partial;
	-
	- __syncthreads();
	-
	- // Reduce parallelism to one warp
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking reduction in grid
	- T *raking_segment = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
	- partial = raking_segment[0];
	-
	- #pragma unroll
	- for (int ITEM = 1; ITEM < SEGMENT_LENGTH; ITEM++)
	- {
	- // Update partial if addend is in range
	- if ((FULL_TILE && RAKING_UNGUARDED) \|\| ((linear_tid * SEGMENT_LENGTH) + ITEM < num_valid))
	- {
	- partial = reduction_op(partial, raking_segment[ITEM]);
	- }
	- }
	-
	- partial = WarpReduce(temp_storage.warp_storage, 0, linear_tid).template Sum<FULL_TILE && RAKING_UNGUARDED, SEGMENT_LENGTH>(
	- partial,
	- num_valid);
	- }
	- }
	-
	- return partial;
	- }
	-
	-
	- /// Computes a threadblock-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread<sub>0</sub>.
	- template <
	- bool FULL_TILE,
	- typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T partial, ///< [in] Calling thread's input partial reductions
	- int num_valid, ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp synchronous reduction (unguarded if active threads is a power-of-two)
	- partial = WarpReduce(temp_storage.warp_storage, 0, linear_tid).template Reduce<FULL_TILE, SEGMENT_LENGTH>(
	- partial,
	- num_valid,
	- reduction_op);
	- }
	- else
	- {
	- // Place partial into shared memory grid.
	- *BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid) = partial;
	-
	- __syncthreads();
	-
	- // Reduce parallelism to one warp
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking reduction in grid
	- T *raking_segment = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
	- partial = raking_segment[0];
	-
	- #pragma unroll
	- for (int ITEM = 1; ITEM < SEGMENT_LENGTH; ITEM++)
	- {
	- // Update partial if addend is in range
	- if ((FULL_TILE && RAKING_UNGUARDED) \|\| ((linear_tid * SEGMENT_LENGTH) + ITEM < num_valid))
	- {
	- partial = reduction_op(partial, raking_segment[ITEM]);
	- }
	- }
	-
	- partial = WarpReduce(temp_storage.warp_storage, 0, linear_tid).template Reduce<FULL_TILE && RAKING_UNGUARDED, SEGMENT_LENGTH>(
	- partial,
	- num_valid,
	- reduction_op);
	- }
	- }
	-
	- return partial;
	- }
	-
	-};
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/specializations/block_reduce_warp_reductions.cuh b/lib/kokkos/TPL/cub/block/specializations/block_reduce_warp_reductions.cuh
	deleted file mode 100755
	index 0e316dd17..000000000
	--- a/lib/kokkos/TPL/cub/block/specializations/block_reduce_warp_reductions.cuh
	+++ /dev/null
	@@ -1,198 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA threadblock
	- */
	-
	-#pragma once
	-
	-#include "../../warp/warp_reduce.cuh"
	-#include "../../util_arch.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \brief BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA threadblock
	- */
	-template <
	- typename T, ///< Data type being reduced
	- int BLOCK_THREADS> ///< The thread block size in threads
	-struct BlockReduceWarpReductions
	-{
	- /// Constants
	- enum
	- {
	- /// Number of active warps
	- WARPS = (BLOCK_THREADS + PtxArchProps::WARP_THREADS - 1) / PtxArchProps::WARP_THREADS,
	-
	- /// The logical warp size for warp reductions
	- LOGICAL_WARP_SIZE = CUB_MIN(BLOCK_THREADS, PtxArchProps::WARP_THREADS),
	-
	- /// Whether or not the logical warp size evenly divides the threadblock size
	- EVEN_WARP_MULTIPLE = (BLOCK_THREADS % LOGICAL_WARP_SIZE == 0)
	- };
	-
	-
	- /// WarpReduce utility type
	- typedef typename WarpReduce<T, WARPS, LOGICAL_WARP_SIZE>::InternalWarpReduce WarpReduce;
	-
	-
	- /// Shared memory storage layout type
	- struct _TempStorage
	- {
	- typename WarpReduce::TempStorage warp_reduce; ///< Buffer for warp-synchronous scan
	- T warp_aggregates[WARPS]; ///< Shared totals from each warp-synchronous scan
	- T block_prefix; ///< Shared prefix for the entire threadblock
	- };
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- // Thread fields
	- _TempStorage &temp_storage;
	- int linear_tid;
	- int warp_id;
	- int lane_id;
	-
	-
	- /// Constructor
	- __device__ __forceinline__ BlockReduceWarpReductions(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid),
	- warp_id((BLOCK_THREADS <= PtxArchProps::WARP_THREADS) ?
	- 0 :
	- linear_tid / PtxArchProps::WARP_THREADS),
	- lane_id((BLOCK_THREADS <= PtxArchProps::WARP_THREADS) ?
	- linear_tid :
	- linear_tid % PtxArchProps::WARP_THREADS)
	- {}
	-
	-
	- /// Returns block-wide aggregate in <em>thread</em><sub>0</sub>.
	- template <
	- bool FULL_TILE,
	- typename ReductionOp>
	- __device__ __forceinline__ T ApplyWarpAggregates(
	- ReductionOp reduction_op, ///< [in] Binary scan operator
	- T warp_aggregate, ///< [in] <b>[<em>lane</em><sub>0</sub>s only]</b> Warp-wide aggregate reduction of input items
	- int num_valid) ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
	- {
	- // Share lane aggregates
	- if (lane_id == 0)
	- {
	- temp_storage.warp_aggregates[warp_id] = warp_aggregate;
	- }
	-
	- __syncthreads();
	-
	- // Update total aggregate in warp 0, lane 0
	- if (linear_tid == 0)
	- {
	- #pragma unroll
	- for (int SUCCESSOR_WARP = 1; SUCCESSOR_WARP < WARPS; SUCCESSOR_WARP++)
	- {
	- if (FULL_TILE \|\| (SUCCESSOR_WARP * LOGICAL_WARP_SIZE < num_valid))
	- {
	- warp_aggregate = reduction_op(warp_aggregate, temp_storage.warp_aggregates[SUCCESSOR_WARP]);
	- }
	- }
	- }
	-
	- return warp_aggregate;
	- }
	-
	-
	- /// Computes a threadblock-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread<sub>0</sub>.
	- template <bool FULL_TILE>
	- __device__ __forceinline__ T Sum(
	- T input, ///< [in] Calling thread's input partial reductions
	- int num_valid) ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
	- {
	- cub::Sum reduction_op;
	- unsigned int warp_offset = warp_id * LOGICAL_WARP_SIZE;
	- unsigned int warp_num_valid = (FULL_TILE && EVEN_WARP_MULTIPLE) ?
	- LOGICAL_WARP_SIZE :
	- (warp_offset < num_valid) ?
	- num_valid - warp_offset :
	- 0;
	-
	- // Warp reduction in every warp
	- T warp_aggregate = WarpReduce(temp_storage.warp_reduce, warp_id, lane_id).template Sum<(FULL_TILE && EVEN_WARP_MULTIPLE), 1>(
	- input,
	- warp_num_valid);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
	- return ApplyWarpAggregates<FULL_TILE>(reduction_op, warp_aggregate, num_valid);
	- }
	-
	-
	- /// Computes a threadblock-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial. The return value is only valid for thread<sub>0</sub>.
	- template <
	- bool FULL_TILE,
	- typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input partial reductions
	- int num_valid, ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- unsigned int warp_id = (WARPS == 1) ? 0 : (linear_tid / LOGICAL_WARP_SIZE);
	- unsigned int warp_offset = warp_id * LOGICAL_WARP_SIZE;
	- unsigned int warp_num_valid = (FULL_TILE && EVEN_WARP_MULTIPLE) ?
	- LOGICAL_WARP_SIZE :
	- (warp_offset < num_valid) ?
	- num_valid - warp_offset :
	- 0;
	-
	- // Warp reduction in every warp
	- T warp_aggregate = WarpReduce(temp_storage.warp_reduce, warp_id, lane_id).template Reduce<(FULL_TILE && EVEN_WARP_MULTIPLE), 1>(
	- input,
	- warp_num_valid,
	- reduction_op);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
	- return ApplyWarpAggregates<FULL_TILE>(reduction_op, warp_aggregate, num_valid);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/specializations/block_scan_raking.cuh b/lib/kokkos/TPL/cub/block/specializations/block_scan_raking.cuh
	deleted file mode 100755
	index 75e15d95c..000000000
	--- a/lib/kokkos/TPL/cub/block/specializations/block_scan_raking.cuh
	+++ /dev/null
	@@ -1,761 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-
	-/**
	- * \file
	- * cub::BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA threadblock.
	- */
	-
	-#pragma once
	-
	-#include "../../util_arch.cuh"
	-#include "../../block/block_raking_layout.cuh"
	-#include "../../thread/thread_reduce.cuh"
	-#include "../../thread/thread_scan.cuh"
	-#include "../../warp/warp_scan.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \brief BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA threadblock.
	- */
	-template <
	- typename T, ///< Data type being scanned
	- int BLOCK_THREADS, ///< The thread block size in threads
	- bool MEMOIZE> ///< Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure
	-struct BlockScanRaking
	-{
	- /// Layout type for padded threadblock raking grid
	- typedef BlockRakingLayout<T, BLOCK_THREADS> BlockRakingLayout;
	-
	- /// Constants
	- enum
	- {
	- /// Number of active warps
	- WARPS = (BLOCK_THREADS + PtxArchProps::WARP_THREADS - 1) / PtxArchProps::WARP_THREADS,
	-
	- /// Number of raking threads
	- RAKING_THREADS = BlockRakingLayout::RAKING_THREADS,
	-
	- /// Number of raking elements per warp synchronous raking thread
	- SEGMENT_LENGTH = BlockRakingLayout::SEGMENT_LENGTH,
	-
	- /// Cooperative work can be entirely warp synchronous
	- WARP_SYNCHRONOUS = (BLOCK_THREADS == RAKING_THREADS),
	- };
	-
	- /// WarpScan utility type
	- typedef WarpScan<T, 1, RAKING_THREADS> WarpScan;
	-
	- /// Shared memory storage layout type
	- struct _TempStorage
	- {
	- typename WarpScan::TempStorage warp_scan; ///< Buffer for warp-synchronous scan
	- typename BlockRakingLayout::TempStorage raking_grid; ///< Padded threadblock raking grid
	- T block_aggregate; ///< Block aggregate
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- // Thread fields
	- _TempStorage &temp_storage;
	- int linear_tid;
	- T cached_segment[SEGMENT_LENGTH];
	-
	-
	- /// Constructor
	- __device__ __forceinline__ BlockScanRaking(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid)
	- {}
	-
	- /// Performs upsweep raking reduction, returning the aggregate
	- template <typename ScanOp>
	- __device__ __forceinline__ T Upsweep(
	- ScanOp scan_op)
	- {
	- T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
	- T *raking_ptr;
	-
	- if (MEMOIZE)
	- {
	- // Copy data into registers
	- #pragma unroll
	- for (int i = 0; i < SEGMENT_LENGTH; i++)
	- {
	- cached_segment[i] = smem_raking_ptr[i];
	- }
	- raking_ptr = cached_segment;
	- }
	- else
	- {
	- raking_ptr = smem_raking_ptr;
	- }
	-
	- T raking_partial = raking_ptr[0];
	-
	- #pragma unroll
	- for (int i = 1; i < SEGMENT_LENGTH; i++)
	- {
	- if ((BlockRakingLayout::UNGUARDED) \|\| (((linear_tid * SEGMENT_LENGTH) + i) < BLOCK_THREADS))
	- {
	- raking_partial = scan_op(raking_partial, raking_ptr[i]);
	- }
	- }
	-
	- return raking_partial;
	- }
	-
	-
	- /// Performs exclusive downsweep raking scan
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveDownsweep(
	- ScanOp scan_op,
	- T raking_partial,
	- bool apply_prefix = true)
	- {
	- T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
	-
	- T *raking_ptr = (MEMOIZE) ?
	- cached_segment :
	- smem_raking_ptr;
	-
	- ThreadScanExclusive<SEGMENT_LENGTH>(raking_ptr, raking_ptr, scan_op, raking_partial, apply_prefix);
	-
	- if (MEMOIZE)
	- {
	- // Copy data back to smem
	- #pragma unroll
	- for (int i = 0; i < SEGMENT_LENGTH; i++)
	- {
	- smem_raking_ptr[i] = cached_segment[i];
	- }
	- }
	- }
	-
	-
	- /// Performs inclusive downsweep raking scan
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveDownsweep(
	- ScanOp scan_op,
	- T raking_partial,
	- bool apply_prefix = true)
	- {
	- T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
	-
	- T *raking_ptr = (MEMOIZE) ?
	- cached_segment :
	- smem_raking_ptr;
	-
	- ThreadScanInclusive<SEGMENT_LENGTH>(raking_ptr, raking_ptr, scan_op, raking_partial, apply_prefix);
	-
	- if (MEMOIZE)
	- {
	- // Copy data back to smem
	- #pragma unroll
	- for (int i = 0; i < SEGMENT_LENGTH; i++)
	- {
	- smem_raking_ptr[i] = cached_segment[i];
	- }
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input items
	- T &output, ///< [out] Calling thread's output items (may be aliased to \p input)
	- const T &identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- input,
	- output,
	- identity,
	- scan_op,
	- block_aggregate);
	- }
	- else
	- {
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- raking_partial,
	- raking_partial,
	- identity,
	- scan_op,
	- temp_storage.block_aggregate);
	-
	- // Exclusive raking downsweep scan
	- ExclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- input,
	- output,
	- identity,
	- scan_op,
	- block_aggregate,
	- block_prefix_op);
	- }
	- else
	- {
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- raking_partial,
	- raking_partial,
	- identity,
	- scan_op,
	- temp_storage.block_aggregate,
	- block_prefix_op);
	-
	- // Exclusive raking downsweep scan
	- ExclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for <em>thread</em><sub>0</sub> is undefined.
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- input,
	- output,
	- scan_op,
	- block_aggregate);
	- }
	- else
	- {
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- raking_partial,
	- raking_partial,
	- scan_op,
	- temp_storage.block_aggregate);
	-
	- // Exclusive raking downsweep scan
	- ExclusiveDownsweep(scan_op, raking_partial, (linear_tid != 0));
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- input,
	- output,
	- scan_op,
	- block_aggregate,
	- block_prefix_op);
	- }
	- else
	- {
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- raking_partial,
	- raking_partial,
	- scan_op,
	- temp_storage.block_aggregate,
	- block_prefix_op);
	-
	- // Exclusive raking downsweep scan
	- ExclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveSum(
	- input,
	- output,
	- block_aggregate);
	- }
	- else
	- {
	- // Raking scan
	- Sum scan_op;
	-
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveSum(
	- raking_partial,
	- raking_partial,
	- temp_storage.block_aggregate);
	-
	- // Exclusive raking downsweep scan
	- ExclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the threadblock-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveSum(
	- input,
	- output,
	- block_aggregate,
	- block_prefix_op);
	- }
	- else
	- {
	- // Raking scan
	- Sum scan_op;
	-
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveSum(
	- raking_partial,
	- raking_partial,
	- temp_storage.block_aggregate,
	- block_prefix_op);
	-
	- // Exclusive raking downsweep scan
	- ExclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).InclusiveScan(
	- input,
	- output,
	- scan_op,
	- block_aggregate);
	- }
	- else
	- {
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- raking_partial,
	- raking_partial,
	- scan_op,
	- temp_storage.block_aggregate);
	-
	- // Inclusive raking downsweep scan
	- InclusiveDownsweep(scan_op, raking_partial, (linear_tid != 0));
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).InclusiveScan(
	- input,
	- output,
	- scan_op,
	- block_aggregate,
	- block_prefix_op);
	- }
	- else
	- {
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveScan(
	- raking_partial,
	- raking_partial,
	- scan_op,
	- temp_storage.block_aggregate,
	- block_prefix_op);
	-
	- // Inclusive raking downsweep scan
	- InclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).InclusiveSum(
	- input,
	- output,
	- block_aggregate);
	- }
	- else
	- {
	- // Raking scan
	- Sum scan_op;
	-
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Exclusive warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveSum(
	- raking_partial,
	- raking_partial,
	- temp_storage.block_aggregate);
	-
	- // Inclusive raking downsweep scan
	- InclusiveDownsweep(scan_op, raking_partial, (linear_tid != 0));
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Instead of using 0 as the threadblock-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- if (WARP_SYNCHRONOUS)
	- {
	- // Short-circuit directly to warp scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).InclusiveSum(
	- input,
	- output,
	- block_aggregate,
	- block_prefix_op);
	- }
	- else
	- {
	- // Raking scan
	- Sum scan_op;
	-
	- // Place thread partial into shared memory raking grid
	- T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
	- *placement_ptr = input;
	-
	- __syncthreads();
	-
	- // Reduce parallelism down to just raking threads
	- if (linear_tid < RAKING_THREADS)
	- {
	- // Raking upsweep reduction in grid
	- T raking_partial = Upsweep(scan_op);
	-
	- // Warp synchronous scan
	- WarpScan(temp_storage.warp_scan, 0, linear_tid).ExclusiveSum(
	- raking_partial,
	- raking_partial,
	- temp_storage.block_aggregate,
	- block_prefix_op);
	-
	- // Inclusive raking downsweep scan
	- InclusiveDownsweep(scan_op, raking_partial);
	- }
	-
	- __syncthreads();
	-
	- // Grab thread prefix from shared memory
	- output = *placement_ptr;
	-
	- // Retrieve block aggregate
	- block_aggregate = temp_storage.block_aggregate;
	- }
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/block/specializations/block_scan_warp_scans.cuh b/lib/kokkos/TPL/cub/block/specializations/block_scan_warp_scans.cuh
	deleted file mode 100755
	index f7af3613d..000000000
	--- a/lib/kokkos/TPL/cub/block/specializations/block_scan_warp_scans.cuh
	+++ /dev/null
	@@ -1,342 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockScanWarpscans provides warpscan-based variants of parallel prefix scan across a CUDA threadblock.
	- */
	-
	-#pragma once
	-
	-#include "../../util_arch.cuh"
	-#include "../../warp/warp_scan.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA threadblock.
	- */
	-template <
	- typename T,
	- int BLOCK_THREADS>
	-struct BlockScanWarpScans
	-{
	- /// Constants
	- enum
	- {
	- /// Number of active warps
	- WARPS = (BLOCK_THREADS + PtxArchProps::WARP_THREADS - 1) / PtxArchProps::WARP_THREADS,
	- };
	-
	- /// WarpScan utility type
	- typedef WarpScan<T, WARPS, PtxArchProps::WARP_THREADS> WarpScan;
	-
	- /// Shared memory storage layout type
	- struct _TempStorage
	- {
	- typename WarpScan::TempStorage warp_scan; ///< Buffer for warp-synchronous scan
	- T warp_aggregates[WARPS]; ///< Shared totals from each warp-synchronous scan
	- T block_prefix; ///< Shared prefix for the entire threadblock
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- // Thread fields
	- _TempStorage &temp_storage;
	- int linear_tid;
	- int warp_id;
	- int lane_id;
	-
	-
	- /// Constructor
	- __device__ __forceinline__ BlockScanWarpScans(
	- TempStorage &temp_storage,
	- int linear_tid)
	- :
	- temp_storage(temp_storage.Alias()),
	- linear_tid(linear_tid),
	- warp_id((BLOCK_THREADS <= PtxArchProps::WARP_THREADS) ?
	- 0 :
	- linear_tid / PtxArchProps::WARP_THREADS),
	- lane_id((BLOCK_THREADS <= PtxArchProps::WARP_THREADS) ?
	- linear_tid :
	- linear_tid % PtxArchProps::WARP_THREADS)
	- {}
	-
	-
	- /// Update the calling thread's partial reduction with the warp-wide aggregates from preceding warps. Also returns block-wide aggregate in <em>thread</em><sub>0</sub>.
	- template <typename ScanOp>
	- __device__ __forceinline__ void ApplyWarpAggregates(
	- T &partial, ///< [out] The calling thread's partial reduction
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T warp_aggregate, ///< [in] <b>[<em>lane</em><sub>0</sub>s only]</b> Warp-wide aggregate reduction of input items
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items
	- bool lane_valid = true) ///< [in] Whether or not the partial belonging to the current thread is valid
	- {
	- // Share lane aggregates
	- temp_storage.warp_aggregates[warp_id] = warp_aggregate;
	-
	- __syncthreads();
	-
	- block_aggregate = temp_storage.warp_aggregates[0];
	-
	- #pragma unroll
	- for (int WARP = 1; WARP < WARPS; WARP++)
	- {
	- if (warp_id == WARP)
	- {
	- partial = (lane_valid) ?
	- scan_op(block_aggregate, partial) : // fold it in our valid partial
	- block_aggregate; // replace our invalid partial with the aggregate
	- }
	-
	- block_aggregate = scan_op(block_aggregate, temp_storage.warp_aggregates[WARP]);
	- }
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input items
	- T &output, ///< [out] Calling thread's output items (may be aliased to \p input)
	- const T &identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- T warp_aggregate;
	- WarpScan(temp_storage.warp_scan, warp_id, lane_id).ExclusiveScan(input, output, identity, scan_op, warp_aggregate);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates
	- ApplyWarpAggregates(output, scan_op, warp_aggregate, block_aggregate);
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- ExclusiveScan(input, output, identity, scan_op, block_aggregate);
	-
	- // Compute and share threadblock prefix
	- if (warp_id == 0)
	- {
	- temp_storage.block_prefix = block_prefix_op(block_aggregate);
	- }
	-
	- __syncthreads();
	-
	- // Incorporate threadblock prefix into outputs
	- output = scan_op(temp_storage.block_prefix, output);
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for <em>thread</em><sub>0</sub> is undefined.
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- T warp_aggregate;
	- WarpScan(temp_storage.warp_scan, warp_id, lane_id).ExclusiveScan(input, output, scan_op, warp_aggregate);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates
	- ApplyWarpAggregates(output, scan_op, warp_aggregate, block_aggregate, (lane_id > 0));
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- ExclusiveScan(input, output, scan_op, block_aggregate);
	-
	- // Compute and share threadblock prefix
	- if (warp_id == 0)
	- {
	- temp_storage.block_prefix = block_prefix_op(block_aggregate);
	- }
	-
	- __syncthreads();
	-
	- // Incorporate threadblock prefix into outputs
	- output = (linear_tid == 0) ?
	- temp_storage.block_prefix :
	- scan_op(temp_storage.block_prefix, output);
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- T warp_aggregate;
	- WarpScan(temp_storage.warp_scan, warp_id, lane_id).ExclusiveSum(input, output, warp_aggregate);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
	- ApplyWarpAggregates(output, Sum(), warp_aggregate, block_aggregate);
	- }
	-
	-
	- /// Computes an exclusive threadblock-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the threadblock-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename BlockPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- ExclusiveSum(input, output, block_aggregate);
	-
	- // Compute and share threadblock prefix
	- if (warp_id == 0)
	- {
	- temp_storage.block_prefix = block_prefix_op(block_aggregate);
	- }
	-
	- __syncthreads();
	-
	- // Incorporate threadblock prefix into outputs
	- Sum scan_op;
	- output = scan_op(temp_storage.block_prefix, output);
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- T warp_aggregate;
	- WarpScan(temp_storage.warp_scan, warp_id, lane_id).InclusiveScan(input, output, scan_op, warp_aggregate);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
	- ApplyWarpAggregates(output, scan_op, warp_aggregate, block_aggregate);
	-
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <
	- typename ScanOp,
	- typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- InclusiveScan(input, output, scan_op, block_aggregate);
	-
	- // Compute and share threadblock prefix
	- if (warp_id == 0)
	- {
	- temp_storage.block_prefix = block_prefix_op(block_aggregate);
	- }
	-
	- __syncthreads();
	-
	- // Incorporate threadblock prefix into outputs
	- output = scan_op(temp_storage.block_prefix, output);
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate) ///< [out] Threadblock-wide aggregate reduction of input items
	- {
	- T warp_aggregate;
	- WarpScan(temp_storage.warp_scan, warp_id, lane_id).InclusiveSum(input, output, warp_aggregate);
	-
	- // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
	- ApplyWarpAggregates(output, Sum(), warp_aggregate, block_aggregate);
	- }
	-
	-
	- /// Computes an inclusive threadblock-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Instead of using 0 as the threadblock-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
	- template <typename BlockPrefixOp>
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item
	- T &output, ///< [out] Calling thread's output item (may be aliased to \p input)
	- T &block_aggregate, ///< [out] Threadblock-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value)
	- BlockPrefixOp &block_prefix_op) ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a threadblock-wide prefix to be applied to all inputs.
	- {
	- InclusiveSum(input, output, block_aggregate);
	-
	- // Compute and share threadblock prefix
	- if (warp_id == 0)
	- {
	- temp_storage.block_prefix = block_prefix_op(block_aggregate);
	- }
	-
	- __syncthreads();
	-
	- // Incorporate threadblock prefix into outputs
	- Sum scan_op;
	- output = scan_op(temp_storage.block_prefix, output);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/cub.cuh b/lib/kokkos/TPL/cub/cub.cuh
	deleted file mode 100755
	index dbb77da22..000000000
	--- a/lib/kokkos/TPL/cub/cub.cuh
	+++ /dev/null
	@@ -1,84 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * CUB umbrella include file
	- */
	-
	-#pragma once
	-
	-
	-// Block
	-#include "block/block_histogram.cuh"
	-#include "block/block_discontinuity.cuh"
	-#include "block/block_exchange.cuh"
	-#include "block/block_load.cuh"
	-#include "block/block_radix_rank.cuh"
	-#include "block/block_radix_sort.cuh"
	-#include "block/block_reduce.cuh"
	-#include "block/block_scan.cuh"
	-#include "block/block_store.cuh"
	-
	-// Device
	-#include "device/device_histogram.cuh"
	-#include "device/device_radix_sort.cuh"
	-#include "device/device_reduce.cuh"
	-#include "device/device_scan.cuh"
	-
	-// Grid
	-//#include "grid/grid_barrier.cuh"
	-#include "grid/grid_even_share.cuh"
	-#include "grid/grid_mapping.cuh"
	-#include "grid/grid_queue.cuh"
	-
	-// Host
	-#include "host/spinlock.cuh"
	-
	-// Thread
	-#include "thread/thread_load.cuh"
	-#include "thread/thread_operators.cuh"
	-#include "thread/thread_reduce.cuh"
	-#include "thread/thread_scan.cuh"
	-#include "thread/thread_store.cuh"
	-
	-// Warp
	-#include "warp/warp_reduce.cuh"
	-#include "warp/warp_scan.cuh"
	-
	-// Util
	-#include "util_allocator.cuh"
	-#include "util_arch.cuh"
	-#include "util_debug.cuh"
	-#include "util_device.cuh"
	-#include "util_macro.cuh"
	-#include "util_ptx.cuh"
	-#include "util_type.cuh"
	-#include "util_iterator.cuh"
	-#include "util_vector.cuh"
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_histo_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_histo_tiles.cuh
	deleted file mode 100755
	index e1165d60c..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_histo_tiles.cuh
	+++ /dev/null
	@@ -1,322 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockHistogramTiles implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "specializations/block_histo_tiles_gatomic.cuh"
	-#include "specializations/block_histo_tiles_satomic.cuh"
	-#include "specializations/block_histo_tiles_sort.cuh"
	-#include "../../util_type.cuh"
	-#include "../../grid/grid_mapping.cuh"
	-#include "../../grid/grid_even_share.cuh"
	-#include "../../grid/grid_queue.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Algorithmic variants
	- ******************************************************************************/
	-
	-
	-/**
	- * \brief BlockHistogramTilesAlgorithm enumerates alternative algorithms for BlockHistogramTiles.
	- */
	-enum BlockHistogramTilesAlgorithm
	-{
	-
	- /**
	- * \par Overview
	- * A two-kernel approach in which:
	- * -# Thread blocks in the first kernel aggregate their own privatized
	- * histograms using block-wide sorting (see BlockHistogramAlgorithm::BLOCK_HISTO_SORT).
	- * -# A single thread block in the second kernel reduces them into the output histogram(s).
	- *
	- * \par Performance Considerations
	- * Delivers consistent throughput regardless of sample bin distribution.
	- *
	- * However, because histograms are privatized in shared memory, a large
	- * number of bins (e.g., thousands) may adversely affect occupancy and
	- * performance (or even the ability to launch).
	- */
	- GRID_HISTO_SORT,
	-
	-
	- /**
	- * \par Overview
	- * A two-kernel approach in which:
	- * -# Thread blocks in the first kernel aggregate their own privatized
	- * histograms using shared-memory \p atomicAdd().
	- * -# A single thread block in the second kernel reduces them into the
	- * output histogram(s).
	- *
	- * \par Performance Considerations
	- * Performance is strongly tied to the hardware implementation of atomic
	- * addition, and may be significantly degraded for non uniformly-random
	- * input distributions where many concurrent updates are likely to be
	- * made to the same bin counter.
	- *
	- * However, because histograms are privatized in shared memory, a large
	- * number of bins (e.g., thousands) may adversely affect occupancy and
	- * performance (or even the ability to launch).
	- */
	- GRID_HISTO_SHARED_ATOMIC,
	-
	-
	- /**
	- * \par Overview
	- * A single-kernel approach in which thread blocks update the output histogram(s) directly
	- * using global-memory \p atomicAdd().
	- *
	- * \par Performance Considerations
	- * Performance is strongly tied to the hardware implementation of atomic
	- * addition, and may be significantly degraded for non uniformly-random
	- * input distributions where many concurrent updates are likely to be
	- * made to the same bin counter.
	- *
	- * Performance is not significantly impacted when computing histograms having large
	- * numbers of bins (e.g., thousands).
	- */
	- GRID_HISTO_GLOBAL_ATOMIC,
	-
	-};
	-
	-
	-/******************************************************************************
	- * Tuning policy
	- ******************************************************************************/
	-
	-/**
	- * Tuning policy for BlockHistogramTiles
	- */
	-template <
	- int _BLOCK_THREADS,
	- int _ITEMS_PER_THREAD,
	- BlockHistogramTilesAlgorithm _GRID_ALGORITHM,
	- GridMappingStrategy _GRID_MAPPING,
	- int _SM_OCCUPANCY>
	-struct BlockHistogramTilesPolicy
	-{
	- enum
	- {
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- SM_OCCUPANCY = _SM_OCCUPANCY,
	- };
	-
	- static const BlockHistogramTilesAlgorithm GRID_ALGORITHM = _GRID_ALGORITHM;
	- static const GridMappingStrategy GRID_MAPPING = _GRID_MAPPING;
	-};
	-
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-
	-/**
	- * Implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram using global atomics
	- */
	-template <
	- typename BlockHistogramTilesPolicy, ///< Tuning policy
	- int BINS, ///< Number of histogram bins per channel
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of active channels being histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- typename HistoCounter, ///< Integral type for counting sample occurrences per histogram bin
	- typename SizeT> ///< Integer type for offsets
	-struct BlockHistogramTiles
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Histogram grid algorithm
	- static const BlockHistogramTilesAlgorithm GRID_ALGORITHM = BlockHistogramTilesPolicy::GRID_ALGORITHM;
	-
	- // Alternative internal implementation types
	- typedef BlockHistogramTilesSort< BlockHistogramTilesPolicy, BINS, CHANNELS, ACTIVE_CHANNELS, InputIteratorRA, HistoCounter, SizeT> BlockHistogramTilesSortT;
	- typedef BlockHistogramTilesSharedAtomic< BlockHistogramTilesPolicy, BINS, CHANNELS, ACTIVE_CHANNELS, InputIteratorRA, HistoCounter, SizeT> BlockHistogramTilesSharedAtomicT;
	- typedef BlockHistogramTilesGlobalAtomic< BlockHistogramTilesPolicy, BINS, CHANNELS, ACTIVE_CHANNELS, InputIteratorRA, HistoCounter, SizeT> BlockHistogramTilesGlobalAtomicT;
	-
	- // Internal block sweep histogram type
	- typedef typename If<(GRID_ALGORITHM == GRID_HISTO_SORT),
	- BlockHistogramTilesSortT,
	- typename If<(GRID_ALGORITHM == GRID_HISTO_SHARED_ATOMIC),
	- BlockHistogramTilesSharedAtomicT,
	- BlockHistogramTilesGlobalAtomicT>::Type>::Type InternalBlockDelegate;
	-
	- enum
	- {
	- TILE_ITEMS = InternalBlockDelegate::TILE_ITEMS,
	- };
	-
	-
	- // Temporary storage type
	- typedef typename InternalBlockDelegate::TempStorage TempStorage;
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- // Internal block delegate
	- InternalBlockDelegate internal_delegate;
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockHistogramTiles(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data to reduce
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS]) ///< Reference to output histograms
	- :
	- internal_delegate(temp_storage, d_in, d_out_histograms)
	- {}
	-
	-
	- /**
	- * \brief Reduce a consecutive segment of input tiles
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT block_offset, ///< [in] Threadblock begin offset (inclusive)
	- SizeT block_oob) ///< [in] Threadblock end offset (exclusive)
	- {
	- // Consume subsequent full tiles of input
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- internal_delegate.ConsumeTile<true>(block_offset);
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < block_oob)
	- {
	- int valid_items = block_oob - block_offset;
	- internal_delegate.ConsumeTile<false>(block_offset, valid_items);
	- }
	-
	- // Aggregate output
	- internal_delegate.AggregateOutput();
	- }
	-
	-
	- /**
	- * Reduce a consecutive segment of input tiles
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT num_items, ///< [in] Total number of global input items
	- GridEvenShare<SizeT> &even_share, ///< [in] GridEvenShare descriptor
	- GridQueue<SizeT> &queue, ///< [in,out] GridQueue descriptor
	- Int2Type<GRID_MAPPING_EVEN_SHARE> is_even_share) ///< [in] Marker type indicating this is an even-share mapping
	- {
	- even_share.BlockInit();
	- ConsumeTiles(even_share.block_offset, even_share.block_oob);
	- }
	-
	-
	- /**
	- * Dequeue and reduce tiles of items as part of a inter-block scan
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- int num_items, ///< Total number of input items
	- GridQueue<SizeT> queue) ///< Queue descriptor for assigning tiles of work to thread blocks
	- {
	- // Shared block offset
	- __shared__ SizeT shared_block_offset;
	-
	- // We give each thread block at least one tile of input.
	- SizeT block_offset = blockIdx.x * TILE_ITEMS;
	- SizeT even_share_base = gridDim.x * TILE_ITEMS;
	-
	- // Process full tiles of input
	- while (block_offset + TILE_ITEMS <= num_items)
	- {
	- internal_delegate.ConsumeTile<true>(block_offset);
	-
	- // Dequeue up to TILE_ITEMS
	- if (threadIdx.x == 0)
	- shared_block_offset = queue.Drain(TILE_ITEMS) + even_share_base;
	-
	- __syncthreads();
	-
	- block_offset = shared_block_offset;
	-
	- __syncthreads();
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < num_items)
	- {
	- int valid_items = num_items - block_offset;
	- internal_delegate.ConsumeTile<false>(block_offset, valid_items);
	- }
	-
	- // Aggregate output
	- internal_delegate.AggregateOutput();
	- }
	-
	-
	- /**
	- * Dequeue and reduce tiles of items as part of a inter-block scan
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT num_items, ///< [in] Total number of global input items
	- GridEvenShare<SizeT> &even_share, ///< [in] GridEvenShare descriptor
	- GridQueue<SizeT> &queue, ///< [in,out] GridQueue descriptor
	- Int2Type<GRID_MAPPING_DYNAMIC> is_dynamic) ///< [in] Marker type indicating this is a dynamic mapping
	- {
	- ConsumeTiles(num_items, queue);
	- }
	-
	-
	-};
	-
	-
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_partition_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_partition_tiles.cuh
	deleted file mode 100755
	index 4597773af..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_partition_tiles.cuh
	+++ /dev/null
	@@ -1,381 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockPartitionTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide list partitioning.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "scan_tiles_types.cuh"
	-#include "../../thread/thread_operators.cuh"
	-#include "../../block/block_load.cuh"
	-#include "../../block/block_store.cuh"
	-#include "../../block/block_scan.cuh"
	-#include "../../grid/grid_queue.cuh"
	-#include "../../util_vector.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Tuning policy types
	- ******************************************************************************/
	-
	-/**
	- * Tuning policy for BlockPartitionTiles
	- */
	-template <
	- int _PARTITIONS,
	- int _BLOCK_THREADS,
	- int _ITEMS_PER_THREAD,
	- PtxLoadModifier _LOAD_MODIFIER,
	- BlockScanAlgorithm _SCAN_ALGORITHM>
	-struct BlockPartitionTilesPolicy
	-{
	- enum
	- {
	- PARTITIONS = _PARTITIONS,
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- };
	-
	- static const PtxLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;
	- static const BlockScanAlgorithm SCAN_ALGORITHM = _SCAN_ALGORITHM;
	-};
	-
	-
	-
	-/**
	- * Tuple type for scanning partition membership flags
	- */
	-template <
	- typename SizeT,
	- int PARTITIONS>
	-struct PartitionScanTuple;
	-
	-
	-/**
	- * Tuple type for scanning partition membership flags (specialized for 1 output partition)
	- */
	-template <typename SizeT>
	-struct PartitionScanTuple<SizeT, 1> : VectorHelper<SizeT, 1>::Type
	-{
	- __device__ __forceinline__ PartitionScanTuple operator+(const PartitionScanTuple &other)
	- {
	- PartitionScanTuple retval;
	- retval.x = x + other.x;
	- return retval;
	- }
	-
	- template <typename PredicateOp, typename T>
	- __device__ __forceinline__ void SetFlags(PredicateOp pred_op, T val)
	- {
	- this->x = pred_op(val);
	- }
	-
	- template <typename PredicateOp, typename T, typename OutputIteratorRA, SizeT num_items>
	- __device__ __forceinline__ void Scatter(PredicateOp pred_op, T val, OutputIteratorRA d_out, SizeT num_items)
	- {
	- if (pred_op(val))
	- d_out[this->x - 1] = val;
	- }
	-
	-};
	-
	-
	-/**
	- * Tuple type for scanning partition membership flags (specialized for 2 output partitions)
	- */
	-template <typename SizeT>
	-struct PartitionScanTuple<SizeT, 2> : VectorHelper<SizeT, 2>::Type
	-{
	- __device__ __forceinline__ PartitionScanTuple operator+(const PartitionScanTuple &other)
	- {
	- PartitionScanTuple retval;
	- retval.x = x + other.x;
	- retval.y = y + other.y;
	- return retval;
	- }
	-
	- template <typename PredicateOp, typename T>
	- __device__ __forceinline__ void SetFlags(PredicateOp pred_op, T val)
	- {
	- bool pred = pred_op(val);
	- this->x = pred;
	- this->y = !pred;
	- }
	-
	- template <typename PredicateOp, typename T, typename OutputIteratorRA, SizeT num_items>
	- __device__ __forceinline__ void Scatter(PredicateOp pred_op, T val, OutputIteratorRA d_out, SizeT num_items)
	- {
	- SizeT scatter_offset = (pred_op(val)) ?
	- this->x - 1 :
	- num_items - this->y;
	-
	- d_out[scatter_offset] = val;
	- }
	-};
	-
	-
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockPartitionTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide list partitioning.
	- *
	- * Implements a single-pass "domino" strategy with adaptive prefix lookback.
	- */
	-template <
	- typename BlockPartitionTilesPolicy, ///< Tuning policy
	- typename InputIteratorRA, ///< Input iterator type
	- typename OutputIteratorRA, ///< Output iterator type
	- typename PredicateOp, ///< Partition predicate functor type
	- typename SizeT> ///< Offset integer type
	-struct BlockPartitionTiles
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Constants
	- enum
	- {
	- PARTITIONS = BlockPartitionTilesPolicy::PARTITIONS,
	- BLOCK_THREADS = BlockPartitionTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockPartitionTilesPolicy::ITEMS_PER_THREAD,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- };
	-
	- // Load modifier
	- static const PtxLoadModifier LOAD_MODIFIER = BlockPartitionTilesPolicy::LOAD_MODIFIER;
	-
	- // Data type of input iterator
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Tuple type for scanning partition membership flags
	- typedef PartitionScanTuple<SizeT, PARTITIONS> PartitionScanTuple;
	-
	- // Tile status descriptor type
	- typedef ScanTileDescriptor<PartitionScanTuple> ScanTileDescriptorT;
	-
	- // Block scan type for scanning membership flag scan_tuples
	- typedef BlockScan<
	- PartitionScanTuple,
	- BlockPartitionTilesPolicy::BLOCK_THREADS,
	- BlockPartitionTilesPolicy::SCAN_ALGORITHM> BlockScanT;
	-
	- // Callback type for obtaining inter-tile prefix during block scan
	- typedef DeviceScanBlockPrefixOp<PartitionScanTuple, Sum> InterblockPrefixOp;
	-
	- // Shared memory type for this threadblock
	- struct TempStorage
	- {
	- typename InterblockPrefixOp::TempStorage prefix; // Smem needed for cooperative prefix callback
	- typename BlockScanT::TempStorage scan; // Smem needed for tile scanning
	- SizeT tile_idx; // Shared tile index
	- };
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- TempStorage &temp_storage; ///< Reference to temp_storage
	- InputIteratorRA d_in; ///< Input data
	- OutputIteratorRA d_out; ///< Output data
	- ScanTileDescriptorT *d_tile_status; ///< Global list of tile status
	- PredicateOp pred_op; ///< Unary predicate operator indicating membership in the first partition
	- SizeT num_items; ///< Total number of input items
	-
	-
	- //---------------------------------------------------------------------
	- // Constructor
	- //---------------------------------------------------------------------
	-
	- // Constructor
	- __device__ __forceinline__
	- BlockPartitionTiles(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data
	- OutputIteratorRA d_out, ///< Output data
	- ScanTileDescriptorT *d_tile_status, ///< Global list of tile status
	- PredicateOp pred_op, ///< Unary predicate operator indicating membership in the first partition
	- SizeT num_items) ///< Total number of input items
	- :
	- temp_storage(temp_storage.Alias()),
	- d_in(d_in),
	- d_out(d_out),
	- d_tile_status(d_tile_status),
	- pred_op(pred_op),
	- num_items(num_items)
	- {}
	-
	-
	- //---------------------------------------------------------------------
	- // Domino scan
	- //---------------------------------------------------------------------
	-
	- /**
	- * Process a tile of input
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- int tile_idx, ///< Tile index
	- SizeT block_offset, ///< Tile offset
	- PartitionScanTuple &partition_ends) ///< Running total
	- {
	- T items[ITEMS_PER_THREAD];
	- PartitionScanTuple scan_tuples[ITEMS_PER_THREAD];
	-
	- // Load items
	- int valid_items = num_items - block_offset;
	- if (FULL_TILE)
	- LoadStriped<LOAD_MODIFIER, BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items);
	- else
	- LoadStriped<LOAD_MODIFIER, BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items, valid_items);
	-
	- // Prevent hoisting
	-// __syncthreads();
	-// __threadfence_block();
	-
	- // Set partition membership flags in scan scan_tuples
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- scan_tuples[ITEM].SetFlags(pred_op, items[ITEM]);
	- }
	-
	- // Perform inclusive scan over scan scan_tuples
	- PartitionScanTuple block_aggregate;
	- if (tile_idx == 0)
	- {
	- BlockScanT(temp_storage.scan).InclusiveScan(scan_tuples, scan_tuples, Sum(), block_aggregate);
	- partition_ends = block_aggregate;
	-
	- // Update tile status if there are successor tiles
	- if (FULL_TILE && (threadIdx.x == 0))
	- ScanTileDescriptorT::SetPrefix(d_tile_status, block_aggregate);
	- }
	- else
	- {
	- InterblockPrefixOp prefix_op(d_tile_status, temp_storage.prefix, Sum(), tile_idx);
	- BlockScanT(temp_storage.scan).InclusiveScan(scan_tuples, scan_tuples, Sum(), block_aggregate, prefix_op);
	- partition_ends = prefix_op.inclusive_prefix;
	- }
	-
	- // Scatter items
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- // Scatter if not out-of-bounds
	- if (FULL_TILE \|\| (threadIdx.x + (ITEM * BLOCK_THREADS) < valid_items))
	- {
	- scan_tuples[ITEM].Scatter(pred_op, items[ITEM], d_out, num_items);
	- }
	- }
	- }
	-
	-
	- /**
	- * Dequeue and scan tiles of items as part of a domino scan
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- GridQueue<int> queue, ///< [in] Queue descriptor for assigning tiles of work to thread blocks
	- SizeT num_tiles, ///< [in] Total number of input tiles
	- PartitionScanTuple &partition_ends, ///< [out] Running partition end offsets
	- bool &is_last_tile) ///< [out] Whether or not this block handled the last tile (i.e., partition_ends is valid for the entire input)
	- {
	-#if CUB_PTX_ARCH < 200
	-
	- // No concurrent kernels allowed and blocks are launched in increasing order, so just assign one tile per block (up to 65K blocks)
	- int tile_idx = blockIdx.x;
	- SizeT block_offset = SizeT(TILE_ITEMS) * tile_idx;
	-
	- if (block_offset + TILE_ITEMS <= num_items)
	- {
	- ConsumeTile<true>(tile_idx, block_offset, partition_ends);
	- }
	- else if (block_offset < num_items)
	- {
	- ConsumeTile<false>(tile_idx, block_offset, partition_ends);
	- }
	- is_last_tile = (tile_idx == num_tiles - 1);
	-
	-#else
	-
	- // Get first tile
	- if (threadIdx.x == 0)
	- temp_storage.tile_idx = queue.Drain(1);
	-
	- __syncthreads();
	-
	- int tile_idx = temp_storage.tile_idx;
	- SizeT block_offset = SizeT(TILE_ITEMS) * tile_idx;
	-
	- while (block_offset + TILE_ITEMS <= num_items)
	- {
	- // Consume full tile
	- ConsumeTile<true>(tile_idx, block_offset, partition_ends);
	- is_last_tile = (tile_idx == num_tiles - 1);
	-
	- // Get next tile
	- if (threadIdx.x == 0)
	- temp_storage.tile_idx = queue.Drain(1);
	-
	- __syncthreads();
	-
	- tile_idx = temp_storage.tile_idx;
	- block_offset = SizeT(TILE_ITEMS) * tile_idx;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < num_items)
	- {
	- ConsumeTile<false>(tile_idx, block_offset, partition_ends);
	- is_last_tile = (tile_idx == num_tiles - 1);
	- }
	-#endif
	- }
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_radix_sort_downsweep_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_radix_sort_downsweep_tiles.cuh
	deleted file mode 100755
	index 91d628e00..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_radix_sort_downsweep_tiles.cuh
	+++ /dev/null
	@@ -1,713 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * BlockRadixSortDownsweepTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep.
	- */
	-
	-
	-#pragma once
	-
	-#include "../../thread/thread_load.cuh"
	-#include "../../block/block_load.cuh"
	-#include "../../block/block_store.cuh"
	-#include "../../block/block_radix_rank.cuh"
	-#include "../../block/block_exchange.cuh"
	-#include "../../util_type.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Tuning policy types
	- ******************************************************************************/
	-
	-/**
	- * Types of scattering strategies
	- */
	-enum RadixSortScatterAlgorithm
	-{
	- RADIX_SORT_SCATTER_DIRECT, ///< Scatter directly from registers to global bins
	- RADIX_SORT_SCATTER_TWO_PHASE, ///< First scatter from registers into shared memory bins, then into global bins
	-};
	-
	-
	-/**
	- * Tuning policy for BlockRadixSortDownsweepTiles
	- */
	-template <
	- int _BLOCK_THREADS, ///< The number of threads per CTA
	- int _ITEMS_PER_THREAD, ///< The number of consecutive downsweep keys to process per thread
	- BlockLoadAlgorithm _LOAD_ALGORITHM, ///< The BlockLoad algorithm to use
	- PtxLoadModifier _LOAD_MODIFIER, ///< The PTX cache-modifier to use for loads
	- bool _EXCHANGE_TIME_SLICING, ///< Whether or not to time-slice key/value exchanges through shared memory to lower shared memory pressure
	- bool _MEMOIZE_OUTER_SCAN, ///< Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure. See BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE for more details.
	- BlockScanAlgorithm _INNER_SCAN_ALGORITHM, ///< The cub::BlockScanAlgorithm algorithm to use
	- RadixSortScatterAlgorithm _SCATTER_ALGORITHM, ///< The scattering strategy to use
	- cudaSharedMemConfig _SMEM_CONFIG, ///< Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)
	- int _RADIX_BITS> ///< The number of radix bits, i.e., log2(bins)
	-struct BlockRadixSortDownsweepTilesPolicy
	-{
	- enum
	- {
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- EXCHANGE_TIME_SLICING = _EXCHANGE_TIME_SLICING,
	- RADIX_BITS = _RADIX_BITS,
	- MEMOIZE_OUTER_SCAN = _MEMOIZE_OUTER_SCAN,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- };
	-
	- static const BlockLoadAlgorithm LOAD_ALGORITHM = _LOAD_ALGORITHM;
	- static const PtxLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;
	- static const BlockScanAlgorithm INNER_SCAN_ALGORITHM = _INNER_SCAN_ALGORITHM;
	- static const RadixSortScatterAlgorithm SCATTER_ALGORITHM = _SCATTER_ALGORITHM;
	- static const cudaSharedMemConfig SMEM_CONFIG = _SMEM_CONFIG;
	-
	- typedef BlockRadixSortDownsweepTilesPolicy<
	- BLOCK_THREADS,
	- ITEMS_PER_THREAD,
	- LOAD_ALGORITHM,
	- LOAD_MODIFIER,
	- EXCHANGE_TIME_SLICING,
	- MEMOIZE_OUTER_SCAN,
	- INNER_SCAN_ALGORITHM,
	- SCATTER_ALGORITHM,
	- SMEM_CONFIG,
	- CUB_MAX(1, RADIX_BITS - 1)> AltPolicy;
	-};
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-/**
	- * CTA-wide "downsweep" abstraction for distributing keys from
	- * a range of input tiles.
	- */
	-template <
	- typename BlockRadixSortDownsweepTilesPolicy,
	- typename Key,
	- typename Value,
	- typename SizeT>
	-struct BlockRadixSortDownsweepTiles
	-{
	- //---------------------------------------------------------------------
	- // Type definitions and constants
	- //---------------------------------------------------------------------
	-
	- // Appropriate unsigned-bits representation of Key
	- typedef typename Traits<Key>::UnsignedBits UnsignedBits;
	-
	- static const UnsignedBits MIN_KEY = Traits<Key>::MIN_KEY;
	- static const UnsignedBits MAX_KEY = Traits<Key>::MAX_KEY;
	-
	- static const BlockLoadAlgorithm LOAD_ALGORITHM = BlockRadixSortDownsweepTilesPolicy::LOAD_ALGORITHM;
	- static const PtxLoadModifier LOAD_MODIFIER = BlockRadixSortDownsweepTilesPolicy::LOAD_MODIFIER;
	- static const BlockScanAlgorithm INNER_SCAN_ALGORITHM = BlockRadixSortDownsweepTilesPolicy::INNER_SCAN_ALGORITHM;
	- static const RadixSortScatterAlgorithm SCATTER_ALGORITHM = BlockRadixSortDownsweepTilesPolicy::SCATTER_ALGORITHM;
	- static const cudaSharedMemConfig SMEM_CONFIG = BlockRadixSortDownsweepTilesPolicy::SMEM_CONFIG;
	-
	- enum
	- {
	- BLOCK_THREADS = BlockRadixSortDownsweepTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockRadixSortDownsweepTilesPolicy::ITEMS_PER_THREAD,
	- EXCHANGE_TIME_SLICING = BlockRadixSortDownsweepTilesPolicy::EXCHANGE_TIME_SLICING,
	- RADIX_BITS = BlockRadixSortDownsweepTilesPolicy::RADIX_BITS,
	- MEMOIZE_OUTER_SCAN = BlockRadixSortDownsweepTilesPolicy::MEMOIZE_OUTER_SCAN,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	-
	- RADIX_DIGITS = 1 << RADIX_BITS,
	- KEYS_ONLY = Equals<Value, NullType>::VALUE,
	-
	- WARP_THREADS = PtxArchProps::LOG_WARP_THREADS,
	- WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
	-
	- BYTES_PER_SIZET = sizeof(SizeT),
	- LOG_BYTES_PER_SIZET = Log2<BYTES_PER_SIZET>::VALUE,
	-
	- LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS,
	- SMEM_BANKS = 1 << LOG_SMEM_BANKS,
	-
	- DIGITS_PER_SCATTER_PASS = BLOCK_THREADS / SMEM_BANKS,
	- SCATTER_PASSES = RADIX_DIGITS / DIGITS_PER_SCATTER_PASS,
	-
	- LOG_STORE_TXN_THREADS = LOG_SMEM_BANKS,
	- STORE_TXN_THREADS = 1 << LOG_STORE_TXN_THREADS,
	- };
	-
	- // BlockRadixRank type
	- typedef BlockRadixRank<
	- BLOCK_THREADS,
	- RADIX_BITS,
	- MEMOIZE_OUTER_SCAN,
	- INNER_SCAN_ALGORITHM,
	- SMEM_CONFIG> BlockRadixRank;
	-
	- // BlockLoad type (keys)
	- typedef BlockLoad<
	- UnsignedBits*,
	- BLOCK_THREADS,
	- ITEMS_PER_THREAD,
	- LOAD_ALGORITHM,
	- LOAD_MODIFIER,
	- EXCHANGE_TIME_SLICING> BlockLoadKeys;
	-
	- // BlockLoad type (values)
	- typedef BlockLoad<
	- Value*,
	- BLOCK_THREADS,
	- ITEMS_PER_THREAD,
	- LOAD_ALGORITHM,
	- LOAD_MODIFIER,
	- EXCHANGE_TIME_SLICING> BlockLoadValues;
	-
	- // BlockExchange type (keys)
	- typedef BlockExchange<
	- UnsignedBits,
	- BLOCK_THREADS,
	- ITEMS_PER_THREAD,
	- EXCHANGE_TIME_SLICING> BlockExchangeKeys;
	-
	- // BlockExchange type (values)
	- typedef BlockExchange<
	- Value,
	- BLOCK_THREADS,
	- ITEMS_PER_THREAD,
	- EXCHANGE_TIME_SLICING> BlockExchangeValues;
	-
	-
	- /**
	- * Shared memory storage layout
	- */
	- struct _TempStorage
	- {
	- SizeT relative_bin_offsets[RADIX_DIGITS + 1];
	- bool short_circuit;
	-
	- union
	- {
	- typename BlockRadixRank::TempStorage ranking;
	- typename BlockLoadKeys::TempStorage load_keys;
	- typename BlockLoadValues::TempStorage load_values;
	- typename BlockExchangeKeys::TempStorage exchange_keys;
	- typename BlockExchangeValues::TempStorage exchange_values;
	- };
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- //---------------------------------------------------------------------
	- // Thread fields
	- //---------------------------------------------------------------------
	-
	- // Shared storage for this CTA
	- _TempStorage &temp_storage;
	-
	- // Input and output device pointers
	- UnsignedBits *d_keys_in;
	- UnsignedBits *d_keys_out;
	- Value *d_values_in;
	- Value *d_values_out;
	-
	- // The global scatter base offset for each digit (valid in the first RADIX_DIGITS threads)
	- SizeT bin_offset;
	-
	- // The least-significant bit position of the current digit to extract
	- int current_bit;
	-
	- // Whether to short-ciruit
	- bool short_circuit;
	-
	-
	-
	- //---------------------------------------------------------------------
	- // Utility methods
	- //---------------------------------------------------------------------
	-
	- /**
	- * Decodes given keys to lookup digit offsets in shared memory
	- */
	- __device__ __forceinline__ void DecodeRelativeBinOffsets(
	- UnsignedBits (&twiddled_keys)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD])
	- {
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- UnsignedBits digit = BFE(twiddled_keys[KEY], current_bit, RADIX_BITS);
	-
	- // Lookup base digit offset from shared memory
	- relative_bin_offsets[KEY] = temp_storage.relative_bin_offsets[digit];
	- }
	- }
	-
	-
	- /**
	- * Scatter ranked items to global memory
	- */
	- template <bool FULL_TILE, typename T>
	- __device__ __forceinline__ void ScatterItems(
	- T (&items)[ITEMS_PER_THREAD],
	- int (&local_ranks)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- T *d_out,
	- SizeT valid_items)
	- {
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- // Scatter if not out-of-bounds
	- if (FULL_TILE \|\| (local_ranks[ITEM] < valid_items))
	- {
	- d_out[relative_bin_offsets[ITEM] + local_ranks[ITEM]] = items[ITEM];
	- }
	- }
	- }
	-
	-
	- /**
	- * Scatter ranked keys directly to global memory
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ScatterKeys(
	- UnsignedBits (&twiddled_keys)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- int (&ranks)[ITEMS_PER_THREAD],
	- SizeT valid_items,
	- Int2Type<RADIX_SORT_SCATTER_DIRECT> scatter_algorithm)
	- {
	- // Compute scatter offsets
	- DecodeRelativeBinOffsets(twiddled_keys, relative_bin_offsets);
	-
	- // Untwiddle keys before outputting
	- UnsignedBits keys[ITEMS_PER_THREAD];
	-
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- keys[KEY] = Traits<Key>::TwiddleOut(twiddled_keys[KEY]);
	- }
	-
	- // Scatter to global
	- ScatterItems<FULL_TILE>(keys, ranks, relative_bin_offsets, d_keys_out, valid_items);
	- }
	-
	-
	- /**
	- * Scatter ranked keys through shared memory, then to global memory
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ScatterKeys(
	- UnsignedBits (&twiddled_keys)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- int (&ranks)[ITEMS_PER_THREAD],
	- SizeT valid_items,
	- Int2Type<RADIX_SORT_SCATTER_TWO_PHASE> scatter_algorithm)
	- {
	- // Exchange keys through shared memory
	- BlockExchangeKeys(temp_storage.exchange_keys).ScatterToStriped(twiddled_keys, ranks);
	-
	- // Compute striped local ranks
	- int local_ranks[ITEMS_PER_THREAD];
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- local_ranks[ITEM] = threadIdx.x + (ITEM * BLOCK_THREADS);
	- }
	-
	- // Scatter directly
	- ScatterKeys<FULL_TILE>(
	- twiddled_keys,
	- relative_bin_offsets,
	- local_ranks,
	- valid_items,
	- Int2Type<RADIX_SORT_SCATTER_DIRECT>());
	- }
	-
	-
	- /**
	- * Scatter ranked values directly to global memory
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ScatterValues(
	- Value (&values)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- int (&ranks)[ITEMS_PER_THREAD],
	- SizeT valid_items,
	- Int2Type<RADIX_SORT_SCATTER_DIRECT> scatter_algorithm)
	- {
	- // Scatter to global
	- ScatterItems<FULL_TILE>(values, ranks, relative_bin_offsets, d_values_out, valid_items);
	- }
	-
	-
	- /**
	- * Scatter ranked values through shared memory, then to global memory
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ScatterValues(
	- Value (&values)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- int (&ranks)[ITEMS_PER_THREAD],
	- SizeT valid_items,
	- Int2Type<RADIX_SORT_SCATTER_TWO_PHASE> scatter_algorithm)
	- {
	- __syncthreads();
	-
	- // Exchange keys through shared memory
	- BlockExchangeValues(temp_storage.exchange_values).ScatterToStriped(values, ranks);
	-
	- // Compute striped local ranks
	- int local_ranks[ITEMS_PER_THREAD];
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- local_ranks[ITEM] = threadIdx.x + (ITEM * BLOCK_THREADS);
	- }
	-
	- // Scatter directly
	- ScatterValues<FULL_TILE>(
	- values,
	- relative_bin_offsets,
	- local_ranks,
	- valid_items,
	- Int2Type<RADIX_SORT_SCATTER_DIRECT>());
	- }
	-
	-
	- /**
	- * Load a tile of items (specialized for full tile)
	- */
	- template <typename BlockLoadT, typename T>
	- __device__ __forceinline__ void LoadItems(
	- BlockLoadT &block_loader,
	- T (&items)[ITEMS_PER_THREAD],
	- T *d_in,
	- SizeT valid_items,
	- Int2Type<true> is_full_tile)
	- {
	- block_loader.Load(d_in, items);
	- }
	-
	-
	- /**
	- * Load a tile of items (specialized for partial tile)
	- */
	- template <typename BlockLoadT, typename T>
	- __device__ __forceinline__ void LoadItems(
	- BlockLoadT &block_loader,
	- T (&items)[ITEMS_PER_THREAD],
	- T *d_in,
	- SizeT valid_items,
	- Int2Type<false> is_full_tile)
	- {
	- block_loader.Load(d_in, items, valid_items);
	- }
	-
	-
	- /**
	- * Truck along associated values
	- */
	- template <bool FULL_TILE, typename _Value>
	- __device__ __forceinline__ void GatherScatterValues(
	- _Value (&values)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- int (&ranks)[ITEMS_PER_THREAD],
	- SizeT block_offset,
	- SizeT valid_items)
	- {
	- BlockLoadValues loader(temp_storage.load_values);
	- LoadItems(
	- loader,
	- values,
	- d_values_in + block_offset,
	- valid_items,
	- Int2Type<FULL_TILE>());
	-
	- ScatterValues<FULL_TILE>(
	- values,
	- relative_bin_offsets,
	- ranks,
	- valid_items,
	- Int2Type<SCATTER_ALGORITHM>());
	- }
	-
	-
	- /**
	- * Truck along associated values (specialized for key-only sorting)
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void GatherScatterValues(
	- NullType (&values)[ITEMS_PER_THREAD],
	- SizeT (&relative_bin_offsets)[ITEMS_PER_THREAD],
	- int (&ranks)[ITEMS_PER_THREAD],
	- SizeT block_offset,
	- SizeT valid_items)
	- {}
	-
	-
	- /**
	- * Process tile
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ProcessTile(
	- SizeT block_offset,
	- const SizeT &valid_items = TILE_ITEMS)
	- {
	- // Per-thread tile data
	- UnsignedBits keys[ITEMS_PER_THREAD]; // Keys
	- UnsignedBits twiddled_keys[ITEMS_PER_THREAD]; // Twiddled keys
	- int ranks[ITEMS_PER_THREAD]; // For each key, the local rank within the CTA
	- SizeT relative_bin_offsets[ITEMS_PER_THREAD]; // For each key, the global scatter base offset of the corresponding digit
	-
	- if (LOAD_ALGORITHM != BLOCK_LOAD_DIRECT) __syncthreads();
	-
	- // Assign max-key to all keys
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- keys[ITEM] = MAX_KEY;
	- }
	-
	- // Load tile of keys
	- BlockLoadKeys loader(temp_storage.load_keys);
	- LoadItems(
	- loader,
	- keys,
	- d_keys_in + block_offset,
	- valid_items,
	- Int2Type<FULL_TILE>());
	-
	- __syncthreads();
	-
	- // Twiddle key bits if necessary
	- #pragma unroll
	- for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
	- {
	- twiddled_keys[KEY] = Traits<Key>::TwiddleIn(keys[KEY]);
	- }
	-
	- // Rank the twiddled keys
	- int inclusive_digit_prefix;
	- BlockRadixRank(temp_storage.ranking).RankKeys(
	- twiddled_keys,
	- ranks,
	- current_bit,
	- inclusive_digit_prefix);
	-
	- // Update global scatter base offsets for each digit
	- if ((BLOCK_THREADS == RADIX_DIGITS) \|\| (threadIdx.x < RADIX_DIGITS))
	- {
	- int exclusive_digit_prefix;
	-
	- // Get exclusive digit prefix from inclusive prefix
	-#if CUB_PTX_ARCH >= 300
	- exclusive_digit_prefix = ShuffleUp(inclusive_digit_prefix, 1);
	- if (threadIdx.x == 0)
	- exclusive_digit_prefix = 0;
	-#else
	- volatile int* exchange = reinterpret_cast<int *>(temp_storage.relative_bin_offsets);
	- exchange[threadIdx.x] = 0;
	- exchange[threadIdx.x + 1] = inclusive_digit_prefix;
	- exclusive_digit_prefix = exchange[threadIdx.x];
	-#endif
	-
	- bin_offset -= exclusive_digit_prefix;
	- temp_storage.relative_bin_offsets[threadIdx.x] = bin_offset;
	- bin_offset += inclusive_digit_prefix;
	- }
	-
	- __syncthreads();
	-
	- // Scatter keys
	- ScatterKeys<FULL_TILE>(twiddled_keys, relative_bin_offsets, ranks, valid_items, Int2Type<SCATTER_ALGORITHM>());
	-
	- // Gather/scatter values
	- Value values[ITEMS_PER_THREAD];
	- GatherScatterValues<FULL_TILE>(values, relative_bin_offsets, ranks, block_offset, valid_items);
	- }
	-
	-
	- /**
	- * Copy tiles within the range of input
	- */
	- template <typename T>
	- __device__ __forceinline__ void Copy(
	- T *d_in,
	- T *d_out,
	- SizeT block_offset,
	- SizeT block_oob)
	- {
	- // Simply copy the input
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- T items[ITEMS_PER_THREAD];
	-
	- LoadStriped<LOAD_DEFAULT, BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items);
	- __syncthreads();
	- StoreStriped<STORE_DEFAULT, BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
	-
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Clean up last partial tile with guarded-I/O
	- if (block_offset < block_oob)
	- {
	- SizeT valid_items = block_oob - block_offset;
	-
	- T items[ITEMS_PER_THREAD];
	-
	- LoadStriped<LOAD_DEFAULT, BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items, valid_items);
	- __syncthreads();
	- StoreStriped<STORE_DEFAULT, BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items, valid_items);
	- }
	- }
	-
	-
	- /**
	- * Copy tiles within the range of input (specialized for NullType)
	- */
	- __device__ __forceinline__ void Copy(
	- NullType *d_in,
	- NullType *d_out,
	- SizeT block_offset,
	- SizeT block_oob)
	- {}
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockRadixSortDownsweepTiles(
	- TempStorage &temp_storage,
	- SizeT bin_offset,
	- Key *d_keys_in,
	- Key *d_keys_out,
	- Value *d_values_in,
	- Value *d_values_out,
	- int current_bit)
	- :
	- temp_storage(temp_storage.Alias()),
	- bin_offset(bin_offset),
	- d_keys_in(reinterpret_cast<UnsignedBits*>(d_keys_in)),
	- d_keys_out(reinterpret_cast<UnsignedBits*>(d_keys_out)),
	- d_values_in(d_values_in),
	- d_values_out(d_values_out),
	- current_bit(current_bit),
	- short_circuit(false)
	- {}
	-
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockRadixSortDownsweepTiles(
	- TempStorage &temp_storage,
	- SizeT num_items,
	- SizeT *d_spine,
	- Key *d_keys_in,
	- Key *d_keys_out,
	- Value *d_values_in,
	- Value *d_values_out,
	- int current_bit)
	- :
	- temp_storage(temp_storage.Alias()),
	- d_keys_in(reinterpret_cast<UnsignedBits*>(d_keys_in)),
	- d_keys_out(reinterpret_cast<UnsignedBits*>(d_keys_out)),
	- d_values_in(d_values_in),
	- d_values_out(d_values_out),
	- current_bit(current_bit)
	- {
	- // Load digit bin offsets (each of the first RADIX_DIGITS threads will load an offset for that digit)
	- if (threadIdx.x < RADIX_DIGITS)
	- {
	- // Short circuit if the first block's histogram has only bin counts of only zeros or problem-size
	- SizeT first_block_bin_offset = d_spine[gridDim.x * threadIdx.x];
	- int predicate = ((first_block_bin_offset == 0) \|\| (first_block_bin_offset == num_items));
	- this->temp_storage.short_circuit = WarpAll(predicate);
	-
	- // Load my block's bin offset for my bin
	- bin_offset = d_spine[(gridDim.x * threadIdx.x) + blockIdx.x];
	- }
	-
	- __syncthreads();
	-
	- short_circuit = this->temp_storage.short_circuit;
	- }
	-
	-
	- /**
	- * Distribute keys from a segment of input tiles.
	- */
	- __device__ __forceinline__ void ProcessTiles(
	- SizeT block_offset,
	- const SizeT &block_oob)
	- {
	- if (short_circuit)
	- {
	- // Copy keys
	- Copy(d_keys_in, d_keys_out, block_offset, block_oob);
	-
	- // Copy values
	- Copy(d_values_in, d_values_out, block_offset, block_oob);
	- }
	- else
	- {
	- // Process full tiles of tile_items
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- ProcessTile<true>(block_offset);
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Clean up last partial tile with guarded-I/O
	- if (block_offset < block_oob)
	- {
	- ProcessTile<false>(block_offset, block_oob - block_offset);
	- }
	- }
	- }
	-};
	-
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_radix_sort_upsweep_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_radix_sort_upsweep_tiles.cuh
	deleted file mode 100755
	index 22f8c9c75..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_radix_sort_upsweep_tiles.cuh
	+++ /dev/null
	@@ -1,464 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * BlockRadixSortUpsweepTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep.
	- */
	-
	-#pragma once
	-
	-#include "../../thread/thread_reduce.cuh"
	-#include "../../thread/thread_load.cuh"
	-#include "../../block/block_load.cuh"
	-#include "../../util_type.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/******************************************************************************
	- * Tuning policy types
	- ******************************************************************************/
	-
	-/**
	- * Tuning policy for BlockRadixSortUpsweepTiles
	- */
	-template <
	- int _BLOCK_THREADS, ///< The number of threads per CTA
	- int _ITEMS_PER_THREAD, ///< The number of items to load per thread per tile
	- PtxLoadModifier _LOAD_MODIFIER, ///< Load cache-modifier
	- int _RADIX_BITS> ///< The number of radix bits, i.e., log2(bins)
	-struct BlockRadixSortUpsweepTilesPolicy
	-{
	- enum
	- {
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- RADIX_BITS = _RADIX_BITS,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- };
	-
	- static const PtxLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;
	-
	- typedef BlockRadixSortUpsweepTilesPolicy<
	- BLOCK_THREADS,
	- ITEMS_PER_THREAD,
	- LOAD_MODIFIER,
	- CUB_MAX(1, RADIX_BITS - 1)> AltPolicy;
	-};
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockRadixSortUpsweepTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep.
	- *
	- * Computes radix digit histograms over a range of input tiles.
	- */
	-template <
	- typename BlockRadixSortUpsweepTilesPolicy,
	- typename Key,
	- typename SizeT>
	-struct BlockRadixSortUpsweepTiles
	-{
	-
	- //---------------------------------------------------------------------
	- // Type definitions and constants
	- //---------------------------------------------------------------------
	-
	- typedef typename Traits<Key>::UnsignedBits UnsignedBits;
	-
	- // Integer type for digit counters (to be packed into words of PackedCounters)
	- typedef unsigned char DigitCounter;
	-
	- // Integer type for packing DigitCounters into columns of shared memory banks
	- typedef unsigned int PackedCounter;
	-
	- static const PtxLoadModifier LOAD_MODIFIER = BlockRadixSortUpsweepTilesPolicy::LOAD_MODIFIER;
	-
	- enum
	- {
	- RADIX_BITS = BlockRadixSortUpsweepTilesPolicy::RADIX_BITS,
	- BLOCK_THREADS = BlockRadixSortUpsweepTilesPolicy::BLOCK_THREADS,
	- KEYS_PER_THREAD = BlockRadixSortUpsweepTilesPolicy::ITEMS_PER_THREAD,
	-
	- RADIX_DIGITS = 1 << RADIX_BITS,
	-
	- LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS,
	- WARP_THREADS = 1 << LOG_WARP_THREADS,
	- WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
	-
	- TILE_ITEMS = BLOCK_THREADS * KEYS_PER_THREAD,
	-
	- BYTES_PER_COUNTER = sizeof(DigitCounter),
	- LOG_BYTES_PER_COUNTER = Log2<BYTES_PER_COUNTER>::VALUE,
	-
	- PACKING_RATIO = sizeof(PackedCounter) / sizeof(DigitCounter),
	- LOG_PACKING_RATIO = Log2<PACKING_RATIO>::VALUE,
	-
	- LOG_COUNTER_LANES = CUB_MAX(0, RADIX_BITS - LOG_PACKING_RATIO),
	- COUNTER_LANES = 1 << LOG_COUNTER_LANES,
	-
	- // To prevent counter overflow, we must periodically unpack and aggregate the
	- // digit counters back into registers. Each counter lane is assigned to a
	- // warp for aggregation.
	-
	- LANES_PER_WARP = CUB_MAX(1, (COUNTER_LANES + WARPS - 1) / WARPS),
	-
	- // Unroll tiles in batches without risk of counter overflow
	- UNROLL_COUNT = CUB_MIN(64, 255 / KEYS_PER_THREAD),
	- UNROLLED_ELEMENTS = UNROLL_COUNT * TILE_ITEMS,
	- };
	-
	-
	-
	- /**
	- * Shared memory storage layout
	- */
	- struct _TempStorage
	- {
	- union
	- {
	- DigitCounter digit_counters[COUNTER_LANES][BLOCK_THREADS][PACKING_RATIO];
	- PackedCounter packed_counters[COUNTER_LANES][BLOCK_THREADS];
	- SizeT digit_partials[RADIX_DIGITS][WARP_THREADS + 1];
	- };
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- //---------------------------------------------------------------------
	- // Thread fields (aggregate state bundle)
	- //---------------------------------------------------------------------
	-
	- // Shared storage for this CTA
	- _TempStorage &temp_storage;
	-
	- // Thread-local counters for periodically aggregating composite-counter lanes
	- SizeT local_counts[LANES_PER_WARP][PACKING_RATIO];
	-
	- // Input and output device pointers
	- UnsignedBits *d_keys_in;
	-
	- // The least-significant bit position of the current digit to extract
	- int current_bit;
	-
	-
	-
	- //---------------------------------------------------------------------
	- // Helper structure for templated iteration
	- //---------------------------------------------------------------------
	-
	- // Iterate
	- template <int COUNT, int MAX>
	- struct Iterate
	- {
	- enum {
	- HALF = (MAX / 2),
	- };
	-
	- // BucketKeys
	- static __device__ __forceinline__ void BucketKeys(
	- BlockRadixSortUpsweepTiles &cta,
	- UnsignedBits keys[KEYS_PER_THREAD])
	- {
	- cta.Bucket(keys[COUNT]);
	-
	- // Next
	- Iterate<COUNT + 1, MAX>::BucketKeys(cta, keys);
	- }
	-
	- // ProcessTiles
	- static __device__ __forceinline__ void ProcessTiles(BlockRadixSortUpsweepTiles &cta, SizeT block_offset)
	- {
	- // Next
	- Iterate<1, HALF>::ProcessTiles(cta, block_offset);
	- Iterate<1, MAX - HALF>::ProcessTiles(cta, block_offset + (HALF * TILE_ITEMS));
	- }
	- };
	-
	- // Terminate
	- template <int MAX>
	- struct Iterate<MAX, MAX>
	- {
	- // BucketKeys
	- static __device__ __forceinline__ void BucketKeys(BlockRadixSortUpsweepTiles &cta, UnsignedBits keys[KEYS_PER_THREAD]) {}
	-
	- // ProcessTiles
	- static __device__ __forceinline__ void ProcessTiles(BlockRadixSortUpsweepTiles &cta, SizeT block_offset)
	- {
	- cta.ProcessFullTile(block_offset);
	- }
	- };
	-
	-
	- //---------------------------------------------------------------------
	- // Utility methods
	- //---------------------------------------------------------------------
	-
	- /**
	- * Decode a key and increment corresponding smem digit counter
	- */
	- __device__ __forceinline__ void Bucket(UnsignedBits key)
	- {
	- // Perform transform op
	- UnsignedBits converted_key = Traits<Key>::TwiddleIn(key);
	-
	- // Add in sub-counter offset
	- UnsignedBits sub_counter = BFE(converted_key, current_bit, LOG_PACKING_RATIO);
	-
	- // Add in row offset
	- UnsignedBits row_offset = BFE(converted_key, current_bit + LOG_PACKING_RATIO, LOG_COUNTER_LANES);
	-
	- // Increment counter
	- temp_storage.digit_counters[row_offset][threadIdx.x][sub_counter]++;
	-
	- }
	-
	-
	- /**
	- * Reset composite counters
	- */
	- __device__ __forceinline__ void ResetDigitCounters()
	- {
	- #pragma unroll
	- for (int LANE = 0; LANE < COUNTER_LANES; LANE++)
	- {
	- temp_storage.packed_counters[LANE][threadIdx.x] = 0;
	- }
	- }
	-
	-
	- /**
	- * Reset the unpacked counters in each thread
	- */
	- __device__ __forceinline__ void ResetUnpackedCounters()
	- {
	- #pragma unroll
	- for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
	- {
	- #pragma unroll
	- for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
	- {
	- local_counts[LANE][UNPACKED_COUNTER] = 0;
	- }
	- }
	- }
	-
	-
	- /**
	- * Extracts and aggregates the digit counters for each counter lane
	- * owned by this warp
	- */
	- __device__ __forceinline__ void UnpackDigitCounts()
	- {
	- unsigned int warp_id = threadIdx.x >> LOG_WARP_THREADS;
	- unsigned int warp_tid = threadIdx.x & (WARP_THREADS - 1);
	-
	- #pragma unroll
	- for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
	- {
	- const int counter_lane = (LANE * WARPS) + warp_id;
	- if (counter_lane < COUNTER_LANES)
	- {
	- #pragma unroll
	- for (int PACKED_COUNTER = 0; PACKED_COUNTER < BLOCK_THREADS; PACKED_COUNTER += WARP_THREADS)
	- {
	- #pragma unroll
	- for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
	- {
	- SizeT counter = temp_storage.digit_counters[counter_lane][warp_tid + PACKED_COUNTER][UNPACKED_COUNTER];
	- local_counts[LANE][UNPACKED_COUNTER] += counter;
	- }
	- }
	- }
	- }
	- }
	-
	-
	- /**
	- * Places unpacked counters into smem for final digit reduction
	- */
	- __device__ __forceinline__ void ReduceUnpackedCounts(SizeT &bin_count)
	- {
	- unsigned int warp_id = threadIdx.x >> LOG_WARP_THREADS;
	- unsigned int warp_tid = threadIdx.x & (WARP_THREADS - 1);
	-
	- // Place unpacked digit counters in shared memory
	- #pragma unroll
	- for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
	- {
	- int counter_lane = (LANE * WARPS) + warp_id;
	- if (counter_lane < COUNTER_LANES)
	- {
	- int digit_row = counter_lane << LOG_PACKING_RATIO;
	-
	- #pragma unroll
	- for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
	- {
	- temp_storage.digit_partials[digit_row + UNPACKED_COUNTER][warp_tid] =
	- local_counts[LANE][UNPACKED_COUNTER];
	- }
	- }
	- }
	-
	- __syncthreads();
	-
	- // Rake-reduce bin_count reductions
	- if (threadIdx.x < RADIX_DIGITS)
	- {
	- bin_count = ThreadReduce<WARP_THREADS>(
	- temp_storage.digit_partials[threadIdx.x],
	- Sum());
	- }
	- }
	-
	-
	- /**
	- * Processes a single, full tile
	- */
	- __device__ __forceinline__ void ProcessFullTile(SizeT block_offset)
	- {
	- // Tile of keys
	- UnsignedBits keys[KEYS_PER_THREAD];
	-
	- LoadStriped<LOAD_MODIFIER, BLOCK_THREADS>(threadIdx.x, d_keys_in + block_offset, keys);
	-
	- // Prevent hoisting
	-// __threadfence_block();
	-// __syncthreads();
	-
	- // Bucket tile of keys
	- Iterate<0, KEYS_PER_THREAD>::BucketKeys(*this, keys);
	- }
	-
	-
	- /**
	- * Processes a single load (may have some threads masked off)
	- */
	- __device__ __forceinline__ void ProcessPartialTile(
	- SizeT block_offset,
	- const SizeT &block_oob)
	- {
	- // Process partial tile if necessary using single loads
	- block_offset += threadIdx.x;
	- while (block_offset < block_oob)
	- {
	- // Load and bucket key
	- UnsignedBits key = ThreadLoad<LOAD_MODIFIER>(d_keys_in + block_offset);
	- Bucket(key);
	- block_offset += BLOCK_THREADS;
	- }
	- }
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockRadixSortUpsweepTiles(
	- TempStorage &temp_storage,
	- Key *d_keys_in,
	- int current_bit)
	- :
	- temp_storage(temp_storage.Alias()),
	- d_keys_in(reinterpret_cast<UnsignedBits*>(d_keys_in)),
	- current_bit(current_bit)
	- {}
	-
	-
	- /**
	- * Compute radix digit histograms from a segment of input tiles.
	- */
	- __device__ __forceinline__ void ProcessTiles(
	- SizeT block_offset,
	- const SizeT &block_oob,
	- SizeT &bin_count) ///< [out] The digit count for tid'th bin (output param, valid in the first RADIX_DIGITS threads)
	- {
	- // Reset digit counters in smem and unpacked counters in registers
	- ResetDigitCounters();
	- ResetUnpackedCounters();
	-
	- // Unroll batches of full tiles
	- while (block_offset + UNROLLED_ELEMENTS <= block_oob)
	- {
	- Iterate<0, UNROLL_COUNT>::ProcessTiles(*this, block_offset);
	- block_offset += UNROLLED_ELEMENTS;
	-
	- __syncthreads();
	-
	- // Aggregate back into local_count registers to prevent overflow
	- UnpackDigitCounts();
	-
	- __syncthreads();
	-
	- // Reset composite counters in lanes
	- ResetDigitCounters();
	- }
	-
	- // Unroll single full tiles
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- ProcessFullTile(block_offset);
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Process partial tile if necessary
	- ProcessPartialTile(
	- block_offset,
	- block_oob);
	-
	- __syncthreads();
	-
	- // Aggregate back into local_count registers
	- UnpackDigitCounts();
	-
	- __syncthreads();
	-
	- // Final raking reduction of counts by bin
	- ReduceUnpackedCounts(bin_count);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_reduce_by_key_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_reduce_by_key_tiles.cuh
	deleted file mode 100755
	index 99e1980b6..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_reduce_by_key_tiles.cuh
	+++ /dev/null
	@@ -1,399 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockReduceByKeyiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "scan_tiles_types.cuh"
	-#include "../../block/block_load.cuh"
	-#include "../../block/block_discontinuity.cuh"
	-#include "../../block/block_scan.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Utility data types
	- ******************************************************************************/
	-
	-/// Scan tuple data type for reduce-value-by-key
	-template <typename Value, typename SizeT>
	-struct ReduceByKeyuple
	-{
	- Value value; // Initially set as value, contains segment aggregate after prefix scan
	- SizeT flag; // Initially set as a tail flag, contains scatter offset after prefix scan
	-};
	-
	-
	-/// Binary reduce-by-key scan operator
	-template <typename ReductionOp>
	-struct ReduceByKeyScanOp
	-{
	- /// Reduction functor
	- ReductionOp reduction_op;
	-
	- /// Constructor
	- ReduceByKeyScanOp(ReductionOp reduction_op) : reduction_op(reduction_op)
	- {}
	-
	- /// Binary scan operator
	- template <typename ReduceByKeyuple>
	- __device__ __forceinline__ ReduceByKeyuple operator()(
	- const ReduceByKeyuple &first,
	- const ReduceByKeyuple &second)
	- {
	- ReduceByKeyuple retval;
	- retval.val = (second.flag) ? second.val : reduction_op(first.val, second.val);
	- retval.flag = first.flag + second.flag;
	- return retval;
	- }
	-};
	-
	-
	-
	-/******************************************************************************
	- * Tuning policy types
	- ******************************************************************************/
	-
	-/**
	- * Tuning policy for BlockReduceByKeyiles
	- */
	-template <
	- int _BLOCK_THREADS,
	- int _ITEMS_PER_THREAD,
	- BlockLoadAlgorithm _LOAD_ALGORITHM,
	- bool _LOAD_WARP_TIME_SLICING,
	- PtxLoadModifier _LOAD_MODIFIER,
	- BlockScanAlgorithm _SCAN_ALGORITHM>
	-struct BlockReduceByKeyilesPolicy
	-{
	- enum
	- {
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- LOAD_WARP_TIME_SLICING = _LOAD_WARP_TIME_SLICING,
	- };
	-
	- static const BlockLoadAlgorithm LOAD_ALGORITHM = _LOAD_ALGORITHM;
	- static const PtxLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;
	- static const BlockScanAlgorithm SCAN_ALGORITHM = _SCAN_ALGORITHM;
	-};
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockReduceByKeyiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan.
	- */
	-template <
	- typename BlockReduceByKeyilesPolicy, ///< Tuning policy
	- typename KeyInputIteratorRA, ///< Random-access input iterator type for keys
	- typename KeyOutputIteratorRA, ///< Random-access output iterator type for keys
	- typename ValueInputIteratorRA, ///< Random-access input iterator type for values
	- typename ValueOutputIteratorRA, ///< Random-access output iterator type for values
	- typename ReductionOp, ///< Reduction functor type
	- typename SizeT> ///< Offset integer type
	-struct BlockReduceByKeyiles
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Data types of input iterators
	- typedef typename std::iterator_traits<KeyInputIteratorRA>::value_type Key; // Key data type
	- typedef typename std::iterator_traits<ValueInputIteratorRA>::value_type Value; // Value data type
	-
	- // Constants
	- enum
	- {
	- BLOCK_THREADS = BlockReduceByKeyilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockReduceByKeyilesPolicy::ITEMS_PER_THREAD,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- STATUS_PADDING = PtxArchProps::WARP_THREADS,
	- };
	-
	- // Block load type for keys
	- typedef BlockLoad<
	- KeyInputIteratorRA,
	- BlockReduceByKeyilesPolicy::BLOCK_THREADS,
	- BlockReduceByKeyilesPolicy::ITEMS_PER_THREAD,
	- BlockReduceByKeyilesPolicy::LOAD_ALGORITHM,
	- BlockReduceByKeyilesPolicy::LOAD_MODIFIER,
	- BlockReduceByKeyilesPolicy::LOAD_WARP_TIME_SLICING> BlockLoadKeys;
	-
	- // Block load type for values
	- typedef BlockLoad<
	- ValueInputIteratorRA,
	- BlockReduceByKeyilesPolicy::BLOCK_THREADS,
	- BlockReduceByKeyilesPolicy::ITEMS_PER_THREAD,
	- BlockReduceByKeyilesPolicy::LOAD_ALGORITHM,
	- BlockReduceByKeyilesPolicy::LOAD_MODIFIER,
	- BlockReduceByKeyilesPolicy::LOAD_WARP_TIME_SLICING> BlockLoadValues;
	-
	- // Block discontinuity type for setting tail flags
	- typedef BlockDiscontinuity<Key, BLOCK_THREADS> BlockDiscontinuityKeys;
	-
	- // Scan tuple type
	- typedef ReduceByKeyuple<Value, SizeT> ScanTuple;
	-
	- // Tile status descriptor type
	- typedef ScanTileDescriptor<ScanTuple> ScanTileDescriptorT;
	-
	- // Block scan functor type
	- typedef ReduceByKeyScanOp<ReductionOp> ScanOp;
	-
	- // Block scan prefix callback type
	- typedef DeviceScanBlockPrefixOp<ScanTuple, ScanOp> PrefixCallback;
	-
	- // Block scan type
	- typedef BlockScan<
	- ScanTuple,
	- BlockReduceByKeyilesPolicy::BLOCK_THREADS,
	- BlockReduceByKeyilesPolicy::SCAN_ALGORITHM> BlockScanT;
	-
	- /// Shared memory type for this threadblock
	- struct _TempStorage
	- {
	- union
	- {
	- typename BlockLoadKeys::TempStorage load_keys; // Smem needed for loading tiles of keys
	- typename BlockLoadValues::TempStorage load_values; // Smem needed for loading tiles of values
	- struct
	- {
	- typename BlockScanT::TempStorage scan; // Smem needed for tile scanning
	- typename PrefixCallback::TempStorage prefix; // Smem needed for cooperative prefix callback
	- };
	- };
	-
	- typename BlockDiscontinuityKeys::TempStorage flagging; // Smem needed for tile scanning
	- SizeT tile_idx; // Shared tile index
	- };
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- _TempStorage &temp_storage; ///< Reference to temp_storage
	- KeyInputIteratorRA d_keys_in; ///< Key input data
	- KeyOutputIteratorRA d_keys_out; ///< Key output data
	- ValueInputIteratorRA d_values_in; ///< Value input data
	- ValueOutputIteratorRA d_values_out; ///< Value output data
	- ScanTileDescriptorT *d_tile_status; ///< Global list of tile status
	- ScanOp scan_op; ///< Binary scan operator
	- int num_tiles; ///< Total number of input tiles for the entire problem
	- SizeT num_items; ///< Total number of scan items for the entire problem
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- // Constructor
	- __device__ __forceinline__
	- BlockReduceByKeyiles(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- KeyInputIteratorRA d_keys_in, ///< Key input data
	- KeyOutputIteratorRA d_keys_out, ///< Key output data
	- ValueInputIteratorRA d_values_in, ///< Value input data
	- ValueOutputIteratorRA d_values_out, ///< Value output data
	- ScanTileDescriptorT *d_tile_status, ///< Global list of tile status
	- ReductionOp reduction_op, ///< Binary scan operator
	- int num_tiles, ///< Total number of input tiles for the entire problem
	- SizeT num_items) ///< Total number of scan items for the entire problem
	- :
	- temp_storage(temp_storage.Alias()),
	- d_keys_in(d_keys_in),
	- d_keys_out(d_keys_out),
	- d_values_in(d_values_in),
	- d_values_out(d_values_out),
	- d_tile_status(d_tile_status),
	- scan_op(reduction_op),
	- num_tiles(num_tiles),
	- num_items(num_items)
	- {}
	-
	-
	- /**
	- * Process a tile of input
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- int tile_idx, ///< Tile index
	- SizeT block_offset, ///< Tile offset
	- int valid_items = TILE_ITEMS) ///< Number of valid items in the tile
	- {
	- Key keys[ITEMS_PER_THREAD];
	- Value values[ITEMS_PER_THREAD];
	- int tail_flags[ITEMS_PER_THREAD];
	- ScanTuple scan_tuples[ITEMS_PER_THREAD];
	-
	- // Load keys
	- if (FULL_TILE)
	- BlockLoadKeys(temp_storage.load_keys).Load(d_keys_in + block_offset, keys);
	- else
	- BlockLoadKeys(temp_storage.load_keys).Load(d_keys_in + block_offset, keys, valid_items);
	-
	- // Set tail flags
	- if (tile_idx == num_tiles - 1)
	- {
	- // Last tile
	- BlockDiscontinuityKeys(temp_storage.flagging).FlagTails(tail_flags, keys, Equality());
	- }
	- else
	- {
	- // Preceding tiles require the first element of the next tile
	- Key tile_suffix_item;
	- if (threadIdx.x == 0)
	- tile_suffix_item = d_keys_in[block_offset + TILE_ITEMS];
	-
	- BlockDiscontinuityKeys(temp_storage.flagging).FlagTails(tail_flags, keys, Equality(), tile_suffix_item);
	- }
	-
	- __syncthreads();
	-
	- // Load values
	- if (FULL_TILE)
	- BlockLoadValues(temp_storage.load_values).Load(d_values_in + block_offset, values);
	- else
	- BlockLoadValues(temp_storage.load_values).Load(d_values_in + block_offset, values, valid_items);
	-
	- // Assemble scan tuples
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- scan_tuples[ITEM].value = values[ITEM];
	- scan_tuples[ITEM].flag = tail_flags[ITEM];
	- }
	-
	- __syncthreads();
	-
	- // Perform inclusive prefix scan
	- ScanTuple block_aggregate;
	- if (tile_idx == 0)
	- {
	- // Without prefix callback
	- BlockScanT(temp_storage.scan).InclusiveScan(scan_tuples, scan_tuples, scan_op, block_aggregate);
	-
	- // Update tile status
	- if (threadIdx.x == 0)
	- ScanTileDescriptorT::SetPrefix(d_tile_status, block_aggregate);
	- }
	- else
	- {
	- // With prefix callback
	- PrefixCallback prefix_op(d_tile_status, temp_storage.prefix, scan_op, tile_idx);
	- BlockScanT(temp_storage.scan).InclusiveScan(scan_tuples, scan_tuples, scan_op, block_aggregate, prefix_op);
	- }
	-
	- // Scatter flagged keys and values to output
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- int tile_item = (threadIdx.x * ITEMS_PER_THREAD) + ITEM;
	-
	- // Set the head flag on the last item in a partially-full tile
	- if (!FULL_TILE && (tile_item == valid_items - 1))
	- tail_flags[ITEM] = 1;
	-
	- // Decrement scatter offset
	- scan_tuples[ITEM].flag--;
	-
	- // Scatter key and aggregate value if flagged and in range
	- if ((FULL_TILE \|\| (tile_item < valid_items)) && (tail_flags[ITEM]))
	- {
	- d_keys_out[scan_tuples[ITEM].flag] = keys[ITEM];
	- d_values_out[scan_tuples[ITEM].flag] = scan_tuples[ITEM].value;
	- }
	- }
	- }
	-
	-
	-
	- /**
	- * Dequeue and scan tiles of elements
	- */
	- __device__ __forceinline__ void ProcessTiles(GridQueue<int> queue) ///< Queue descriptor for assigning tiles of work to thread blocks
	- {
	- // We give each thread block at least one tile of input
	- int tile_idx = blockIdx.x;
	-
	- // Consume full tiles of input
	- SizeT block_offset = SizeT(TILE_ITEMS) * tile_idx;
	- while (block_offset + TILE_ITEMS <= num_items)
	- {
	- ConsumeTile<true>(tile_idx, block_offset);
	-
	- // Get next tile
	-#if CUB_PTX_ARCH < 200
	- // No concurrent kernels allowed, so just stripe tiles
	- tile_idx += gridDim.x;
	-#else
	- // Concurrent kernels are allowed, so we must only use active blocks to dequeue tile indices
	- if (threadIdx.x == 0)
	- temp_storage.tile_idx = queue.Drain(1) + gridDim.x;
	-
	- __syncthreads();
	-
	- tile_idx = temp_storage.tile_idx;
	-#endif
	- block_offset = SizeT(TILE_ITEMS) * tile_idx;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < num_items)
	- {
	- // Consume a partially-full tile
	- int valid_items = num_items - block_offset;
	- ConsumeTile<false>(tile_idx, block_offset, valid_items);
	- }
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_reduce_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_reduce_tiles.cuh
	deleted file mode 100755
	index a83c098ae..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_reduce_tiles.cuh
	+++ /dev/null
	@@ -1,375 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockReduceTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../../block/block_load.cuh"
	-#include "../../block/block_reduce.cuh"
	-#include "../../grid/grid_mapping.cuh"
	-#include "../../grid/grid_queue.cuh"
	-#include "../../grid/grid_even_share.cuh"
	-#include "../../util_vector.cuh"
	-#include "../../util_namespace.cuh"
	-
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Tuning policy types
	- ******************************************************************************/
	-
	-/**
	- * Tuning policy for BlockReduceTiles
	- */
	-template <
	- int _BLOCK_THREADS, ///< Threads per thread block
	- int _ITEMS_PER_THREAD, ///< Items per thread per tile of input
	- int _VECTOR_LOAD_LENGTH, ///< Number of items per vectorized load
	- BlockReduceAlgorithm _BLOCK_ALGORITHM, ///< Cooperative block-wide reduction algorithm to use
	- PtxLoadModifier _LOAD_MODIFIER, ///< PTX load modifier
	- GridMappingStrategy _GRID_MAPPING> ///< How to map tiles of input onto thread blocks
	-struct BlockReduceTilesPolicy
	-{
	- enum
	- {
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- VECTOR_LOAD_LENGTH = _VECTOR_LOAD_LENGTH,
	- };
	-
	- static const BlockReduceAlgorithm BLOCK_ALGORITHM = _BLOCK_ALGORITHM;
	- static const GridMappingStrategy GRID_MAPPING = _GRID_MAPPING;
	- static const PtxLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;
	-};
	-
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockReduceTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction.
	- *
	- * Each thread reduces only the values it loads. If \p FIRST_TILE, this
	- * partial reduction is stored into \p thread_aggregate. Otherwise it is
	- * accumulated into \p thread_aggregate.
	- */
	-template <
	- typename BlockReduceTilesPolicy,
	- typename InputIteratorRA,
	- typename SizeT,
	- typename ReductionOp>
	-struct BlockReduceTiles
	-{
	-
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T; // Type of input iterator
	- typedef VectorHelper<T, BlockReduceTilesPolicy::VECTOR_LOAD_LENGTH> VecHelper; // Helper type for vectorizing loads of T
	- typedef typename VecHelper::Type VectorT; // Vector of T
	-
	- // Constants
	- enum
	- {
	- BLOCK_THREADS = BlockReduceTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockReduceTilesPolicy::ITEMS_PER_THREAD,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- VECTOR_LOAD_LENGTH = BlockReduceTilesPolicy::VECTOR_LOAD_LENGTH,
	-
	- // Can vectorize according to the policy if the input iterator is a native pointer to a built-in primitive
	- CAN_VECTORIZE = (BlockReduceTilesPolicy::VECTOR_LOAD_LENGTH > 1) &&
	- (IsPointer<InputIteratorRA>::VALUE) &&
	- (VecHelper::BUILT_IN),
	-
	- };
	-
	- static const PtxLoadModifier LOAD_MODIFIER = BlockReduceTilesPolicy::LOAD_MODIFIER;
	- static const BlockReduceAlgorithm BLOCK_ALGORITHM = BlockReduceTilesPolicy::BLOCK_ALGORITHM;
	-
	- // Parameterized BlockReduce primitive
	- typedef BlockReduce<T, BLOCK_THREADS, BlockReduceTilesPolicy::BLOCK_ALGORITHM> BlockReduceT;
	-
	- /// Shared memory type required by this thread block
	- typedef typename BlockReduceT::TempStorage _TempStorage;
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- T thread_aggregate; ///< Each thread's partial reduction
	- _TempStorage& temp_storage; ///< Reference to temp_storage
	- InputIteratorRA d_in; ///< Input data to reduce
	- ReductionOp reduction_op; ///< Binary reduction operator
	- int first_tile_size; ///< Size of first tile consumed
	- bool input_aligned; ///< Whether or not input is vector-aligned
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockReduceTiles(
	- TempStorage& temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data to reduce
	- ReductionOp reduction_op) ///< Binary reduction operator
	- :
	- temp_storage(temp_storage.Alias()),
	- d_in(d_in),
	- reduction_op(reduction_op),
	- first_tile_size(0),
	- input_aligned(CAN_VECTORIZE && ((size_t(d_in) & (sizeof(VectorT) - 1)) == 0))
	- {}
	-
	-
	- /**
	- * Process a single tile of input
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- SizeT block_offset, ///< The offset the tile to consume
	- int valid_items = TILE_ITEMS) ///< The number of valid items in the tile
	- {
	- if (FULL_TILE)
	- {
	- T stripe_partial;
	-
	- // Load full tile
	- if (input_aligned)
	- {
	- // Alias items as an array of VectorT and load it in striped fashion
	- enum { WORDS = ITEMS_PER_THREAD / VECTOR_LOAD_LENGTH };
	-
	- VectorT vec_items[WORDS];
	-
	- // Load striped into vec items
	- VectorT* alias_ptr = reinterpret_cast<VectorT>(d_in + block_offset + (threadIdx.x VECTOR_LOAD_LENGTH));
	-
	- #pragma unroll
	- for (int i = 0; i < WORDS; ++i)
	- vec_items[i] = alias_ptr[BLOCK_THREADS * i];
	-
	- // Reduce items within each thread stripe
	- stripe_partial = ThreadReduce<ITEMS_PER_THREAD>(
	- reinterpret_cast<T*>(vec_items),
	- reduction_op);
	- }
	- else
	- {
	- T items[ITEMS_PER_THREAD];
	-
	- // Load items in striped fashion
	- LoadStriped<LOAD_MODIFIER, BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items);
	-
	- // Reduce items within each thread stripe
	- stripe_partial = ThreadReduce(items, reduction_op);
	- }
	-
	- // Update running thread aggregate
	- thread_aggregate = (first_tile_size) ?
	- reduction_op(thread_aggregate, stripe_partial) : // Update
	- stripe_partial; // Assign
	- }
	- else
	- {
	-
	- // Partial tile
	- int thread_offset = threadIdx.x;
	-
	- if (!first_tile_size && (thread_offset < valid_items))
	- {
	- // Assign thread_aggregate
	- thread_aggregate = ThreadLoad<LOAD_MODIFIER>(d_in + block_offset + thread_offset);
	- thread_offset += BLOCK_THREADS;
	- }
	-
	- while (thread_offset < valid_items)
	- {
	- // Update thread aggregate
	- T item = ThreadLoad<LOAD_MODIFIER>(d_in + block_offset + thread_offset);
	- thread_aggregate = reduction_op(thread_aggregate, item);
	- thread_offset += BLOCK_THREADS;
	- }
	- }
	-
	- // Set first tile size if necessary
	- if (!first_tile_size)
	- first_tile_size = valid_items;
	- }
	-
	-
	- //---------------------------------------------------------------------
	- // Consume a contiguous segment of tiles
	- //---------------------------------------------------------------------
	-
	- /**
	- * \brief Reduce a contiguous segment of input tiles
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT block_offset, ///< [in] Threadblock begin offset (inclusive)
	- SizeT block_oob, ///< [in] Threadblock end offset (exclusive)
	- T &block_aggregate) ///< [out] Running total
	- {
	- // Consume subsequent full tiles of input
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- ConsumeTile<true>(block_offset);
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < block_oob)
	- {
	- int valid_items = block_oob - block_offset;
	- ConsumeTile<false>(block_offset, valid_items);
	- }
	-
	- // Compute block-wide reduction
	- block_aggregate = (first_tile_size < TILE_ITEMS) ?
	- BlockReduceT(temp_storage).Reduce(thread_aggregate, reduction_op, first_tile_size) :
	- BlockReduceT(temp_storage).Reduce(thread_aggregate, reduction_op);
	- }
	-
	-
	- /**
	- * Reduce a contiguous segment of input tiles
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT num_items, ///< [in] Total number of global input items
	- GridEvenShare<SizeT> &even_share, ///< [in] GridEvenShare descriptor
	- GridQueue<SizeT> &queue, ///< [in,out] GridQueue descriptor
	- T &block_aggregate, ///< [out] Running total
	- Int2Type<GRID_MAPPING_EVEN_SHARE> is_even_share) ///< [in] Marker type indicating this is an even-share mapping
	- {
	- // Initialize even-share descriptor for this thread block
	- even_share.BlockInit();
	-
	- // Consume input tiles
	- ConsumeTiles(even_share.block_offset, even_share.block_oob, block_aggregate);
	- }
	-
	-
	- //---------------------------------------------------------------------
	- // Dynamically consume tiles
	- //---------------------------------------------------------------------
	-
	- /**
	- * Dequeue and reduce tiles of items as part of a inter-block scan
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- int num_items, ///< Total number of input items
	- GridQueue<SizeT> queue, ///< Queue descriptor for assigning tiles of work to thread blocks
	- T &block_aggregate) ///< [out] Running total
	- {
	- // Shared dequeue offset
	- __shared__ SizeT dequeue_offset;
	-
	- // We give each thread block at least one tile of input.
	- SizeT block_offset = blockIdx.x * TILE_ITEMS;
	- SizeT even_share_base = gridDim.x * TILE_ITEMS;
	-
	- if (block_offset + TILE_ITEMS <= num_items)
	- {
	- // Consume full tile of input
	- ConsumeTile<true>(block_offset);
	-
	- // Dequeue more tiles
	- while (true)
	- {
	- // Dequeue a tile of items
	- if (threadIdx.x == 0)
	- dequeue_offset = queue.Drain(TILE_ITEMS) + even_share_base;
	-
	- __syncthreads();
	-
	- // Grab tile offset and check if we're done with full tiles
	- block_offset = dequeue_offset;
	-
	- __syncthreads();
	-
	- if (block_offset + TILE_ITEMS > num_items)
	- break;
	-
	- // Consume a full tile
	- ConsumeTile<true>(block_offset);
	- }
	- }
	-
	- if (block_offset < num_items)
	- {
	- int valid_items = num_items - block_offset;
	- ConsumeTile<false>(block_offset, valid_items);
	- }
	-
	- // Compute block-wide reduction
	- block_aggregate = (first_tile_size < TILE_ITEMS) ?
	- BlockReduceT(temp_storage).Reduce(thread_aggregate, reduction_op, first_tile_size) :
	- BlockReduceT(temp_storage).Reduce(thread_aggregate, reduction_op);
	- }
	-
	-
	- /**
	- * Dequeue and reduce tiles of items as part of a inter-block scan
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT num_items, ///< [in] Total number of global input items
	- GridEvenShare<SizeT> &even_share, ///< [in] GridEvenShare descriptor
	- GridQueue<SizeT> &queue, ///< [in,out] GridQueue descriptor
	- T &block_aggregate, ///< [out] Running total
	- Int2Type<GRID_MAPPING_DYNAMIC> is_dynamic) ///< [in] Marker type indicating this is a dynamic mapping
	- {
	- ConsumeTiles(num_items, queue, block_aggregate);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/block_scan_tiles.cuh b/lib/kokkos/TPL/cub/device/block/block_scan_tiles.cuh
	deleted file mode 100755
	index 980220480..000000000
	--- a/lib/kokkos/TPL/cub/device/block/block_scan_tiles.cuh
	+++ /dev/null
	@@ -1,509 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockScanTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "scan_tiles_types.cuh"
	-#include "../../block/block_load.cuh"
	-#include "../../block/block_store.cuh"
	-#include "../../block/block_scan.cuh"
	-#include "../../grid/grid_queue.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Tuning policy types
	- ******************************************************************************/
	-
	-/**
	- * Tuning policy for BlockScanTiles
	- */
	-template <
	- int _BLOCK_THREADS,
	- int _ITEMS_PER_THREAD,
	- BlockLoadAlgorithm _LOAD_ALGORITHM,
	- bool _LOAD_WARP_TIME_SLICING,
	- PtxLoadModifier _LOAD_MODIFIER,
	- BlockStoreAlgorithm _STORE_ALGORITHM,
	- bool _STORE_WARP_TIME_SLICING,
	- BlockScanAlgorithm _SCAN_ALGORITHM>
	-struct BlockScanTilesPolicy
	-{
	- enum
	- {
	- BLOCK_THREADS = _BLOCK_THREADS,
	- ITEMS_PER_THREAD = _ITEMS_PER_THREAD,
	- LOAD_WARP_TIME_SLICING = _LOAD_WARP_TIME_SLICING,
	- STORE_WARP_TIME_SLICING = _STORE_WARP_TIME_SLICING,
	- };
	-
	- static const BlockLoadAlgorithm LOAD_ALGORITHM = _LOAD_ALGORITHM;
	- static const PtxLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;
	- static const BlockStoreAlgorithm STORE_ALGORITHM = _STORE_ALGORITHM;
	- static const BlockScanAlgorithm SCAN_ALGORITHM = _SCAN_ALGORITHM;
	-};
	-
	-
	-/******************************************************************************
	- * Thread block abstractions
	- ******************************************************************************/
	-
	-/**
	- * \brief BlockScanTiles implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan.
	- *
	- * Implements a single-pass "domino" strategy with adaptive prefix lookback.
	- */
	-template <
	- typename BlockScanTilesPolicy, ///< Tuning policy
	- typename InputIteratorRA, ///< Input iterator type
	- typename OutputIteratorRA, ///< Output iterator type
	- typename ScanOp, ///< Scan functor type
	- typename Identity, ///< Identity element type (cub::NullType for inclusive scan)
	- typename SizeT> ///< Offset integer type
	-struct BlockScanTiles
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Data type of input iterator
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Constants
	- enum
	- {
	- INCLUSIVE = Equals<Identity, NullType>::VALUE, // Inclusive scan if no identity type is provided
	- BLOCK_THREADS = BlockScanTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockScanTilesPolicy::ITEMS_PER_THREAD,
	- TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- };
	-
	- // Block load type
	- typedef BlockLoad<
	- InputIteratorRA,
	- BlockScanTilesPolicy::BLOCK_THREADS,
	- BlockScanTilesPolicy::ITEMS_PER_THREAD,
	- BlockScanTilesPolicy::LOAD_ALGORITHM,
	- BlockScanTilesPolicy::LOAD_MODIFIER,
	- BlockScanTilesPolicy::LOAD_WARP_TIME_SLICING> BlockLoadT;
	-
	- // Block store type
	- typedef BlockStore<
	- OutputIteratorRA,
	- BlockScanTilesPolicy::BLOCK_THREADS,
	- BlockScanTilesPolicy::ITEMS_PER_THREAD,
	- BlockScanTilesPolicy::STORE_ALGORITHM,
	- STORE_DEFAULT,
	- BlockScanTilesPolicy::STORE_WARP_TIME_SLICING> BlockStoreT;
	-
	- // Tile status descriptor type
	- typedef ScanTileDescriptor<T> ScanTileDescriptorT;
	-
	- // Block scan type
	- typedef BlockScan<
	- T,
	- BlockScanTilesPolicy::BLOCK_THREADS,
	- BlockScanTilesPolicy::SCAN_ALGORITHM> BlockScanT;
	-
	- // Callback type for obtaining inter-tile prefix during block scan
	- typedef DeviceScanBlockPrefixOp<T, ScanOp> InterblockPrefixOp;
	-
	- // Shared memory type for this threadblock
	- struct _TempStorage
	- {
	- union
	- {
	- typename BlockLoadT::TempStorage load; // Smem needed for tile loading
	- typename BlockStoreT::TempStorage store; // Smem needed for tile storing
	- struct
	- {
	- typename InterblockPrefixOp::TempStorage prefix; // Smem needed for cooperative prefix callback
	- typename BlockScanT::TempStorage scan; // Smem needed for tile scanning
	- };
	- };
	-
	- SizeT tile_idx; // Shared tile index
	- };
	-
	- // Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- _TempStorage &temp_storage; ///< Reference to temp_storage
	- InputIteratorRA d_in; ///< Input data
	- OutputIteratorRA d_out; ///< Output data
	- ScanOp scan_op; ///< Binary scan operator
	- Identity identity; ///< Identity element
	-
	-
	-
	- //---------------------------------------------------------------------
	- // Block scan utility methods (first tile)
	- //---------------------------------------------------------------------
	-
	- /**
	- * Exclusive scan specialization
	- */
	- template <typename _ScanOp, typename _Identity>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], _ScanOp scan_op, _Identity identity, T& block_aggregate)
	- {
	- BlockScanT(temp_storage.scan).ExclusiveScan(items, items, identity, scan_op, block_aggregate);
	- }
	-
	- /**
	- * Exclusive sum specialization
	- */
	- template <typename _Identity>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], Sum scan_op, _Identity identity, T& block_aggregate)
	- {
	- BlockScanT(temp_storage.scan).ExclusiveSum(items, items, block_aggregate);
	- }
	-
	- /**
	- * Inclusive scan specialization
	- */
	- template <typename _ScanOp>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], _ScanOp scan_op, NullType identity, T& block_aggregate)
	- {
	- BlockScanT(temp_storage.scan).InclusiveScan(items, items, scan_op, block_aggregate);
	- }
	-
	- /**
	- * Inclusive sum specialization
	- */
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], Sum scan_op, NullType identity, T& block_aggregate)
	- {
	- BlockScanT(temp_storage.scan).InclusiveSum(items, items, block_aggregate);
	- }
	-
	- //---------------------------------------------------------------------
	- // Block scan utility methods (subsequent tiles)
	- //---------------------------------------------------------------------
	-
	- /**
	- * Exclusive scan specialization (with prefix from predecessors)
	- */
	- template <typename _ScanOp, typename _Identity, typename PrefixCallback>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], _ScanOp scan_op, _Identity identity, T& block_aggregate, PrefixCallback &prefix_op)
	- {
	- BlockScanT(temp_storage.scan).ExclusiveScan(items, items, identity, scan_op, block_aggregate, prefix_op);
	- }
	-
	- /**
	- * Exclusive sum specialization (with prefix from predecessors)
	- */
	- template <typename _Identity, typename PrefixCallback>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], Sum scan_op, _Identity identity, T& block_aggregate, PrefixCallback &prefix_op)
	- {
	- BlockScanT(temp_storage.scan).ExclusiveSum(items, items, block_aggregate, prefix_op);
	- }
	-
	- /**
	- * Inclusive scan specialization (with prefix from predecessors)
	- */
	- template <typename _ScanOp, typename PrefixCallback>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], _ScanOp scan_op, NullType identity, T& block_aggregate, PrefixCallback &prefix_op)
	- {
	- BlockScanT(temp_storage.scan).InclusiveScan(items, items, scan_op, block_aggregate, prefix_op);
	- }
	-
	- /**
	- * Inclusive sum specialization (with prefix from predecessors)
	- */
	- template <typename PrefixCallback>
	- __device__ __forceinline__
	- void ScanBlock(T (&items)[ITEMS_PER_THREAD], Sum scan_op, NullType identity, T& block_aggregate, PrefixCallback &prefix_op)
	- {
	- BlockScanT(temp_storage.scan).InclusiveSum(items, items, block_aggregate, prefix_op);
	- }
	-
	- //---------------------------------------------------------------------
	- // Constructor
	- //---------------------------------------------------------------------
	-
	- // Constructor
	- __device__ __forceinline__
	- BlockScanTiles(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data
	- OutputIteratorRA d_out, ///< Output data
	- ScanOp scan_op, ///< Binary scan operator
	- Identity identity) ///< Identity element
	- :
	- temp_storage(temp_storage.Alias()),
	- d_in(d_in),
	- d_out(d_out),
	- scan_op(scan_op),
	- identity(identity)
	- {}
	-
	-
	- //---------------------------------------------------------------------
	- // Domino scan
	- //---------------------------------------------------------------------
	-
	- /**
	- * Process a tile of input (domino scan)
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- SizeT num_items, ///< Total number of input items
	- int tile_idx, ///< Tile index
	- SizeT block_offset, ///< Tile offset
	- ScanTileDescriptorT *d_tile_status) ///< Global list of tile status
	- {
	- // Load items
	- T items[ITEMS_PER_THREAD];
	-
	- if (FULL_TILE)
	- BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
	- else
	- BlockLoadT(temp_storage.load).Load(d_in + block_offset, items, num_items - block_offset);
	-
	- __syncthreads();
	-
	- T block_aggregate;
	- if (tile_idx == 0)
	- {
	- ScanBlock(items, scan_op, identity, block_aggregate);
	-
	- // Update tile status if there are successor tiles
	- if (FULL_TILE && (threadIdx.x == 0))
	- ScanTileDescriptorT::SetPrefix(d_tile_status, block_aggregate);
	- }
	- else
	- {
	- InterblockPrefixOp prefix_op(d_tile_status, temp_storage.prefix, scan_op, tile_idx);
	- ScanBlock(items, scan_op, identity, block_aggregate, prefix_op);
	- }
	-
	- __syncthreads();
	-
	- // Store items
	- if (FULL_TILE)
	- BlockStoreT(temp_storage.store).Store(d_out + block_offset, items);
	- else
	- BlockStoreT(temp_storage.store).Store(d_out + block_offset, items, num_items - block_offset);
	- }
	-
	- /**
	- * Dequeue and scan tiles of items as part of a domino scan
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- int num_items, ///< Total number of input items
	- GridQueue<int> queue, ///< Queue descriptor for assigning tiles of work to thread blocks
	- ScanTileDescriptorT *d_tile_status) ///< Global list of tile status
	- {
	-#if CUB_PTX_ARCH < 200
	-
	- // No concurrent kernels allowed and blocks are launched in increasing order, so just assign one tile per block (up to 65K blocks)
	- int tile_idx = blockIdx.x;
	- SizeT block_offset = SizeT(TILE_ITEMS) * tile_idx;
	-
	- if (block_offset + TILE_ITEMS <= num_items)
	- ConsumeTile<true>(num_items, tile_idx, block_offset, d_tile_status);
	- else if (block_offset < num_items)
	- ConsumeTile<false>(num_items, tile_idx, block_offset, d_tile_status);
	-
	-#else
	-
	- // Get first tile
	- if (threadIdx.x == 0)
	- temp_storage.tile_idx = queue.Drain(1);
	-
	- __syncthreads();
	-
	- int tile_idx = temp_storage.tile_idx;
	- SizeT block_offset = SizeT(TILE_ITEMS) * tile_idx;
	-
	- while (block_offset + TILE_ITEMS <= num_items)
	- {
	- // Consume full tile
	- ConsumeTile<true>(num_items, tile_idx, block_offset, d_tile_status);
	-
	- // Get next tile
	- if (threadIdx.x == 0)
	- temp_storage.tile_idx = queue.Drain(1);
	-
	- __syncthreads();
	-
	- tile_idx = temp_storage.tile_idx;
	- block_offset = SizeT(TILE_ITEMS) * tile_idx;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < num_items)
	- {
	- ConsumeTile<false>(num_items, tile_idx, block_offset, d_tile_status);
	- }
	-#endif
	-
	- }
	-
	-
	- //---------------------------------------------------------------------
	- // Even-share scan
	- //---------------------------------------------------------------------
	-
	- /**
	- * Process a tile of input
	- */
	- template <
	- bool FULL_TILE,
	- bool FIRST_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- SizeT block_offset, ///< Tile offset
	- RunningBlockPrefixOp<T> &prefix_op, ///< Running prefix operator
	- int valid_items = TILE_ITEMS) ///< Number of valid items in the tile
	- {
	- // Load items
	- T items[ITEMS_PER_THREAD];
	-
	- if (FULL_TILE)
	- BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
	- else
	- BlockLoadT(temp_storage.load).Load(d_in + block_offset, items, valid_items);
	-
	- __syncthreads();
	-
	- // Block scan
	- T block_aggregate;
	- if (FIRST_TILE)
	- {
	- ScanBlock(items, scan_op, identity, block_aggregate);
	- prefix_op.running_total = block_aggregate;
	- }
	- else
	- {
	- ScanBlock(items, scan_op, identity, block_aggregate, prefix_op);
	- }
	-
	- __syncthreads();
	-
	- // Store items
	- if (FULL_TILE)
	- BlockStoreT(temp_storage.store).Store(d_out + block_offset, items);
	- else
	- BlockStoreT(temp_storage.store).Store(d_out + block_offset, items, valid_items);
	- }
	-
	-
	- /**
	- * Scan a consecutive share of input tiles
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT block_offset, ///< [in] Threadblock begin offset (inclusive)
	- SizeT block_oob) ///< [in] Threadblock end offset (exclusive)
	- {
	- RunningBlockPrefixOp<T> prefix_op;
	-
	- if (block_offset + TILE_ITEMS <= block_oob)
	- {
	- // Consume first tile of input (full)
	- ConsumeTile<true, true>(block_offset, prefix_op);
	- block_offset += TILE_ITEMS;
	-
	- // Consume subsequent full tiles of input
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- ConsumeTile<true, false>(block_offset, prefix_op);
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < block_oob)
	- {
	- int valid_items = block_oob - block_offset;
	- ConsumeTile<false, false>(block_offset, prefix_op, valid_items);
	- }
	- }
	- else
	- {
	- // Consume the first tile of input (partially-full)
	- int valid_items = block_oob - block_offset;
	- ConsumeTile<false, true>(block_offset, prefix_op, valid_items);
	- }
	- }
	-
	-
	- /**
	- * Scan a consecutive share of input tiles, seeded with the specified prefix value
	- */
	- __device__ __forceinline__ void ConsumeTiles(
	- SizeT block_offset, ///< [in] Threadblock begin offset (inclusive)
	- SizeT block_oob, ///< [in] Threadblock end offset (exclusive)
	- T prefix) ///< [in] The prefix to apply to the scan segment
	- {
	- RunningBlockPrefixOp<T> prefix_op;
	- prefix_op.running_total = prefix;
	-
	- // Consume full tiles of input
	- while (block_offset + TILE_ITEMS <= block_oob)
	- {
	- ConsumeTile<true, false>(block_offset, prefix_op);
	- block_offset += TILE_ITEMS;
	- }
	-
	- // Consume a partially-full tile
	- if (block_offset < block_oob)
	- {
	- int valid_items = block_oob - block_offset;
	- ConsumeTile<false, false>(block_offset, prefix_op, valid_items);
	- }
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/scan_tiles_types.cuh b/lib/kokkos/TPL/cub/device/block/scan_tiles_types.cuh
	deleted file mode 100755
	index 2b933d0af..000000000
	--- a/lib/kokkos/TPL/cub/device/block/scan_tiles_types.cuh
	+++ /dev/null
	@@ -1,318 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Utility types for device-wide scan
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../../thread/thread_load.cuh"
	-#include "../../thread/thread_store.cuh"
	-#include "../../warp/warp_reduce.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * Enumerations of tile status
	- */
	-enum ScanTileStatus
	-{
	- SCAN_TILE_OOB, // Out-of-bounds (e.g., padding)
	- SCAN_TILE_INVALID, // Not yet processed
	- SCAN_TILE_PARTIAL, // Tile aggregate is available
	- SCAN_TILE_PREFIX, // Inclusive tile prefix is available
	-};
	-
	-
	-/**
	- * Data type of tile status descriptor.
	- *
	- * Specialized for scan status and value types that can be combined into the same
	- * machine word that can be read/written coherently in a single access.
	- */
	-template <
	- typename T,
	- bool SINGLE_WORD = (PowerOfTwo<sizeof(T)>::VALUE && (sizeof(T) <= 8))>
	-struct ScanTileDescriptor
	-{
	- // Status word type
	- typedef typename If<(sizeof(T) == 8),
	- long long,
	- typename If<(sizeof(T) == 4),
	- int,
	- typename If<(sizeof(T) == 2),
	- short,
	- char>::Type>::Type>::Type StatusWord;
	-
	- // Vector word type
	- typedef typename If<(sizeof(T) == 8),
	- longlong2,
	- typename If<(sizeof(T) == 4),
	- int2,
	- typename If<(sizeof(T) == 2),
	- int,
	- short>::Type>::Type>::Type VectorWord;
	-
	- T value;
	- StatusWord status;
	-
	- static __device__ __forceinline__ void SetPrefix(ScanTileDescriptor *ptr, T prefix)
	- {
	- ScanTileDescriptor tile_descriptor;
	- tile_descriptor.status = SCAN_TILE_PREFIX;
	- tile_descriptor.value = prefix;
	-
	- VectorWord alias;
	- reinterpret_cast<ScanTileDescriptor>(&alias) = tile_descriptor;
	- ThreadStore<STORE_CG>(reinterpret_cast<VectorWord*>(ptr), alias);
	- }
	-
	- static __device__ __forceinline__ void SetPartial(ScanTileDescriptor *ptr, T partial)
	- {
	- ScanTileDescriptor tile_descriptor;
	- tile_descriptor.status = SCAN_TILE_PARTIAL;
	- tile_descriptor.value = partial;
	-
	- VectorWord alias;
	- reinterpret_cast<ScanTileDescriptor>(&alias) = tile_descriptor;
	- ThreadStore<STORE_CG>(reinterpret_cast<VectorWord*>(ptr), alias);
	- }
	-
	- static __device__ __forceinline__ void WaitForValid(
	- ScanTileDescriptor *ptr,
	- int &status,
	- T &value)
	- {
	- ScanTileDescriptor tile_descriptor;
	- while (true)
	- {
	- VectorWord alias = ThreadLoad<LOAD_CG>(reinterpret_cast<VectorWord*>(ptr));
	-
	- tile_descriptor = reinterpret_cast<ScanTileDescriptor>(&alias);
	- if (tile_descriptor.status != SCAN_TILE_INVALID) break;
	-
	- __threadfence_block();
	- }
	-
	- status = tile_descriptor.status;
	- value = tile_descriptor.value;
	- }
	-
	-};
	-
	-
	-/**
	- * Data type of tile status descriptor.
	- *
	- * Specialized for scan status and value types that cannot fused into
	- * the same machine word.
	- */
	-template <typename T>
	-struct ScanTileDescriptor<T, false>
	-{
	- T prefix_value;
	- T partial_value;
	-
	- /// Workaround for the fact that win32 doesn't guarantee 16B alignment 16B values of T
	- union
	- {
	- int status;
	- Uninitialized<T> padding;
	- };
	-
	- static __device__ __forceinline__ void SetPrefix(ScanTileDescriptor *ptr, T prefix)
	- {
	- ThreadStore<STORE_CG>(&ptr->prefix_value, prefix);
	- __threadfence_block();
	-// __threadfence(); // __threadfence_block seems sufficient on current architectures to prevent reordeing
	- ThreadStore<STORE_CG>(&ptr->status, (int) SCAN_TILE_PREFIX);
	-
	- }
	-
	- static __device__ __forceinline__ void SetPartial(ScanTileDescriptor *ptr, T partial)
	- {
	- ThreadStore<STORE_CG>(&ptr->partial_value, partial);
	- __threadfence_block();
	-// __threadfence(); // __threadfence_block seems sufficient on current architectures to prevent reordeing
	- ThreadStore<STORE_CG>(&ptr->status, (int) SCAN_TILE_PARTIAL);
	- }
	-
	- static __device__ __forceinline__ void WaitForValid(
	- ScanTileDescriptor *ptr,
	- int &status,
	- T &value)
	- {
	- while (true)
	- {
	- status = ThreadLoad<LOAD_CG>(&ptr->status);
	- if (status != SCAN_TILE_INVALID) break;
	-
	- __threadfence_block();
	- }
	-
	- value = (status == SCAN_TILE_PARTIAL) ?
	- ThreadLoad<LOAD_CG>(&ptr->partial_value) :
	- ThreadLoad<LOAD_CG>(&ptr->prefix_value);
	- }
	-};
	-
	-
	-/**
	- * Stateful prefix functor that provides the the running prefix for
	- * the current tile by using the callback warp to wait on on
	- * aggregates/prefixes from predecessor tiles to become available
	- */
	-template <
	- typename T,
	- typename ScanOp>
	-struct DeviceScanBlockPrefixOp
	-{
	- // Parameterized warp reduce
	- typedef WarpReduce<T> WarpReduceT;
	-
	- // Storage type
	- typedef typename WarpReduceT::TempStorage _TempStorage;
	-
	- // Alias wrapper allowing storage to be unioned
	- typedef Uninitialized<_TempStorage> TempStorage;
	-
	- // Tile status descriptor type
	- typedef ScanTileDescriptor<T> ScanTileDescriptorT;
	-
	- // Fields
	- ScanTileDescriptorT *d_tile_status; ///< Pointer to array of tile status
	- _TempStorage &temp_storage; ///< Reference to a warp-reduction instance
	- ScanOp scan_op; ///< Binary scan operator
	- int tile_idx; ///< The current tile index
	- T inclusive_prefix; ///< Inclusive prefix for the tile
	-
	- // Constructor
	- __device__ __forceinline__
	- DeviceScanBlockPrefixOp(
	- ScanTileDescriptorT *d_tile_status,
	- TempStorage &temp_storage,
	- ScanOp scan_op,
	- int tile_idx) :
	- d_tile_status(d_tile_status),
	- temp_storage(temp_storage.Alias()),
	- scan_op(scan_op),
	- tile_idx(tile_idx) {}
	-
	-
	- // Block until all predecessors within the specified window have non-invalid status
	- __device__ __forceinline__
	- void ProcessWindow(
	- int predecessor_idx,
	- int &predecessor_status,
	- T &window_aggregate)
	- {
	- T value;
	- ScanTileDescriptorT::WaitForValid(d_tile_status + predecessor_idx, predecessor_status, value);
	-
	- // Perform a segmented reduction to get the prefix for the current window
	- int flag = (predecessor_status != SCAN_TILE_PARTIAL);
	- window_aggregate = WarpReduceT(temp_storage).TailSegmentedReduce(value, flag, scan_op);
	- }
	-
	-
	- // Prefix functor (called by the first warp)
	- __device__ __forceinline__
	- T operator()(T block_aggregate)
	- {
	- // Update our status with our tile-aggregate
	- if (threadIdx.x == 0)
	- {
	- ScanTileDescriptorT::SetPartial(d_tile_status + tile_idx, block_aggregate);
	- }
	-
	- // Wait for the window of predecessor tiles to become valid
	- int predecessor_idx = tile_idx - threadIdx.x - 1;
	- int predecessor_status;
	- T window_aggregate;
	- ProcessWindow(predecessor_idx, predecessor_status, window_aggregate);
	-
	- // The exclusive tile prefix starts out as the current window aggregate
	- T exclusive_prefix = window_aggregate;
	-
	- // Keep sliding the window back until we come across a tile whose inclusive prefix is known
	- while (WarpAll(predecessor_status != SCAN_TILE_PREFIX))
	- {
	- predecessor_idx -= PtxArchProps::WARP_THREADS;
	-
	- // Update exclusive tile prefix with the window prefix
	- ProcessWindow(predecessor_idx, predecessor_status, window_aggregate);
	- exclusive_prefix = scan_op(window_aggregate, exclusive_prefix);
	- }
	-
	- // Compute the inclusive tile prefix and update the status for this tile
	- if (threadIdx.x == 0)
	- {
	- inclusive_prefix = scan_op(exclusive_prefix, block_aggregate);
	- ScanTileDescriptorT::SetPrefix(
	- d_tile_status + tile_idx,
	- inclusive_prefix);
	- }
	-
	- // Return exclusive_prefix
	- return exclusive_prefix;
	- }
	-};
	-
	-
	-// Running scan prefix callback type for single-block scans.
	-// Maintains a running prefix that can be applied to consecutive
	-// scan operations.
	-template <typename T>
	-struct RunningBlockPrefixOp
	-{
	- // Running prefix
	- T running_total;
	-
	- // Callback operator.
	- __device__ T operator()(T block_aggregate)
	- {
	- T old_prefix = running_total;
	- running_total += block_aggregate;
	- return old_prefix;
	- }
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_gatomic.cuh b/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_gatomic.cuh
	deleted file mode 100755
	index 5896dbcf6..000000000
	--- a/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_gatomic.cuh
	+++ /dev/null
	@@ -1,184 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockHistogramTilesGlobalAtomic implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram.
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../../../util_type.cuh"
	-#include "../../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-
	-/**
	- * BlockHistogramTilesGlobalAtomic implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram using global atomics
	- */
	-template <
	- typename BlockHistogramTilesPolicy, ///< Tuning policy
	- int BINS, ///< Number of histogram bins per channel
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of active channels being histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- typename HistoCounter, ///< Integral type for counting sample occurrences per histogram bin
	- typename SizeT> ///< Integer type for offsets
	-struct BlockHistogramTilesGlobalAtomic
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Sample type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type SampleT;
	-
	- // Constants
	- enum
	- {
	- BLOCK_THREADS = BlockHistogramTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockHistogramTilesPolicy::ITEMS_PER_THREAD,
	- TILE_CHANNEL_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- TILE_ITEMS = TILE_CHANNEL_ITEMS * CHANNELS,
	- };
	-
	- // Shared memory type required by this thread block
	- typedef NullType TempStorage;
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- /// Reference to output histograms
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS];
	-
	- /// Input data to reduce
	- InputIteratorRA d_in;
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockHistogramTilesGlobalAtomic(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data to reduce
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS]) ///< Reference to output histograms
	- :
	- d_in(d_in),
	- d_out_histograms(d_out_histograms)
	- {}
	-
	-
	- /**
	- * Process a single tile of input
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- SizeT block_offset, ///< The offset the tile to consume
	- int valid_items = TILE_ITEMS) ///< The number of valid items in the tile
	- {
	- if (FULL_TILE)
	- {
	- // Full tile of samples to read and composite
	- SampleT items[ITEMS_PER_THREAD][CHANNELS];
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < CHANNELS; ++CHANNEL)
	- {
	- if (CHANNEL < ACTIVE_CHANNELS)
	- {
	- items[ITEM][CHANNEL] = d_in[block_offset + (ITEM * BLOCK_THREADS * CHANNELS) + (threadIdx.x * CHANNELS) + CHANNEL];
	- }
	- }
	- }
	-
	- __threadfence_block();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < CHANNELS; ++CHANNEL)
	- {
	- if (CHANNEL < ACTIVE_CHANNELS)
	- {
	- atomicAdd(d_out_histograms[CHANNEL] + items[ITEM][CHANNEL], 1);
	- }
	- }
	- }
	- }
	- else
	- {
	- // Only a partially-full tile of samples to read and composite
	- int bounds = valid_items - (threadIdx.x * CHANNELS);
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < CHANNELS; ++CHANNEL)
	- {
	- if (((ACTIVE_CHANNELS == CHANNELS) \|\| (CHANNEL < ACTIVE_CHANNELS)) && ((ITEM * BLOCK_THREADS * CHANNELS) + CHANNEL < bounds))
	- {
	- SampleT item = d_in[block_offset + (ITEM * BLOCK_THREADS * CHANNELS) + (threadIdx.x * CHANNELS) + CHANNEL];
	- atomicAdd(d_out_histograms[CHANNEL] + item, 1);
	- }
	- }
	- }
	-
	- }
	- }
	-
	-
	- /**
	- * Aggregate results into output
	- */
	- __device__ __forceinline__ void AggregateOutput()
	- {}
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_satomic.cuh b/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_satomic.cuh
	deleted file mode 100755
	index c55d78953..000000000
	--- a/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_satomic.cuh
	+++ /dev/null
	@@ -1,237 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockHistogramTilesSharedAtomic implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram using shared atomics
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../../../util_type.cuh"
	-#include "../../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * BlockHistogramTilesSharedAtomic implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram using shared atomics
	- */
	-template <
	- typename BlockHistogramTilesPolicy, ///< Tuning policy
	- int BINS, ///< Number of histogram bins
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of active channels being histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- typename HistoCounter, ///< Integral type for counting sample occurrences per histogram bin
	- typename SizeT> ///< Integer type for offsets
	-struct BlockHistogramTilesSharedAtomic
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Sample type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type SampleT;
	-
	- // Constants
	- enum
	- {
	- BLOCK_THREADS = BlockHistogramTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockHistogramTilesPolicy::ITEMS_PER_THREAD,
	- TILE_CHANNEL_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- TILE_ITEMS = TILE_CHANNEL_ITEMS * CHANNELS,
	- };
	-
	- /// Shared memory type required by this thread block
	- struct _TempStorage
	- {
	- HistoCounter histograms[ACTIVE_CHANNELS][BINS + 1]; // One word of padding between channel histograms to prevent warps working on different histograms from hammering on the same bank
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- /// Reference to temp_storage
	- _TempStorage &temp_storage;
	-
	- /// Reference to output histograms
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS];
	-
	- /// Input data to reduce
	- InputIteratorRA d_in;
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockHistogramTilesSharedAtomic(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data to reduce
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS]) ///< Reference to output histograms
	- :
	- temp_storage(temp_storage.Alias()),
	- d_in(d_in),
	- d_out_histograms(d_out_histograms)
	- {
	- // Initialize histogram bin counts to zeros
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < ACTIVE_CHANNELS; ++CHANNEL)
	- {
	- int histo_offset = 0;
	-
	- #pragma unroll
	- for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
	- {
	- this->temp_storage.histograms[CHANNEL][histo_offset + threadIdx.x] = 0;
	- }
	- // Finish up with guarded initialization if necessary
	- if ((BINS % BLOCK_THREADS != 0) && (histo_offset + threadIdx.x < BINS))
	- {
	- this->temp_storage.histograms[CHANNEL][histo_offset + threadIdx.x] = 0;
	- }
	- }
	- }
	-
	-
	- /**
	- * Process a single tile of input
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- SizeT block_offset, ///< The offset the tile to consume
	- int valid_items = TILE_ITEMS) ///< The number of valid items in the tile
	- {
	- if (FULL_TILE)
	- {
	- // Full tile of samples to read and composite
	- SampleT items[ITEMS_PER_THREAD][CHANNELS];
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < CHANNELS; ++CHANNEL)
	- {
	- if (CHANNEL < ACTIVE_CHANNELS)
	- {
	- items[ITEM][CHANNEL] = d_in[block_offset + (ITEM * BLOCK_THREADS * CHANNELS) + (threadIdx.x * CHANNELS) + CHANNEL];
	- }
	- }
	- }
	-
	- __threadfence_block();
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < CHANNELS; ++CHANNEL)
	- {
	- if (CHANNEL < ACTIVE_CHANNELS)
	- {
	- atomicAdd(temp_storage.histograms[CHANNEL] + items[ITEM][CHANNEL], 1);
	- }
	- }
	- }
	-
	- __threadfence_block();
	- }
	- else
	- {
	- // Only a partially-full tile of samples to read and composite
	- int bounds = valid_items - (threadIdx.x * CHANNELS);
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
	- {
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < CHANNELS; ++CHANNEL)
	- {
	- if (((ACTIVE_CHANNELS == CHANNELS) \|\| (CHANNEL < ACTIVE_CHANNELS)) && ((ITEM * BLOCK_THREADS * CHANNELS) + CHANNEL < bounds))
	- {
	- SampleT item = d_in[block_offset + (ITEM * BLOCK_THREADS * CHANNELS) + (threadIdx.x * CHANNELS) + CHANNEL];
	- atomicAdd(temp_storage.histograms[CHANNEL] + item, 1);
	- }
	- }
	- }
	-
	- }
	- }
	-
	-
	- /**
	- * Aggregate results into output
	- */
	- __device__ __forceinline__ void AggregateOutput()
	- {
	- // Barrier to ensure shared memory histograms are coherent
	- __syncthreads();
	-
	- // Copy shared memory histograms to output
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < ACTIVE_CHANNELS; ++CHANNEL)
	- {
	- int channel_offset = (blockIdx.x * BINS);
	- int histo_offset = 0;
	-
	- #pragma unroll
	- for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
	- {
	- d_out_histograms[CHANNEL][channel_offset + histo_offset + threadIdx.x] = temp_storage.histograms[CHANNEL][histo_offset + threadIdx.x];
	- }
	- // Finish up with guarded initialization if necessary
	- if ((BINS % BLOCK_THREADS != 0) && (histo_offset + threadIdx.x < BINS))
	- {
	- d_out_histograms[CHANNEL][channel_offset + histo_offset + threadIdx.x] = temp_storage.histograms[CHANNEL][histo_offset + threadIdx.x];
	- }
	- }
	- }
	-};
	-
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_sort.cuh b/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_sort.cuh
	deleted file mode 100755
	index 0f821309c..000000000
	--- a/lib/kokkos/TPL/cub/device/block/specializations/block_histo_tiles_sort.cuh
	+++ /dev/null
	@@ -1,364 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::BlockHistogramTilesSort implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram using local sorting
	- */
	-
	-#pragma once
	-
	-#include <iterator>
	-
	-#include "../../../block/block_radix_sort.cuh"
	-#include "../../../block/block_discontinuity.cuh"
	-#include "../../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * BlockHistogramTilesSort implements a stateful abstraction of CUDA thread blocks for histogramming multiple tiles as part of device-wide histogram using local sorting
	- */
	-template <
	- typename BlockHistogramTilesPolicy, ///< Tuning policy
	- int BINS, ///< Number of histogram bins per channel
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of active channels being histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- typename HistoCounter, ///< Integral type for counting sample occurrences per histogram bin
	- typename SizeT> ///< Integer type for offsets
	-struct BlockHistogramTilesSort
	-{
	- //---------------------------------------------------------------------
	- // Types and constants
	- //---------------------------------------------------------------------
	-
	- // Sample type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type SampleT;
	-
	- // Constants
	- enum
	- {
	- BLOCK_THREADS = BlockHistogramTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockHistogramTilesPolicy::ITEMS_PER_THREAD,
	- TILE_CHANNEL_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD,
	- TILE_ITEMS = TILE_CHANNEL_ITEMS * CHANNELS,
	-
	- STRIPED_COUNTERS_PER_THREAD = (BINS + BLOCK_THREADS - 1) / BLOCK_THREADS,
	- };
	-
	- // Parameterize BlockRadixSort type for our thread block
	- typedef BlockRadixSort<SampleT, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
	-
	- // Parameterize BlockDiscontinuity type for our thread block
	- typedef BlockDiscontinuity<SampleT, BLOCK_THREADS> BlockDiscontinuityT;
	-
	- /// Shared memory type required by this thread block
	- union _TempStorage
	- {
	- // Storage for sorting bin values
	- typename BlockRadixSortT::TempStorage sort;
	-
	- struct
	- {
	- // Storage for detecting discontinuities in the tile of sorted bin values
	- typename BlockDiscontinuityT::TempStorage flag;
	-
	- // Storage for noting begin/end offsets of bin runs in the tile of sorted bin values
	- int run_begin[BLOCK_THREADS * STRIPED_COUNTERS_PER_THREAD];
	- int run_end[BLOCK_THREADS * STRIPED_COUNTERS_PER_THREAD];
	- };
	- };
	-
	-
	- /// Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- // Discontinuity functor
	- struct DiscontinuityOp
	- {
	- // Reference to temp_storage
	- _TempStorage &temp_storage;
	-
	- // Constructor
	- __device__ __forceinline__ DiscontinuityOp(_TempStorage &temp_storage) :
	- temp_storage(temp_storage)
	- {}
	-
	- // Discontinuity predicate
	- __device__ __forceinline__ bool operator()(const SampleT &a, const SampleT &b, int b_index)
	- {
	- if (a != b)
	- {
	- // Note the begin/end offsets in shared storage
	- temp_storage.run_begin[b] = b_index;
	- temp_storage.run_end[a] = b_index;
	-
	- return true;
	- }
	- else
	- {
	- return false;
	- }
	- }
	- };
	-
	-
	- //---------------------------------------------------------------------
	- // Per-thread fields
	- //---------------------------------------------------------------------
	-
	- /// Reference to temp_storage
	- _TempStorage &temp_storage;
	-
	- /// Histogram counters striped across threads
	- HistoCounter thread_counters[ACTIVE_CHANNELS][STRIPED_COUNTERS_PER_THREAD];
	-
	- /// Reference to output histograms
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS];
	-
	- /// Input data to reduce
	- InputIteratorRA d_in;
	-
	-
	- //---------------------------------------------------------------------
	- // Interface
	- //---------------------------------------------------------------------
	-
	- /**
	- * Constructor
	- */
	- __device__ __forceinline__ BlockHistogramTilesSort(
	- TempStorage &temp_storage, ///< Reference to temp_storage
	- InputIteratorRA d_in, ///< Input data to reduce
	- HistoCounter* (&d_out_histograms)[ACTIVE_CHANNELS]) ///< Reference to output histograms
	- :
	- temp_storage(temp_storage.Alias()),
	- d_in(d_in),
	- d_out_histograms(d_out_histograms)
	- {
	- // Initialize histogram counters striped across threads
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < ACTIVE_CHANNELS; ++CHANNEL)
	- {
	- #pragma unroll
	- for (int COUNTER = 0; COUNTER < STRIPED_COUNTERS_PER_THREAD; ++COUNTER)
	- {
	- thread_counters[CHANNEL][COUNTER] = 0;
	- }
	- }
	- }
	-
	-
	- /**
	- * Composite a tile of input items
	- */
	- __device__ __forceinline__ void Composite(
	- SampleT (&items)[ITEMS_PER_THREAD], ///< Tile of samples
	- HistoCounter thread_counters[STRIPED_COUNTERS_PER_THREAD]) ///< Histogram counters striped across threads
	- {
	- // Sort bytes in blocked arrangement
	- BlockRadixSortT(temp_storage.sort).Sort(items);
	-
	- __syncthreads();
	-
	- // Initialize the shared memory's run_begin and run_end for each bin
	- #pragma unroll
	- for (int COUNTER = 0; COUNTER < STRIPED_COUNTERS_PER_THREAD; ++COUNTER)
	- {
	- temp_storage.run_begin[(COUNTER * BLOCK_THREADS) + threadIdx.x] = TILE_CHANNEL_ITEMS;
	- temp_storage.run_end[(COUNTER * BLOCK_THREADS) + threadIdx.x] = TILE_CHANNEL_ITEMS;
	- }
	-
	- __syncthreads();
	-
	- // Note the begin/end run offsets of bin runs in the sorted tile
	- int flags[ITEMS_PER_THREAD]; // unused
	- DiscontinuityOp flag_op(temp_storage);
	- BlockDiscontinuityT(temp_storage.flag).FlagHeads(flags, items, flag_op);
	-
	- // Update begin for first item
	- if (threadIdx.x == 0) temp_storage.run_begin[items[0]] = 0;
	-
	- __syncthreads();
	-
	- // Composite into histogram
	- // Initialize the shared memory's run_begin and run_end for each bin
	- #pragma unroll
	- for (int COUNTER = 0; COUNTER < STRIPED_COUNTERS_PER_THREAD; ++COUNTER)
	- {
	- int bin = (COUNTER * BLOCK_THREADS) + threadIdx.x;
	- HistoCounter run_length = temp_storage.run_end[bin] - temp_storage.run_begin[bin];
	-
	- thread_counters[COUNTER] += run_length;
	- }
	- }
	-
	-
	- /**
	- * Process one channel within a tile.
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTileChannel(
	- int channel,
	- SizeT block_offset,
	- int valid_items)
	- {
	- // Load items in striped fashion
	- if (FULL_TILE)
	- {
	- // Full tile of samples to read and composite
	- SampleT items[ITEMS_PER_THREAD];
	-
	- // Unguarded loads
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = d_in[channel + block_offset + (ITEM * BLOCK_THREADS * CHANNELS) + (threadIdx.x * CHANNELS)];
	- }
	-
	- // Composite our histogram data
	- Composite(items, thread_counters[channel]);
	- }
	- else
	- {
	- // Only a partially-full tile of samples to read and composite
	- SampleT items[ITEMS_PER_THREAD];
	-
	- // Assign our tid as the bin for out-of-bounds items (to give an even distribution), and keep track of how oob items to subtract out later
	- int bounds = (valid_items - (threadIdx.x * CHANNELS));
	-
	- #pragma unroll
	- for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
	- {
	- items[ITEM] = ((ITEM * BLOCK_THREADS * CHANNELS) < bounds) ?
	- d_in[channel + block_offset + (ITEM * BLOCK_THREADS * CHANNELS) + (threadIdx.x * CHANNELS)] :
	- 0;
	- }
	-
	- // Composite our histogram data
	- Composite(items, thread_counters[channel]);
	-
	- __syncthreads();
	-
	- // Correct the overcounting in the zero-bin from invalid (out-of-bounds) items
	- if (threadIdx.x == 0)
	- {
	- int extra = (TILE_ITEMS - valid_items) / CHANNELS;
	- thread_counters[channel][0] -= extra;
	- }
	- }
	- }
	-
	-
	- /**
	- * Template iteration over channels (to silence not-unrolled warnings for SM10-13). Inductive step.
	- */
	- template <bool FULL_TILE, int CHANNEL, int END>
	- struct IterateChannels
	- {
	- /**
	- * Process one channel within a tile.
	- */
	- static __device__ __forceinline__ void ConsumeTileChannel(
	- BlockHistogramTilesSort *cta,
	- SizeT block_offset,
	- int valid_items)
	- {
	- __syncthreads();
	-
	- cta->ConsumeTileChannel<FULL_TILE>(CHANNEL, block_offset, valid_items);
	-
	- IterateChannels<FULL_TILE, CHANNEL + 1, END>::ConsumeTileChannel(cta, block_offset, valid_items);
	- }
	- };
	-
	-
	- /**
	- * Template iteration over channels (to silence not-unrolled warnings for SM10-13). Base step.
	- */
	- template <bool FULL_TILE, int END>
	- struct IterateChannels<FULL_TILE, END, END>
	- {
	- static __device__ __forceinline__ void ConsumeTileChannel(BlockHistogramTilesSort *cta, SizeT block_offset, int valid_items) {}
	- };
	-
	-
	- /**
	- * Process a single tile of input
	- */
	- template <bool FULL_TILE>
	- __device__ __forceinline__ void ConsumeTile(
	- SizeT block_offset, ///< The offset the tile to consume
	- int valid_items = TILE_ITEMS) ///< The number of valid items in the tile
	- {
	- // First channel
	- ConsumeTileChannel<FULL_TILE>(0, block_offset, valid_items);
	-
	- // Iterate through remaining channels
	- IterateChannels<FULL_TILE, 1, ACTIVE_CHANNELS>::ConsumeTileChannel(this, block_offset, valid_items);
	- }
	-
	-
	- /**
	- * Aggregate results into output
	- */
	- __device__ __forceinline__ void AggregateOutput()
	- {
	- // Copy counters striped across threads into the histogram output
	- #pragma unroll
	- for (int CHANNEL = 0; CHANNEL < ACTIVE_CHANNELS; ++CHANNEL)
	- {
	- int channel_offset = (blockIdx.x * BINS);
	-
	- #pragma unroll
	- for (int COUNTER = 0; COUNTER < STRIPED_COUNTERS_PER_THREAD; ++COUNTER)
	- {
	- int bin = (COUNTER * BLOCK_THREADS) + threadIdx.x;
	-
	- if ((STRIPED_COUNTERS_PER_THREAD * BLOCK_THREADS == BINS) \|\| (bin < BINS))
	- {
	- d_out_histograms[CHANNEL][channel_offset + bin] = thread_counters[CHANNEL][COUNTER];
	- }
	- }
	- }
	- }
	-};
	-
	-
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/device/device_histogram.cuh b/lib/kokkos/TPL/cub/device/device_histogram.cuh
	deleted file mode 100755
	index 6f5a74d1f..000000000
	--- a/lib/kokkos/TPL/cub/device/device_histogram.cuh
	+++ /dev/null
	@@ -1,1062 +0,0 @@
	-
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from samples data residing within global memory.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include <iterator>
	-
	-#include "block/block_histo_tiles.cuh"
	-#include "../grid/grid_even_share.cuh"
	-#include "../grid/grid_queue.cuh"
	-#include "../util_debug.cuh"
	-#include "../util_device.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Kernel entry points
	- *****************************************************************************/
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Initialization pass kernel entry point (multi-block). Prepares queue descriptors zeroes global counters.
	- */
	-template <
	- int BINS, ///< Number of histogram bins per channel
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename SizeT, ///< Integer type used for global array indexing
	- typename HistoCounter> ///< Integral type for counting sample occurrences per histogram bin
	-__launch_bounds__ (BINS, 1)
	-__global__ void InitHistoKernel(
	- GridQueue<SizeT> grid_queue, ///< [in] Descriptor for performing dynamic mapping of tile data to thread blocks
	- ArrayWrapper<HistoCounter*, ACTIVE_CHANNELS> d_out_histograms, ///< [out] Histogram counter data having logical dimensions <tt>HistoCounter[ACTIVE_CHANNELS][BINS]</tt>
	- SizeT num_samples) ///< [in] Total number of samples \p d_samples for all channels
	-{
	- d_out_histograms.array[blockIdx.x][threadIdx.x] = 0;
	- if (threadIdx.x == 0) grid_queue.ResetDrain(num_samples);
	-}
	-
	-
	-/**
	- * Histogram pass kernel entry point (multi-block). Computes privatized histograms, one per thread block.
	- */
	-template <
	- typename BlockHistogramTilesPolicy, ///< Tuning policy for cub::BlockHistogramTiles abstraction
	- int BINS, ///< Number of histogram bins per channel
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that is assignable to <tt>unsigned char</tt>
	- typename HistoCounter, ///< Integral type for counting sample occurrences per histogram bin
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockHistogramTilesPolicy::BLOCK_THREADS), BlockHistogramTilesPolicy::SM_OCCUPANCY)
	-__global__ void MultiBlockHistogramKernel(
	- InputIteratorRA d_samples, ///< [in] Array of sample data. The samples from different channels are assumed to be interleaved (e.g., an array of 32b pixels where each pixel consists of four RGBA 8b samples).
	- ArrayWrapper<HistoCounter*, ACTIVE_CHANNELS> d_out_histograms, ///< [out] Histogram counter data having logical dimensions <tt>HistoCounter[ACTIVE_CHANNELS][gridDim.x][BINS]</tt>
	- SizeT num_samples, ///< [in] Total number of samples \p d_samples for all channels
	- GridEvenShare<SizeT> even_share, ///< [in] Descriptor for how to map an even-share of tiles across thread blocks
	- GridQueue<SizeT> queue) ///< [in] Descriptor for performing dynamic mapping of tile data to thread blocks
	-{
	- // Constants
	- enum
	- {
	- BLOCK_THREADS = BlockHistogramTilesPolicy::BLOCK_THREADS,
	- ITEMS_PER_THREAD = BlockHistogramTilesPolicy::ITEMS_PER_THREAD,
	- TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD,
	- };
	-
	- // Thread block type for compositing input tiles
	- typedef BlockHistogramTiles<BlockHistogramTilesPolicy, BINS, CHANNELS, ACTIVE_CHANNELS, InputIteratorRA, HistoCounter, SizeT> BlockHistogramTilesT;
	-
	- // Shared memory for BlockHistogramTiles
	- __shared__ typename BlockHistogramTilesT::TempStorage temp_storage;
	-
	- // Consume input tiles
	- BlockHistogramTilesT(temp_storage, d_samples, d_out_histograms.array).ConsumeTiles(
	- num_samples,
	- even_share,
	- queue,
	- Int2Type<BlockHistogramTilesPolicy::GRID_MAPPING>());
	-}
	-
	-
	-/**
	- * Block-aggregation pass kernel entry point (single-block). Aggregates privatized threadblock histograms from a previous multi-block histogram pass.
	- */
	-template <
	- int BINS, ///< Number of histogram bins per channel
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename HistoCounter> ///< Integral type for counting sample occurrences per histogram bin
	-__launch_bounds__ (BINS, 1)
	-__global__ void AggregateHistoKernel(
	- HistoCounter* d_block_histograms, ///< [in] Histogram counter data having logical dimensions <tt>HistoCounter[ACTIVE_CHANNELS][num_threadblocks][BINS]</tt>
	- ArrayWrapper<HistoCounter*, ACTIVE_CHANNELS> d_out_histograms, ///< [out] Histogram counter data having logical dimensions <tt>HistoCounter[ACTIVE_CHANNELS][BINS]</tt>
	- int num_threadblocks) ///< [in] Number of threadblock histograms per channel in \p d_block_histograms
	-{
	- // Accumulate threadblock-histograms from the channel
	- HistoCounter bin_aggregate = 0;
	-
	- int block_offset = blockIdx.x * (num_threadblocks * BINS);
	- int block_oob = block_offset + (num_threadblocks * BINS);
	-
	-#if CUB_PTX_ARCH >= 200
	- #pragma unroll 32
	-#endif
	- while (block_offset < block_oob)
	- {
	- bin_aggregate += d_block_histograms[block_offset + threadIdx.x];
	- block_offset += BINS;
	- }
	-
	- // Output
	- d_out_histograms.array[blockIdx.x][threadIdx.x] = bin_aggregate;
	-}
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/******************************************************************************
	- * DeviceHistogram
	- *****************************************************************************/
	-
	-/**
	- * \brief DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from samples data residing within global memory. ![](histogram_logo.png)
	- * \ingroup DeviceModule
	- *
	- * \par Overview
	- * A <a href="http://en.wikipedia.org/wiki/Histogram"><em>histogram</em></a>
	- * counts the number of observations that fall into each of the disjoint categories (known as <em>bins</em>).
	- *
	- * \par Usage Considerations
	- * \cdp_class{DeviceHistogram}
	- *
	- * \par Performance
	- *
	- * \image html histo_perf.png
	- *
	- */
	-struct DeviceHistogram
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Generic structure for encapsulating dispatch properties. Mirrors the constants within BlockHistogramTilesPolicy.
	- struct KernelDispachParams
	- {
	- // Policy fields
	- int block_threads;
	- int items_per_thread;
	- BlockHistogramTilesAlgorithm block_algorithm;
	- GridMappingStrategy grid_mapping;
	- int subscription_factor;
	-
	- // Derived fields
	- int channel_tile_size;
	-
	- template <typename BlockHistogramTilesPolicy>
	- __host__ __device__ __forceinline__
	- void Init(int subscription_factor = 1)
	- {
	- block_threads = BlockHistogramTilesPolicy::BLOCK_THREADS;
	- items_per_thread = BlockHistogramTilesPolicy::ITEMS_PER_THREAD;
	- block_algorithm = BlockHistogramTilesPolicy::GRID_ALGORITHM;
	- grid_mapping = BlockHistogramTilesPolicy::GRID_MAPPING;
	- this->subscription_factor = subscription_factor;
	-
	- channel_tile_size = block_threads * items_per_thread;
	- }
	-
	- __host__ __device__ __forceinline__
	- void Print()
	- {
	- printf("%d, %d, %d, %d, %d",
	- block_threads,
	- items_per_thread,
	- block_algorithm,
	- grid_mapping,
	- subscription_factor);
	- }
	-
	- };
	-
	-
	- /******************************************************************************
	- * Tuning policies
	- ******************************************************************************/
	-
	- /// Specializations of tuned policy types for different PTX architectures
	- template <
	- int CHANNELS,
	- int ACTIVE_CHANNELS,
	- BlockHistogramTilesAlgorithm GRID_ALGORITHM,
	- int ARCH>
	- struct TunedPolicies;
	-
	- /// SM35 tune
	- template <int CHANNELS, int ACTIVE_CHANNELS, BlockHistogramTilesAlgorithm GRID_ALGORITHM>
	- struct TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 350>
	- {
	- typedef BlockHistogramTilesPolicy<
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? 128 : 256,
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? 12 : (30 / ACTIVE_CHANNELS),
	- GRID_ALGORITHM,
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? GRID_MAPPING_DYNAMIC : GRID_MAPPING_EVEN_SHARE,
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? 8 : 1> MultiBlockPolicy;
	- enum { SUBSCRIPTION_FACTOR = 7 };
	- };
	-
	- /// SM30 tune
	- template <int CHANNELS, int ACTIVE_CHANNELS, BlockHistogramTilesAlgorithm GRID_ALGORITHM>
	- struct TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 300>
	- {
	- typedef BlockHistogramTilesPolicy<
	- 128,
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? 20 : (22 / ACTIVE_CHANNELS),
	- GRID_ALGORITHM,
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? GRID_MAPPING_DYNAMIC : GRID_MAPPING_EVEN_SHARE,
	- 1> MultiBlockPolicy;
	- enum { SUBSCRIPTION_FACTOR = 1 };
	- };
	-
	- /// SM20 tune
	- template <int CHANNELS, int ACTIVE_CHANNELS, BlockHistogramTilesAlgorithm GRID_ALGORITHM>
	- struct TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 200>
	- {
	- typedef BlockHistogramTilesPolicy<
	- 128,
	- (GRID_ALGORITHM == GRID_HISTO_SORT) ? 21 : (23 / ACTIVE_CHANNELS),
	- GRID_ALGORITHM,
	- GRID_MAPPING_DYNAMIC,
	- 1> MultiBlockPolicy;
	- enum { SUBSCRIPTION_FACTOR = 1 };
	- };
	-
	- /// SM10 tune
	- template <int CHANNELS, int ACTIVE_CHANNELS, BlockHistogramTilesAlgorithm GRID_ALGORITHM>
	- struct TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 100>
	- {
	- typedef BlockHistogramTilesPolicy<
	- 128,
	- 7,
	- GRID_HISTO_SORT, // (use sort regardless because atomics are perf-useless)
	- GRID_MAPPING_EVEN_SHARE,
	- 1> MultiBlockPolicy;
	- enum { SUBSCRIPTION_FACTOR = 1 };
	- };
	-
	-
	- /// Tuning policy for the PTX architecture that DeviceHistogram operations will get dispatched to
	- template <
	- int CHANNELS,
	- int ACTIVE_CHANNELS,
	- BlockHistogramTilesAlgorithm GRID_ALGORITHM>
	- struct PtxDefaultPolicies
	- {
	- static const int PTX_TUNE_ARCH = (CUB_PTX_ARCH >= 350) ?
	- 350 :
	- (CUB_PTX_ARCH >= 300) ?
	- 300 :
	- (CUB_PTX_ARCH >= 200) ?
	- 200 :
	- 100;
	-
	- // Tuned policy set for the current PTX compiler pass
	- typedef TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, PTX_TUNE_ARCH> PtxTunedPolicies;
	-
	- // Subscription factor for the current PTX compiler pass
	- static const int SUBSCRIPTION_FACTOR = PtxTunedPolicies::SUBSCRIPTION_FACTOR;
	-
	- // MultiBlockPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct MultiBlockPolicy : PtxTunedPolicies::MultiBlockPolicy {};
	-
	- /**
	- * Initialize dispatch params with the policies corresponding to the PTX assembly we will use
	- */
	- static void InitDispatchParams(int ptx_version, KernelDispachParams &multi_block_dispatch_params)
	- {
	- if (ptx_version >= 350)
	- {
	- typedef TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 350> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- else if (ptx_version >= 300)
	- {
	- typedef TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 300> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- else if (ptx_version >= 200)
	- {
	- typedef TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 200> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- else
	- {
	- typedef TunedPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM, 100> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal dispatch routine for invoking device-wide, multi-channel, histogram
	- */
	- template <
	- int BINS, ///< Number of histogram bins per channel
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InitHistoKernelPtr, ///< Function type of cub::InitHistoKernel
	- typename MultiBlockHistogramKernelPtr, ///< Function type of cub::MultiBlockHistogramKernel
	- typename AggregateHistoKernelPtr, ///< Function type of cub::AggregateHistoKernel
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that is assignable to <tt>unsigned char</tt>
	- typename HistoCounter, ///< Integral type for counting sample occurrences per histogram bin
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InitHistoKernelPtr init_kernel, ///< [in] Kernel function pointer to parameterization of cub::InitHistoKernel
	- MultiBlockHistogramKernelPtr multi_block_kernel, ///< [in] Kernel function pointer to parameterization of cub::MultiBlockHistogramKernel
	- AggregateHistoKernelPtr aggregate_kernel, ///< [in] Kernel function pointer to parameterization of cub::AggregateHistoKernel
	- KernelDispachParams &multi_block_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p multi_block_kernel was compiled for
	- InputIteratorRA d_samples, ///< [in] Input samples to histogram
	- HistoCounter *d_histograms[ACTIVE_CHANNELS], ///< [out] Array of channel histograms, each having BINS counters of integral type \p HistoCounter.
	- SizeT num_samples, ///< [in] Number of samples to process
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // Kernel launch not supported from this device
	- return CubDebug(cudaErrorNotSupported);
	-
	-#else
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Get device ordinal
	- int device_ordinal;
	- if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
	-
	- // Get SM count
	- int sm_count;
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Get a rough estimate of multi_block_kernel SM occupancy based upon the maximum SM occupancy of the targeted PTX architecture
	- int multi_block_sm_occupancy = CUB_MIN(
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADBLOCKS,
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADS / multi_block_dispatch_params.block_threads);
	-
	-#ifndef __CUDA_ARCH__
	- // We're on the host, so come up with a more accurate estimate of multi_block_kernel SM occupancy from actual device properties
	- Device device_props;
	- if (CubDebug(error = device_props.Init(device_ordinal))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- multi_block_sm_occupancy,
	- multi_block_kernel,
	- multi_block_dispatch_params.block_threads))) break;
	-#endif
	-
	- // Get device occupancy for multi_block_kernel
	- int multi_block_occupancy = multi_block_sm_occupancy * sm_count;
	-
	- // Even-share work distribution
	- GridEvenShare<SizeT> even_share;
	-
	- // Get tile size for multi_block_kernel
	- int multi_block_tile_size = multi_block_dispatch_params.channel_tile_size * CHANNELS;
	-
	- // Get grid size for multi_block_kernel
	- int multi_block_grid_size;
	- switch (multi_block_dispatch_params.grid_mapping)
	- {
	- case GRID_MAPPING_EVEN_SHARE:
	-
	- // Work is distributed evenly
	- even_share.GridInit(
	- num_samples,
	- multi_block_occupancy * multi_block_dispatch_params.subscription_factor,
	- multi_block_tile_size);
	- multi_block_grid_size = even_share.grid_size;
	- break;
	-
	- case GRID_MAPPING_DYNAMIC:
	-
	- // Work is distributed dynamically
	- int num_tiles = (num_samples + multi_block_tile_size - 1) / multi_block_tile_size;
	- multi_block_grid_size = (num_tiles < multi_block_occupancy) ?
	- num_tiles : // Not enough to fill the device with threadblocks
	- multi_block_occupancy; // Fill the device with threadblocks
	- break;
	- };
	-
	- // Temporary storage allocation requirements
	- void* allocations[2];
	- size_t allocation_sizes[2] =
	- {
	- ACTIVE_CHANNELS * multi_block_grid_size * sizeof(HistoCounter) * BINS, // bytes needed for privatized histograms
	- GridQueue<int>::AllocationSize() // bytes needed for grid queue descriptor
	- };
	-
	- if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- return cudaSuccess;
	-
	- // Privatized per-block reductions
	- HistoCounter d_block_histograms = (HistoCounter) allocations[0];
	-
	- // Grid queue descriptor
	- GridQueue<SizeT> queue(allocations[1]);
	-
	- // Setup array wrapper for histogram channel output (because we can't pass static arrays as kernel parameters)
	- ArrayWrapper<HistoCounter*, ACTIVE_CHANNELS> d_histo_wrapper;
	- for (int CHANNEL = 0; CHANNEL < ACTIVE_CHANNELS; ++CHANNEL)
	- d_histo_wrapper.array[CHANNEL] = d_histograms[CHANNEL];
	-
	- // Setup array wrapper for temporary histogram channel output (because we can't pass static arrays as kernel parameters)
	- ArrayWrapper<HistoCounter*, ACTIVE_CHANNELS> d_temp_histo_wrapper;
	- for (int CHANNEL = 0; CHANNEL < ACTIVE_CHANNELS; ++CHANNEL)
	- d_temp_histo_wrapper.array[CHANNEL] = d_block_histograms + (CHANNEL * multi_block_grid_size * BINS);
	-
	- // Log init_kernel configuration
	- if (stream_synchronous) CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", ACTIVE_CHANNELS, BINS, (long long) stream);
	-
	- // Invoke init_kernel to initialize counters and queue descriptor
	- init_kernel<<<ACTIVE_CHANNELS, BINS, 0, stream>>>(queue, d_histo_wrapper, num_samples);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Whether we need privatized histograms (i.e., non-global atomics and multi-block)
	- bool privatized_temporaries = (multi_block_grid_size > 1) && (multi_block_dispatch_params.block_algorithm != GRID_HISTO_GLOBAL_ATOMIC);
	-
	- // Log multi_block_kernel configuration
	- if (stream_synchronous) CubLog("Invoking multi_block_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
	- multi_block_grid_size, multi_block_dispatch_params.block_threads, (long long) stream, multi_block_dispatch_params.items_per_thread, multi_block_sm_occupancy);
	-
	- // Invoke multi_block_kernel
	- multi_block_kernel<<<multi_block_grid_size, multi_block_dispatch_params.block_threads, 0, stream>>>(
	- d_samples,
	- (privatized_temporaries) ?
	- d_temp_histo_wrapper :
	- d_histo_wrapper,
	- num_samples,
	- even_share,
	- queue);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Aggregate privatized block histograms if necessary
	- if (privatized_temporaries)
	- {
	- // Log aggregate_kernel configuration
	- if (stream_synchronous) CubLog("Invoking aggregate_kernel<<<%d, %d, 0, %lld>>>()\n",
	- ACTIVE_CHANNELS, BINS, (long long) stream);
	-
	- // Invoke aggregate_kernel
	- aggregate_kernel<<<ACTIVE_CHANNELS, BINS, 0, stream>>>(
	- d_block_histograms,
	- d_histo_wrapper,
	- multi_block_grid_size);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	- }
	- }
	- while (0);
	-
	- return error;
	-#endif // CUB_RUNTIME_ENABLED
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide histogram
	- *
	- * \tparam GRID_ALGORITHM cub::BlockHistogramTilesAlgorithm enumerator specifying the underlying algorithm to use
	- * \tparam CHANNELS Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- * \tparam ACTIVE_CHANNELS <b>[inferred]</b> Number of channels actively being histogrammed
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that is assignable to <tt>unsigned char</tt>
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- BlockHistogramTilesAlgorithm GRID_ALGORITHM,
	- int BINS, ///< Number of histogram bins per channel
	- int CHANNELS, ///< Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- int ACTIVE_CHANNELS, ///< Number of channels actively being histogrammed
	- typename InputIteratorRA, ///< The input iterator type (may be a simple pointer type). Must have a value type that is assignable to <tt>unsigned char</tt>
	- typename HistoCounter> ///< Integral type for counting sample occurrences per histogram bin
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples to histogram
	- HistoCounter *d_histograms[ACTIVE_CHANNELS], ///< [out] Array of channel histograms, each having BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Number of samples to process
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- // Type used for array indexing
	- typedef int SizeT;
	-
	- // Tuning polices for the PTX architecture that will get dispatched to
	- typedef PtxDefaultPolicies<CHANNELS, ACTIVE_CHANNELS, GRID_ALGORITHM> PtxDefaultPolicies;
	- typedef typename PtxDefaultPolicies::MultiBlockPolicy MultiBlockPolicy;
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Declare dispatch parameters
	- KernelDispachParams multi_block_dispatch_params;
	-
	- #ifdef __CUDA_ARCH__
	-
	- // We're on the device, so initialize the dispatch parameters with the PtxDefaultPolicies directly
	- multi_block_dispatch_params.Init<MultiBlockPolicy>(PtxDefaultPolicies::SUBSCRIPTION_FACTOR);
	-
	- #else
	-
	- // We're on the host, so lookup and initialize the dispatch parameters with the policies that match the device's PTX version
	- int ptx_version;
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- PtxDefaultPolicies::InitDispatchParams(ptx_version, multi_block_dispatch_params);
	-
	- #endif
	-
	- Dispatch<BINS, CHANNELS, ACTIVE_CHANNELS>(
	- d_temp_storage,
	- temp_storage_bytes,
	- InitHistoKernel<BINS, ACTIVE_CHANNELS, SizeT, HistoCounter>,
	- MultiBlockHistogramKernel<MultiBlockPolicy, BINS, CHANNELS, ACTIVE_CHANNELS, InputIteratorRA, HistoCounter, SizeT>,
	- AggregateHistoKernel<BINS, ACTIVE_CHANNELS, HistoCounter>,
	- multi_block_dispatch_params,
	- d_samples,
	- d_histograms,
	- num_samples,
	- stream,
	- stream_synchronous);
	-
	- if (CubDebug(error)) break;
	- }
	- while (0);
	-
	- return error;
	- }
	-
	- #endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	- /****************************************************************//
	- * \name Single-channel samples
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes a device-wide histogram. Uses fast block-sorting to compute the histogram. Delivers consistent throughput regardless of sample diversity, but occupancy may be limited by histogram bin count.
	- *
	- * However, because histograms are privatized in shared memory, a large
	- * number of bins (e.g., thousands) may adversely affect occupancy and
	- * performance (or even the ability to launch).
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the computation of a 256-bin histogram of
	- * single-channel <tt>unsigned char</tt> samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input samples and 256-bin output histogram
	- * unsigned char *d_samples;
	- * unsigned int *d_histogram;
	- * int num_items = ...
	- * ...
	- *
	- * // Wrap d_samples device pointer in a random-access texture iterator
	- * cub::TexIteratorRA<unsigned int> d_samples_tex_itr;
	- * d_samples_tex_itr.BindTexture(d_samples, num_items * sizeof(unsigned char));
	- *
	- * // Determine temporary device storage requirements for histogram computation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceHistogram::SingleChannelSorting<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histogram, num_items);
	- *
	- * // Allocate temporary storage for histogram computation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Compute histogram
	- * cub::DeviceHistogram::SingleChannelSorting<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histogram, num_items);
	- *
	- * // Unbind texture iterator
	- * d_samples_tex_itr.UnbindTexture();
	- *
	- * \endcode
	- *
	- * \tparam BINS Number of histogram bins per channel
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- int BINS,
	- typename InputIteratorRA,
	- typename HistoCounter>
	- __host__ __device__ __forceinline__
	- static cudaError_t SingleChannelSorting(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples
	- HistoCounter* d_histogram, ///< [out] Array of BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Number of samples to process
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- return Dispatch<GRID_HISTO_SORT, BINS, 1, 1>(
	- d_temp_storage, temp_storage_bytes, d_samples, &d_histogram, num_samples, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide histogram. Uses shared-memory atomic read-modify-write operations to compute the histogram. Input samples having lower diversity can cause performance to be degraded, and occupancy may be limited by histogram bin count.
	- *
	- * However, because histograms are privatized in shared memory, a large
	- * number of bins (e.g., thousands) may adversely affect occupancy and
	- * performance (or even the ability to launch).
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the computation of a 256-bin histogram of
	- * single-channel <tt>unsigned char</tt> samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input samples and 256-bin output histogram
	- * unsigned char *d_samples;
	- * unsigned int *d_histogram;
	- * int num_items = ...
	- * ...
	- *
	- * // Wrap d_samples device pointer in a random-access texture iterator
	- * cub::TexIteratorRA<unsigned int> d_samples_tex_itr;
	- * d_samples_tex_itr.BindTexture(d_samples, num_items * sizeof(unsigned char));
	- *
	- * // Determine temporary device storage requirements for histogram computation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceHistogram::SingleChannelSorting<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histogram, num_items);
	- *
	- * // Allocate temporary storage for histogram computation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Compute histogram
	- * cub::DeviceHistogram::SingleChannelSharedAtomic<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histogram, num_items);
	- *
	- * // Unbind texture iterator
	- * d_samples_tex_itr.UnbindTexture();
	- *
	- * \endcode
	- *
	- * \tparam BINS Number of histogram bins per channel
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- int BINS,
	- typename InputIteratorRA,
	- typename HistoCounter>
	- __host__ __device__ __forceinline__
	- static cudaError_t SingleChannelSharedAtomic(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples
	- HistoCounter* d_histogram, ///< [out] Array of BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Number of samples to process
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch<GRID_HISTO_SHARED_ATOMIC, BINS, 1, 1>(
	- d_temp_storage, temp_storage_bytes, d_samples, &d_histogram, num_samples, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide histogram. Uses global-memory atomic read-modify-write operations to compute the histogram. Input samples having lower diversity can cause performance to be degraded.
	- *
	- * Performance is not significantly impacted when computing histograms having large
	- * numbers of bins (e.g., thousands).
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the computation of a 256-bin histogram of
	- * single-channel <tt>unsigned char</tt> samples.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input samples and 256-bin output histogram
	- * unsigned char *d_samples;
	- * unsigned int *d_histogram;
	- * int num_items = ...
	- * ...
	- *
	- * // Wrap d_samples device pointer in a random-access texture iterator
	- * cub::TexIteratorRA<unsigned int> d_samples_tex_itr;
	- * d_samples_tex_itr.BindTexture(d_samples, num_items * sizeof(unsigned char));
	- *
	- * // Determine temporary device storage requirements for histogram computation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceHistogram::SingleChannelSorting<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histogram, num_items);
	- *
	- * // Allocate temporary storage for histogram computation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Compute histogram
	- * cub::DeviceHistogram::SingleChannelGlobalAtomic<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histogram, num_items);
	- *
	- * // Unbind texture iterator
	- * d_samples_tex_itr.UnbindTexture();
	- *
	- * \endcode
	- *
	- * \tparam BINS Number of histogram bins per channel
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- int BINS,
	- typename InputIteratorRA,
	- typename HistoCounter>
	- __host__ __device__ __forceinline__
	- static cudaError_t SingleChannelGlobalAtomic(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples
	- HistoCounter* d_histogram, ///< [out] Array of BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Number of samples to process
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch<GRID_HISTO_GLOBAL_ATOMIC, BINS, 1, 1>(
	- d_temp_storage, temp_storage_bytes, d_samples, &d_histogram, num_samples, stream, stream_synchronous);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Interleaved multi-channel samples
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes a device-wide histogram from multi-channel data. Uses fast block-sorting to compute the histogram. Delivers consistent throughput regardless of sample diversity, but occupancy may be limited by histogram bin count.
	- *
	- * However, because histograms are privatized in shared memory, a large
	- * number of bins (e.g., thousands) may adversely affect occupancy and
	- * performance (or even the ability to launch).
	- *
	- * The total number of samples across all channels (\p num_samples) must be a whole multiple of \p CHANNELS.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the computation of three 256-bin histograms from
	- * interleaved quad-channel <tt>unsigned char</tt> samples (e.g., RGB histograms from RGBA samples).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input samples and
	- * // three 256-bin output histograms
	- * unsigned char *d_samples;
	- * unsigned int *d_histograms[3];
	- * int num_items = ...
	- * ...
	- *
	- * // Wrap d_samples device pointer in a random-access texture iterator
	- * cub::TexIteratorRA<unsigned int> d_samples_tex_itr;
	- * d_samples_tex_itr.BindTexture(d_samples, num_items * sizeof(unsigned char));
	- *
	- * // Determine temporary device storage requirements for histogram computation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceHistogram::MultiChannelSorting<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histograms, num_items);
	- *
	- * // Allocate temporary storage for histogram computation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Compute histograms
	- * cub::DeviceHistogram::MultiChannelSorting<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histograms, num_items);
	- *
	- * // Unbind texture iterator
	- * d_samples_tex_itr.UnbindTexture();
	- *
	- * \endcode
	- *
	- * \tparam BINS Number of histogram bins per channel
	- * \tparam CHANNELS Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- * \tparam ACTIVE_CHANNELS <b>[inferred]</b> Number of channels actively being histogrammed
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- int BINS,
	- int CHANNELS,
	- int ACTIVE_CHANNELS,
	- typename InputIteratorRA,
	- typename HistoCounter>
	- __host__ __device__ __forceinline__
	- static cudaError_t MultiChannelSorting(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32b pixels where each pixel consists of four RGBA 8b samples).
	- HistoCounter *d_histograms[ACTIVE_CHANNELS], ///< [out] Array of channel histogram counter arrays, each having BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Total number of samples to process in all channels, including non-active channels
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch<GRID_HISTO_SORT, BINS, CHANNELS, ACTIVE_CHANNELS>(
	- d_temp_storage, temp_storage_bytes, d_samples, d_histograms, num_samples, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide histogram from multi-channel data. Uses shared-memory atomic read-modify-write operations to compute the histogram. Input samples having lower diversity can cause performance to be degraded, and occupancy may be limited by histogram bin count.
	- *
	- * However, because histograms are privatized in shared memory, a large
	- * number of bins (e.g., thousands) may adversely affect occupancy and
	- * performance (or even the ability to launch).
	- *
	- * The total number of samples across all channels (\p num_samples) must be a whole multiple of \p CHANNELS.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the computation of three 256-bin histograms from
	- * interleaved quad-channel <tt>unsigned char</tt> samples (e.g., RGB histograms from RGBA samples).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input samples and
	- * // three 256-bin output histograms
	- * unsigned char *d_samples;
	- * unsigned int *d_histograms[3];
	- * int num_items = ...
	- * ...
	- *
	- * // Wrap d_samples device pointer in a random-access texture iterator
	- * cub::TexIteratorRA<unsigned int> d_samples_tex_itr;
	- * d_samples_tex_itr.BindTexture(d_samples, num_items * sizeof(unsigned char));
	- *
	- * // Determine temporary device storage requirements for histogram computation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceHistogram::MultiChannelSharedAtomic<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histograms, num_items);
	- *
	- * // Allocate temporary storage for histogram computation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Compute histograms
	- * cub::DeviceHistogram::MultiChannelSharedAtomic<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histograms, num_items);
	- *
	- * // Unbind texture iterator
	- * d_samples_tex_itr.UnbindTexture();
	- *
	- * \endcode
	- *
	- * \tparam BINS Number of histogram bins per channel
	- * \tparam CHANNELS Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- * \tparam ACTIVE_CHANNELS <b>[inferred]</b> Number of channels actively being histogrammed
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- int BINS,
	- int CHANNELS,
	- int ACTIVE_CHANNELS,
	- typename InputIteratorRA,
	- typename HistoCounter>
	- __host__ __device__ __forceinline__
	- static cudaError_t MultiChannelSharedAtomic(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32b pixels where each pixel consists of four RGBA 8b samples).
	- HistoCounter *d_histograms[ACTIVE_CHANNELS], ///< [out] Array of channel histogram counter arrays, each having BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Total number of samples to process in all channels, including non-active channels
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch<GRID_HISTO_SHARED_ATOMIC, BINS, CHANNELS, ACTIVE_CHANNELS>(
	- d_temp_storage, temp_storage_bytes, d_samples, d_histograms, num_samples, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide histogram from multi-channel data. Uses global-memory atomic read-modify-write operations to compute the histogram. Input samples having lower diversity can cause performance to be degraded.
	- *
	- * Performance is not significantly impacted when computing histograms having large
	- * numbers of bins (e.g., thousands).
	- *
	- * The total number of samples across all channels (\p num_samples) must be a whole multiple of \p CHANNELS.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * Performance is often improved when referencing input samples through a texture-caching iterator, e.g., cub::TexIteratorRA or cub::TexTransformIteratorRA.
	- *
	- * \par
	- * The code snippet below illustrates the computation of three 256-bin histograms from
	- * interleaved quad-channel <tt>unsigned char</tt> samples (e.g., RGB histograms from RGBA samples).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input samples and
	- * // three 256-bin output histograms
	- * unsigned char *d_samples;
	- * unsigned int *d_histograms[3];
	- * int num_items = ...
	- * ...
	- *
	- * // Wrap d_samples device pointer in a random-access texture iterator
	- * cub::TexIteratorRA<unsigned int> d_samples_tex_itr;
	- * d_samples_tex_itr.BindTexture(d_samples, num_items * sizeof(unsigned char));
	- *
	- * // Determine temporary device storage requirements for histogram computation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceHistogram::MultiChannelGlobalAtomic<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histograms, num_items);
	- *
	- * // Allocate temporary storage for histogram computation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Compute histograms
	- * cub::DeviceHistogram::MultiChannelGlobalAtomic<256>(d_temp_storage, temp_storage_bytes, d_samples_tex_itr, d_histograms, num_items);
	- *
	- * // Unbind texture iterator
	- * d_samples_tex_itr.UnbindTexture();
	- *
	- * \endcode
	- *
	- * \tparam BINS Number of histogram bins per channel
	- * \tparam CHANNELS Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
	- * \tparam ACTIVE_CHANNELS <b>[inferred]</b> Number of channels actively being histogrammed
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type) Must have a value type that can be cast as an integer in the range [0..BINS-1]
	- * \tparam HistoCounter <b>[inferred]</b> Integral type for counting sample occurrences per histogram bin
	- */
	- template <
	- int BINS,
	- int CHANNELS,
	- int ACTIVE_CHANNELS,
	- typename InputIteratorRA,
	- typename HistoCounter>
	- __host__ __device__ __forceinline__
	- static cudaError_t MultiChannelGlobalAtomic(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_samples, ///< [in] Input samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32b pixels where each pixel consists of four RGBA 8b samples).
	- HistoCounter *d_histograms[ACTIVE_CHANNELS], ///< [out] Array of channel histogram counter arrays, each having BINS counters of integral type \p HistoCounter.
	- int num_samples, ///< [in] Total number of samples to process in all channels, including non-active channels
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch<GRID_HISTO_GLOBAL_ATOMIC, BINS, CHANNELS, ACTIVE_CHANNELS>(
	- d_temp_storage, temp_storage_bytes, d_samples, d_histograms, num_samples, stream, stream_synchronous);
	- }
	-
	- //@} end member group
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/device/device_radix_sort.cuh b/lib/kokkos/TPL/cub/device/device_radix_sort.cuh
	deleted file mode 100755
	index 087d546bc..000000000
	--- a/lib/kokkos/TPL/cub/device/device_radix_sort.cuh
	+++ /dev/null
	@@ -1,890 +0,0 @@
	-
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::DeviceRadixSort provides operations for computing a device-wide, parallel reduction across data items residing within global memory.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include <iterator>
	-
	-#include "block/block_radix_sort_upsweep_tiles.cuh"
	-#include "block/block_radix_sort_downsweep_tiles.cuh"
	-#include "block/block_scan_tiles.cuh"
	-#include "../grid/grid_even_share.cuh"
	-#include "../util_debug.cuh"
	-#include "../util_device.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-
	-
	-/******************************************************************************
	- * Kernel entry points
	- *****************************************************************************/
	-
	-/**
	- * Upsweep pass kernel entry point (multi-block). Computes privatized digit histograms, one per block.
	- */
	-template <
	- typename BlockRadixSortUpsweepTilesPolicy, ///< Tuning policy for cub::BlockRadixSortUpsweepTiles abstraction
	- typename Key, ///< Key type
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockRadixSortUpsweepTilesPolicy::BLOCK_THREADS), 1)
	-__global__ void RadixSortUpsweepKernel(
	- Key *d_keys, ///< [in] Input keys buffer
	- SizeT *d_spine, ///< [out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
	- SizeT num_items, ///< [in] Total number of input data items
	- int current_bit, ///< [in] Bit position of current radix digit
	- bool use_primary_bit_granularity, ///< [in] Whether nor not to use the primary policy (or the embedded alternate policy for smaller bit granularity)
	- bool first_pass, ///< [in] Whether this is the first digit pass
	- GridEvenShare<SizeT> even_share) ///< [in] Descriptor for how to map an even-share of tiles across thread blocks
	-{
	-
	- // Alternate policy for when fewer bits remain
	- typedef typename BlockRadixSortUpsweepTilesPolicy::AltPolicy AltPolicy;
	-
	- // Parameterize two versions of BlockRadixSortUpsweepTiles type for the current configuration
	- typedef BlockRadixSortUpsweepTiles<BlockRadixSortUpsweepTilesPolicy, Key, SizeT> BlockRadixSortUpsweepTilesT; // Primary
	- typedef BlockRadixSortUpsweepTiles<AltPolicy, Key, SizeT> AltBlockRadixSortUpsweepTilesT; // Alternate (smaller bit granularity)
	-
	- // Shared memory storage
	- __shared__ union
	- {
	- typename BlockRadixSortUpsweepTilesT::TempStorage pass_storage;
	- typename AltBlockRadixSortUpsweepTilesT::TempStorage alt_pass_storage;
	- } temp_storage;
	-
	- // Initialize even-share descriptor for this thread block
	- even_share.BlockInit();
	-
	- // Process input tiles (each of the first RADIX_DIGITS threads will compute a count for that digit)
	- if (use_primary_bit_granularity)
	- {
	- // Primary granularity
	- SizeT bin_count;
	- BlockRadixSortUpsweepTilesT(temp_storage.pass_storage, d_keys, current_bit).ProcessTiles(
	- even_share.block_offset,
	- even_share.block_oob,
	- bin_count);
	-
	- // Write out digit counts (striped)
	- if (threadIdx.x < BlockRadixSortUpsweepTilesT::RADIX_DIGITS)
	- {
	- d_spine[(gridDim.x * threadIdx.x) + blockIdx.x] = bin_count;
	- }
	- }
	- else
	- {
	- // Alternate granularity
	- // Process input tiles (each of the first RADIX_DIGITS threads will compute a count for that digit)
	- SizeT bin_count;
	- AltBlockRadixSortUpsweepTilesT(temp_storage.alt_pass_storage, d_keys, current_bit).ProcessTiles(
	- even_share.block_offset,
	- even_share.block_oob,
	- bin_count);
	-
	- // Write out digit counts (striped)
	- if (threadIdx.x < AltBlockRadixSortUpsweepTilesT::RADIX_DIGITS)
	- {
	- d_spine[(gridDim.x * threadIdx.x) + blockIdx.x] = bin_count;
	- }
	- }
	-}
	-
	-
	-/**
	- * Spine scan kernel entry point (single-block). Computes an exclusive prefix sum over the privatized digit histograms
	- */
	-template <
	- typename BlockScanTilesPolicy, ///< Tuning policy for cub::BlockScanTiles abstraction
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockScanTilesPolicy::BLOCK_THREADS), 1)
	-__global__ void RadixSortScanKernel(
	- SizeT *d_spine, ///< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
	- int num_counts) ///< [in] Total number of bin-counts
	-{
	- // Parameterize the BlockScanTiles type for the current configuration
	- typedef BlockScanTiles<BlockScanTilesPolicy, SizeT, SizeT, cub::Sum, SizeT, SizeT> BlockScanTilesT;
	-
	- // Shared memory storage
	- __shared__ typename BlockScanTilesT::TempStorage temp_storage;
	-
	- // Block scan instance
	- BlockScanTilesT block_scan(temp_storage, d_spine, d_spine, cub::Sum(), SizeT(0)) ;
	-
	- // Process full input tiles
	- int block_offset = 0;
	- RunningBlockPrefixOp<SizeT> prefix_op;
	- prefix_op.running_total = 0;
	- while (block_offset < num_counts)
	- {
	- block_scan.ConsumeTile<true, false>(block_offset, prefix_op);
	- block_offset += BlockScanTilesT::TILE_ITEMS;
	- }
	-}
	-
	-
	-/**
	- * Downsweep pass kernel entry point (multi-block). Scatters keys (and values) into corresponding bins for the current digit place.
	- */
	-template <
	- typename BlockRadixSortDownsweepTilesPolicy, ///< Tuning policy for cub::BlockRadixSortUpsweepTiles abstraction
	- typename Key, ///< Key type
	- typename Value, ///< Value type
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockRadixSortDownsweepTilesPolicy::BLOCK_THREADS))
	-__global__ void RadixSortDownsweepKernel(
	- Key *d_keys_in, ///< [in] Input keys ping buffer
	- Key *d_keys_out, ///< [in] Output keys pong buffer
	- Value *d_values_in, ///< [in] Input values ping buffer
	- Value *d_values_out, ///< [in] Output values pong buffer
	- SizeT *d_spine, ///< [in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
	- SizeT num_items, ///< [in] Total number of input data items
	- int current_bit, ///< [in] Bit position of current radix digit
	- bool use_primary_bit_granularity, ///< [in] Whether nor not to use the primary policy (or the embedded alternate policy for smaller bit granularity)
	- bool first_pass, ///< [in] Whether this is the first digit pass
	- bool last_pass, ///< [in] Whether this is the last digit pass
	- GridEvenShare<SizeT> even_share) ///< [in] Descriptor for how to map an even-share of tiles across thread blocks
	-{
	-
	- // Alternate policy for when fewer bits remain
	- typedef typename BlockRadixSortDownsweepTilesPolicy::AltPolicy AltPolicy;
	-
	- // Parameterize two versions of BlockRadixSortDownsweepTiles type for the current configuration
	- typedef BlockRadixSortDownsweepTiles<BlockRadixSortDownsweepTilesPolicy, Key, Value, SizeT> BlockRadixSortDownsweepTilesT;
	- typedef BlockRadixSortDownsweepTiles<AltPolicy, Key, Value, SizeT> AltBlockRadixSortDownsweepTilesT;
	-
	- // Shared memory storage
	- __shared__ union
	- {
	- typename BlockRadixSortDownsweepTilesT::TempStorage pass_storage;
	- typename AltBlockRadixSortDownsweepTilesT::TempStorage alt_pass_storage;
	-
	- } temp_storage;
	-
	- // Initialize even-share descriptor for this thread block
	- even_share.BlockInit();
	-
	- if (use_primary_bit_granularity)
	- {
	- // Process input tiles
	- BlockRadixSortDownsweepTilesT(temp_storage.pass_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit).ProcessTiles(
	- even_share.block_offset,
	- even_share.block_oob);
	- }
	- else
	- {
	- // Process input tiles
	- AltBlockRadixSortDownsweepTilesT(temp_storage.alt_pass_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit).ProcessTiles(
	- even_share.block_offset,
	- even_share.block_oob);
	- }
	-}
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-
	-
	-/******************************************************************************
	- * DeviceRadixSort
	- *****************************************************************************/
	-
	-/**
	- * \brief DeviceRadixSort provides operations for computing a device-wide, parallel radix sort across data items residing within global memory. ![](sorting_logo.png)
	- * \ingroup DeviceModule
	- *
	- * \par Overview
	- * The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
	- * items into ascending order. It relies upon a positional representation for
	- * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
	- * characters, etc.) specified from least-significant to most-significant. For a
	- * given input sequence of keys and a set of rules specifying a total ordering
	- * of the symbolic alphabet, the radix sorting method produces a lexicographic
	- * ordering of those keys.
	- *
	- * \par
	- * DeviceRadixSort can sort all of the built-in C++ numeric primitive types, e.g.:
	- * <tt>unsigned char</tt>, \p int, \p double, etc. Although the direct radix sorting
	- * method can only be applied to unsigned integral types, BlockRadixSort
	- * is able to sort signed and floating-point types via simple bit-wise transformations
	- * that ensure lexicographic key ordering.
	- *
	- * \par Usage Considerations
	- * \cdp_class{DeviceRadixSort}
	- *
	- * \par Performance
	- *
	- * \image html lsd_sort_perf.png
	- *
	- */
	-struct DeviceRadixSort
	-{
	- #ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Generic structure for encapsulating dispatch properties codified in block policy.
	- struct KernelDispachParams
	- {
	- int block_threads;
	- int items_per_thread;
	- cudaSharedMemConfig smem_config;
	- int radix_bits;
	- int alt_radix_bits;
	- int subscription_factor;
	- int tile_size;
	-
	- template <typename SortBlockPolicy>
	- __host__ __device__ __forceinline__
	- void InitUpsweepPolicy(int subscription_factor = 1)
	- {
	- block_threads = SortBlockPolicy::BLOCK_THREADS;
	- items_per_thread = SortBlockPolicy::ITEMS_PER_THREAD;
	- radix_bits = SortBlockPolicy::RADIX_BITS;
	- alt_radix_bits = SortBlockPolicy::AltPolicy::RADIX_BITS;
	- smem_config = cudaSharedMemBankSizeFourByte;
	- this->subscription_factor = subscription_factor;
	- tile_size = block_threads * items_per_thread;
	- }
	-
	- template <typename ScanBlockPolicy>
	- __host__ __device__ __forceinline__
	- void InitScanPolicy()
	- {
	- block_threads = ScanBlockPolicy::BLOCK_THREADS;
	- items_per_thread = ScanBlockPolicy::ITEMS_PER_THREAD;
	- radix_bits = 0;
	- alt_radix_bits = 0;
	- smem_config = cudaSharedMemBankSizeFourByte;
	- subscription_factor = 0;
	- tile_size = block_threads * items_per_thread;
	- }
	-
	- template <typename SortBlockPolicy>
	- __host__ __device__ __forceinline__
	- void InitDownsweepPolicy(int subscription_factor = 1)
	- {
	- block_threads = SortBlockPolicy::BLOCK_THREADS;
	- items_per_thread = SortBlockPolicy::ITEMS_PER_THREAD;
	- radix_bits = SortBlockPolicy::RADIX_BITS;
	- alt_radix_bits = SortBlockPolicy::AltPolicy::RADIX_BITS;
	- smem_config = SortBlockPolicy::SMEM_CONFIG;
	- this->subscription_factor = subscription_factor;
	- tile_size = block_threads * items_per_thread;
	- }
	- };
	-
	-
	-
	- /******************************************************************************
	- * Tuning policies
	- ******************************************************************************/
	-
	- /// Specializations of tuned policy types for different PTX architectures
	- template <typename Key, typename Value, typename SizeT, int ARCH>
	- struct TunedPolicies;
	-
	- /// SM35 tune
	- template <typename Key, typename Value, typename SizeT>
	- struct TunedPolicies<Key, Value, SizeT, 350>
	- {
	- enum {
	- KEYS_ONLY = (Equals<Value, NullType>::VALUE),
	- SCALE_FACTOR = (CUB_MAX(sizeof(Key), sizeof(Value)) + 3) / 4,
	- RADIX_BITS = 5,
	- };
	-
	- // UpsweepPolicy
	- typedef BlockRadixSortUpsweepTilesPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), LOAD_LDG, RADIX_BITS> UpsweepPolicyKeys;
	- typedef BlockRadixSortUpsweepTilesPolicy <128, CUB_MAX(1, 15 / SCALE_FACTOR), LOAD_LDG, RADIX_BITS> UpsweepPolicyPairs;
	- typedef typename If<KEYS_ONLY, UpsweepPolicyKeys, UpsweepPolicyPairs>::Type UpsweepPolicy;
	-/*
	- // 4bit
	- typedef BlockRadixSortUpsweepTilesPolicy <128, 15, LOAD_LDG, RADIX_BITS> UpsweepPolicyKeys;
	- typedef BlockRadixSortUpsweepTilesPolicy <256, 13, LOAD_LDG, RADIX_BITS> UpsweepPolicyPairs;
	-*/
	- // ScanPolicy
	- typedef BlockScanTilesPolicy <1024, 4, BLOCK_LOAD_VECTORIZE, false, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
	-
	- // DownsweepPolicy
	- typedef BlockRadixSortDownsweepTilesPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_DIRECT, LOAD_LDG, false, true, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeEightByte, RADIX_BITS> DownsweepPolicyKeys;
	- typedef BlockRadixSortDownsweepTilesPolicy <128, CUB_MAX(1, 15 / SCALE_FACTOR), BLOCK_LOAD_DIRECT, LOAD_LDG, false, true, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeEightByte, RADIX_BITS> DownsweepPolicyPairs;
	- typedef typename If<KEYS_ONLY, DownsweepPolicyKeys, DownsweepPolicyPairs>::Type DownsweepPolicy;
	-
	-/*
	- // 4bit
	- typedef BlockRadixSortDownsweepTilesPolicy <128, 15, BLOCK_LOAD_DIRECT, LOAD_LDG, false, true, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeEightByte, RADIX_BITS> DownsweepPolicyKeys;
	- typedef BlockRadixSortDownsweepTilesPolicy <256, 13, BLOCK_LOAD_DIRECT, LOAD_LDG, false, true, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeEightByte, RADIX_BITS> DownsweepPolicyPairs;
	-*/
	- enum { SUBSCRIPTION_FACTOR = 7 };
	- };
	-
	-
	- /// SM20 tune
	- template <typename Key, typename Value, typename SizeT>
	- struct TunedPolicies<Key, Value, SizeT, 200>
	- {
	- enum {
	- KEYS_ONLY = (Equals<Value, NullType>::VALUE),
	- SCALE_FACTOR = (CUB_MAX(sizeof(Key), sizeof(Value)) + 3) / 4,
	- RADIX_BITS = 5,
	- };
	-
	- // UpsweepPolicy
	- typedef BlockRadixSortUpsweepTilesPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), LOAD_DEFAULT, RADIX_BITS> UpsweepPolicyKeys;
	- typedef BlockRadixSortUpsweepTilesPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR), LOAD_DEFAULT, RADIX_BITS> UpsweepPolicyPairs;
	- typedef typename If<KEYS_ONLY, UpsweepPolicyKeys, UpsweepPolicyPairs>::Type UpsweepPolicy;
	-
	- // ScanPolicy
	- typedef BlockScanTilesPolicy <512, 4, BLOCK_LOAD_VECTORIZE, false, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
	-
	- // DownsweepPolicy
	- typedef BlockRadixSortDownsweepTilesPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicyKeys;
	- typedef BlockRadixSortDownsweepTilesPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicyPairs;
	- typedef typename If<KEYS_ONLY, DownsweepPolicyKeys, DownsweepPolicyPairs>::Type DownsweepPolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 3 };
	- };
	-
	-
	- /// SM10 tune
	- template <typename Key, typename Value, typename SizeT>
	- struct TunedPolicies<Key, Value, SizeT, 100>
	- {
	- enum {
	- RADIX_BITS = 4,
	- };
	-
	- // UpsweepPolicy
	- typedef BlockRadixSortUpsweepTilesPolicy <64, 9, LOAD_DEFAULT, RADIX_BITS> UpsweepPolicy;
	-
	- // ScanPolicy
	- typedef BlockScanTilesPolicy <256, 4, BLOCK_LOAD_VECTORIZE, false, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
	-
	- // DownsweepPolicy
	- typedef BlockRadixSortDownsweepTilesPolicy <64, 9, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, false, false, BLOCK_SCAN_WARP_SCANS, RADIX_SORT_SCATTER_TWO_PHASE, cudaSharedMemBankSizeFourByte, RADIX_BITS> DownsweepPolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 3 };
	- };
	-
	-
	-
	- /******************************************************************************
	- * Default policy initializer
	- ******************************************************************************/
	-
	- /// Tuning policy for the PTX architecture that DeviceRadixSort operations will get dispatched to
	- template <typename Key, typename Value, typename SizeT>
	- struct PtxDefaultPolicies
	- {
	-
	- static const int PTX_TUNE_ARCH = (CUB_PTX_ARCH >= 350) ?
	- 350 :
	- (CUB_PTX_ARCH >= 200) ?
	- 200 :
	- 100;
	-
	- // Tuned policy set for the current PTX compiler pass
	- typedef TunedPolicies<Key, Value, SizeT, PTX_TUNE_ARCH> PtxTunedPolicies;
	-
	- // UpsweepPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct UpsweepPolicy : PtxTunedPolicies::UpsweepPolicy {};
	-
	- // ScanPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct ScanPolicy : PtxTunedPolicies::ScanPolicy {};
	-
	- // DownsweepPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct DownsweepPolicy : PtxTunedPolicies::DownsweepPolicy {};
	-
	- // Subscription factor for the current PTX compiler pass
	- enum { SUBSCRIPTION_FACTOR = PtxTunedPolicies::SUBSCRIPTION_FACTOR };
	-
	-
	- /**
	- * Initialize dispatch params with the policies corresponding to the PTX assembly we will use
	- */
	- static void InitDispatchParams(
	- int ptx_version,
	- KernelDispachParams &upsweep_dispatch_params,
	- KernelDispachParams &scan_dispatch_params,
	- KernelDispachParams &downsweep_dispatch_params)
	- {
	- if (ptx_version >= 350)
	- {
	- typedef TunedPolicies<Key, Value, SizeT, 350> TunedPolicies;
	- upsweep_dispatch_params.InitUpsweepPolicy<typename TunedPolicies::UpsweepPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- scan_dispatch_params.InitScanPolicy<typename TunedPolicies::ScanPolicy>();
	- downsweep_dispatch_params.InitDownsweepPolicy<typename TunedPolicies::DownsweepPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- else if (ptx_version >= 200)
	- {
	- typedef TunedPolicies<Key, Value, SizeT, 200> TunedPolicies;
	- upsweep_dispatch_params.InitUpsweepPolicy<typename TunedPolicies::UpsweepPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- scan_dispatch_params.InitScanPolicy<typename TunedPolicies::ScanPolicy>();
	- downsweep_dispatch_params.InitDownsweepPolicy<typename TunedPolicies::DownsweepPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- else
	- {
	- typedef TunedPolicies<Key, Value, SizeT, 100> TunedPolicies;
	- upsweep_dispatch_params.InitUpsweepPolicy<typename TunedPolicies::UpsweepPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- scan_dispatch_params.InitScanPolicy<typename TunedPolicies::ScanPolicy>();
	- downsweep_dispatch_params.InitDownsweepPolicy<typename TunedPolicies::DownsweepPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- }
	- }
	- };
	-
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal dispatch routine for computing a device-wide reduction using a two-stages of kernel invocations.
	- */
	- template <
	- typename UpsweepKernelPtr, ///< Function type of cub::RadixSortUpsweepKernel
	- typename SpineKernelPtr, ///< Function type of cub::SpineScanKernel
	- typename DownsweepKernelPtr, ///< Function type of cub::RadixSortUpsweepKernel
	- typename Key, ///< Key type
	- typename Value, ///< Value type
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- UpsweepKernelPtr upsweep_kernel, ///< [in] Kernel function pointer to parameterization of cub::RadixSortUpsweepKernel
	- SpineKernelPtr scan_kernel, ///< [in] Kernel function pointer to parameterization of cub::SpineScanKernel
	- DownsweepKernelPtr downsweep_kernel, ///< [in] Kernel function pointer to parameterization of cub::RadixSortUpsweepKernel
	- KernelDispachParams &upsweep_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p upsweep_kernel was compiled for
	- KernelDispachParams &scan_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p scan_kernel was compiled for
	- KernelDispachParams &downsweep_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p downsweep_kernel was compiled for
	- DoubleBuffer<Key> &d_keys, ///< [in,out] Double-buffer whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
	- DoubleBuffer<Value> &d_values, ///< [in,out] Double-buffer whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
	- SizeT num_items, ///< [in] Number of items to reduce
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8, ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // Kernel launch not supported from this device
	- return CubDebug(cudaErrorNotSupported );
	-
	-#else
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Get device ordinal
	- int device_ordinal;
	- if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
	-
	- // Get SM count
	- int sm_count;
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Get a rough estimate of downsweep_kernel SM occupancy based upon the maximum SM occupancy of the targeted PTX architecture
	- int downsweep_sm_occupancy = CUB_MIN(
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADBLOCKS,
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADS / downsweep_dispatch_params.block_threads);
	- int upsweep_sm_occupancy = downsweep_sm_occupancy;
	-
	-#ifndef __CUDA_ARCH__
	- // We're on the host, so come up with more accurate estimates of SM occupancy from actual device properties
	- Device device_props;
	- if (CubDebug(error = device_props.Init(device_ordinal))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- downsweep_sm_occupancy,
	- downsweep_kernel,
	- downsweep_dispatch_params.block_threads))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- upsweep_sm_occupancy,
	- upsweep_kernel,
	- upsweep_dispatch_params.block_threads))) break;
	-#endif
	- // Get device occupancies
	- int downsweep_occupancy = downsweep_sm_occupancy * sm_count;
	-
	- // Get even-share work distribution descriptor
	- GridEvenShare<SizeT> even_share;
	- int max_downsweep_grid_size = downsweep_occupancy * downsweep_dispatch_params.subscription_factor;
	- int downsweep_grid_size;
	- even_share.GridInit(num_items, max_downsweep_grid_size, downsweep_dispatch_params.tile_size);
	- downsweep_grid_size = even_share.grid_size;
	-
	- // Get number of spine elements (round up to nearest spine scan kernel tile size)
	- int bins = 1 << downsweep_dispatch_params.radix_bits;
	- int spine_size = downsweep_grid_size * bins;
	- int spine_tiles = (spine_size + scan_dispatch_params.tile_size - 1) / scan_dispatch_params.tile_size;
	- spine_size = spine_tiles * scan_dispatch_params.tile_size;
	-
	- int alt_bins = 1 << downsweep_dispatch_params.alt_radix_bits;
	- int alt_spine_size = downsweep_grid_size * alt_bins;
	- int alt_spine_tiles = (alt_spine_size + scan_dispatch_params.tile_size - 1) / scan_dispatch_params.tile_size;
	- alt_spine_size = alt_spine_tiles * scan_dispatch_params.tile_size;
	-
	- // Temporary storage allocation requirements
	- void* allocations[1];
	- size_t allocation_sizes[1] =
	- {
	- spine_size * sizeof(SizeT), // bytes needed for privatized block digit histograms
	- };
	-
	- // Alias temporaries (or set the necessary size of the storage allocation)
	- if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- return cudaSuccess;
	-
	- // Privatized per-block digit histograms
	- SizeT d_spine = (SizeT) allocations[0];
	-
	-#ifndef __CUDA_ARCH__
	- // Get current smem bank configuration
	- cudaSharedMemConfig original_smem_config;
	- if (CubDebug(error = cudaDeviceGetSharedMemConfig(&original_smem_config))) break;
	- cudaSharedMemConfig current_smem_config = original_smem_config;
	-#endif
	- // Iterate over digit places
	- int current_bit = begin_bit;
	- while (current_bit < end_bit)
	- {
	- // Use primary bit granularity if bits remaining is a whole multiple of bit primary granularity
	- int bits_remaining = end_bit - current_bit;
	- bool use_primary_bit_granularity = (bits_remaining % downsweep_dispatch_params.radix_bits == 0);
	- int radix_bits = (use_primary_bit_granularity) ?
	- downsweep_dispatch_params.radix_bits :
	- downsweep_dispatch_params.alt_radix_bits;
	-
	-#ifndef __CUDA_ARCH__
	- // Update smem config if necessary
	- if (current_smem_config != upsweep_dispatch_params.smem_config)
	- {
	- if (CubDebug(error = cudaDeviceSetSharedMemConfig(upsweep_dispatch_params.smem_config))) break;
	- current_smem_config = upsweep_dispatch_params.smem_config;
	- }
	-#endif
	-
	- // Log upsweep_kernel configuration
	- if (stream_synchronous)
	- CubLog("Invoking upsweep_kernel<<<%d, %d, 0, %lld>>>(), %d smem config, %d items per thread, %d SM occupancy, selector %d, current bit %d, bit_grain %d\n",
	- downsweep_grid_size, upsweep_dispatch_params.block_threads, (long long) stream, upsweep_dispatch_params.smem_config, upsweep_dispatch_params.items_per_thread, upsweep_sm_occupancy, d_keys.selector, current_bit, radix_bits);
	-
	- // Invoke upsweep_kernel with same grid size as downsweep_kernel
	- upsweep_kernel<<<downsweep_grid_size, upsweep_dispatch_params.block_threads, 0, stream>>>(
	- d_keys.d_buffers[d_keys.selector],
	- d_spine,
	- num_items,
	- current_bit,
	- use_primary_bit_granularity,
	- (current_bit == begin_bit),
	- even_share);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Log scan_kernel configuration
	- if (stream_synchronous) CubLog("Invoking scan_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread\n",
	- 1, scan_dispatch_params.block_threads, (long long) stream, scan_dispatch_params.items_per_thread);
	-
	- // Invoke scan_kernel
	- scan_kernel<<<1, scan_dispatch_params.block_threads, 0, stream>>>(
	- d_spine,
	- (use_primary_bit_granularity) ? spine_size : alt_spine_size);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	-#ifndef __CUDA_ARCH__
	- // Update smem config if necessary
	- if (current_smem_config != downsweep_dispatch_params.smem_config)
	- {
	- if (CubDebug(error = cudaDeviceSetSharedMemConfig(downsweep_dispatch_params.smem_config))) break;
	- current_smem_config = downsweep_dispatch_params.smem_config;
	- }
	-#endif
	-
	- // Log downsweep_kernel configuration
	- if (stream_synchronous) CubLog("Invoking downsweep_kernel<<<%d, %d, 0, %lld>>>(), %d smem config, %d items per thread, %d SM occupancy\n",
	- downsweep_grid_size, downsweep_dispatch_params.block_threads, (long long) stream, downsweep_dispatch_params.smem_config, downsweep_dispatch_params.items_per_thread, downsweep_sm_occupancy);
	-
	- // Invoke downsweep_kernel
	- downsweep_kernel<<<downsweep_grid_size, downsweep_dispatch_params.block_threads, 0, stream>>>(
	- d_keys.d_buffers[d_keys.selector],
	- d_keys.d_buffers[d_keys.selector ^ 1],
	- d_values.d_buffers[d_values.selector],
	- d_values.d_buffers[d_values.selector ^ 1],
	- d_spine,
	- num_items,
	- current_bit,
	- use_primary_bit_granularity,
	- (current_bit == begin_bit),
	- (current_bit + downsweep_dispatch_params.radix_bits >= end_bit),
	- even_share);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Invert selectors
	- d_keys.selector ^= 1;
	- d_values.selector ^= 1;
	-
	- // Update current bit position
	- current_bit += radix_bits;
	- }
	-
	-#ifndef __CUDA_ARCH__
	- // Reset smem config if necessary
	- if (current_smem_config != original_smem_config)
	- {
	- if (CubDebug(error = cudaDeviceSetSharedMemConfig(original_smem_config))) break;
	- }
	-#endif
	-
	- }
	- while (0);
	-
	- return error;
	-
	-#endif // CUB_RUNTIME_ENABLED
	- }
	-
	-
	-
	- #endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	- /******************************************************************************
	- * Interface
	- ******************************************************************************/
	-
	-
	- /**
	- * \brief Sorts key-value pairs.
	- *
	- * \par
	- * The sorting operation requires a pair of key buffers and a pair of value
	- * buffers. Each pair is wrapped in a DoubleBuffer structure whose member
	- * DoubleBuffer::Current() references the active buffer. The currently-active
	- * buffer may be changed by the sorting operation.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \par
	- * The code snippet below illustrates the sorting of a device vector of \p int keys
	- * with associated vector of \p int values.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Create a set of DoubleBuffers to wrap pairs of device pointers for
	- * // sorting data (keys, values, and equivalently-sized alternate buffers)
	- * int num_items = ...
	- * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
	- * cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
	- *
	- * // Determine temporary device storage requirements for sorting operation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
	- *
	- * // Allocate temporary storage for sorting operation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run sorting operation
	- * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
	- *
	- * // Sorted keys and values are referenced by d_keys.Current() and d_values.Current()
	- *
	- * \endcode
	- *
	- * \tparam Key <b>[inferred]</b> Key type
	- * \tparam Value <b>[inferred]</b> Value type
	- */
	- template <
	- typename Key,
	- typename Value>
	- __host__ __device__ __forceinline__
	- static cudaError_t SortPairs(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- DoubleBuffer<Key> &d_keys, ///< [in,out] Double-buffer of keys whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
	- DoubleBuffer<Value> &d_values, ///< [in,out] Double-buffer of values whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
	- int num_items, ///< [in] Number of items to reduce
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The first (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8, ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- // Type used for array indexing
	- typedef int SizeT;
	-
	- // Tuning polices
	- typedef PtxDefaultPolicies<Key, Value, SizeT> PtxDefaultPolicies; // Wrapper of default kernel policies
	- typedef typename PtxDefaultPolicies::UpsweepPolicy UpsweepPolicy; // Upsweep kernel policy
	- typedef typename PtxDefaultPolicies::ScanPolicy ScanPolicy; // Scan kernel policy
	- typedef typename PtxDefaultPolicies::DownsweepPolicy DownsweepPolicy; // Downsweep kernel policy
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Declare dispatch parameters
	- KernelDispachParams upsweep_dispatch_params;
	- KernelDispachParams scan_dispatch_params;
	- KernelDispachParams downsweep_dispatch_params;
	-
	-#ifdef __CUDA_ARCH__
	- // We're on the device, so initialize the dispatch parameters with the PtxDefaultPolicies directly
	- upsweep_dispatch_params.InitUpsweepPolicy<UpsweepPolicy>(PtxDefaultPolicies::SUBSCRIPTION_FACTOR);
	- scan_dispatch_params.InitScanPolicy<ScanPolicy>();
	- downsweep_dispatch_params.InitDownsweepPolicy<DownsweepPolicy>(PtxDefaultPolicies::SUBSCRIPTION_FACTOR);
	-#else
	- // We're on the host, so lookup and initialize the dispatch parameters with the policies that match the device's PTX version
	- int ptx_version;
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- PtxDefaultPolicies::InitDispatchParams(
	- ptx_version,
	- upsweep_dispatch_params,
	- scan_dispatch_params,
	- downsweep_dispatch_params);
	-#endif
	- // Dispatch
	- if (CubDebug(error = Dispatch(
	- d_temp_storage,
	- temp_storage_bytes,
	- RadixSortUpsweepKernel<UpsweepPolicy, Key, SizeT>,
	- RadixSortScanKernel<ScanPolicy, SizeT>,
	- RadixSortDownsweepKernel<DownsweepPolicy, Key, Value, SizeT>,
	- upsweep_dispatch_params,
	- scan_dispatch_params,
	- downsweep_dispatch_params,
	- d_keys,
	- d_values,
	- num_items,
	- begin_bit,
	- end_bit,
	- stream,
	- stream_synchronous))) break;
	- }
	- while (0);
	-
	- return error;
	- }
	-
	-
	- /**
	- * \brief Sorts keys
	- *
	- * \par
	- * The sorting operation requires a pair of key buffers. The pair is
	- * wrapped in a DoubleBuffer structure whose member DoubleBuffer::Current()
	- * references the active buffer. The currently-active buffer may be changed
	- * by the sorting operation.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \par
	- * The code snippet below illustrates the sorting of a device vector of \p int keys.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Create a set of DoubleBuffers to wrap pairs of device pointers for
	- * // sorting data (keys and equivalently-sized alternate buffer)
	- * int num_items = ...
	- * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
	- *
	- * // Determine temporary device storage requirements for sorting operation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
	- *
	- * // Allocate temporary storage for sorting operation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run sorting operation
	- * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
	- *
	- * // Sorted keys are referenced by d_keys.Current()
	- *
	- * \endcode
	- *
	- * \tparam Key <b>[inferred]</b> Key type
	- */
	- template <typename Key>
	- __host__ __device__ __forceinline__
	- static cudaError_t SortKeys(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- DoubleBuffer<Key> &d_keys, ///< [in,out] Double-buffer of keys whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
	- int num_items, ///< [in] Number of items to reduce
	- int begin_bit = 0, ///< [in] <b>[optional]</b> The first (least-significant) bit index needed for key comparison
	- int end_bit = sizeof(Key) * 8, ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- DoubleBuffer<NullType> d_values;
	- return SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items, begin_bit, end_bit, stream, stream_synchronous);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/device/device_reduce.cuh b/lib/kokkos/TPL/cub/device/device_reduce.cuh
	deleted file mode 100755
	index 069af8c1f..000000000
	--- a/lib/kokkos/TPL/cub/device/device_reduce.cuh
	+++ /dev/null
	@@ -1,775 +0,0 @@
	-
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::DeviceReduce provides operations for computing a device-wide, parallel reduction across data items residing within global memory.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include <iterator>
	-
	-#include "block/block_reduce_tiles.cuh"
	-#include "../thread/thread_operators.cuh"
	-#include "../grid/grid_even_share.cuh"
	-#include "../grid/grid_queue.cuh"
	-#include "../util_debug.cuh"
	-#include "../util_device.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-
	-
	-
	-
	-/******************************************************************************
	- * Kernel entry points
	- *****************************************************************************/
	-
	-/**
	- * Reduction pass kernel entry point (multi-block). Computes privatized reductions, one per thread block.
	- */
	-template <
	- typename BlockReduceTilesPolicy, ///< Tuning policy for cub::BlockReduceTiles abstraction
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename SizeT, ///< Integer type used for global array indexing
	- typename ReductionOp> ///< Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	-__launch_bounds__ (int(BlockReduceTilesPolicy::BLOCK_THREADS), 1)
	-__global__ void ReducePrivatizedKernel(
	- InputIteratorRA d_in, ///< [in] Input data to reduce
	- OutputIteratorRA d_out, ///< [out] Output location for result
	- SizeT num_items, ///< [in] Total number of input data items
	- GridEvenShare<SizeT> even_share, ///< [in] Descriptor for how to map an even-share of tiles across thread blocks
	- GridQueue<SizeT> queue, ///< [in] Descriptor for performing dynamic mapping of tile data to thread blocks
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	-{
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Thread block type for reducing input tiles
	- typedef BlockReduceTiles<BlockReduceTilesPolicy, InputIteratorRA, SizeT, ReductionOp> BlockReduceTilesT;
	-
	- // Block-wide aggregate
	- T block_aggregate;
	-
	- // Shared memory storage
	- __shared__ typename BlockReduceTilesT::TempStorage temp_storage;
	-
	- // Consume input tiles
	- BlockReduceTilesT(temp_storage, d_in, reduction_op).ConsumeTiles(
	- num_items,
	- even_share,
	- queue,
	- block_aggregate,
	- Int2Type<BlockReduceTilesPolicy::GRID_MAPPING>());
	-
	- // Output result
	- if (threadIdx.x == 0)
	- {
	- d_out[blockIdx.x] = block_aggregate;
	- }
	-}
	-
	-
	-/**
	- * Reduction pass kernel entry point (single-block). Aggregates privatized threadblock reductions from a previous multi-block reduction pass.
	- */
	-template <
	- typename BlockReduceTilesPolicy, ///< Tuning policy for cub::BlockReduceTiles abstraction
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename SizeT, ///< Integer type used for global array indexing
	- typename ReductionOp> ///< Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	-__launch_bounds__ (int(BlockReduceTilesPolicy::BLOCK_THREADS), 1)
	-__global__ void ReduceSingleKernel(
	- InputIteratorRA d_in, ///< [in] Input data to reduce
	- OutputIteratorRA d_out, ///< [out] Output location for result
	- SizeT num_items, ///< [in] Total number of input data items
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	-{
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Thread block type for reducing input tiles
	- typedef BlockReduceTiles<BlockReduceTilesPolicy, InputIteratorRA, SizeT, ReductionOp> BlockReduceTilesT;
	-
	- // Block-wide aggregate
	- T block_aggregate;
	-
	- // Shared memory storage
	- __shared__ typename BlockReduceTilesT::TempStorage temp_storage;
	-
	- // Consume input tiles
	- BlockReduceTilesT(temp_storage, d_in, reduction_op).ConsumeTiles(
	- SizeT(0),
	- SizeT(num_items),
	- block_aggregate);
	-
	- // Output result
	- if (threadIdx.x == 0)
	- {
	- d_out[blockIdx.x] = block_aggregate;
	- }
	-}
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/******************************************************************************
	- * DeviceReduce
	- *****************************************************************************/
	-
	-/**
	- * \brief DeviceReduce provides operations for computing a device-wide, parallel reduction across data items residing within global memory. ![](reduce_logo.png)
	- * \ingroup DeviceModule
	- *
	- * \par Overview
	- * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
	- * uses a binary combining operator to compute a single aggregate from a list of input elements.
	- *
	- * \par Usage Considerations
	- * \cdp_class{DeviceReduce}
	- *
	- * \par Performance
	- *
	- * \image html reduction_perf.png
	- *
	- */
	-struct DeviceReduce
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Generic structure for encapsulating dispatch properties codified in block policy.
	- struct KernelDispachParams
	- {
	- int block_threads;
	- int items_per_thread;
	- int vector_load_length;
	- BlockReduceAlgorithm block_algorithm;
	- PtxLoadModifier load_modifier;
	- GridMappingStrategy grid_mapping;
	- int subscription_factor;
	- int tile_size;
	-
	- template <typename BlockPolicy>
	- __host__ __device__ __forceinline__
	- void Init(int subscription_factor = 1)
	- {
	- block_threads = BlockPolicy::BLOCK_THREADS;
	- items_per_thread = BlockPolicy::ITEMS_PER_THREAD;
	- vector_load_length = BlockPolicy::VECTOR_LOAD_LENGTH;
	- block_algorithm = BlockPolicy::BLOCK_ALGORITHM;
	- load_modifier = BlockPolicy::LOAD_MODIFIER;
	- grid_mapping = BlockPolicy::GRID_MAPPING;
	- this->subscription_factor = subscription_factor;
	- tile_size = block_threads * items_per_thread;
	- }
	-
	- __host__ __device__ __forceinline__
	- void Print()
	- {
	- printf("%d threads, %d per thread, %d veclen, %d algo, %d loadmod, %d mapping, %d subscription",
	- block_threads,
	- items_per_thread,
	- vector_load_length,
	- block_algorithm,
	- load_modifier,
	- grid_mapping,
	- subscription_factor);
	- }
	-
	- };
	-
	-
	- /******************************************************************************
	- * Tuning policies
	- ******************************************************************************/
	-
	- /// Specializations of tuned policy types for different PTX architectures
	- template <
	- typename T,
	- typename SizeT,
	- int ARCH>
	- struct TunedPolicies;
	-
	- /// SM35 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 350>
	- {
	- // PrivatizedPolicy (1B): GTX Titan: 206.0 GB/s @ 192M 1B items
	- typedef BlockReduceTilesPolicy<128, 12, 1, BLOCK_REDUCE_RAKING, LOAD_LDG, GRID_MAPPING_DYNAMIC> PrivatizedPolicy1B;
	-
	- // PrivatizedPolicy (4B): GTX Titan: 254.2 GB/s @ 48M 4B items
	- typedef BlockReduceTilesPolicy<512, 20, 1, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> PrivatizedPolicy4B;
	-
	- // PrivatizedPolicy
	- typedef typename If<(sizeof(T) < 4),
	- PrivatizedPolicy1B,
	- PrivatizedPolicy4B>::Type PrivatizedPolicy;
	-
	- // SinglePolicy
	- typedef BlockReduceTilesPolicy<256, 8, 1, BLOCK_REDUCE_WARP_REDUCTIONS, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> SinglePolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 7 };
	-
	- };
	-
	- /// SM30 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 300>
	- {
	- // PrivatizedPolicy: GTX670: 154.0 @ 48M 32-bit T
	- typedef BlockReduceTilesPolicy<256, 2, 1, BLOCK_REDUCE_WARP_REDUCTIONS, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> PrivatizedPolicy;
	-
	- // SinglePolicy
	- typedef BlockReduceTilesPolicy<256, 24, 4, BLOCK_REDUCE_WARP_REDUCTIONS, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> SinglePolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 1 };
	- };
	-
	- /// SM20 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 200>
	- {
	- // PrivatizedPolicy (1B): GTX 580: 158.1 GB/s @ 192M 1B items
	- typedef BlockReduceTilesPolicy<192, 24, 4, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> PrivatizedPolicy1B;
	-
	- // PrivatizedPolicy (4B): GTX 580: 178.9 GB/s @ 48M 4B items
	- typedef BlockReduceTilesPolicy<128, 8, 4, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_DYNAMIC> PrivatizedPolicy4B;
	-
	- // PrivatizedPolicy
	- typedef typename If<(sizeof(T) < 4),
	- PrivatizedPolicy1B,
	- PrivatizedPolicy4B>::Type PrivatizedPolicy;
	-
	- // SinglePolicy
	- typedef BlockReduceTilesPolicy<192, 7, 1, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> SinglePolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 2 };
	- };
	-
	- /// SM13 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 130>
	- {
	- // PrivatizedPolicy
	- typedef BlockReduceTilesPolicy<128, 8, 2, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> PrivatizedPolicy;
	-
	- // SinglePolicy
	- typedef BlockReduceTilesPolicy<32, 4, 4, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> SinglePolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 1 };
	- };
	-
	- /// SM10 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 100>
	- {
	- // PrivatizedPolicy
	- typedef BlockReduceTilesPolicy<128, 8, 2, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> PrivatizedPolicy;
	-
	- // SinglePolicy
	- typedef BlockReduceTilesPolicy<32, 4, 4, BLOCK_REDUCE_RAKING, LOAD_DEFAULT, GRID_MAPPING_EVEN_SHARE> SinglePolicy;
	-
	- enum { SUBSCRIPTION_FACTOR = 1 };
	- };
	-
	-
	-
	- /******************************************************************************
	- * Default policy initializer
	- ******************************************************************************/
	-
	- /// Tuning policy for the PTX architecture that DeviceReduce operations will get dispatched to
	- template <typename T, typename SizeT>
	- struct PtxDefaultPolicies
	- {
	- static const int PTX_TUNE_ARCH = (CUB_PTX_ARCH >= 350) ?
	- 350 :
	- (CUB_PTX_ARCH >= 300) ?
	- 300 :
	- (CUB_PTX_ARCH >= 200) ?
	- 200 :
	- (CUB_PTX_ARCH >= 130) ?
	- 130 :
	- 100;
	-
	- // Tuned policy set for the current PTX compiler pass
	- typedef TunedPolicies<T, SizeT, PTX_TUNE_ARCH> PtxTunedPolicies;
	-
	- // Subscription factor for the current PTX compiler pass
	- static const int SUBSCRIPTION_FACTOR = PtxTunedPolicies::SUBSCRIPTION_FACTOR;
	-
	- // PrivatizedPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct PrivatizedPolicy : PtxTunedPolicies::PrivatizedPolicy {};
	-
	- // SinglePolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct SinglePolicy : PtxTunedPolicies::SinglePolicy {};
	-
	-
	- /**
	- * Initialize dispatch params with the policies corresponding to the PTX assembly we will use
	- */
	- static void InitDispatchParams(
	- int ptx_version,
	- KernelDispachParams &privatized_dispatch_params,
	- KernelDispachParams &single_dispatch_params)
	- {
	- if (ptx_version >= 350)
	- {
	- typedef TunedPolicies<T, SizeT, 350> TunedPolicies;
	- privatized_dispatch_params.Init<typename TunedPolicies::PrivatizedPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- single_dispatch_params.Init<typename TunedPolicies::SinglePolicy >();
	- }
	- else if (ptx_version >= 300)
	- {
	- typedef TunedPolicies<T, SizeT, 300> TunedPolicies;
	- privatized_dispatch_params.Init<typename TunedPolicies::PrivatizedPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- single_dispatch_params.Init<typename TunedPolicies::SinglePolicy >();
	- }
	- else if (ptx_version >= 200)
	- {
	- typedef TunedPolicies<T, SizeT, 200> TunedPolicies;
	- privatized_dispatch_params.Init<typename TunedPolicies::PrivatizedPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- single_dispatch_params.Init<typename TunedPolicies::SinglePolicy >();
	- }
	- else if (ptx_version >= 130)
	- {
	- typedef TunedPolicies<T, SizeT, 130> TunedPolicies;
	- privatized_dispatch_params.Init<typename TunedPolicies::PrivatizedPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- single_dispatch_params.Init<typename TunedPolicies::SinglePolicy >();
	- }
	- else
	- {
	- typedef TunedPolicies<T, SizeT, 100> TunedPolicies;
	- privatized_dispatch_params.Init<typename TunedPolicies::PrivatizedPolicy>(TunedPolicies::SUBSCRIPTION_FACTOR);
	- single_dispatch_params.Init<typename TunedPolicies::SinglePolicy >();
	- }
	- }
	- };
	-
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal dispatch routine for computing a device-wide reduction using a two-stages of kernel invocations.
	- */
	- template <
	- typename ReducePrivatizedKernelPtr, ///< Function type of cub::ReducePrivatizedKernel
	- typename ReduceSingleKernelPtr, ///< Function type of cub::ReduceSingleKernel
	- typename ResetDrainKernelPtr, ///< Function type of cub::ResetDrainKernel
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename SizeT, ///< Integer type used for global array indexing
	- typename ReductionOp> ///< Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- ReducePrivatizedKernelPtr privatized_kernel, ///< [in] Kernel function pointer to parameterization of cub::ReducePrivatizedKernel
	- ReduceSingleKernelPtr single_kernel, ///< [in] Kernel function pointer to parameterization of cub::ReduceSingleKernel
	- ResetDrainKernelPtr prepare_drain_kernel, ///< [in] Kernel function pointer to parameterization of cub::ResetDrainKernel
	- KernelDispachParams &privatized_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p privatized_kernel_ptr was compiled for
	- KernelDispachParams &single_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p single_kernel was compiled for
	- InputIteratorRA d_in, ///< [in] Input data to reduce
	- OutputIteratorRA d_out, ///< [out] Output location for result
	- SizeT num_items, ///< [in] Number of items to reduce
	- ReductionOp reduction_op, ///< [in] Binary reduction operator
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // Kernel launch not supported from this device
	- return CubDebug(cudaErrorNotSupported );
	-
	-#else
	-
	- // Data type of input iterator
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- if ((privatized_kernel == NULL) \|\| (num_items <= (single_dispatch_params.tile_size)))
	- {
	- // Dispatch a single-block reduction kernel
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- {
	- temp_storage_bytes = 1;
	- return cudaSuccess;
	- }
	-
	- // Log single_kernel configuration
	- if (stream_synchronous) CubLog("Invoking ReduceSingle<<<1, %d, 0, %lld>>>(), %d items per thread\n",
	- single_dispatch_params.block_threads, (long long) stream, single_dispatch_params.items_per_thread);
	-
	- // Invoke single_kernel
	- single_kernel<<<1, single_dispatch_params.block_threads>>>(
	- d_in,
	- d_out,
	- num_items,
	- reduction_op);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- }
	- else
	- {
	- // Dispatch two kernels: a multi-block kernel to compute
	- // privatized per-block reductions, and then a single-block
	- // to reduce those
	-
	- // Get device ordinal
	- int device_ordinal;
	- if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
	-
	- // Get SM count
	- int sm_count;
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Get a rough estimate of privatized_kernel SM occupancy based upon the maximum SM occupancy of the targeted PTX architecture
	- int privatized_sm_occupancy = CUB_MIN(
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADBLOCKS,
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADS / privatized_dispatch_params.block_threads);
	-
	-#ifndef __CUDA_ARCH__
	- // We're on the host, so come up with a more accurate estimate of privatized_kernel SM occupancy from actual device properties
	- Device device_props;
	- if (CubDebug(error = device_props.Init(device_ordinal))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- privatized_sm_occupancy,
	- privatized_kernel,
	- privatized_dispatch_params.block_threads))) break;
	-#endif
	-
	- // Get device occupancy for privatized_kernel
	- int privatized_occupancy = privatized_sm_occupancy * sm_count;
	-
	- // Even-share work distribution
	- GridEvenShare<SizeT> even_share;
	-
	- // Get grid size for privatized_kernel
	- int privatized_grid_size;
	- switch (privatized_dispatch_params.grid_mapping)
	- {
	- case GRID_MAPPING_EVEN_SHARE:
	-
	- // Work is distributed evenly
	- even_share.GridInit(
	- num_items,
	- privatized_occupancy * privatized_dispatch_params.subscription_factor,
	- privatized_dispatch_params.tile_size);
	- privatized_grid_size = even_share.grid_size;
	- break;
	-
	- case GRID_MAPPING_DYNAMIC:
	-
	- // Work is distributed dynamically
	- int num_tiles = (num_items + privatized_dispatch_params.tile_size - 1) / privatized_dispatch_params.tile_size;
	- privatized_grid_size = (num_tiles < privatized_occupancy) ?
	- num_tiles : // Not enough to fill the device with threadblocks
	- privatized_occupancy; // Fill the device with threadblocks
	- break;
	- };
	-
	- // Temporary storage allocation requirements
	- void* allocations[2];
	- size_t allocation_sizes[2] =
	- {
	- privatized_grid_size * sizeof(T), // bytes needed for privatized block reductions
	- GridQueue<int>::AllocationSize() // bytes needed for grid queue descriptor
	- };
	-
	- // Alias temporaries (or set the necessary size of the storage allocation)
	- if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- return cudaSuccess;
	-
	- // Privatized per-block reductions
	- T d_block_reductions = (T) allocations[0];
	-
	- // Grid queue descriptor
	- GridQueue<SizeT> queue(allocations[1]);
	-
	- // Prepare the dynamic queue descriptor if necessary
	- if (privatized_dispatch_params.grid_mapping == GRID_MAPPING_DYNAMIC)
	- {
	- // Prepare queue using a kernel so we know it gets prepared once per operation
	- if (stream_synchronous) CubLog("Invoking prepare_drain_kernel<<<1, 1, 0, %lld>>>()\n", (long long) stream);
	-
	- // Invoke prepare_drain_kernel
	- prepare_drain_kernel<<<1, 1, 0, stream>>>(queue, num_items);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	- }
	-
	- // Log privatized_kernel configuration
	- if (stream_synchronous) CubLog("Invoking privatized_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
	- privatized_grid_size, privatized_dispatch_params.block_threads, (long long) stream, privatized_dispatch_params.items_per_thread, privatized_sm_occupancy);
	-
	- // Invoke privatized_kernel
	- privatized_kernel<<<privatized_grid_size, privatized_dispatch_params.block_threads, 0, stream>>>(
	- d_in,
	- d_block_reductions,
	- num_items,
	- even_share,
	- queue,
	- reduction_op);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Log single_kernel configuration
	- if (stream_synchronous) CubLog("Invoking single_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread\n",
	- 1, single_dispatch_params.block_threads, (long long) stream, single_dispatch_params.items_per_thread);
	-
	- // Invoke single_kernel
	- single_kernel<<<1, single_dispatch_params.block_threads, 0, stream>>>(
	- d_block_reductions,
	- d_out,
	- privatized_grid_size,
	- reduction_op);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	- }
	- }
	- while (0);
	-
	- return error;
	-
	-#endif // CUB_RUNTIME_ENABLED
	- }
	-
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	- /******************************************************************************
	- * Interface
	- ******************************************************************************/
	-
	- /**
	- * \brief Computes a device-wide reduction using the specified binary \p reduction_op functor.
	- *
	- * \par
	- * Does not support non-commutative reduction operators.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the max reduction of a device vector of \p int items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input and output
	- * int d_reduce_input, d_aggregate;
	- * int num_items = ...
	- * ...
	- *
	- * // Determine temporary device storage requirements for reduction
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_reduce_input, d_aggregate, num_items, cub::Max());
	- *
	- * // Allocate temporary storage for reduction
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run reduction (max)
	- * cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_reduce_input, d_aggregate, num_items, cub::Max());
	- *
	- * \endcode
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA,
	- typename ReductionOp>
	- __host__ __device__ __forceinline__
	- static cudaError_t Reduce(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Input data to reduce
	- OutputIteratorRA d_out, ///< [out] Output location for result
	- int num_items, ///< [in] Number of items to reduce
	- ReductionOp reduction_op, ///< [in] Binary reduction operator
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- // Type used for array indexing
	- typedef int SizeT;
	-
	- // Data type of input iterator
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Tuning polices
	- typedef PtxDefaultPolicies<T, SizeT> PtxDefaultPolicies; // Wrapper of default kernel policies
	- typedef typename PtxDefaultPolicies::PrivatizedPolicy PrivatizedPolicy; // Multi-block kernel policy
	- typedef typename PtxDefaultPolicies::SinglePolicy SinglePolicy; // Single-block kernel policy
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Declare dispatch parameters
	- KernelDispachParams privatized_dispatch_params;
	- KernelDispachParams single_dispatch_params;
	-
	-#ifdef __CUDA_ARCH__
	- // We're on the device, so initialize the dispatch parameters with the PtxDefaultPolicies directly
	- privatized_dispatch_params.Init<PrivatizedPolicy>(PtxDefaultPolicies::SUBSCRIPTION_FACTOR);
	- single_dispatch_params.Init<SinglePolicy>();
	-#else
	- // We're on the host, so lookup and initialize the dispatch parameters with the policies that match the device's PTX version
	- int ptx_version;
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- PtxDefaultPolicies::InitDispatchParams(ptx_version, privatized_dispatch_params, single_dispatch_params);
	-#endif
	-
	- // Dispatch
	- if (CubDebug(error = Dispatch(
	- d_temp_storage,
	- temp_storage_bytes,
	- ReducePrivatizedKernel<PrivatizedPolicy, InputIteratorRA, T*, SizeT, ReductionOp>,
	- ReduceSingleKernel<SinglePolicy, T*, OutputIteratorRA, SizeT, ReductionOp>,
	- ResetDrainKernel<SizeT>,
	- privatized_dispatch_params,
	- single_dispatch_params,
	- d_in,
	- d_out,
	- num_items,
	- reduction_op,
	- stream,
	- stream_synchronous))) break;
	- }
	- while (0);
	-
	- return error;
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide sum using the addition ('+') operator.
	- *
	- * \par
	- * Does not support non-commutative reduction operators.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the sum reduction of a device vector of \p int items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input and output
	- * int d_reduce_input, d_aggregate;
	- * int num_items = ...
	- * ...
	- *
	- * // Determine temporary device storage requirements for summation
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_reduce_input, d_aggregate, num_items);
	- *
	- * // Allocate temporary storage for summation
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run reduction summation
	- * cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_reduce_input, d_aggregate, num_items);
	- *
	- * \endcode
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA>
	- __host__ __device__ __forceinline__
	- static cudaError_t Sum(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Input data to reduce
	- OutputIteratorRA d_out, ///< [out] Output location for result
	- int num_items, ///< [in] Number of items to reduce
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- return Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, cub::Sum(), stream, stream_synchronous);
	- }
	-
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/device/device_reduce_by_key.cuh b/lib/kokkos/TPL/cub/device/device_reduce_by_key.cuh
	deleted file mode 100755
	index f05f75154..000000000
	--- a/lib/kokkos/TPL/cub/device/device_reduce_by_key.cuh
	+++ /dev/null
	@@ -1,633 +0,0 @@
	-
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::DeviceReduceByKey provides operations for computing a device-wide, parallel prefix scan across data items residing within global memory.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include <iterator>
	-
	-#include "block/block_reduce_by_key_tiles.cuh"
	-#include "device_scan.cuh"
	-#include "../thread/thread_operators.cuh"
	-#include "../grid/grid_queue.cuh"
	-#include "../util_iterator.cuh"
	-#include "../util_debug.cuh"
	-#include "../util_device.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Kernel entry points
	- *****************************************************************************/
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Reduce-by-key kernel entry point (multi-block)
	- */
	-template <
	- typename BlockReduceByKeyilesPolicy, ///< Tuning policy for cub::BlockReduceByKeyiles abstraction
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename T, ///< The scan data type
	- typename ReductionOp, ///< Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- typename Identity, ///< Identity value type (cub::NullType for inclusive scans)
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockSweepScanPolicy::BLOCK_THREADS))
	-__global__ void MultiBlockScanKernel(
	- InputIteratorRA d_in, ///< Input data
	- OutputIteratorRA d_out, ///< Output data
	- ScanTileDescriptor<T> *d_tile_status, ///< Global list of tile status
	- ReductionOp reduction_op, ///< Binary scan operator
	- Identity identity, ///< Identity element
	- SizeT num_items, ///< Total number of scan items for the entire problem
	- GridQueue<int> queue) ///< Descriptor for performing dynamic mapping of tile data to thread blocks
	-{
	- enum
	- {
	- TILE_STATUS_PADDING = PtxArchProps::WARP_THREADS,
	- };
	-
	- // Thread block type for scanning input tiles
	- typedef BlockSweepScan<
	- BlockSweepScanPolicy,
	- InputIteratorRA,
	- OutputIteratorRA,
	- ReductionOp,
	- Identity,
	- SizeT> BlockSweepScanT;
	-
	- // Shared memory for BlockSweepScan
	- __shared__ typename BlockSweepScanT::TempStorage temp_storage;
	-
	- // Process tiles
	- BlockSweepScanT(temp_storage, d_in, d_out, reduction_op, identity).ConsumeTiles(
	- num_items,
	- queue,
	- d_tile_status + TILE_STATUS_PADDING);
	-}
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/******************************************************************************
	- * DeviceReduceByKey
	- *****************************************************************************/
	-
	-/**
	- * \addtogroup DeviceModule
	- * @{
	- */
	-
	-/**
	- * \brief DeviceReduceByKey provides operations for computing a device-wide, parallel prefix scan across data items residing within global memory. ![](scan_logo.png)
	- */
	-struct DeviceReduceByKey
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Generic structure for encapsulating dispatch properties. Mirrors the constants within BlockSweepScanPolicy.
	- struct KernelDispachParams
	- {
	- // Policy fields
	- int block_threads;
	- int items_per_thread;
	- BlockLoadAlgorithm load_policy;
	- BlockStoreAlgorithm store_policy;
	- BlockScanAlgorithm scan_algorithm;
	-
	- // Other misc
	- int tile_size;
	-
	- template <typename BlockSweepScanPolicy>
	- __host__ __device__ __forceinline__
	- void Init()
	- {
	- block_threads = BlockSweepScanPolicy::BLOCK_THREADS;
	- items_per_thread = BlockSweepScanPolicy::ITEMS_PER_THREAD;
	- load_policy = BlockSweepScanPolicy::LOAD_ALGORITHM;
	- store_policy = BlockSweepScanPolicy::STORE_ALGORITHM;
	- scan_algorithm = BlockSweepScanPolicy::SCAN_ALGORITHM;
	-
	- tile_size = block_threads * items_per_thread;
	- }
	-
	- __host__ __device__ __forceinline__
	- void Print()
	- {
	- printf("%d, %d, %d, %d, %d",
	- block_threads,
	- items_per_thread,
	- load_policy,
	- store_policy,
	- scan_algorithm);
	- }
	-
	- };
	-
	-
	- /******************************************************************************
	- * Tuning policies
	- ******************************************************************************/
	-
	-
	- /// Specializations of tuned policy types for different PTX architectures
	- template <
	- typename T,
	- typename SizeT,
	- int ARCH>
	- struct TunedPolicies;
	-
	- /// SM35 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 350>
	- {
	- typedef BlockSweepScanPolicy<128, 16, BLOCK_LOAD_DIRECT, false, LOAD_LDG, BLOCK_STORE_WARP_TRANSPOSE, true, BLOCK_SCAN_RAKING_MEMOIZE> MultiBlockPolicy;
	- };
	-
	- /// SM30 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 300>
	- {
	- typedef BlockSweepScanPolicy<256, 9, BLOCK_LOAD_WARP_TRANSPOSE, false, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, false, BLOCK_SCAN_RAKING_MEMOIZE> MultiBlockPolicy;
	- };
	-
	- /// SM20 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 200>
	- {
	- typedef BlockSweepScanPolicy<128, 15, BLOCK_LOAD_WARP_TRANSPOSE, false, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, false, BLOCK_SCAN_RAKING_MEMOIZE> MultiBlockPolicy;
	- };
	-
	- /// SM10 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 100>
	- {
	- typedef BlockSweepScanPolicy<128, 7, BLOCK_LOAD_TRANSPOSE, false, LOAD_DEFAULT, BLOCK_STORE_TRANSPOSE, false, BLOCK_SCAN_RAKING> MultiBlockPolicy;
	- };
	-
	-
	- /// Tuning policy for the PTX architecture that DeviceReduceByKey operations will get dispatched to
	- template <typename T, typename SizeT>
	- struct PtxDefaultPolicies
	- {
	- static const int PTX_TUNE_ARCH = (CUB_PTX_ARCH >= 350) ?
	- 350 :
	- (CUB_PTX_ARCH >= 300) ?
	- 300 :
	- (CUB_PTX_ARCH >= 200) ?
	- 200 :
	- 100;
	-
	- // Tuned policy set for the current PTX compiler pass
	- typedef TunedPolicies<T, SizeT, PTX_TUNE_ARCH> PtxTunedPolicies;
	-
	- // MultiBlockPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct MultiBlockPolicy : PtxTunedPolicies::MultiBlockPolicy {};
	-
	- /**
	- * Initialize dispatch params with the policies corresponding to the PTX assembly we will use
	- */
	- static void InitDispatchParams(int ptx_version, KernelDispachParams &multi_block_dispatch_params)
	- {
	- if (ptx_version >= 350)
	- {
	- typedef TunedPolicies<T, SizeT, 350> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>();
	- }
	- else if (ptx_version >= 300)
	- {
	- typedef TunedPolicies<T, SizeT, 300> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>();
	- }
	- else if (ptx_version >= 200)
	- {
	- typedef TunedPolicies<T, SizeT, 200> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>();
	- }
	- else
	- {
	- typedef TunedPolicies<T, SizeT, 100> TunedPolicies;
	- multi_block_dispatch_params.Init<typename TunedPolicies::MultiBlockPolicy>();
	- }
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal dispatch routine
	- */
	- template <
	- typename InitScanKernelPtr, ///< Function type of cub::InitScanKernel
	- typename MultiBlockScanKernelPtr, ///< Function type of cub::MultiBlockScanKernel
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename ReductionOp, ///< Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- typename Identity, ///< Identity value type (cub::NullType for inclusive scans)
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InitScanKernelPtr init_kernel, ///< [in] Kernel function pointer to parameterization of cub::InitScanKernel
	- MultiBlockScanKernelPtr multi_block_kernel, ///< [in] Kernel function pointer to parameterization of cub::MultiBlockScanKernel
	- KernelDispachParams &multi_block_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p multi_block_kernel was compiled for
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- ReductionOp reduction_op, ///< [in] Binary scan operator
	- Identity identity, ///< [in] Identity element
	- SizeT num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	-
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // Kernel launch not supported from this device
	- return CubDebug(cudaErrorNotSupported );
	-
	-#else
	-
	- enum
	- {
	- TILE_STATUS_PADDING = 32,
	- };
	-
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Number of input tiles
	- int num_tiles = (num_items + multi_block_dispatch_params.tile_size - 1) / multi_block_dispatch_params.tile_size;
	-
	- // Temporary storage allocation requirements
	- void* allocations[2];
	- size_t allocation_sizes[2] =
	- {
	- (num_tiles + TILE_STATUS_PADDING) * sizeof(ScanTileDescriptor<T>), // bytes needed for tile status descriptors
	- GridQueue<int>::AllocationSize() // bytes needed for grid queue descriptor
	- };
	-
	- // Alias temporaries (or set the necessary size of the storage allocation)
	- if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- return cudaSuccess;
	-
	- // Global list of tile status
	- ScanTileDescriptor<T> d_tile_status = (ScanTileDescriptor<T>) allocations[0];
	-
	- // Grid queue descriptor
	- GridQueue<int> queue(allocations[1]);
	-
	- // Get GPU id
	- int device_ordinal;
	- if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
	-
	- // Get SM count
	- int sm_count;
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Log init_kernel configuration
	- int init_kernel_threads = 128;
	- int init_grid_size = (num_tiles + init_kernel_threads - 1) / init_kernel_threads;
	- if (stream_synchronous) CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, init_kernel_threads, (long long) stream);
	-
	- // Invoke init_kernel to initialize tile descriptors and queue descriptors
	- init_kernel<<<init_grid_size, init_kernel_threads, 0, stream>>>(
	- queue,
	- d_tile_status,
	- num_tiles);
	-
	- // Sync the stream if specified
	-#ifndef __CUDA_ARCH__
	- if (stream_synchronous && CubDebug(error = cudaStreamSynchronize(stream))) break;
	-#else
	- if (stream_synchronous && CubDebug(error = cudaDeviceSynchronize())) break;
	-#endif
	-
	- // Get a rough estimate of multi_block_kernel SM occupancy based upon the maximum SM occupancy of the targeted PTX architecture
	- int multi_sm_occupancy = CUB_MIN(
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADBLOCKS,
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADS / multi_block_dispatch_params.block_threads);
	-
	-#ifndef __CUDA_ARCH__
	-
	- // We're on the host, so come up with a more accurate estimate of multi_block_kernel SM occupancy from actual device properties
	- Device device_props;
	- if (CubDebug(error = device_props.Init(device_ordinal))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- multi_sm_occupancy,
	- multi_block_kernel,
	- multi_block_dispatch_params.block_threads))) break;
	-
	-#endif
	- // Get device occupancy for multi_block_kernel
	- int multi_block_occupancy = multi_sm_occupancy * sm_count;
	-
	- // Get grid size for multi_block_kernel
	- int multi_block_grid_size = (num_tiles < multi_block_occupancy) ?
	- num_tiles : // Not enough to fill the device with threadblocks
	- multi_block_occupancy; // Fill the device with threadblocks
	-
	- // Log multi_block_kernel configuration
	- if (stream_synchronous) CubLog("Invoking multi_block_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
	- multi_block_grid_size, multi_block_dispatch_params.block_threads, (long long) stream, multi_block_dispatch_params.items_per_thread, multi_sm_occupancy);
	-
	- // Invoke multi_block_kernel
	- multi_block_kernel<<<multi_block_grid_size, multi_block_dispatch_params.block_threads, 0, stream>>>(
	- d_in,
	- d_out,
	- d_tile_status,
	- reduction_op,
	- identity,
	- num_items,
	- queue);
	-
	- // Sync the stream if specified
	-#ifndef __CUDA_ARCH__
	- if (stream_synchronous && CubDebug(error = cudaStreamSynchronize(stream))) break;
	-#else
	- if (stream_synchronous && CubDebug(error = cudaDeviceSynchronize())) break;
	-#endif
	- }
	- while (0);
	-
	- return error;
	-
	-#endif // CUB_RUNTIME_ENABLED
	- }
	-
	-
	-
	- /**
	- * Internal scan dispatch routine for using default tuning policies
	- */
	- template <
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename ReductionOp, ///< Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- typename Identity, ///< Identity value type (cub::NullType for inclusive scans)
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- ReductionOp reduction_op, ///< [in] Binary scan operator
	- Identity identity, ///< [in] Identity element
	- SizeT num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Tuning polices for the PTX architecture that will get dispatched to
	- typedef PtxDefaultPolicies<T, SizeT> PtxDefaultPolicies;
	- typedef typename PtxDefaultPolicies::MultiBlockPolicy MultiBlockPolicy;
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Declare dispatch parameters
	- KernelDispachParams multi_block_dispatch_params;
	-
	-#ifdef __CUDA_ARCH__
	- // We're on the device, so initialize the dispatch parameters with the PtxDefaultPolicies directly
	- multi_block_dispatch_params.Init<MultiBlockPolicy>();
	-#else
	- // We're on the host, so lookup and initialize the dispatch parameters with the policies that match the device's PTX version
	- int ptx_version;
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- PtxDefaultPolicies::InitDispatchParams(ptx_version, multi_block_dispatch_params);
	-#endif
	-
	- Dispatch(
	- d_temp_storage,
	- temp_storage_bytes,
	- InitScanKernel<T, SizeT>,
	- MultiBlockScanKernel<MultiBlockPolicy, InputIteratorRA, OutputIteratorRA, T, ReductionOp, Identity, SizeT>,
	- multi_block_dispatch_params,
	- d_in,
	- d_out,
	- reduction_op,
	- identity,
	- num_items,
	- stream,
	- stream_synchronous);
	-
	- if (CubDebug(error)) break;
	- }
	- while (0);
	-
	- return error;
	- }
	-
	- #endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	- /****************************************************************//
	- * Interface
	- *********************************************************************/
	-
	-
	- /**
	- * \brief Computes device-wide reductions of consecutive values whose corresponding keys are equal.
	- *
	- * The resulting output lists of value-aggregates and their corresponding keys are compacted.
	- *
	- * \devicestorage
	- *
	- * \tparam KeyInputIteratorRA <b>[inferred]</b> Random-access input iterator type for keys input (may be a simple pointer type)
	- * \tparam KeyOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for keys output (may be a simple pointer type)
	- * \tparam ValueInputIteratorRA <b>[inferred]</b> Random-access input iterator type for values input (may be a simple pointer type)
	- * \tparam ValueOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for values output (may be a simple pointer type)
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>, where \p T is the value type of \p ValueInputIteratorRA
	- */
	- template <
	- typename KeyInputIteratorRA,
	- typename KeyOutputIteratorRA,
	- typename ValueInputIteratorRA,
	- typename ValueOutputIteratorRA,
	- typename ReductionOp>
	- __host__ __device__ __forceinline__
	- static cudaError_t ReduceValues(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- KeyInputIteratorRA d_keys_in, ///< [in] Key input data
	- KeyOutputIteratorRA d_keys_out, ///< [out] Key output data (compacted)
	- ValueInputIteratorRA d_values_in, ///< [in] Value input data
	- ValueOutputIteratorRA d_values_out, ///< [out] Value output data (compacted)
	- int num_items, ///< [in] Total number of input pairs
	- ReductionOp reduction_op, ///< [in] Binary value reduction operator
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, reduction_op, num_items, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes device-wide sums of consecutive values whose corresponding keys are equal.
	- *
	- * The resulting output lists of value-aggregates and their corresponding keys are compacted.
	- *
	- * \devicestorage
	- *
	- * \tparam KeyInputIteratorRA <b>[inferred]</b> Random-access input iterator type for keys input (may be a simple pointer type)
	- * \tparam KeyOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for keys output (may be a simple pointer type)
	- * \tparam ValueInputIteratorRA <b>[inferred]</b> Random-access input iterator type for values input (may be a simple pointer type)
	- * \tparam ValueOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for values output (may be a simple pointer type)
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>, where \p T is the value type of \p ValueInputIteratorRA
	- */
	- template <
	- typename KeyInputIteratorRA,
	- typename KeyOutputIteratorRA,
	- typename ValueInputIteratorRA,
	- typename ValueOutputIteratorRA>
	- __host__ __device__ __forceinline__
	- static cudaError_t SumValues(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- KeyInputIteratorRA d_keys_in, ///< [in] Key input data
	- KeyOutputIteratorRA d_keys_out, ///< [in] Key output data (compacted)
	- ValueInputIteratorRA d_values_in, ///< [in] Value input data
	- ValueOutputIteratorRA d_values_out, ///< [in] Value output data (compacted)
	- int num_items, ///< [in] Total number of input pairs
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return ReduceValues(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, cub::Sum(), num_items, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes the "run-length" of each group of consecutive, equal-valued keys.
	- *
	- * The resulting output lists of run-length counts and their corresponding keys are compacted.
	- *
	- * \devicestorage
	- *
	- * \tparam KeyInputIteratorRA <b>[inferred]</b> Random-access input iterator type for keys input (may be a simple pointer type)
	- * \tparam KeyOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for keys output (may be a simple pointer type)
	- * \tparam CountOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for output of key-counts whose value type must be convertible to an integer type (may be a simple pointer type)
	- */
	- template <
	- typename KeyInputIteratorRA,
	- typename KeyOutputIteratorRA,
	- typename CountOutputIteratorRA>
	- __host__ __device__ __forceinline__
	- static cudaError_t RunLengths(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- KeyInputIteratorRA d_keys_in, ///< [in] Key input data
	- KeyOutputIteratorRA d_keys_out, ///< [in] Key output data (compacted)
	- CountOutputIteratorRA d_counts_out, ///< [in] Run-length counts output data (compacted)
	- int num_items, ///< [in] Total number of keys
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- typedef typename std::iterator_traits<CountOutputIteratorRA>::value_type CountT;
	- return SumValues(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, ConstantIteratorRA<CountT>(1), d_counts_out, num_items, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Removes duplicates within each group of consecutive, equal-valued keys. Only the first key from each group (and corresponding value) is kept.
	- *
	- * The resulting keys are compacted.
	- *
	- * \devicestorage
	- *
	- * \tparam KeyInputIteratorRA <b>[inferred]</b> Random-access input iterator type for keys input (may be a simple pointer type)
	- * \tparam KeyOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for keys output (may be a simple pointer type)
	- * \tparam ValueInputIteratorRA <b>[inferred]</b> Random-access input iterator type for values input (may be a simple pointer type)
	- * \tparam ValueOutputIteratorRA <b>[inferred]</b> Random-access output iterator type for values output (may be a simple pointer type)
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>, where \p T is the value type of \p ValueInputIteratorRA
	- */
	- template <
	- typename KeyInputIteratorRA,
	- typename KeyOutputIteratorRA,
	- typename ValueInputIteratorRA,
	- typename ValueOutputIteratorRA,
	- typename ReductionOp>
	- __host__ __device__ __forceinline__
	- static cudaError_t Unique(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- KeyInputIteratorRA d_keys_in, ///< [in] Key input data
	- KeyOutputIteratorRA d_keys_out, ///< [out] Key output data (compacted)
	- ValueInputIteratorRA d_values_in, ///< [in] Value input data
	- ValueOutputIteratorRA d_values_out, ///< [out] Value output data (compacted)
	- int num_items, ///< [in] Total number of input pairs
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, reduction_op, num_items, stream, stream_synchronous);
	- }
	-
	-
	-
	-};
	-
	-
	-/** @} */ // DeviceModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/device/device_reorder.cuh b/lib/kokkos/TPL/cub/device/device_reorder.cuh
	deleted file mode 100755
	index cba3bb48f..000000000
	--- a/lib/kokkos/TPL/cub/device/device_reorder.cuh
	+++ /dev/null
	@@ -1,550 +0,0 @@
	-
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::DeviceReorder provides device-wide operations for partitioning and filtering lists of items residing within global memory.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include <iterator>
	-
	-#include "device_scan.cuh"
	-#include "block/block_partition_tiles.cuh"
	-#include "../grid/grid_queue.cuh"
	-#include "../util_debug.cuh"
	-#include "../util_device.cuh"
	-#include "../util_vector.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Kernel entry points
	- *****************************************************************************/
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-/**
	- * Partition kernel entry point (multi-block)
	- */
	-template <
	- typename BlockPartitionTilesPolicy, ///< Tuning policy for cub::BlockPartitionTiles abstraction
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename LengthOutputIterator, ///< Output iterator type for recording the length of the first partition (may be a simple pointer type)
	- typename PredicateOp, ///< Unary predicate operator indicating membership in the first partition type having member <tt>bool operator()(const T &val)</tt>
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockPartitionTilesPolicy::BLOCK_THREADS))
	-__global__ void PartitionKernel(
	- InputIteratorRA d_in, ///< Input data
	- OutputIteratorRA d_out, ///< Output data
	- LengthOutputIterator d_partition_length, ///< Number of items in the first partition
	- ScanTileDescriptor<PartitionScanTuple<SizeT, BlockPartitionTilesPolicy::PARTITOINS> > *d_tile_status, ///< Global list of tile status
	- PredicateOp pred_op, ///< Unary predicate operator indicating membership in the first partition
	- SizeT num_items, ///< Total number of input items for the entire problem
	- int num_tiles, ///< Totla number of intut tiles for the entire problem
	- GridQueue<int> queue) ///< Descriptor for performing dynamic mapping of tile data to thread blocks
	-{
	- enum
	- {
	- TILE_STATUS_PADDING = PtxArchProps::WARP_THREADS,
	- };
	-
	- typedef PartitionScanTuple<SizeT, BlockPartitionTilesPolicy::PARTITOINS> PartitionScanTuple;
	-
	- // Thread block type for scanning input tiles
	- typedef BlockPartitionTiles<
	- BlockPartitionTilesPolicy,
	- InputIteratorRA,
	- OutputIteratorRA,
	- PredicateOp,
	- SizeT> BlockPartitionTilesT;
	-
	- // Shared memory for BlockPartitionTiles
	- __shared__ typename BlockPartitionTilesT::TempStorage temp_storage;
	-
	- // Process tiles
	- PartitionScanTuple partition_ends; // Ending offsets for partitions (one-after)
	- bool is_last_tile; // Whether or not this block handled the last tile (i.e., partition_ends is valid for the entire input)
	- BlockPartitionTilesT(temp_storage, d_in, d_out, d_tile_status + TILE_STATUS_PADDING, pred_op, num_items).ConsumeTiles(
	- queue,
	- num_tiles,
	- partition_ends,
	- is_last_tile);
	-
	- // Record the length of the first partition
	- if (is_last_tile && (threadIdx.x == 0))
	- {
	- *d_partition_length = partition_ends.x;
	- }
	-}
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/******************************************************************************
	- * DeviceReorder
	- *****************************************************************************/
	-
	-/**
	- * \addtogroup DeviceModule
	- * @{
	- */
	-
	-/**
	- * \brief DeviceReorder provides device-wide operations for partitioning and filtering lists of items residing within global memory
	- */
	-struct DeviceReorder
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Generic structure for encapsulating dispatch properties. Mirrors the constants within BlockPartitionTilesPolicy.
	- struct KernelDispachParams
	- {
	- int block_threads;
	- int items_per_thread;
	- BlockScanAlgorithm scan_algorithm;
	- int tile_size;
	-
	- template <typename BlockPartitionTilesPolicy>
	- __host__ __device__ __forceinline__
	- void Init()
	- {
	- block_threads = BlockPartitionTilesPolicy::BLOCK_THREADS;
	- items_per_thread = BlockPartitionTilesPolicy::ITEMS_PER_THREAD;
	- scan_algorithm = BlockPartitionTilesPolicy::SCAN_ALGORITHM;
	- tile_size = block_threads * items_per_thread;
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Tuning policies
	- ******************************************************************************/
	-
	-
	- /// Specializations of tuned policy types for different PTX architectures
	- template <
	- int PARTITIONS,
	- typename T,
	- typename SizeT,
	- int ARCH>
	- struct TunedPolicies;
	-
	- /// SM35 tune
	- template <int PARTITIONS, typename T, typename SizeT>
	- struct TunedPolicies<PARTITIONS, T, SizeT, 350>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 16,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	-
	- typedef BlockPartitionTilesPolicy<PARTITIONS, 128, ITEMS_PER_THREAD, LOAD_LDG, BLOCK_SCAN_RAKING_MEMOIZE> PartitionPolicy;
	- };
	-
	- /// SM30 tune
	- template <int PARTITIONS, typename T, typename SizeT>
	- struct TunedPolicies<PARTITIONS, T, SizeT, 300>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 9,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	-
	- typedef BlockPartitionTilesPolicy<PARTITIONS, 256, ITEMS_PER_THREAD, LOAD_DEFAULT, BLOCK_SCAN_RAKING_MEMOIZE> PartitionPolicy;
	- };
	-
	- /// SM20 tune
	- template <int PARTITIONS, typename T, typename SizeT>
	- struct TunedPolicies<PARTITIONS, T, SizeT, 200>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 15,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	-
	- typedef BlockPartitionTilesPolicy<PARTITIONS, 128, ITEMS_PER_THREAD, LOAD_DEFAULT, BLOCK_SCAN_RAKING_MEMOIZE> PartitionPolicy;
	- };
	-
	- /// SM10 tune
	- template <int PARTITIONS, typename T, typename SizeT>
	- struct TunedPolicies<PARTITIONS, T, SizeT, 100>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 7,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	- typedef BlockPartitionTilesPolicy<PARTITIONS, 128, ITEMS_PER_THREAD, LOAD_DEFAULT, BLOCK_SCAN_RAKING> PartitionPolicy;
	- };
	-
	-
	- /// Tuning policy for the PTX architecture that DevicePartition operations will get dispatched to
	- template <int PARTITIONS, typename T, typename SizeT>
	- struct PtxDefaultPolicies
	- {
	- static const int PTX_TUNE_ARCH = (CUB_PTX_ARCH >= 350) ?
	- 350 :
	- (CUB_PTX_ARCH >= 300) ?
	- 300 :
	- (CUB_PTX_ARCH >= 200) ?
	- 200 :
	- 100;
	-
	- // Tuned policy set for the current PTX compiler pass
	- typedef TunedPolicies<PARTITIONS, T, SizeT, PTX_TUNE_ARCH> PtxTunedPolicies;
	-
	- // PartitionPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct PartitionPolicy : PtxTunedPolicies::PartitionPolicy {};
	-
	- /**
	- * Initialize dispatch params with the policies corresponding to the PTX assembly we will use
	- */
	- static void InitDispatchParams(int ptx_version, KernelDispachParams &scan_dispatch_params)
	- {
	- if (ptx_version >= 350)
	- {
	- typedef TunedPolicies<PARTITIONS, T, SizeT, 350> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::PartitionPolicy>();
	- }
	- else if (ptx_version >= 300)
	- {
	- typedef TunedPolicies<PARTITIONS, T, SizeT, 300> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::PartitionPolicy>();
	- }
	- else if (ptx_version >= 200)
	- {
	- typedef TunedPolicies<PARTITIONS, T, SizeT, 200> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::PartitionPolicy>();
	- }
	- else
	- {
	- typedef TunedPolicies<PARTITIONS, T, SizeT, 100> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::PartitionPolicy>();
	- }
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal dispatch routine
	- */
	- template <
	- typename ScanInitKernelPtr, ///< Function type of cub::ScanInitKernel
	- typename PartitionKernelPtr, ///< Function type of cub::PartitionKernel
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename LengthOutputIterator, ///< Output iterator type for recording the length of the first partition (may be a simple pointer type)
	- typename PredicateOp, ///< Unary predicate operator indicating membership in the first partition type having member <tt>bool operator()(const T &val)</tt>
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- int ptx_version, ///< [in] PTX version
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- ScanInitKernelPtr init_kernel, ///< [in] Kernel function pointer to parameterization of cub::PartitionInitKernel
	- PartitionKernelPtr partition_kernel, ///< [in] Kernel function pointer to parameterization of cub::PartitionKernel
	- KernelDispachParams &scan_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p partition_kernel was compiled for
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- LengthOutputIterator d_partition_length, ///< [out] Output iterator referencing the location where the pivot offset (i.e., the length of the first partition) is to be recorded
	- PredicateOp pred_op, ///< [in] Unary predicate operator indicating membership in the first partition
	- SizeT num_items, ///< [in] Total number of items to partition
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	-
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // Kernel launch not supported from this device
	- return CubDebug(cudaErrorNotSupported);
	-
	-#else
	-
	- enum
	- {
	- TILE_STATUS_PADDING = 32,
	- };
	-
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Scan tuple type and tile status descriptor type
	- typedef typename VectorHelper<SizeT, 2>::Type ScanTuple;
	- typedef ScanTileDescriptor<ScanTuple> ScanTileDescriptorT;
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Number of input tiles
	- int num_tiles = (num_items + scan_dispatch_params.tile_size - 1) / scan_dispatch_params.tile_size;
	-
	- // Temporary storage allocation requirements
	- void* allocations[2];
	- size_t allocation_sizes[2] =
	- {
	- (num_tiles + TILE_STATUS_PADDING) * sizeof(ScanTileDescriptorT), // bytes needed for tile status descriptors
	- GridQueue<int>::AllocationSize() // bytes needed for grid queue descriptor
	- };
	-
	- // Alias temporaries (or set the necessary size of the storage allocation)
	- if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- return cudaSuccess;
	-
	- // Global list of tile status
	- ScanTileDescriptorT d_tile_status = (ScanTileDescriptorT) allocations[0];
	-
	- // Grid queue descriptor
	- GridQueue<int> queue(allocations[1]);
	-
	- // Log init_kernel configuration
	- int init_kernel_threads = 128;
	- int init_grid_size = (num_tiles + init_kernel_threads - 1) / init_kernel_threads;
	- if (stream_synchronous) CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, init_kernel_threads, (long long) stream);
	-
	- // Invoke init_kernel to initialize tile descriptors and queue descriptors
	- init_kernel<<<init_grid_size, init_kernel_threads, 0, stream>>>(
	- queue,
	- d_tile_status,
	- num_tiles);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Get grid size for multi-block kernel
	- int scan_grid_size;
	- int multi_sm_occupancy = -1;
	- if (ptx_version < 200)
	- {
	- // We don't have atomics (or don't have fast ones), so just assign one
	- // block per tile (limited to 65K tiles)
	- scan_grid_size = num_tiles;
	- }
	- else
	- {
	- // We have atomics and can thus reuse blocks across multiple tiles using a queue descriptor.
	- // Get GPU id
	- int device_ordinal;
	- if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
	-
	- // Get SM count
	- int sm_count;
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Get a rough estimate of partition_kernel SM occupancy based upon the maximum SM occupancy of the targeted PTX architecture
	- multi_sm_occupancy = CUB_MIN(
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADBLOCKS,
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADS / scan_dispatch_params.block_threads);
	-
	-#ifndef __CUDA_ARCH__
	- // We're on the host, so come up with a
	- Device device_props;
	- if (CubDebug(error = device_props.Init(device_ordinal))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- multi_sm_occupancy,
	- partition_kernel,
	- scan_dispatch_params.block_threads))) break;
	-#endif
	- // Get device occupancy for partition_kernel
	- int scan_occupancy = multi_sm_occupancy * sm_count;
	-
	- // Get grid size for partition_kernel
	- scan_grid_size = (num_tiles < scan_occupancy) ?
	- num_tiles : // Not enough to fill the device with threadblocks
	- scan_occupancy; // Fill the device with threadblocks
	- }
	-
	- // Log partition_kernel configuration
	- if (stream_synchronous) CubLog("Invoking partition_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
	- scan_grid_size, scan_dispatch_params.block_threads, (long long) stream, scan_dispatch_params.items_per_thread, multi_sm_occupancy);
	-
	- // Invoke partition_kernel
	- partition_kernel<<<scan_grid_size, scan_dispatch_params.block_threads, 0, stream>>>(
	- d_in,
	- d_out,
	- d_partition_length,
	- d_tile_status,
	- pred_op,
	- num_items,
	- num_tiles,
	- queue);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	- }
	- while (0);
	-
	- return error;
	-
	-#endif // CUB_RUNTIME_ENABLED
	- }
	-
	-
	-
	- /**
	- * Internal partition dispatch routine for using default tuning policies
	- */
	- template <
	- typename PARTITIONS, ///< Number of partitions we are keeping
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename LengthOutputIterator, ///< Output iterator type for recording the length of the first partition (may be a simple pointer type)
	- typename PredicateOp, ///< Unary predicate operator indicating membership in the first partition type having member <tt>bool operator()(const T &val)</tt>
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to input items
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to output items
	- LengthOutputIterator d_partition_length, ///< [out] Output iterator referencing the location where the pivot offset (i.e., the length of the first partition) is to be recorded
	- PredicateOp pred_op, ///< [in] Unary predicate operator indicating membership in the first partition
	- SizeT num_items, ///< [in] Total number of items to partition
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Tuning polices
	- typedef PtxDefaultPolicies<PARTITIONS, T, SizeT> PtxDefaultPolicies; // Wrapper of default kernel policies
	- typedef typename PtxDefaultPolicies::PartitionPolicy PartitionPolicy; // Partition kernel policy
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Declare dispatch parameters
	- KernelDispachParams scan_dispatch_params;
	-
	- int ptx_version;
	-#ifdef __CUDA_ARCH__
	- // We're on the device, so initialize the dispatch parameters with the PtxDefaultPolicies directly
	- scan_dispatch_params.Init<PartitionPolicy>();
	- ptx_version = CUB_PTX_ARCH;
	-#else
	- // We're on the host, so lookup and initialize the dispatch parameters with the policies that match the device's PTX version
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- PtxDefaultPolicies::InitDispatchParams(ptx_version, scan_dispatch_params);
	-#endif
	-
	- Dispatch(
	- ptx_version,
	- d_temp_storage,
	- temp_storage_bytes,
	- ScanInitKernel<T, SizeT>,
	- PartitionKernel<PartitionPolicy, InputIteratorRA, OutputIteratorRA, LengthOutputIterator, PredicateOp, SizeT>,
	- scan_dispatch_params,
	- d_in,
	- d_out,
	- d_partition_length,
	- pred_op,
	- num_items,
	- stream,
	- stream_synchronous);
	-
	- if (CubDebug(error)) break;
	- }
	- while (0);
	-
	- return error;
	- }
	-
	- #endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	- /**
	- * \brief Splits a list of input items into two partitions within the given output list using the specified predicate. The relative ordering of inputs is not necessarily preserved.
	- *
	- * An item \p val is placed in the first partition if <tt>pred_op(val) == true</tt>, otherwise
	- * it is placed in the second partition. The offset of the partitioning pivot (equivalent to
	- * the total length of the first partition as well as the starting offset of the second), is
	- * recorded to \p d_partition_length.
	- *
	- * The length of the output referenced by \p d_out is assumed to be the same as that of \p d_in.
	- *
	- * \devicestorage
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- * \tparam LengthOutputIterator <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- * \tparam PredicateOp <b>[inferred]</b> Unary predicate operator indicating membership in the first partition type having member <tt>bool operator()(const T &val)</tt>
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA,
	- typename LengthOutputIterator,
	- typename PredicateOp>
	- __host__ __device__ __forceinline__
	- static cudaError_t Partition(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to input items
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to output items
	- LengthOutputIterator d_pivot_offset, ///< [out] Output iterator referencing the location where the pivot offset is to be recorded
	- PredicateOp pred_op, ///< [in] Unary predicate operator indicating membership in the first partition
	- int num_items, ///< [in] Total number of items to partition
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_in, d_out, Sum(), T(), num_items, stream, stream_synchronous);
	- }
	-
	-
	-};
	-
	-
	-/** @} */ // DeviceModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/device/device_scan.cuh b/lib/kokkos/TPL/cub/device/device_scan.cuh
	deleted file mode 100755
	index c0640c857..000000000
	--- a/lib/kokkos/TPL/cub/device/device_scan.cuh
	+++ /dev/null
	@@ -1,812 +0,0 @@
	-
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::DeviceScan provides operations for computing a device-wide, parallel prefix scan across data items residing within global memory.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include <iterator>
	-
	-#include "block/block_scan_tiles.cuh"
	-#include "../thread/thread_operators.cuh"
	-#include "../grid/grid_queue.cuh"
	-#include "../util_debug.cuh"
	-#include "../util_device.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Kernel entry points
	- *****************************************************************************/
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Initialization kernel for tile status initialization (multi-block)
	- */
	-template <
	- typename T, ///< Scan value type
	- typename SizeT> ///< Integer type used for global array indexing
	-__global__ void ScanInitKernel(
	- GridQueue<SizeT> grid_queue, ///< [in] Descriptor for performing dynamic mapping of input tiles to thread blocks
	- ScanTileDescriptor<T> *d_tile_status, ///< [out] Tile status words
	- int num_tiles) ///< [in] Number of tiles
	-{
	- typedef ScanTileDescriptor<T> ScanTileDescriptorT;
	-
	- enum
	- {
	- TILE_STATUS_PADDING = PtxArchProps::WARP_THREADS,
	- };
	-
	- // Reset queue descriptor
	- if ((blockIdx.x == 0) && (threadIdx.x == 0)) grid_queue.ResetDrain(num_tiles);
	-
	- // Initialize tile status
	- int tile_offset = (blockIdx.x * blockDim.x) + threadIdx.x;
	- if (tile_offset < num_tiles)
	- {
	- // Not-yet-set
	- d_tile_status[TILE_STATUS_PADDING + tile_offset].status = SCAN_TILE_INVALID;
	- }
	-
	- if ((blockIdx.x == 0) && (threadIdx.x < TILE_STATUS_PADDING))
	- {
	- // Padding
	- d_tile_status[threadIdx.x].status = SCAN_TILE_OOB;
	- }
	-}
	-
	-
	-/**
	- * Scan kernel entry point (multi-block)
	- */
	-template <
	- typename BlockScanTilesPolicy, ///< Tuning policy for cub::BlockScanTiles abstraction
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename T, ///< The scan data type
	- typename ScanOp, ///< Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- typename Identity, ///< Identity value type (cub::NullType for inclusive scans)
	- typename SizeT> ///< Integer type used for global array indexing
	-__launch_bounds__ (int(BlockScanTilesPolicy::BLOCK_THREADS))
	-__global__ void ScanKernel(
	- InputIteratorRA d_in, ///< Input data
	- OutputIteratorRA d_out, ///< Output data
	- ScanTileDescriptor<T> *d_tile_status, ///< Global list of tile status
	- ScanOp scan_op, ///< Binary scan operator
	- Identity identity, ///< Identity element
	- SizeT num_items, ///< Total number of scan items for the entire problem
	- GridQueue<int> queue) ///< Descriptor for performing dynamic mapping of tile data to thread blocks
	-{
	- enum
	- {
	- TILE_STATUS_PADDING = PtxArchProps::WARP_THREADS,
	- };
	-
	- // Thread block type for scanning input tiles
	- typedef BlockScanTiles<
	- BlockScanTilesPolicy,
	- InputIteratorRA,
	- OutputIteratorRA,
	- ScanOp,
	- Identity,
	- SizeT> BlockScanTilesT;
	-
	- // Shared memory for BlockScanTiles
	- __shared__ typename BlockScanTilesT::TempStorage temp_storage;
	-
	- // Process tiles
	- BlockScanTilesT(temp_storage, d_in, d_out, scan_op, identity).ConsumeTiles(
	- num_items,
	- queue,
	- d_tile_status + TILE_STATUS_PADDING);
	-}
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/******************************************************************************
	- * DeviceScan
	- *****************************************************************************/
	-
	-/**
	- * \brief DeviceScan provides operations for computing a device-wide, parallel prefix scan across data items residing within global memory. ![](device_scan.png)
	- * \ingroup DeviceModule
	- *
	- * \par Overview
	- * Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
	- * produces an output list where each element is computed to be the reduction
	- * of the elements occurring earlier in the input list. <em>Prefix sum</em>
	- * connotes a prefix scan with the addition operator. The term \em inclusive indicates
	- * that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
	- * The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
	- * the <em>i</em><sup>th</sup> output reduction.
	- *
	- * \par Usage Considerations
	- * \cdp_class{DeviceScan}
	- *
	- * \par Performance
	- *
	- * \image html scan_perf.png
	- *
	- */
	-struct DeviceScan
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- /// Generic structure for encapsulating dispatch properties. Mirrors the constants within BlockScanTilesPolicy.
	- struct KernelDispachParams
	- {
	- // Policy fields
	- int block_threads;
	- int items_per_thread;
	- BlockLoadAlgorithm load_policy;
	- BlockStoreAlgorithm store_policy;
	- BlockScanAlgorithm scan_algorithm;
	-
	- // Other misc
	- int tile_size;
	-
	- template <typename BlockScanTilesPolicy>
	- __host__ __device__ __forceinline__
	- void Init()
	- {
	- block_threads = BlockScanTilesPolicy::BLOCK_THREADS;
	- items_per_thread = BlockScanTilesPolicy::ITEMS_PER_THREAD;
	- load_policy = BlockScanTilesPolicy::LOAD_ALGORITHM;
	- store_policy = BlockScanTilesPolicy::STORE_ALGORITHM;
	- scan_algorithm = BlockScanTilesPolicy::SCAN_ALGORITHM;
	-
	- tile_size = block_threads * items_per_thread;
	- }
	-
	- __host__ __device__ __forceinline__
	- void Print()
	- {
	- printf("%d, %d, %d, %d, %d",
	- block_threads,
	- items_per_thread,
	- load_policy,
	- store_policy,
	- scan_algorithm);
	- }
	-
	- };
	-
	-
	- /******************************************************************************
	- * Tuning policies
	- ******************************************************************************/
	-
	-
	- /// Specializations of tuned policy types for different PTX architectures
	- template <
	- typename T,
	- typename SizeT,
	- int ARCH>
	- struct TunedPolicies;
	-
	- /// SM35 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 350>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 16,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	-
	- // ScanPolicy: GTX Titan: 29.1B items/s (232.4 GB/s) @ 48M 32-bit T
	- typedef BlockScanTilesPolicy<128, ITEMS_PER_THREAD, BLOCK_LOAD_DIRECT, false, LOAD_LDG, BLOCK_STORE_WARP_TRANSPOSE, true, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
	- };
	-
	- /// SM30 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 300>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 9,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	-
	- typedef BlockScanTilesPolicy<256, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE, false, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
	- };
	-
	- /// SM20 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 200>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 15,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	-
	- // ScanPolicy: GTX 580: 20.3B items/s (162.3 GB/s) @ 48M 32-bit T
	- typedef BlockScanTilesPolicy<128, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE, false, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, false, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
	- };
	-
	- /// SM10 tune
	- template <typename T, typename SizeT>
	- struct TunedPolicies<T, SizeT, 100>
	- {
	- enum {
	- NOMINAL_4B_ITEMS_PER_THREAD = 7,
	- ITEMS_PER_THREAD = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
	- };
	- typedef BlockScanTilesPolicy<128, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE, false, LOAD_DEFAULT, BLOCK_STORE_TRANSPOSE, false, BLOCK_SCAN_RAKING> ScanPolicy;
	- };
	-
	-
	- /// Tuning policy for the PTX architecture that DeviceScan operations will get dispatched to
	- template <typename T, typename SizeT>
	- struct PtxDefaultPolicies
	- {
	- static const int PTX_TUNE_ARCH = (CUB_PTX_ARCH >= 350) ?
	- 350 :
	- (CUB_PTX_ARCH >= 300) ?
	- 300 :
	- (CUB_PTX_ARCH >= 200) ?
	- 200 :
	- 100;
	-
	- // Tuned policy set for the current PTX compiler pass
	- typedef TunedPolicies<T, SizeT, PTX_TUNE_ARCH> PtxTunedPolicies;
	-
	- // ScanPolicy that opaquely derives from the specialization corresponding to the current PTX compiler pass
	- struct ScanPolicy : PtxTunedPolicies::ScanPolicy {};
	-
	- /**
	- * Initialize dispatch params with the policies corresponding to the PTX assembly we will use
	- */
	- static void InitDispatchParams(int ptx_version, KernelDispachParams &scan_dispatch_params)
	- {
	- if (ptx_version >= 350)
	- {
	- typedef TunedPolicies<T, SizeT, 350> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::ScanPolicy>();
	- }
	- else if (ptx_version >= 300)
	- {
	- typedef TunedPolicies<T, SizeT, 300> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::ScanPolicy>();
	- }
	- else if (ptx_version >= 200)
	- {
	- typedef TunedPolicies<T, SizeT, 200> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::ScanPolicy>();
	- }
	- else
	- {
	- typedef TunedPolicies<T, SizeT, 100> TunedPolicies;
	- scan_dispatch_params.Init<typename TunedPolicies::ScanPolicy>();
	- }
	- }
	- };
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /**
	- * Internal dispatch routine
	- */
	- template <
	- typename ScanInitKernelPtr, ///< Function type of cub::ScanInitKernel
	- typename ScanKernelPtr, ///< Function type of cub::ScanKernel
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename ScanOp, ///< Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- typename Identity, ///< Identity value type (cub::NullType for inclusive scans)
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- int ptx_version, ///< [in] PTX version
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- ScanInitKernelPtr init_kernel, ///< [in] Kernel function pointer to parameterization of cub::ScanInitKernel
	- ScanKernelPtr scan_kernel, ///< [in] Kernel function pointer to parameterization of cub::ScanKernel
	- KernelDispachParams &scan_dispatch_params, ///< [in] Dispatch parameters that match the policy that \p scan_kernel was compiled for
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- ScanOp scan_op, ///< [in] Binary scan operator
	- Identity identity, ///< [in] Identity element
	- SizeT num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	-
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // Kernel launch not supported from this device
	- return CubDebug(cudaErrorNotSupported);
	-
	-#else
	-
	- enum
	- {
	- TILE_STATUS_PADDING = 32,
	- INIT_KERNEL_THREADS = 128
	- };
	-
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Tile status descriptor type
	- typedef ScanTileDescriptor<T> ScanTileDescriptorT;
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Number of input tiles
	- int num_tiles = (num_items + scan_dispatch_params.tile_size - 1) / scan_dispatch_params.tile_size;
	-
	- // Temporary storage allocation requirements
	- void* allocations[2];
	- size_t allocation_sizes[2] =
	- {
	- (num_tiles + TILE_STATUS_PADDING) * sizeof(ScanTileDescriptorT), // bytes needed for tile status descriptors
	- GridQueue<int>::AllocationSize() // bytes needed for grid queue descriptor
	- };
	-
	- // Alias temporaries (or set the necessary size of the storage allocation)
	- if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
	-
	- // Return if the caller is simply requesting the size of the storage allocation
	- if (d_temp_storage == NULL)
	- return cudaSuccess;
	-
	- // Global list of tile status
	- ScanTileDescriptorT d_tile_status = (ScanTileDescriptorT) allocations[0];
	-
	- // Grid queue descriptor
	- GridQueue<int> queue(allocations[1]);
	-
	- // Log init_kernel configuration
	- int init_grid_size = (num_tiles + INIT_KERNEL_THREADS - 1) / INIT_KERNEL_THREADS;
	- if (stream_synchronous) CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
	-
	- // Invoke init_kernel to initialize tile descriptors and queue descriptors
	- init_kernel<<<init_grid_size, INIT_KERNEL_THREADS, 0, stream>>>(
	- queue,
	- d_tile_status,
	- num_tiles);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	-
	- // Get grid size for multi-block kernel
	- int scan_grid_size;
	- int multi_sm_occupancy = -1;
	- if (ptx_version < 200)
	- {
	- // We don't have atomics (or don't have fast ones), so just assign one
	- // block per tile (limited to 65K tiles)
	- scan_grid_size = num_tiles;
	- }
	- else
	- {
	- // We have atomics and can thus reuse blocks across multiple tiles using a queue descriptor.
	- // Get GPU id
	- int device_ordinal;
	- if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
	-
	- // Get SM count
	- int sm_count;
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Get a rough estimate of scan_kernel SM occupancy based upon the maximum SM occupancy of the targeted PTX architecture
	- multi_sm_occupancy = CUB_MIN(
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADBLOCKS,
	- ArchProps<CUB_PTX_ARCH>::MAX_SM_THREADS / scan_dispatch_params.block_threads);
	-
	-#ifndef __CUDA_ARCH__
	- // We're on the host, so come up with a
	- Device device_props;
	- if (CubDebug(error = device_props.Init(device_ordinal))) break;
	-
	- if (CubDebug(error = device_props.MaxSmOccupancy(
	- multi_sm_occupancy,
	- scan_kernel,
	- scan_dispatch_params.block_threads))) break;
	-#endif
	- // Get device occupancy for scan_kernel
	- int scan_occupancy = multi_sm_occupancy * sm_count;
	-
	- // Get grid size for scan_kernel
	- scan_grid_size = (num_tiles < scan_occupancy) ?
	- num_tiles : // Not enough to fill the device with threadblocks
	- scan_occupancy; // Fill the device with threadblocks
	- }
	-
	- // Log scan_kernel configuration
	- if (stream_synchronous) CubLog("Invoking scan_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
	- scan_grid_size, scan_dispatch_params.block_threads, (long long) stream, scan_dispatch_params.items_per_thread, multi_sm_occupancy);
	-
	- // Invoke scan_kernel
	- scan_kernel<<<scan_grid_size, scan_dispatch_params.block_threads, 0, stream>>>(
	- d_in,
	- d_out,
	- d_tile_status,
	- scan_op,
	- identity,
	- num_items,
	- queue);
	-
	- // Sync the stream if specified
	- if (stream_synchronous && (CubDebug(error = SyncStream(stream)))) break;
	- }
	- while (0);
	-
	- return error;
	-
	-#endif // CUB_RUNTIME_ENABLED
	- }
	-
	-
	-
	- /**
	- * Internal scan dispatch routine for using default tuning policies
	- */
	- template <
	- typename InputIteratorRA, ///< Random-access iterator type for input (may be a simple pointer type)
	- typename OutputIteratorRA, ///< Random-access iterator type for output (may be a simple pointer type)
	- typename ScanOp, ///< Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- typename Identity, ///< Identity value type (cub::NullType for inclusive scans)
	- typename SizeT> ///< Integer type used for global array indexing
	- __host__ __device__ __forceinline__
	- static cudaError_t Dispatch(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- ScanOp scan_op, ///< [in] Binary scan operator
	- Identity identity, ///< [in] Identity element
	- SizeT num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. Default is \p false.
	- {
	- // Data type
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	-
	- // Tuning polices
	- typedef PtxDefaultPolicies<T, SizeT> PtxDefaultPolicies; // Wrapper of default kernel policies
	- typedef typename PtxDefaultPolicies::ScanPolicy ScanPolicy; // Scan kernel policy
	-
	- cudaError error = cudaSuccess;
	- do
	- {
	- // Declare dispatch parameters
	- KernelDispachParams scan_dispatch_params;
	-
	- int ptx_version;
	-#ifdef __CUDA_ARCH__
	- // We're on the device, so initialize the dispatch parameters with the PtxDefaultPolicies directly
	- scan_dispatch_params.Init<ScanPolicy>();
	- ptx_version = CUB_PTX_ARCH;
	-#else
	- // We're on the host, so lookup and initialize the dispatch parameters with the policies that match the device's PTX version
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- PtxDefaultPolicies::InitDispatchParams(ptx_version, scan_dispatch_params);
	-#endif
	-
	- Dispatch(
	- ptx_version,
	- d_temp_storage,
	- temp_storage_bytes,
	- ScanInitKernel<T, SizeT>,
	- ScanKernel<ScanPolicy, InputIteratorRA, OutputIteratorRA, T, ScanOp, Identity, SizeT>,
	- scan_dispatch_params,
	- d_in,
	- d_out,
	- scan_op,
	- identity,
	- num_items,
	- stream,
	- stream_synchronous);
	-
	- if (CubDebug(error)) break;
	- }
	- while (0);
	-
	- return error;
	- }
	-
	- #endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	- /****************************************************************//
	- * \name Exclusive scans
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Computes a device-wide exclusive prefix sum.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the exclusive prefix sum of a device vector of \p int items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input and output
	- * int d_scan_input, d_scan_output;
	- * int num_items = ...
	- *
	- * ...
	- *
	- * // Determine temporary device storage requirements for exclusive prefix sum
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, num_items);
	- *
	- * // Allocate temporary storage for exclusive prefix sum
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run exclusive prefix sum
	- * cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, num_items);
	- *
	- * \endcode
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA>
	- __host__ __device__ __forceinline__
	- static cudaError_t ExclusiveSum(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- int num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- typedef typename std::iterator_traits<InputIteratorRA>::value_type T;
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_in, d_out, Sum(), T(), num_items, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide exclusive prefix scan using the specified binary \p scan_op functor.
	- *
	- * \par
	- * Supports non-commutative scan operators.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the exclusive prefix scan of a device vector of \p int items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input and output
	- * int d_scan_input, d_scan_output;
	- * int num_items = ...
	- *
	- * ...
	- *
	- * // Determine temporary device storage requirements for exclusive prefix scan
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, cub::Max(), (int) MIN_INT, num_items);
	- *
	- * // Allocate temporary storage for exclusive prefix scan
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run exclusive prefix scan (max)
	- * cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, cub::Max(), (int) MIN_INT, num_items);
	- *
	- * \endcode
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam Identity <b>[inferred]</b> Type of the \p identity value used Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA,
	- typename ScanOp,
	- typename Identity>
	- __host__ __device__ __forceinline__
	- static cudaError_t ExclusiveScan(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- ScanOp scan_op, ///< [in] Binary scan operator
	- Identity identity, ///< [in] Identity element
	- int num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_in, d_out, scan_op, identity, num_items, stream, stream_synchronous);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Inclusive scans
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes a device-wide inclusive prefix sum.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the inclusive prefix sum of a device vector of \p int items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input and output
	- * int d_scan_input, d_scan_output;
	- * int num_items = ...
	- * ...
	- *
	- * // Determine temporary device storage requirements for inclusive prefix sum
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, num_items);
	- *
	- * // Allocate temporary storage for inclusive prefix sum
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run inclusive prefix sum
	- * cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, num_items);
	- *
	- * \endcode
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA>
	- __host__ __device__ __forceinline__
	- static cudaError_t InclusiveSum(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- int num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_in, d_out, Sum(), NullType(), num_items, stream, stream_synchronous);
	- }
	-
	-
	- /**
	- * \brief Computes a device-wide inclusive prefix scan using the specified binary \p scan_op functor.
	- *
	- * \par
	- * Supports non-commutative scan operators.
	- *
	- * \devicestorage
	- *
	- * \cdp
	- *
	- * \iterator
	- *
	- * \par
	- * The code snippet below illustrates the inclusive prefix scan of a device vector of \p int items.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- * ...
	- *
	- * // Declare and initialize device pointers for input and output
	- * int d_scan_input, d_scan_output;
	- * int num_items = ...
	- * ...
	- *
	- * // Determine temporary device storage requirements for inclusive prefix scan
	- * void *d_temp_storage = NULL;
	- * size_t temp_storage_bytes = 0;
	- * cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, cub::Max(), num_items);
	- *
	- * // Allocate temporary storage for inclusive prefix scan
	- * cudaMalloc(&d_temp_storage, temp_storage_bytes);
	- *
	- * // Run inclusive prefix scan (max)
	- * cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_scan_input, d_scan_output, cub::Max(), num_items);
	- *
	- * \endcode
	- *
	- * \tparam InputIteratorRA <b>[inferred]</b> Random-access iterator type for input (may be a simple pointer type)
	- * \tparam OutputIteratorRA <b>[inferred]</b> Random-access iterator type for output (may be a simple pointer type)
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- typename InputIteratorRA,
	- typename OutputIteratorRA,
	- typename ScanOp>
	- __host__ __device__ __forceinline__
	- static cudaError_t InclusiveScan(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \p d_temp_storage allocation.
	- InputIteratorRA d_in, ///< [in] Iterator pointing to scan input
	- OutputIteratorRA d_out, ///< [in] Iterator pointing to scan output
	- ScanOp scan_op, ///< [in] Binary scan operator
	- int num_items, ///< [in] Total number of items to scan
	- cudaStream_t stream = 0, ///< [in] <b>[optional]</b> CUDA stream to launch kernels within. Default is stream<sub>0</sub>.
	- bool stream_synchronous = false) ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is \p false.
	- {
	- return Dispatch(d_temp_storage, temp_storage_bytes, d_in, d_out, scan_op, NullType(), num_items, stream, stream_synchronous);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/grid/grid_barrier.cuh b/lib/kokkos/TPL/cub/grid/grid_barrier.cuh
	deleted file mode 100755
	index ebdc4b552..000000000
	--- a/lib/kokkos/TPL/cub/grid/grid_barrier.cuh
	+++ /dev/null
	@@ -1,211 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::GridBarrier implements a software global barrier among thread blocks within a CUDA grid
	- */
	-
	-#pragma once
	-
	-#include "../util_debug.cuh"
	-#include "../util_namespace.cuh"
	-#include "../thread/thread_load.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup GridModule
	- * @{
	- */
	-
	-
	-/**
	- * \brief GridBarrier implements a software global barrier among thread blocks within a CUDA grid
	- */
	-class GridBarrier
	-{
	-protected :
	-
	- typedef unsigned int SyncFlag;
	-
	- // Counters in global device memory
	- SyncFlag* d_sync;
	-
	-public:
	-
	- /**
	- * Constructor
	- */
	- GridBarrier() : d_sync(NULL) {}
	-
	-
	- /**
	- * Synchronize
	- */
	- __device__ __forceinline__ void Sync() const
	- {
	- volatile SyncFlag *d_vol_sync = d_sync;
	-
	- // Threadfence and syncthreads to make sure global writes are visible before
	- // thread-0 reports in with its sync counter
	- __threadfence();
	- __syncthreads();
	-
	- if (blockIdx.x == 0)
	- {
	- // Report in ourselves
	- if (threadIdx.x == 0)
	- {
	- d_vol_sync[blockIdx.x] = 1;
	- }
	-
	- __syncthreads();
	-
	- // Wait for everyone else to report in
	- for (int peer_block = threadIdx.x; peer_block < gridDim.x; peer_block += blockDim.x)
	- {
	- while (ThreadLoad<LOAD_CG>(d_sync + peer_block) == 0)
	- {
	- __threadfence_block();
	- }
	- }
	-
	- __syncthreads();
	-
	- // Let everyone know it's safe to proceed
	- for (int peer_block = threadIdx.x; peer_block < gridDim.x; peer_block += blockDim.x)
	- {
	- d_vol_sync[peer_block] = 0;
	- }
	- }
	- else
	- {
	- if (threadIdx.x == 0)
	- {
	- // Report in
	- d_vol_sync[blockIdx.x] = 1;
	-
	- // Wait for acknowledgment
	- while (ThreadLoad<LOAD_CG>(d_sync + blockIdx.x) == 1)
	- {
	- __threadfence_block();
	- }
	- }
	-
	- __syncthreads();
	- }
	- }
	-};
	-
	-
	-/**
	- * \brief GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation.
	- *
	- * Uses RAII for lifetime, i.e., device resources are reclaimed when
	- * the destructor is called.
	- */
	-class GridBarrierLifetime : public GridBarrier
	-{
	-protected:
	-
	- // Number of bytes backed by d_sync
	- size_t sync_bytes;
	-
	-public:
	-
	- /**
	- * Constructor
	- */
	- GridBarrierLifetime() : GridBarrier(), sync_bytes(0) {}
	-
	-
	- /**
	- * DeviceFrees and resets the progress counters
	- */
	- cudaError_t HostReset()
	- {
	- cudaError_t retval = cudaSuccess;
	- if (d_sync)
	- {
	- CubDebug(retval = cudaFree(d_sync));
	- d_sync = NULL;
	- }
	- sync_bytes = 0;
	- return retval;
	- }
	-
	-
	- /**
	- * Destructor
	- */
	- virtual ~GridBarrierLifetime()
	- {
	- HostReset();
	- }
	-
	-
	- /**
	- * Sets up the progress counters for the next kernel launch (lazily
	- * allocating and initializing them if necessary)
	- */
	- cudaError_t Setup(int sweep_grid_size)
	- {
	- cudaError_t retval = cudaSuccess;
	- do {
	- size_t new_sync_bytes = sweep_grid_size * sizeof(SyncFlag);
	- if (new_sync_bytes > sync_bytes)
	- {
	- if (d_sync)
	- {
	- if (CubDebug(retval = cudaFree(d_sync))) break;
	- }
	-
	- sync_bytes = new_sync_bytes;
	-
	- // Allocate and initialize to zero
	- if (CubDebug(retval = cudaMalloc((void**) &d_sync, sync_bytes))) break;
	- if (CubDebug(retval = cudaMemset(d_sync, 0, new_sync_bytes))) break;
	- }
	- } while (0);
	-
	- return retval;
	- }
	-};
	-
	-
	-/** @} */ // end group GridModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/grid/grid_even_share.cuh b/lib/kokkos/TPL/cub/grid/grid_even_share.cuh
	deleted file mode 100755
	index defe9e0a6..000000000
	--- a/lib/kokkos/TPL/cub/grid/grid_even_share.cuh
	+++ /dev/null
	@@ -1,197 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::GridEvenShare is a descriptor utility for distributing input among CUDA threadblocks in an "even-share" fashion. Each threadblock gets roughly the same number of fixed-size work units (grains).
	- */
	-
	-
	-#pragma once
	-
	-#include "../util_namespace.cuh"
	-#include "../util_macro.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup GridModule
	- * @{
	- */
	-
	-
	-/**
	- * \brief GridEvenShare is a descriptor utility for distributing input among CUDA threadblocks in an "even-share" fashion. Each threadblock gets roughly the same number of fixed-size work units (grains).
	- *
	- * \par Overview
	- * GridEvenShare indicates which sections of input are to be mapped onto which threadblocks.
	- * Threadblocks may receive one of three different amounts of work: "big", "normal",
	- * and "last". The "big" workloads are one scheduling grain larger than "normal". The "last" work unit
	- * for the last threadblock may be partially-full if the input is not an even multiple of
	- * the scheduling grain size.
	- *
	- * \par
	- * Before invoking a child grid, a parent thread will typically construct and initialize an instance of
	- * GridEvenShare using \p GridInit(). The instance can be passed to child threadblocks which can
	- * initialize their per-threadblock offsets using \p BlockInit().
	- *
	- * \tparam SizeT Integer type for array indexing
	- */
	-template <typename SizeT>
	-class GridEvenShare
	-{
	-private:
	-
	- SizeT total_grains;
	- int big_blocks;
	- SizeT big_share;
	- SizeT normal_share;
	- SizeT normal_base_offset;
	-
	-
	-public:
	-
	- /// Total number of input items
	- SizeT num_items;
	-
	- /// Grid size in threadblocks
	- int grid_size;
	-
	- /// Offset into input marking the beginning of the owning thread block's segment of input tiles
	- SizeT block_offset;
	-
	- /// Offset into input of marking the end (one-past) of the owning thread block's segment of input tiles
	- SizeT block_oob;
	-
	- /**
	- * \brief Block-based constructor for single-block grids.
	- */
	- __device__ __forceinline__ GridEvenShare(SizeT num_items) :
	- num_items(num_items),
	- grid_size(1),
	- block_offset(0),
	- block_oob(num_items) {}
	-
	-
	- /**
	- * \brief Default constructor. Zero-initializes block-specific fields.
	- */
	- __host__ __device__ __forceinline__ GridEvenShare() :
	- num_items(0),
	- grid_size(0),
	- block_offset(0),
	- block_oob(0) {}
	-
	-
	- /**
	- * \brief Initializes the grid-specific members \p num_items and \p grid_size. To be called prior prior to kernel launch)
	- */
	- __host__ __device__ __forceinline__ void GridInit(
	- SizeT num_items, ///< Total number of input items
	- int max_grid_size, ///< Maximum grid size allowable (actual grid size may be less if not warranted by the the number of input items)
	- int schedule_granularity) ///< Granularity by which the input can be parcelled into and distributed among threablocks. Usually the thread block's native tile size (or a multiple thereof.
	- {
	- this->num_items = num_items;
	- this->block_offset = 0;
	- this->block_oob = 0;
	- this->total_grains = (num_items + schedule_granularity - 1) / schedule_granularity;
	- this->grid_size = CUB_MIN(total_grains, max_grid_size);
	- SizeT grains_per_block = total_grains / grid_size;
	- this->big_blocks = total_grains - (grains_per_block * grid_size); // leftover grains go to big blocks
	- this->normal_share = grains_per_block * schedule_granularity;
	- this->normal_base_offset = big_blocks * schedule_granularity;
	- this->big_share = normal_share + schedule_granularity;
	- }
	-
	-
	- /**
	- * \brief Initializes the threadblock-specific details (e.g., to be called by each threadblock after startup)
	- */
	- __device__ __forceinline__ void BlockInit()
	- {
	- if (blockIdx.x < big_blocks)
	- {
	- // This threadblock gets a big share of grains (grains_per_block + 1)
	- block_offset = (blockIdx.x * big_share);
	- block_oob = block_offset + big_share;
	- }
	- else if (blockIdx.x < total_grains)
	- {
	- // This threadblock gets a normal share of grains (grains_per_block)
	- block_offset = normal_base_offset + (blockIdx.x * normal_share);
	- block_oob = block_offset + normal_share;
	- }
	-
	- // Last threadblock
	- if (blockIdx.x == grid_size - 1)
	- {
	- block_oob = num_items;
	- }
	- }
	-
	-
	- /**
	- * Print to stdout
	- */
	- __host__ __device__ __forceinline__ void Print()
	- {
	- printf(
	-#ifdef __CUDA_ARCH__
	- "\tthreadblock(%d) "
	- "block_offset(%lu) "
	- "block_oob(%lu) "
	-#endif
	- "num_items(%lu) "
	- "total_grains(%lu) "
	- "big_blocks(%lu) "
	- "big_share(%lu) "
	- "normal_share(%lu)\n",
	-#ifdef __CUDA_ARCH__
	- blockIdx.x,
	- (unsigned long) block_offset,
	- (unsigned long) block_oob,
	-#endif
	- (unsigned long) num_items,
	- (unsigned long) total_grains,
	- (unsigned long) big_blocks,
	- (unsigned long) big_share,
	- (unsigned long) normal_share);
	- }
	-};
	-
	-
	-
	-/** @} */ // end group GridModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/grid/grid_mapping.cuh b/lib/kokkos/TPL/cub/grid/grid_mapping.cuh
	deleted file mode 100755
	index 419f9ac0e..000000000
	--- a/lib/kokkos/TPL/cub/grid/grid_mapping.cuh
	+++ /dev/null
	@@ -1,95 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
	- */
	-
	-#pragma once
	-
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup GridModule
	- * @{
	- */
	-
	-
	-/******************************************************************************
	- * Mapping policies
	- *****************************************************************************/
	-
	-
	-/**
	- * \brief cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
	- */
	-enum GridMappingStrategy
	-{
	- /**
	- * \brief An "even-share" strategy for assigning input tiles to thread blocks.
	- *
	- * \par Overview
	- * The input is evenly partitioned into \p p segments, where \p p is
	- * constant and corresponds loosely to the number of thread blocks that may
	- * actively reside on the target device. Each segment is comprised of
	- * consecutive tiles, where a tile is a small, constant-sized unit of input
	- * to be processed to completion before the thread block terminates or
	- * obtains more work. The kernel invokes \p p thread blocks, each
	- * of which iteratively consumes a segment of <em>n</em>/<em>p</em> elements
	- * in tile-size increments.
	- */
	- GRID_MAPPING_EVEN_SHARE,
	-
	- /**
	- * \brief A dynamic "queue-based" strategy for assigning input tiles to thread blocks.
	- *
	- * \par Overview
	- * The input is treated as a queue to be dynamically consumed by a grid of
	- * thread blocks. Work is atomically dequeued in tiles, where a tile is a
	- * unit of input to be processed to completion before the thread block
	- * terminates or obtains more work. The grid size \p p is constant,
	- * loosely corresponding to the number of thread blocks that may actively
	- * reside on the target device.
	- */
	- GRID_MAPPING_DYNAMIC,
	-};
	-
	-
	-/** @} */ // end group GridModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/grid/grid_queue.cuh b/lib/kokkos/TPL/cub/grid/grid_queue.cuh
	deleted file mode 100755
	index 009260d87..000000000
	--- a/lib/kokkos/TPL/cub/grid/grid_queue.cuh
	+++ /dev/null
	@@ -1,207 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::GridQueue is a descriptor utility for dynamic queue management.
	- */
	-
	-#pragma once
	-
	-#include "../util_namespace.cuh"
	-#include "../util_debug.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup GridModule
	- * @{
	- */
	-
	-
	-/**
	- * \brief GridQueue is a descriptor utility for dynamic queue management.
	- *
	- * \par Overview
	- * GridQueue descriptors provides abstractions for "filling" or
	- * "draining" globally-shared vectors.
	- *
	- * \par
	- * A "filling" GridQueue works by atomically-adding to a zero-initialized counter,
	- * returning a unique offset for the calling thread to write its items.
	- * The GridQueue maintains the total "fill-size". The fill counter must be reset
	- * using GridQueue::ResetFill by the host or kernel instance prior to the kernel instance that
	- * will be filling.
	- *
	- * \par
	- * Similarly a "draining" GridQueue works by works by atomically-incrementing a
	- * zero-initialized counter, returning a unique offset for the calling thread to
	- * read its items. Threads can safely drain until the array's logical fill-size is
	- * exceeded. The drain counter must be reset using GridQueue::ResetDrain or
	- * GridQueue::ResetDrainAfterFill by the host or kernel instance prior to the kernel instance that
	- * will be filling. (For dynamic work distribution of existing data, the corresponding fill-size
	- * is simply the number of elements in the array.)
	- *
	- * \par
	- * Iterative work management can be implemented simply with a pair of flip-flopping
	- * work buffers, each with an associated set of fill and drain GridQueue descriptors.
	- *
	- * \tparam SizeT Integer type for array indexing
	- */
	-template <typename SizeT>
	-class GridQueue
	-{
	-private:
	-
	- /// Counter indices
	- enum
	- {
	- FILL = 0,
	- DRAIN = 1,
	- };
	-
	- /// Pair of counters
	- SizeT *d_counters;
	-
	-public:
	-
	- /// Returns the device allocation size in bytes needed to construct a GridQueue instance
	- __host__ __device__ __forceinline__
	- static size_t AllocationSize()
	- {
	- return sizeof(SizeT) * 2;
	- }
	-
	-
	- /// Constructs an invalid GridQueue descriptor around the device storage allocation
	- __host__ __device__ __forceinline__ GridQueue(
	- void *d_storage) ///< Device allocation to back the GridQueue. Must be at least as big as <tt>AllocationSize()</tt>.
	- :
	- d_counters((SizeT*) d_storage)
	- {}
	-
	-
	- /// This operation resets the drain so that it may advance to meet the existing fill-size. To be called by the host or by a kernel prior to that which will be draining.
	- __host__ __device__ __forceinline__ cudaError_t ResetDrainAfterFill(cudaStream_t stream = 0)
	- {
	-#ifdef __CUDA_ARCH__
	- d_counters[DRAIN] = 0;
	- return cudaSuccess;
	-#else
	- return ResetDrain(0, stream);
	-#endif
	- }
	-
	- /// This operation sets the fill-size and resets the drain counter, preparing the GridQueue for draining in the next kernel instance. To be called by the host or by a kernel prior to that which will be draining.
	- __host__ __device__ __forceinline__ cudaError_t ResetDrain(
	- SizeT fill_size,
	- cudaStream_t stream = 0)
	- {
	-#ifdef __CUDA_ARCH__
	- d_counters[FILL] = fill_size;
	- d_counters[DRAIN] = 0;
	- return cudaSuccess;
	-#else
	- SizeT counters[2];
	- counters[FILL] = fill_size;
	- counters[DRAIN] = 0;
	- return CubDebug(cudaMemcpyAsync(d_counters, counters, sizeof(SizeT) * 2, cudaMemcpyHostToDevice, stream));
	-#endif
	- }
	-
	-
	- /// This operation resets the fill counter. To be called by the host or by a kernel prior to that which will be filling.
	- __host__ __device__ __forceinline__ cudaError_t ResetFill()
	- {
	-#ifdef __CUDA_ARCH__
	- d_counters[FILL] = 0;
	- return cudaSuccess;
	-#else
	- return CubDebug(cudaMemset(d_counters + FILL, 0, sizeof(SizeT)));
	-#endif
	- }
	-
	-
	- /// Returns the fill-size established by the parent or by the previous kernel.
	- __host__ __device__ __forceinline__ cudaError_t FillSize(
	- SizeT &fill_size,
	- cudaStream_t stream = 0)
	- {
	-#ifdef __CUDA_ARCH__
	- fill_size = d_counters[FILL];
	-#else
	- return CubDebug(cudaMemcpyAsync(&fill_size, d_counters + FILL, sizeof(SizeT), cudaMemcpyDeviceToHost, stream));
	-#endif
	- }
	-
	-
	- /// Drain num_items. Returns offset from which to read items.
	- __device__ __forceinline__ SizeT Drain(SizeT num_items)
	- {
	- return atomicAdd(d_counters + DRAIN, num_items);
	- }
	-
	-
	- /// Fill num_items. Returns offset from which to write items.
	- __device__ __forceinline__ SizeT Fill(SizeT num_items)
	- {
	- return atomicAdd(d_counters + FILL, num_items);
	- }
	-};
	-
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Reset grid queue (call with 1 block of 1 thread)
	- */
	-template <typename SizeT>
	-__global__ void ResetDrainKernel(
	- GridQueue<SizeT> grid_queue,
	- SizeT num_items)
	-{
	- grid_queue.ResetDrain(num_items);
	-}
	-
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/** @} */ // end group GridModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	-
	diff --git a/lib/kokkos/TPL/cub/host/spinlock.cuh b/lib/kokkos/TPL/cub/host/spinlock.cuh
	deleted file mode 100755
	index 5621b6f1a..000000000
	--- a/lib/kokkos/TPL/cub/host/spinlock.cuh
	+++ /dev/null
	@@ -1,123 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Simple x86/x64 atomic spinlock, portable across MS Windows (cl.exe) & Linux (g++)
	- */
	-
	-
	-#pragma once
	-
	-#if defined(_WIN32) \|\| defined(_WIN64)
	- #include <intrin.h>
	- #include <windows.h>
	- #undef small // Windows is terrible for polluting macro namespace
	-
	- /**
	- * Compiler read/write barrier
	- */
	- #pragma intrinsic(_ReadWriteBarrier)
	-
	-#endif
	-
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-#if defined(_MSC_VER)
	-
	- // Microsoft VC++
	- typedef long Spinlock;
	-
	-#else
	-
	- // GNU g++
	- typedef int Spinlock;
	-
	- /**
	- * Compiler read/write barrier
	- */
	- __forceinline__ void _ReadWriteBarrier()
	- {
	- __sync_synchronize();
	- }
	-
	- /**
	- * Atomic exchange
	- */
	- __forceinline__ long _InterlockedExchange(volatile int * const Target, const int Value)
	- {
	- // NOTE: __sync_lock_test_and_set would be an acquire barrier, so we force a full barrier
	- _ReadWriteBarrier();
	- return __sync_lock_test_and_set(Target, Value);
	- }
	-
	- /**
	- * Pause instruction to prevent excess processor bus usage
	- */
	- __forceinline__ void YieldProcessor()
	- {
	-#ifndef __arm__
	- asm volatile("pause\n": : :"memory");
	-#endif // __arm__
	- }
	-
	-#endif // defined(_MSC_VER)
	-
	-/**
	- * Return when the specified spinlock has been acquired
	- */
	-__forceinline__ void Lock(volatile Spinlock *lock)
	-{
	- while (1)
	- {
	- if (!_InterlockedExchange(lock, 1)) return;
	- while (*lock) YieldProcessor();
	- }
	-}
	-
	-
	-/**
	- * Release the specified spinlock
	- */
	-__forceinline__ void Unlock(volatile Spinlock *lock)
	-{
	- _ReadWriteBarrier();
	- *lock = 0;
	-}
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	-
	diff --git a/lib/kokkos/TPL/cub/thread/thread_load.cuh b/lib/kokkos/TPL/cub/thread/thread_load.cuh
	deleted file mode 100755
	index ee112b9d5..000000000
	--- a/lib/kokkos/TPL/cub/thread/thread_load.cuh
	+++ /dev/null
	@@ -1,429 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Thread utilities for reading memory using PTX cache modifiers.
	- */
	-
	-#pragma once
	-
	-#include <cuda.h>
	-
	-#include <iterator>
	-
	-#include "../util_ptx.cuh"
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup IoModule
	- * @{
	- */
	-
	-//-----------------------------------------------------------------------------
	-// Tags and constants
	-//-----------------------------------------------------------------------------
	-
	-/**
	- * \brief Enumeration of PTX cache-modifiers for memory load operations.
	- */
	-enum PtxLoadModifier
	-{
	- LOAD_DEFAULT, ///< Default (no modifier)
	- LOAD_CA, ///< Cache at all levels
	- LOAD_CG, ///< Cache at global level
	- LOAD_CS, ///< Cache streaming (likely to be accessed once)
	- LOAD_CV, ///< Cache as volatile (including cached system lines)
	- LOAD_LDG, ///< Cache as texture
	- LOAD_VOLATILE, ///< Volatile (any memory space)
	-};
	-
	-
	-/**
	- * \name Simple I/O
	- * @{
	- */
	-
	-/**
	- * \brief Thread utility for reading memory using cub::PtxLoadModifier cache modifiers.
	- *
	- * Cache modifiers will only be effected for built-in types (i.e., C++
	- * primitives and CUDA vector-types).
	- *
	- * For example:
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // 32-bit load using cache-global modifier:
	- * int *d_in;
	- * int val = cub::ThreadLoad<cub::LOAD_CA>(d_in + threadIdx.x);
	- *
	- * // 16-bit load using default modifier
	- * short *d_in;
	- * short val = cub::ThreadLoad<cub::LOAD_DEFAULT>(d_in + threadIdx.x);
	- *
	- * // 256-bit load using cache-volatile modifier
	- * double4 *d_in;
	- * double4 val = cub::ThreadLoad<cub::LOAD_CV>(d_in + threadIdx.x);
	- *
	- * // 96-bit load using default cache modifier (ignoring LOAD_CS)
	- * struct TestFoo { bool a; short b; };
	- * TestFoo *d_struct;
	- * TestFoo val = cub::ThreadLoad<cub::LOAD_CS>(d_in + threadIdx.x);
	- * \endcode
	- *
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename InputIteratorRA>
	-__device__ __forceinline__ typename std::iterator_traits<InputIteratorRA>::value_type ThreadLoad(InputIteratorRA itr);
	-
	-
	-//@} end member group
	-
	-
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Define a int4 (16B) ThreadLoad specialization for the given PTX load modifier
	- */
	-#define CUB_LOAD_16(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ int4 ThreadLoad<cub_modifier, int4>(int4 ptr) \
	- { \
	- int4 retval; \
	- asm volatile ("ld."#ptx_modifier".v4.s32 {%0, %1, %2, %3}, [%4];" : \
	- "=r"(retval.x), \
	- "=r"(retval.y), \
	- "=r"(retval.z), \
	- "=r"(retval.w) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- } \
	- template<> \
	- __device__ __forceinline__ longlong2 ThreadLoad<cub_modifier, longlong2>(longlong2 ptr) \
	- { \
	- longlong2 retval; \
	- asm volatile ("ld."#ptx_modifier".v2.s64 {%0, %1}, [%2];" : \
	- "=l"(retval.x), \
	- "=l"(retval.y) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- }
	-
	-/**
	- * Define a int2 (8B) ThreadLoad specialization for the given PTX load modifier
	- */
	-#define CUB_LOAD_8(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ short4 ThreadLoad<cub_modifier, short4>(short4 ptr) \
	- { \
	- short4 retval; \
	- asm volatile ("ld."#ptx_modifier".v4.s16 {%0, %1, %2, %3}, [%4];" : \
	- "=h"(retval.x), \
	- "=h"(retval.y), \
	- "=h"(retval.z), \
	- "=h"(retval.w) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- } \
	- template<> \
	- __device__ __forceinline__ int2 ThreadLoad<cub_modifier, int2>(int2 ptr) \
	- { \
	- int2 retval; \
	- asm volatile ("ld."#ptx_modifier".v2.s32 {%0, %1}, [%2];" : \
	- "=r"(retval.x), \
	- "=r"(retval.y) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- } \
	- template<> \
	- __device__ __forceinline__ long long ThreadLoad<cub_modifier, long long>(long long ptr) \
	- { \
	- long long retval; \
	- asm volatile ("ld."#ptx_modifier".s64 %0, [%1];" : \
	- "=l"(retval) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- }
	-
	-/**
	- * Define a int (4B) ThreadLoad specialization for the given PTX load modifier
	- */
	-#define CUB_LOAD_4(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ int ThreadLoad<cub_modifier, int>(int ptr) \
	- { \
	- int retval; \
	- asm volatile ("ld."#ptx_modifier".s32 %0, [%1];" : \
	- "=r"(retval) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- }
	-
	-
	-/**
	- * Define a short (2B) ThreadLoad specialization for the given PTX load modifier
	- */
	-#define CUB_LOAD_2(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ short ThreadLoad<cub_modifier, short>(short ptr) \
	- { \
	- short retval; \
	- asm volatile ("ld."#ptx_modifier".s16 %0, [%1];" : \
	- "=h"(retval) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return retval; \
	- }
	-
	-
	-/**
	- * Define a char (1B) ThreadLoad specialization for the given PTX load modifier
	- */
	-#define CUB_LOAD_1(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ char ThreadLoad<cub_modifier, char>(char ptr) \
	- { \
	- short retval; \
	- asm volatile ( \
	- "{" \
	- " .reg .s8 datum;" \
	- " ld."#ptx_modifier".s8 datum, [%1];" \
	- " cvt.s16.s8 %0, datum;" \
	- "}" : \
	- "=h"(retval) : \
	- _CUB_ASM_PTR_(ptr)); \
	- return (char) retval; \
	- }
	-
	-
	-/**
	- * Define powers-of-two ThreadLoad specializations for the given PTX load modifier
	- */
	-#define CUB_LOAD_ALL(cub_modifier, ptx_modifier) \
	- CUB_LOAD_16(cub_modifier, ptx_modifier) \
	- CUB_LOAD_8(cub_modifier, ptx_modifier) \
	- CUB_LOAD_4(cub_modifier, ptx_modifier) \
	- CUB_LOAD_2(cub_modifier, ptx_modifier) \
	- CUB_LOAD_1(cub_modifier, ptx_modifier) \
	-
	-
	-/**
	- * Define ThreadLoad specializations for the various PTX load modifiers
	- */
	-#if CUB_PTX_ARCH >= 200
	- CUB_LOAD_ALL(LOAD_CA, ca)
	- CUB_LOAD_ALL(LOAD_CG, cg)
	- CUB_LOAD_ALL(LOAD_CS, cs)
	- CUB_LOAD_ALL(LOAD_CV, cv)
	-#else
	- // LOAD_CV on SM10-13 uses "volatile.global" to ensure reads from last level
	- CUB_LOAD_ALL(LOAD_CV, volatile.global)
	-#endif
	-#if CUB_PTX_ARCH >= 350
	- CUB_LOAD_ALL(LOAD_LDG, global.nc)
	-#endif
	-
	-
	-/// Helper structure for templated load iteration (inductive case)
	-template <PtxLoadModifier MODIFIER, int COUNT, int MAX>
	-struct IterateThreadLoad
	-{
	- template <typename T>
	- static __device__ __forceinline__ void Load(T ptr, T vals)
	- {
	- vals[COUNT] = ThreadLoad<MODIFIER>(ptr + COUNT);
	- IterateThreadLoad<MODIFIER, COUNT + 1, MAX>::Load(ptr, vals);
	- }
	-};
	-
	-/// Helper structure for templated load iteration (termination case)
	-template <PtxLoadModifier MODIFIER, int MAX>
	-struct IterateThreadLoad<MODIFIER, MAX, MAX>
	-{
	- template <typename T>
	- static __device__ __forceinline__ void Load(T ptr, T vals) {}
	-};
	-
	-
	-
	-/**
	- * Load with LOAD_DEFAULT on iterator types
	- */
	-template <typename InputIteratorRA>
	-__device__ __forceinline__ typename std::iterator_traits<InputIteratorRA>::value_type ThreadLoad(
	- InputIteratorRA itr,
	- Int2Type<LOAD_DEFAULT> modifier,
	- Int2Type<false> is_pointer)
	-{
	- return *itr;
	-}
	-
	-
	-/**
	- * Load with LOAD_DEFAULT on pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ T ThreadLoad(
	- T *ptr,
	- Int2Type<LOAD_DEFAULT> modifier,
	- Int2Type<true> is_pointer)
	-{
	- return *ptr;
	-}
	-
	-
	-/**
	- * Load with LOAD_VOLATILE on primitive pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ T ThreadLoadVolatile(
	- T *ptr,
	- Int2Type<true> is_primitive)
	-{
	- T retval = reinterpret_cast<volatile T>(ptr);
	-
	-#if (CUB_PTX_ARCH <= 130)
	- if (sizeof(T) == 1) __threadfence_block();
	-#endif
	-
	- return retval;
	-}
	-
	-
	-/**
	- * Load with LOAD_VOLATILE on non-primitive pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ T ThreadLoadVolatile(
	- T *ptr,
	- Int2Type<false> is_primitive)
	-{
	- typedef typename WordAlignment<T>::VolatileWord VolatileWord; // Word type for memcopying
	- enum { NUM_WORDS = sizeof(T) / sizeof(VolatileWord) };
	-
	- // Memcopy from aliased source into array of uninitialized words
	- typename WordAlignment<T>::UninitializedVolatileWords words;
	-
	- #pragma unroll
	- for (int i = 0; i < NUM_WORDS; ++i)
	- words.buf[i] = reinterpret_cast<volatile VolatileWord*>(ptr)[i];
	-
	- // Load from words
	- return reinterpret_cast<T>(words.buf);
	-}
	-
	-
	-/**
	- * Load with LOAD_VOLATILE on pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ T ThreadLoad(
	- T *ptr,
	- Int2Type<LOAD_VOLATILE> modifier,
	- Int2Type<true> is_pointer)
	-{
	- return ThreadLoadVolatile(ptr, Int2Type<Traits<T>::PRIMITIVE>());
	-}
	-
	-
	-#if (CUB_PTX_ARCH <= 130)
	-
	-/**
	- * Load with LOAD_CG uses LOAD_CV in pre-SM20 PTX to ensure coherent reads when run on newer architectures with L1
	- */
	-template <typename T>
	-__device__ __forceinline__ T ThreadLoad(
	- T *ptr,
	- Int2Type<LOAD_CG> modifier,
	- Int2Type<true> is_pointer)
	-{
	- return ThreadLoad<LOAD_CV>(ptr);
	-}
	-
	-#endif // (CUB_PTX_ARCH <= 130)
	-
	-
	-/**
	- * Load with arbitrary MODIFIER on pointer types
	- */
	-template <typename T, int MODIFIER>
	-__device__ __forceinline__ T ThreadLoad(
	- T *ptr,
	- Int2Type<MODIFIER> modifier,
	- Int2Type<true> is_pointer)
	-{
	- typedef typename WordAlignment<T>::DeviceWord DeviceWord;
	- enum { NUM_WORDS = sizeof(T) / sizeof(DeviceWord) };
	-
	- // Memcopy from aliased source into array of uninitialized words
	- typename WordAlignment<T>::UninitializedDeviceWords words;
	-
	- IterateThreadLoad<PtxLoadModifier(MODIFIER), 0, NUM_WORDS>::Load(
	- reinterpret_cast<DeviceWord*>(ptr),
	- words.buf);
	-
	- // Load from words
	- return reinterpret_cast<T>(words.buf);
	-}
	-
	-
	-/**
	- * Generic ThreadLoad definition
	- */
	-template <
	- PtxLoadModifier MODIFIER,
	- typename InputIteratorRA>
	-__device__ __forceinline__ typename std::iterator_traits<InputIteratorRA>::value_type ThreadLoad(InputIteratorRA itr)
	-{
	- return ThreadLoad(
	- itr,
	- Int2Type<MODIFIER>(),
	- Int2Type<IsPointer<InputIteratorRA>::VALUE>());
	-}
	-
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/** @} */ // end group IoModule
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/thread/thread_operators.cuh b/lib/kokkos/TPL/cub/thread/thread_operators.cuh
	deleted file mode 100755
	index bfb3d7c1b..000000000
	--- a/lib/kokkos/TPL/cub/thread/thread_operators.cuh
	+++ /dev/null
	@@ -1,145 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Simple binary operator functor types
	- */
	-
	-/******************************************************************************
	- * Simple functor operators
	- ******************************************************************************/
	-
	-#pragma once
	-
	-#include "../util_macro.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup ThreadModule
	- * @{
	- */
	-
	-/**
	- * \brief Default equality functor
	- */
	-struct Equality
	-{
	- /// Boolean equality operator, returns <tt>(a == b)</tt>
	- template <typename T>
	- __host__ __device__ __forceinline__ bool operator()(const T &a, const T &b)
	- {
	- return a == b;
	- }
	-};
	-
	-
	-/**
	- * \brief Default inequality functor
	- */
	-struct Inequality
	-{
	- /// Boolean inequality operator, returns <tt>(a != b)</tt>
	- template <typename T>
	- __host__ __device__ __forceinline__ bool operator()(const T &a, const T &b)
	- {
	- return a != b;
	- }
	-};
	-
	-
	-/**
	- * \brief Default sum functor
	- */
	-struct Sum
	-{
	- /// Boolean sum operator, returns <tt>a + b</tt>
	- template <typename T>
	- __host__ __device__ __forceinline__ T operator()(const T &a, const T &b)
	- {
	- return a + b;
	- }
	-};
	-
	-
	-/**
	- * \brief Default max functor
	- */
	-struct Max
	-{
	- /// Boolean max operator, returns <tt>(a > b) ? a : b</tt>
	- template <typename T>
	- __host__ __device__ __forceinline__ T operator()(const T &a, const T &b)
	- {
	- return CUB_MAX(a, b);
	- }
	-};
	-
	-
	-/**
	- * \brief Default min functor
	- */
	-struct Min
	-{
	- /// Boolean min operator, returns <tt>(a < b) ? a : b</tt>
	- template <typename T>
	- __host__ __device__ __forceinline__ T operator()(const T &a, const T &b)
	- {
	- return CUB_MIN(a, b);
	- }
	-};
	-
	-
	-/**
	- * \brief Default cast functor
	- */
	-template <typename B>
	-struct Cast
	-{
	- /// Boolean max operator, returns <tt>(a > b) ? a : b</tt>
	- template <typename A>
	- __host__ __device__ __forceinline__ B operator()(const A &a)
	- {
	- return (B) a;
	- }
	-};
	-
	-
	-
	-/** @} */ // end group ThreadModule
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/thread/thread_reduce.cuh b/lib/kokkos/TPL/cub/thread/thread_reduce.cuh
	deleted file mode 100755
	index 374fd77ae..000000000
	--- a/lib/kokkos/TPL/cub/thread/thread_reduce.cuh
	+++ /dev/null
	@@ -1,145 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Thread utilities for sequential reduction over statically-sized array types
	- */
	-
	-#pragma once
	-
	-#include "../thread/thread_operators.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup ThreadModule
	- * @{
	- */
	-
	-/**
	- * \name Sequential reduction over statically-sized array types
	- * @{
	- */
	-
	-/**
	- * \brief Perform a sequential reduction over \p LENGTH elements of the \p input array, seeded with the specified \p prefix. The aggregate is returned.
	- *
	- * \tparam LENGTH Length of input array
	- * \tparam T <b>[inferred]</b> The data type to be reduced.
	- * \tparam ScanOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ReductionOp>
	-__device__ __forceinline__ T ThreadReduce(
	- T* input, ///< [in] Input array
	- ReductionOp reduction_op, ///< [in] Binary reduction operator
	- T prefix) ///< [in] Prefix to seed reduction with
	-{
	- #pragma unroll
	- for (int i = 0; i < LENGTH; ++i)
	- {
	- prefix = reduction_op(prefix, input[i]);
	- }
	-
	- return prefix;
	-}
	-
	-
	-/**
	- * \brief Perform a sequential reduction over \p LENGTH elements of the \p input array. The aggregate is returned.
	- *
	- * \tparam LENGTH Length of input array
	- * \tparam T <b>[inferred]</b> The data type to be reduced.
	- * \tparam ScanOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ReductionOp>
	-__device__ __forceinline__ T ThreadReduce(
	- T* input, ///< [in] Input array
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	-{
	- T prefix = input[0];
	- return ThreadReduce<LENGTH - 1>(input + 1, reduction_op, prefix);
	-}
	-
	-
	-/**
	- * \brief Perform a sequential reduction over the statically-sized \p input array, seeded with the specified \p prefix. The aggregate is returned.
	- *
	- * \tparam LENGTH <b>[inferred]</b> Length of \p input array
	- * \tparam T <b>[inferred]</b> The data type to be reduced.
	- * \tparam ScanOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ReductionOp>
	-__device__ __forceinline__ T ThreadReduce(
	- T (&input)[LENGTH], ///< [in] Input array
	- ReductionOp reduction_op, ///< [in] Binary reduction operator
	- T prefix) ///< [in] Prefix to seed reduction with
	-{
	- return ThreadReduce<LENGTH>(input, reduction_op, prefix);
	-}
	-
	-
	-/**
	- * \brief Serial reduction with the specified operator
	- *
	- * \tparam LENGTH <b>[inferred]</b> Length of \p input array
	- * \tparam T <b>[inferred]</b> The data type to be reduced.
	- * \tparam ScanOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ReductionOp>
	-__device__ __forceinline__ T ThreadReduce(
	- T (&input)[LENGTH], ///< [in] Input array
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	-{
	- return ThreadReduce<LENGTH>((T*) input, reduction_op);
	-}
	-
	-
	-//@} end member group
	-
	-/** @} */ // end group ThreadModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/thread/thread_scan.cuh b/lib/kokkos/TPL/cub/thread/thread_scan.cuh
	deleted file mode 100755
	index b43bbcf00..000000000
	--- a/lib/kokkos/TPL/cub/thread/thread_scan.cuh
	+++ /dev/null
	@@ -1,231 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Thread utilities for sequential prefix scan over statically-sized array types
	- */
	-
	-#pragma once
	-
	-#include "../thread/thread_operators.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup ThreadModule
	- * @{
	- */
	-
	-/**
	- * \name Sequential prefix scan over statically-sized array types
	- * @{
	- */
	-
	-/**
	- * \brief Perform a sequential exclusive prefix scan over \p LENGTH elements of the \p input array, seeded with the specified \p prefix. The aggregate is returned.
	- *
	- * \tparam LENGTH Length of \p input and \p output arrays
	- * \tparam T <b>[inferred]</b> The data type to be scanned.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ScanOp>
	-__device__ __forceinline__ T ThreadScanExclusive(
	- T *input, ///< [in] Input array
	- T *output, ///< [out] Output array (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T prefix, ///< [in] Prefix to seed scan with
	- bool apply_prefix = true) ///< [in] Whether or not the calling thread should apply its prefix. If not, the first output element is undefined. (Handy for preventing thread-0 from applying a prefix.)
	-{
	- T inclusive = input[0];
	- if (apply_prefix)
	- {
	- inclusive = scan_op(prefix, inclusive);
	- }
	- output[0] = prefix;
	- T exclusive = inclusive;
	-
	- #pragma unroll
	- for (int i = 1; i < LENGTH; ++i)
	- {
	- inclusive = scan_op(exclusive, input[i]);
	- output[i] = exclusive;
	- exclusive = inclusive;
	- }
	-
	- return inclusive;
	-}
	-
	-
	-/**
	- * \brief Perform a sequential exclusive prefix scan over the statically-sized \p input array, seeded with the specified \p prefix. The aggregate is returned.
	- *
	- * \tparam LENGTH <b>[inferred]</b> Length of \p input and \p output arrays
	- * \tparam T <b>[inferred]</b> The data type to be scanned.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ScanOp>
	-__device__ __forceinline__ T ThreadScanExclusive(
	- T (&input)[LENGTH], ///< [in] Input array
	- T (&output)[LENGTH], ///< [out] Output array (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T prefix, ///< [in] Prefix to seed scan with
	- bool apply_prefix = true) ///< [in] Whether or not the calling thread should apply its prefix. (Handy for preventing thread-0 from applying a prefix.)
	-{
	- return ThreadScanExclusive<LENGTH>((T) input, (T) output, scan_op, prefix);
	-}
	-
	-
	-/**
	- * \brief Perform a sequential inclusive prefix scan over \p LENGTH elements of the \p input array. The aggregate is returned.
	- *
	- * \tparam LENGTH Length of \p input and \p output arrays
	- * \tparam T <b>[inferred]</b> The data type to be scanned.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ScanOp>
	-__device__ __forceinline__ T ThreadScanInclusive(
	- T *input, ///< [in] Input array
	- T *output, ///< [out] Output array (may be aliased to \p input)
	- ScanOp scan_op) ///< [in] Binary scan operator
	-{
	- T inclusive = input[0];
	- output[0] = inclusive;
	-
	- // Continue scan
	- #pragma unroll
	- for (int i = 0; i < LENGTH; ++i)
	- {
	- inclusive = scan_op(inclusive, input[i]);
	- output[i] = inclusive;
	- }
	-
	- return inclusive;
	-}
	-
	-
	-/**
	- * \brief Perform a sequential inclusive prefix scan over the statically-sized \p input array. The aggregate is returned.
	- *
	- * \tparam LENGTH <b>[inferred]</b> Length of \p input and \p output arrays
	- * \tparam T <b>[inferred]</b> The data type to be scanned.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ScanOp>
	-__device__ __forceinline__ T ThreadScanInclusive(
	- T (&input)[LENGTH], ///< [in] Input array
	- T (&output)[LENGTH], ///< [out] Output array (may be aliased to \p input)
	- ScanOp scan_op) ///< [in] Binary scan operator
	-{
	- return ThreadScanInclusive<LENGTH>((T) input, (T) output, scan_op);
	-}
	-
	-
	-/**
	- * \brief Perform a sequential inclusive prefix scan over \p LENGTH elements of the \p input array, seeded with the specified \p prefix. The aggregate is returned.
	- *
	- * \tparam LENGTH Length of \p input and \p output arrays
	- * \tparam T <b>[inferred]</b> The data type to be scanned.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ScanOp>
	-__device__ __forceinline__ T ThreadScanInclusive(
	- T *input, ///< [in] Input array
	- T *output, ///< [out] Output array (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T prefix, ///< [in] Prefix to seed scan with
	- bool apply_prefix = true) ///< [in] Whether or not the calling thread should apply its prefix. (Handy for preventing thread-0 from applying a prefix.)
	-{
	- T inclusive = input[0];
	- if (apply_prefix)
	- {
	- inclusive = scan_op(prefix, inclusive);
	- }
	- output[0] = inclusive;
	-
	- // Continue scan
	- #pragma unroll
	- for (int i = 1; i < LENGTH; ++i)
	- {
	- inclusive = scan_op(inclusive, input[i]);
	- output[i] = inclusive;
	- }
	-
	- return inclusive;
	-}
	-
	-
	-/**
	- * \brief Perform a sequential inclusive prefix scan over the statically-sized \p input array, seeded with the specified \p prefix. The aggregate is returned.
	- *
	- * \tparam LENGTH <b>[inferred]</b> Length of \p input and \p output arrays
	- * \tparam T <b>[inferred]</b> The data type to be scanned.
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	-template <
	- int LENGTH,
	- typename T,
	- typename ScanOp>
	-__device__ __forceinline__ T ThreadScanInclusive(
	- T (&input)[LENGTH], ///< [in] Input array
	- T (&output)[LENGTH], ///< [out] Output array (may be aliased to \p input)
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T prefix, ///< [in] Prefix to seed scan with
	- bool apply_prefix = true) ///< [in] Whether or not the calling thread should apply its prefix. (Handy for preventing thread-0 from applying a prefix.)
	-{
	- return ThreadScanInclusive<LENGTH>((T) input, (T) output, scan_op, prefix, apply_prefix);
	-}
	-
	-
	-//@} end member group
	-
	-/** @} */ // end group ThreadModule
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/thread/thread_store.cuh b/lib/kokkos/TPL/cub/thread/thread_store.cuh
	deleted file mode 100755
	index 8d39e07b1..000000000
	--- a/lib/kokkos/TPL/cub/thread/thread_store.cuh
	+++ /dev/null
	@@ -1,412 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Thread utilities for writing memory using PTX cache modifiers.
	- */
	-
	-#pragma once
	-
	-#include <cuda.h>
	-
	-#include "../util_ptx.cuh"
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup IoModule
	- * @{
	- */
	-
	-
	-//-----------------------------------------------------------------------------
	-// Tags and constants
	-//-----------------------------------------------------------------------------
	-
	-/**
	- * \brief Enumeration of PTX cache-modifiers for memory store operations.
	- */
	-enum PtxStoreModifier
	-{
	- STORE_DEFAULT, ///< Default (no modifier)
	- STORE_WB, ///< Cache write-back all coherent levels
	- STORE_CG, ///< Cache at global level
	- STORE_CS, ///< Cache streaming (likely to be accessed once)
	- STORE_WT, ///< Cache write-through (to system memory)
	- STORE_VOLATILE, ///< Volatile shared (any memory space)
	-};
	-
	-
	-/**
	- * \name Simple I/O
	- * @{
	- */
	-
	-/**
	- * \brief Thread utility for writing memory using cub::PtxStoreModifier cache modifiers.
	- *
	- * Cache modifiers will only be effected for built-in types (i.e., C++
	- * primitives and CUDA vector-types).
	- *
	- * For example:
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // 32-bit store using cache-global modifier:
	- * int *d_out;
	- * int val;
	- * cub::ThreadStore<cub::STORE_CG>(d_out + threadIdx.x, val);
	- *
	- * // 16-bit store using default modifier
	- * short *d_out;
	- * short val;
	- * cub::ThreadStore<cub::STORE_DEFAULT>(d_out + threadIdx.x, val);
	- *
	- * // 256-bit store using write-through modifier
	- * double4 *d_out;
	- * double4 val;
	- * cub::ThreadStore<cub::STORE_WT>(d_out + threadIdx.x, val);
	- *
	- * // 96-bit store using default cache modifier (ignoring STORE_CS)
	- * struct TestFoo { bool a; short b; };
	- * TestFoo *d_struct;
	- * TestFoo val;
	- * cub::ThreadStore<cub::STORE_CS>(d_out + threadIdx.x, val);
	- * \endcode
	- *
	- */
	-template <
	- PtxStoreModifier MODIFIER,
	- typename OutputIteratorRA,
	- typename T>
	-__device__ __forceinline__ void ThreadStore(OutputIteratorRA itr, T val);
	-
	-
	-//@} end member group
	-
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Define a int4 (16B) ThreadStore specialization for the given PTX load modifier
	- */
	-#define CUB_STORE_16(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, int4, int4>(int4 ptr, int4 val) \
	- { \
	- asm volatile ("st."#ptx_modifier".v4.s32 [%0], {%1, %2, %3, %4};" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "r"(val.x), \
	- "r"(val.y), \
	- "r"(val.z), \
	- "r"(val.w)); \
	- } \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, longlong2, longlong2>(longlong2 ptr, longlong2 val) \
	- { \
	- asm volatile ("st."#ptx_modifier".v2.s64 [%0], {%1, %2};" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "l"(val.x), \
	- "l"(val.y)); \
	- }
	-
	-
	-/**
	- * Define a int2 (8B) ThreadStore specialization for the given PTX load modifier
	- */
	-#define CUB_STORE_8(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, short4, short4>(short4 ptr, short4 val) \
	- { \
	- asm volatile ("st."#ptx_modifier".v4.s16 [%0], {%1, %2, %3, %4};" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "h"(val.x), \
	- "h"(val.y), \
	- "h"(val.z), \
	- "h"(val.w)); \
	- } \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, int2, int2>(int2 ptr, int2 val) \
	- { \
	- asm volatile ("st."#ptx_modifier".v2.s32 [%0], {%1, %2};" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "r"(val.x), \
	- "r"(val.y)); \
	- } \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, long long, long long>(long long ptr, long long val) \
	- { \
	- asm volatile ("st."#ptx_modifier".s64 [%0], %1;" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "l"(val)); \
	- }
	-
	-/**
	- * Define a int (4B) ThreadStore specialization for the given PTX load modifier
	- */
	-#define CUB_STORE_4(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, int, int>(int ptr, int val) \
	- { \
	- asm volatile ("st."#ptx_modifier".s32 [%0], %1;" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "r"(val)); \
	- }
	-
	-
	-/**
	- * Define a short (2B) ThreadStore specialization for the given PTX load modifier
	- */
	-#define CUB_STORE_2(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, short, short>(short ptr, short val) \
	- { \
	- asm volatile ("st."#ptx_modifier".s16 [%0], %1;" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "h"(val)); \
	- }
	-
	-
	-/**
	- * Define a char (1B) ThreadStore specialization for the given PTX load modifier
	- */
	-#define CUB_STORE_1(cub_modifier, ptx_modifier) \
	- template<> \
	- __device__ __forceinline__ void ThreadStore<cub_modifier, char, char>(char ptr, char val) \
	- { \
	- asm volatile ( \
	- "{" \
	- " .reg .s8 datum;" \
	- " cvt.s8.s16 datum, %1;" \
	- " st."#ptx_modifier".s8 [%0], datum;" \
	- "}" : : \
	- _CUB_ASM_PTR_(ptr), \
	- "h"(short(val))); \
	- }
	-
	-/**
	- * Define powers-of-two ThreadStore specializations for the given PTX load modifier
	- */
	-#define CUB_STORE_ALL(cub_modifier, ptx_modifier) \
	- CUB_STORE_16(cub_modifier, ptx_modifier) \
	- CUB_STORE_8(cub_modifier, ptx_modifier) \
	- CUB_STORE_4(cub_modifier, ptx_modifier) \
	- CUB_STORE_2(cub_modifier, ptx_modifier) \
	- CUB_STORE_1(cub_modifier, ptx_modifier) \
	-
	-
	-/**
	- * Define ThreadStore specializations for the various PTX load modifiers
	- */
	-#if CUB_PTX_ARCH >= 200
	- CUB_STORE_ALL(STORE_WB, ca)
	- CUB_STORE_ALL(STORE_CG, cg)
	- CUB_STORE_ALL(STORE_CS, cs)
	- CUB_STORE_ALL(STORE_WT, cv)
	-#else
	- // STORE_WT on SM10-13 uses "volatile.global" to ensure writes to last level
	- CUB_STORE_ALL(STORE_WT, volatile.global)
	-#endif
	-
	-
	-
	-/// Helper structure for templated store iteration (inductive case)
	-template <PtxStoreModifier MODIFIER, int COUNT, int MAX>
	-struct IterateThreadStore
	-{
	- template <typename T>
	- static __device__ __forceinline__ void Store(T ptr, T vals)
	- {
	- ThreadStore<MODIFIER>(ptr + COUNT, vals[COUNT]);
	- IterateThreadStore<MODIFIER, COUNT + 1, MAX>::Store(ptr, vals);
	- }
	-};
	-
	-/// Helper structure for templated store iteration (termination case)
	-template <PtxStoreModifier MODIFIER, int MAX>
	-struct IterateThreadStore<MODIFIER, MAX, MAX>
	-{
	- template <typename T>
	- static __device__ __forceinline__ void Store(T ptr, T vals) {}
	-};
	-
	-
	-
	-
	-/**
	- * Store with STORE_DEFAULT on iterator types
	- */
	-template <typename OutputIteratorRA, typename T>
	-__device__ __forceinline__ void ThreadStore(
	- OutputIteratorRA itr,
	- T val,
	- Int2Type<STORE_DEFAULT> modifier,
	- Int2Type<false> is_pointer)
	-{
	- *itr = val;
	-}
	-
	-
	-/**
	- * Store with STORE_DEFAULT on pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ void ThreadStore(
	- T *ptr,
	- T val,
	- Int2Type<STORE_DEFAULT> modifier,
	- Int2Type<true> is_pointer)
	-{
	- *ptr = val;
	-}
	-
	-
	-/**
	- * Store with STORE_VOLATILE on primitive pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ void ThreadStoreVolatile(
	- T *ptr,
	- T val,
	- Int2Type<true> is_primitive)
	-{
	- reinterpret_cast<volatile T>(ptr) = val;
	-}
	-
	-
	-/**
	- * Store with STORE_VOLATILE on non-primitive pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ void ThreadStoreVolatile(
	- T *ptr,
	- T val,
	- Int2Type<false> is_primitive)
	-{
	- typedef typename WordAlignment<T>::VolatileWord VolatileWord; // Word type for memcopying
	- enum { NUM_WORDS = sizeof(T) / sizeof(VolatileWord) };
	-
	- // Store into array of uninitialized words
	- typename WordAlignment<T>::UninitializedVolatileWords words;
	- reinterpret_cast<T>(words.buf) = val;
	-
	- // Memcopy words to aliased destination
	- #pragma unroll
	- for (int i = 0; i < NUM_WORDS; ++i)
	- reinterpret_cast<volatile VolatileWord*>(ptr)[i] = words.buf[i];
	-}
	-
	-
	-/**
	- * Store with STORE_VOLATILE on pointer types
	- */
	-template <typename T>
	-__device__ __forceinline__ void ThreadStore(
	- T *ptr,
	- T val,
	- Int2Type<STORE_VOLATILE> modifier,
	- Int2Type<true> is_pointer)
	-{
	- ThreadStoreVolatile(ptr, val, Int2Type<Traits<T>::PRIMITIVE>());
	-}
	-
	-
	-#if (CUB_PTX_ARCH <= 350)
	-
	-/**
	- * Store with STORE_CG on pointer types (uses STORE_DEFAULT on current architectures)
	- */
	-template <typename T>
	-__device__ __forceinline__ void ThreadStore(
	- T *ptr,
	- T val,
	- Int2Type<STORE_CG> modifier,
	- Int2Type<true> is_pointer)
	-{
	- ThreadStore<STORE_DEFAULT>(ptr, val);
	-}
	-
	-#endif // (CUB_PTX_ARCH <= 350)
	-
	-
	-/**
	- * Store with arbitrary MODIFIER on pointer types
	- */
	-template <typename T, int MODIFIER>
	-__device__ __forceinline__ void ThreadStore(
	- T *ptr,
	- T val,
	- Int2Type<MODIFIER> modifier,
	- Int2Type<true> is_pointer)
	-{
	- typedef typename WordAlignment<T>::DeviceWord DeviceWord; // Word type for memcopying
	- enum { NUM_WORDS = sizeof(T) / sizeof(DeviceWord) };
	-
	- // Store into array of uninitialized words
	- typename WordAlignment<T>::UninitializedDeviceWords words;
	- reinterpret_cast<T>(words.buf) = val;
	-
	- // Memcopy words to aliased destination
	- IterateThreadStore<PtxStoreModifier(MODIFIER), 0, NUM_WORDS>::Store(
	- reinterpret_cast<DeviceWord*>(ptr),
	- words.buf);
	-}
	-
	-
	-/**
	- * Generic ThreadStore definition
	- */
	-template <PtxStoreModifier MODIFIER, typename OutputIteratorRA, typename T>
	-__device__ __forceinline__ void ThreadStore(OutputIteratorRA itr, T val)
	-{
	- ThreadStore(
	- itr,
	- val,
	- Int2Type<MODIFIER>(),
	- Int2Type<IsPointer<OutputIteratorRA>::VALUE>());
	-}
	-
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/** @} */ // end group IoModule
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_allocator.cuh b/lib/kokkos/TPL/cub/util_allocator.cuh
	deleted file mode 100755
	index ae40f3305..000000000
	--- a/lib/kokkos/TPL/cub/util_allocator.cuh
	+++ /dev/null
	@@ -1,661 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/******************************************************************************
	- * Simple caching allocator for device memory allocations. The allocator is
	- * thread-safe and capable of managing device allocations on multiple devices.
	- ******************************************************************************/
	-
	-#pragma once
	-
	-#ifndef __CUDA_ARCH__
	- #include <set> // NVCC (EDG, really) takes FOREVER to compile std::map
	- #include <map>
	-#endif
	-
	-#include <math.h>
	-
	-#include "util_namespace.cuh"
	-#include "util_debug.cuh"
	-
	-#include "host/spinlock.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-/******************************************************************************
	- * CachingDeviceAllocator (host use)
	- ******************************************************************************/
	-
	-/**
	- * \brief A simple caching allocator for device memory allocations.
	- *
	- * \par Overview
	- * The allocator is thread-safe and is capable of managing cached device allocations
	- * on multiple devices. It behaves as follows:
	- *
	- * \par
	- * - Allocations categorized by bin size.
	- * - Bin sizes progress geometrically in accordance with the growth factor
	- * \p bin_growth provided during construction. Unused device allocations within
	- * a larger bin cache are not reused for allocation requests that categorize to
	- * smaller bin sizes.
	- * - Allocation requests below (\p bin_growth ^ \p min_bin) are rounded up to
	- * (\p bin_growth ^ \p min_bin).
	- * - Allocations above (\p bin_growth ^ \p max_bin) are not rounded up to the nearest
	- * bin and are simply freed when they are deallocated instead of being returned
	- * to a bin-cache.
	- * - %If the total storage of cached allocations on a given device will exceed
	- * \p max_cached_bytes, allocations for that device are simply freed when they are
	- * deallocated instead of being returned to their bin-cache.
	- *
	- * \par
	- * For example, the default-constructed CachingDeviceAllocator is configured with:
	- * - \p bin_growth = 8
	- * - \p min_bin = 3
	- * - \p max_bin = 7
	- * - \p max_cached_bytes = 6MB - 1B
	- *
	- * \par
	- * which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB
	- * and sets a maximum of 6,291,455 cached bytes per device
	- *
	- */
	-struct CachingDeviceAllocator
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	- //---------------------------------------------------------------------
	- // Type definitions and constants
	- //---------------------------------------------------------------------
	-
	- enum
	- {
	- /// Invalid device ordinal
	- INVALID_DEVICE_ORDINAL = -1,
	- };
	-
	- /**
	- * Integer pow function for unsigned base and exponent
	- */
	- static unsigned int IntPow(
	- unsigned int base,
	- unsigned int exp)
	- {
	- unsigned int retval = 1;
	- while (exp > 0)
	- {
	- if (exp & 1) {
	- retval = retval * base; // multiply the result by the current base
	- }
	- base = base * base; // square the base
	- exp = exp >> 1; // divide the exponent in half
	- }
	- return retval;
	- }
	-
	-
	- /**
	- * Round up to the nearest power-of
	- */
	- static void NearestPowerOf(
	- unsigned int &power,
	- size_t &rounded_bytes,
	- unsigned int base,
	- size_t value)
	- {
	- power = 0;
	- rounded_bytes = 1;
	-
	- while (rounded_bytes < value)
	- {
	- rounded_bytes *= base;
	- power++;
	- }
	- }
	-
	- /**
	- * Descriptor for device memory allocations
	- */
	- struct BlockDescriptor
	- {
	- int device; // device ordinal
	- void* d_ptr; // Device pointer
	- size_t bytes; // Size of allocation in bytes
	- unsigned int bin; // Bin enumeration
	-
	- // Constructor
	- BlockDescriptor(void *d_ptr, int device) :
	- d_ptr(d_ptr),
	- bytes(0),
	- bin(0),
	- device(device) {}
	-
	- // Constructor
	- BlockDescriptor(size_t bytes, unsigned int bin, int device) :
	- d_ptr(NULL),
	- bytes(bytes),
	- bin(bin),
	- device(device) {}
	-
	- // Comparison functor for comparing device pointers
	- static bool PtrCompare(const BlockDescriptor &a, const BlockDescriptor &b)
	- {
	- if (a.device < b.device) {
	- return true;
	- } else if (a.device > b.device) {
	- return false;
	- } else {
	- return (a.d_ptr < b.d_ptr);
	- }
	- }
	-
	- // Comparison functor for comparing allocation sizes
	- static bool SizeCompare(const BlockDescriptor &a, const BlockDescriptor &b)
	- {
	- if (a.device < b.device) {
	- return true;
	- } else if (a.device > b.device) {
	- return false;
	- } else {
	- return (a.bytes < b.bytes);
	- }
	- }
	- };
	-
	- /// BlockDescriptor comparator function interface
	- typedef bool (*Compare)(const BlockDescriptor &, const BlockDescriptor &);
	-
	-#ifndef __CUDA_ARCH__ // Only define STL container members in host code
	-
	- /// Set type for cached blocks (ordered by size)
	- typedef std::multiset<BlockDescriptor, Compare> CachedBlocks;
	-
	- /// Set type for live blocks (ordered by ptr)
	- typedef std::multiset<BlockDescriptor, Compare> BusyBlocks;
	-
	- /// Map type of device ordinals to the number of cached bytes cached by each device
	- typedef std::map<int, size_t> GpuCachedBytes;
	-
	-#endif // __CUDA_ARCH__
	-
	- //---------------------------------------------------------------------
	- // Fields
	- //---------------------------------------------------------------------
	-
	- Spinlock spin_lock; /// Spinlock for thread-safety
	-
	- unsigned int bin_growth; /// Geometric growth factor for bin-sizes
	- unsigned int min_bin; /// Minimum bin enumeration
	- unsigned int max_bin; /// Maximum bin enumeration
	-
	- size_t min_bin_bytes; /// Minimum bin size
	- size_t max_bin_bytes; /// Maximum bin size
	- size_t max_cached_bytes; /// Maximum aggregate cached bytes per device
	-
	- bool debug; /// Whether or not to print (de)allocation events to stdout
	- bool skip_cleanup; /// Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators)
	-
	-#ifndef __CUDA_ARCH__ // Only define STL container members in host code
	-
	- GpuCachedBytes cached_bytes; /// Map of device ordinal to aggregate cached bytes on that device
	- CachedBlocks cached_blocks; /// Set of cached device allocations available for reuse
	- BusyBlocks live_blocks; /// Set of live device allocations currently in use
	-
	-#endif // __CUDA_ARCH__
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	- //---------------------------------------------------------------------
	- // Methods
	- //---------------------------------------------------------------------
	-
	- /**
	- * \brief Constructor.
	- */
	- CachingDeviceAllocator(
	- unsigned int bin_growth, ///< Geometric growth factor for bin-sizes
	- unsigned int min_bin, ///< Minimum bin
	- unsigned int max_bin, ///< Maximum bin
	- size_t max_cached_bytes) ///< Maximum aggregate cached bytes per device
	- :
	- #ifndef __CUDA_ARCH__ // Only define STL container members in host code
	- cached_blocks(BlockDescriptor::SizeCompare),
	- live_blocks(BlockDescriptor::PtrCompare),
	- #endif
	- debug(false),
	- spin_lock(0),
	- bin_growth(bin_growth),
	- min_bin(min_bin),
	- max_bin(max_bin),
	- min_bin_bytes(IntPow(bin_growth, min_bin)),
	- max_bin_bytes(IntPow(bin_growth, max_bin)),
	- max_cached_bytes(max_cached_bytes)
	- {}
	-
	-
	- /**
	- * \brief Default constructor.
	- *
	- * Configured with:
	- * \par
	- * - \p bin_growth = 8
	- * - \p min_bin = 3
	- * - \p max_bin = 7
	- * - \p max_cached_bytes = (\p bin_growth ^ \p max_bin) * 3) - 1 = 6,291,455 bytes
	- *
	- * which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and
	- * sets a maximum of 6,291,455 cached bytes per device
	- */
	- CachingDeviceAllocator(bool skip_cleanup = false) :
	- #ifndef __CUDA_ARCH__ // Only define STL container members in host code
	- cached_blocks(BlockDescriptor::SizeCompare),
	- live_blocks(BlockDescriptor::PtrCompare),
	- #endif
	- skip_cleanup(skip_cleanup),
	- debug(false),
	- spin_lock(0),
	- bin_growth(8),
	- min_bin(3),
	- max_bin(7),
	- min_bin_bytes(IntPow(bin_growth, min_bin)),
	- max_bin_bytes(IntPow(bin_growth, max_bin)),
	- max_cached_bytes((max_bin_bytes * 3) - 1)
	- {}
	-
	-
	- /**
	- * \brief Sets the limit on the number bytes this allocator is allowed to cache per device.
	- */
	- cudaError_t SetMaxCachedBytes(
	- size_t max_cached_bytes)
	- {
	- #ifdef __CUDA_ARCH__
	- // Caching functionality only defined on host
	- return CubDebug(cudaErrorInvalidConfiguration);
	- #else
	-
	- // Lock
	- Lock(&spin_lock);
	-
	- this->max_cached_bytes = max_cached_bytes;
	-
	- if (debug) CubLog("New max_cached_bytes(%lld)\n", (long long) max_cached_bytes);
	-
	- // Unlock
	- Unlock(&spin_lock);
	-
	- return cudaSuccess;
	-
	- #endif // __CUDA_ARCH__
	- }
	-
	-
	- /**
	- * \brief Provides a suitable allocation of device memory for the given size on the specified device
	- */
	- cudaError_t DeviceAllocate(
	- void** d_ptr,
	- size_t bytes,
	- int device)
	- {
	- #ifdef __CUDA_ARCH__
	- // Caching functionality only defined on host
	- return CubDebug(cudaErrorInvalidConfiguration);
	- #else
	-
	- bool locked = false;
	- int entrypoint_device = INVALID_DEVICE_ORDINAL;
	- cudaError_t error = cudaSuccess;
	-
	- // Round up to nearest bin size
	- unsigned int bin;
	- size_t bin_bytes;
	- NearestPowerOf(bin, bin_bytes, bin_growth, bytes);
	- if (bin < min_bin) {
	- bin = min_bin;
	- bin_bytes = min_bin_bytes;
	- }
	-
	- // Check if bin is greater than our maximum bin
	- if (bin > max_bin)
	- {
	- // Allocate the request exactly and give out-of-range bin
	- bin = (unsigned int) -1;
	- bin_bytes = bytes;
	- }
	-
	- BlockDescriptor search_key(bin_bytes, bin, device);
	-
	- // Lock
	- if (!locked) {
	- Lock(&spin_lock);
	- locked = true;
	- }
	-
	- do {
	- // Find a free block big enough within the same bin on the same device
	- CachedBlocks::iterator block_itr = cached_blocks.lower_bound(search_key);
	- if ((block_itr != cached_blocks.end()) &&
	- (block_itr->device == device) &&
	- (block_itr->bin == search_key.bin))
	- {
	- // Reuse existing cache block. Insert into live blocks.
	- search_key = *block_itr;
	- live_blocks.insert(search_key);
	-
	- // Remove from free blocks
	- cached_blocks.erase(block_itr);
	- cached_bytes[device] -= search_key.bytes;
	-
	- if (debug) CubLog("\tdevice %d reused cached block (%lld bytes). %lld available blocks cached (%lld bytes), %lld live blocks outstanding.\n",
	- device, (long long) search_key.bytes, (long long) cached_blocks.size(), (long long) cached_bytes[device], (long long) live_blocks.size());
	- }
	- else
	- {
	- // Need to allocate a new cache block. Unlock.
	- if (locked) {
	- Unlock(&spin_lock);
	- locked = false;
	- }
	-
	- // Set to specified device
	- if (CubDebug(error = cudaGetDevice(&entrypoint_device))) break;
	- if (CubDebug(error = cudaSetDevice(device))) break;
	-
	- // Allocate
	- if (CubDebug(error = cudaMalloc(&search_key.d_ptr, search_key.bytes))) break;
	-
	- // Lock
	- if (!locked) {
	- Lock(&spin_lock);
	- locked = true;
	- }
	-
	- // Insert into live blocks
	- live_blocks.insert(search_key);
	-
	- if (debug) CubLog("\tdevice %d allocating new device block %lld bytes. %lld available blocks cached (%lld bytes), %lld live blocks outstanding.\n",
	- device, (long long) search_key.bytes, (long long) cached_blocks.size(), (long long) cached_bytes[device], (long long) live_blocks.size());
	- }
	- } while(0);
	-
	- // Unlock
	- if (locked) {
	- Unlock(&spin_lock);
	- locked = false;
	- }
	-
	- // Copy device pointer to output parameter (NULL on error)
	- *d_ptr = search_key.d_ptr;
	-
	- // Attempt to revert back to previous device if necessary
	- if (entrypoint_device != INVALID_DEVICE_ORDINAL)
	- {
	- if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error;
	- }
	-
	- return error;
	-
	- #endif // __CUDA_ARCH__
	- }
	-
	-
	- /**
	- * \brief Provides a suitable allocation of device memory for the given size on the current device
	- */
	- cudaError_t DeviceAllocate(
	- void** d_ptr,
	- size_t bytes)
	- {
	- #ifdef __CUDA_ARCH__
	- // Caching functionality only defined on host
	- return CubDebug(cudaErrorInvalidConfiguration);
	- #else
	- cudaError_t error = cudaSuccess;
	- do {
	- int current_device;
	- if (CubDebug(error = cudaGetDevice(&current_device))) break;
	- if (CubDebug(error = DeviceAllocate(d_ptr, bytes, current_device))) break;
	- } while(0);
	-
	- return error;
	-
	- #endif // __CUDA_ARCH__
	- }
	-
	-
	- /**
	- * \brief Frees a live allocation of device memory on the specified device, returning it to the allocator
	- */
	- cudaError_t DeviceFree(
	- void* d_ptr,
	- int device)
	- {
	- #ifdef __CUDA_ARCH__
	- // Caching functionality only defined on host
	- return CubDebug(cudaErrorInvalidConfiguration);
	- #else
	-
	- bool locked = false;
	- int entrypoint_device = INVALID_DEVICE_ORDINAL;
	- cudaError_t error = cudaSuccess;
	-
	- BlockDescriptor search_key(d_ptr, device);
	-
	- // Lock
	- if (!locked) {
	- Lock(&spin_lock);
	- locked = true;
	- }
	-
	- do {
	- // Find corresponding block descriptor
	- BusyBlocks::iterator block_itr = live_blocks.find(search_key);
	- if (block_itr == live_blocks.end())
	- {
	- // Cannot find pointer
	- if (CubDebug(error = cudaErrorUnknown)) break;
	- }
	- else
	- {
	- // Remove from live blocks
	- search_key = *block_itr;
	- live_blocks.erase(block_itr);
	-
	- // Check if we should keep the returned allocation
	- if (cached_bytes[device] + search_key.bytes <= max_cached_bytes)
	- {
	- // Insert returned allocation into free blocks
	- cached_blocks.insert(search_key);
	- cached_bytes[device] += search_key.bytes;
	-
	- if (debug) CubLog("\tdevice %d returned %lld bytes. %lld available blocks cached (%lld bytes), %lld live blocks outstanding.\n",
	- device, (long long) search_key.bytes, (long long) cached_blocks.size(), (long long) cached_bytes[device], (long long) live_blocks.size());
	- }
	- else
	- {
	- // Free the returned allocation. Unlock.
	- if (locked) {
	- Unlock(&spin_lock);
	- locked = false;
	- }
	-
	- // Set to specified device
	- if (CubDebug(error = cudaGetDevice(&entrypoint_device))) break;
	- if (CubDebug(error = cudaSetDevice(device))) break;
	-
	- // Free device memory
	- if (CubDebug(error = cudaFree(d_ptr))) break;
	-
	- if (debug) CubLog("\tdevice %d freed %lld bytes. %lld available blocks cached (%lld bytes), %lld live blocks outstanding.\n",
	- device, (long long) search_key.bytes, (long long) cached_blocks.size(), (long long) cached_bytes[device], (long long) live_blocks.size());
	- }
	- }
	- } while (0);
	-
	- // Unlock
	- if (locked) {
	- Unlock(&spin_lock);
	- locked = false;
	- }
	-
	- // Attempt to revert back to entry-point device if necessary
	- if (entrypoint_device != INVALID_DEVICE_ORDINAL)
	- {
	- if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error;
	- }
	-
	- return error;
	-
	- #endif // __CUDA_ARCH__
	- }
	-
	-
	- /**
	- * \brief Frees a live allocation of device memory on the current device, returning it to the allocator
	- */
	- cudaError_t DeviceFree(
	- void* d_ptr)
	- {
	- #ifdef __CUDA_ARCH__
	- // Caching functionality only defined on host
	- return CubDebug(cudaErrorInvalidConfiguration);
	- #else
	-
	- int current_device;
	- cudaError_t error = cudaSuccess;
	-
	- do {
	- if (CubDebug(error = cudaGetDevice(&current_device))) break;
	- if (CubDebug(error = DeviceFree(d_ptr, current_device))) break;
	- } while(0);
	-
	- return error;
	-
	- #endif // __CUDA_ARCH__
	- }
	-
	-
	- /**
	- * \brief Frees all cached device allocations on all devices
	- */
	- cudaError_t FreeAllCached()
	- {
	- #ifdef __CUDA_ARCH__
	- // Caching functionality only defined on host
	- return CubDebug(cudaErrorInvalidConfiguration);
	- #else
	-
	- cudaError_t error = cudaSuccess;
	- bool locked = false;
	- int entrypoint_device = INVALID_DEVICE_ORDINAL;
	- int current_device = INVALID_DEVICE_ORDINAL;
	-
	- // Lock
	- if (!locked) {
	- Lock(&spin_lock);
	- locked = true;
	- }
	-
	- while (!cached_blocks.empty())
	- {
	- // Get first block
	- CachedBlocks::iterator begin = cached_blocks.begin();
	-
	- // Get entry-point device ordinal if necessary
	- if (entrypoint_device == INVALID_DEVICE_ORDINAL)
	- {
	- if (CubDebug(error = cudaGetDevice(&entrypoint_device))) break;
	- }
	-
	- // Set current device ordinal if necessary
	- if (begin->device != current_device)
	- {
	- if (CubDebug(error = cudaSetDevice(begin->device))) break;
	- current_device = begin->device;
	- }
	-
	- // Free device memory
	- if (CubDebug(error = cudaFree(begin->d_ptr))) break;
	-
	- // Reduce balance and erase entry
	- cached_bytes[current_device] -= begin->bytes;
	- cached_blocks.erase(begin);
	-
	- if (debug) CubLog("\tdevice %d freed %lld bytes. %lld available blocks cached (%lld bytes), %lld live blocks outstanding.\n",
	- current_device, (long long) begin->bytes, (long long) cached_blocks.size(), (long long) cached_bytes[current_device], (long long) live_blocks.size());
	- }
	-
	- // Unlock
	- if (locked) {
	- Unlock(&spin_lock);
	- locked = false;
	- }
	-
	- // Attempt to revert back to entry-point device if necessary
	- if (entrypoint_device != INVALID_DEVICE_ORDINAL)
	- {
	- if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error;
	- }
	-
	- return error;
	-
	- #endif // __CUDA_ARCH__
	- }
	-
	-
	- /**
	- * \brief Destructor
	- */
	- virtual ~CachingDeviceAllocator()
	- {
	- if (!skip_cleanup)
	- FreeAllCached();
	- }
	-
	-};
	-
	-
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_arch.cuh b/lib/kokkos/TPL/cub/util_arch.cuh
	deleted file mode 100755
	index 232a33c4f..000000000
	--- a/lib/kokkos/TPL/cub/util_arch.cuh
	+++ /dev/null
	@@ -1,295 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Static architectural properties by SM version.
	- */
	-
	-
	-/******************************************************************************
	- * Static architectural properties by SM version.
	- *
	- * "Device" reflects the PTX architecture targeted by the active compiler
	- * pass. It provides useful compile-time statics within device code. E.g.,:
	- *
	- * __shared__ int[Device::WARP_THREADS];
	- *
	- * int padded_offset = threadIdx.x + (threadIdx.x >> Device::LOG_SMEM_BANKS);
	- *
	- ******************************************************************************/
	-
	-#pragma once
	-
	-#include "util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-/// CUB_PTX_ARCH reflects the PTX version targeted by the active compiler pass (or zero during the host pass).
	-#ifndef __CUDA_ARCH__
	- #define CUB_PTX_ARCH 0
	-#else
	- #define CUB_PTX_ARCH __CUDA_ARCH__
	-#endif
	-
	-
	-/// Whether or not the source targeted by the active compiler pass is allowed to invoke device kernels or methods from the CUDA runtime API.
	-#if !defined(__CUDA_ARCH__) \|\| defined(CUB_CDP)
	-#define CUB_RUNTIME_ENABLED
	-#endif
	-
	-
	-/// Execution space for destructors
	-#if ((CUB_PTX_ARCH > 0) && (CUB_PTX_ARCH < 200))
	- #define CUB_DESTRUCTOR __host__
	-#else
	- #define CUB_DESTRUCTOR __host__ __device__
	-#endif
	-
	-
	-/**
	- * \brief Structure for statically reporting CUDA device properties, parameterized by SM architecture.
	- *
	- * The default specialization is for SM10.
	- */
	-template <int SM_ARCH>
	-struct ArchProps
	-{
	- enum
	- {
	- LOG_WARP_THREADS =
	- 5, /// Log of the number of threads per warp
	- WARP_THREADS =
	- 1 << LOG_WARP_THREADS, /// Number of threads per warp
	- LOG_SMEM_BANKS =
	- 4, /// Log of the number of smem banks
	- SMEM_BANKS =
	- 1 << LOG_SMEM_BANKS, /// The number of smem banks
	- SMEM_BANK_BYTES =
	- 4, /// Size of smem bank words
	- SMEM_BYTES =
	- 16 * 1024, /// Maximum SM shared memory
	- SMEM_ALLOC_UNIT =
	- 512, /// Smem allocation size in bytes
	- REGS_BY_BLOCK =
	- true, /// Whether or not the architecture allocates registers by block (or by warp)
	- REG_ALLOC_UNIT =
	- 256, /// Number of registers allocated at a time per block (or by warp)
	- WARP_ALLOC_UNIT =
	- 2, /// Granularity of warps for which registers are allocated
	- MAX_SM_THREADS =
	- 768, /// Maximum number of threads per SM
	- MAX_SM_THREADBLOCKS =
	- 8, /// Maximum number of thread blocks per SM
	- MAX_BLOCK_THREADS =
	- 512, /// Maximum number of thread per thread block
	- MAX_SM_REGISTERS =
	- 8 * 1024, /// Maximum number of registers per SM
	- };
	-};
	-
	-
	-
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-/**
	- * Architecture properties for SM30
	- */
	-template <>
	-struct ArchProps<300>
	-{
	- enum
	- {
	- LOG_WARP_THREADS = 5, // 32 threads per warp
	- WARP_THREADS = 1 << LOG_WARP_THREADS,
	- LOG_SMEM_BANKS = 5, // 32 banks
	- SMEM_BANKS = 1 << LOG_SMEM_BANKS,
	- SMEM_BANK_BYTES = 4, // 4 byte bank words
	- SMEM_BYTES = 48 * 1024, // 48KB shared memory
	- SMEM_ALLOC_UNIT = 256, // 256B smem allocation segment size
	- REGS_BY_BLOCK = false, // Allocates registers by warp
	- REG_ALLOC_UNIT = 256, // 256 registers allocated at a time per warp
	- WARP_ALLOC_UNIT = 4, // Registers are allocated at a granularity of every 4 warps per threadblock
	- MAX_SM_THREADS = 2048, // 2K max threads per SM
	- MAX_SM_THREADBLOCKS = 16, // 16 max threadblocks per SM
	- MAX_BLOCK_THREADS = 1024, // 1024 max threads per threadblock
	- MAX_SM_REGISTERS = 64 * 1024, // 64K max registers per SM
	- };
	-
	- // Callback utility
	- template <typename T>
	- static __host__ __device__ __forceinline__ void Callback(T &target, int sm_version)
	- {
	- target.template Callback<ArchProps>();
	- }
	-};
	-
	-
	-/**
	- * Architecture properties for SM20
	- */
	-template <>
	-struct ArchProps<200>
	-{
	- enum
	- {
	- LOG_WARP_THREADS = 5, // 32 threads per warp
	- WARP_THREADS = 1 << LOG_WARP_THREADS,
	- LOG_SMEM_BANKS = 5, // 32 banks
	- SMEM_BANKS = 1 << LOG_SMEM_BANKS,
	- SMEM_BANK_BYTES = 4, // 4 byte bank words
	- SMEM_BYTES = 48 * 1024, // 48KB shared memory
	- SMEM_ALLOC_UNIT = 128, // 128B smem allocation segment size
	- REGS_BY_BLOCK = false, // Allocates registers by warp
	- REG_ALLOC_UNIT = 64, // 64 registers allocated at a time per warp
	- WARP_ALLOC_UNIT = 2, // Registers are allocated at a granularity of every 2 warps per threadblock
	- MAX_SM_THREADS = 1536, // 1536 max threads per SM
	- MAX_SM_THREADBLOCKS = 8, // 8 max threadblocks per SM
	- MAX_BLOCK_THREADS = 1024, // 1024 max threads per threadblock
	- MAX_SM_REGISTERS = 32 * 1024, // 32K max registers per SM
	- };
	-
	- // Callback utility
	- template <typename T>
	- static __host__ __device__ __forceinline__ void Callback(T &target, int sm_version)
	- {
	- if (sm_version > 200) {
	- ArchProps<300>::Callback(target, sm_version);
	- } else {
	- target.template Callback<ArchProps>();
	- }
	- }
	-};
	-
	-
	-/**
	- * Architecture properties for SM12
	- */
	-template <>
	-struct ArchProps<120>
	-{
	- enum
	- {
	- LOG_WARP_THREADS = 5, // 32 threads per warp
	- WARP_THREADS = 1 << LOG_WARP_THREADS,
	- LOG_SMEM_BANKS = 4, // 16 banks
	- SMEM_BANKS = 1 << LOG_SMEM_BANKS,
	- SMEM_BANK_BYTES = 4, // 4 byte bank words
	- SMEM_BYTES = 16 * 1024, // 16KB shared memory
	- SMEM_ALLOC_UNIT = 512, // 512B smem allocation segment size
	- REGS_BY_BLOCK = true, // Allocates registers by threadblock
	- REG_ALLOC_UNIT = 512, // 512 registers allocated at time per threadblock
	- WARP_ALLOC_UNIT = 2, // Registers are allocated at a granularity of every 2 warps per threadblock
	- MAX_SM_THREADS = 1024, // 1024 max threads per SM
	- MAX_SM_THREADBLOCKS = 8, // 8 max threadblocks per SM
	- MAX_BLOCK_THREADS = 512, // 512 max threads per threadblock
	- MAX_SM_REGISTERS = 16 * 1024, // 16K max registers per SM
	- };
	-
	- // Callback utility
	- template <typename T>
	- static __host__ __device__ __forceinline__ void Callback(T &target, int sm_version)
	- {
	- if (sm_version > 120) {
	- ArchProps<200>::Callback(target, sm_version);
	- } else {
	- target.template Callback<ArchProps>();
	- }
	- }
	-};
	-
	-
	-/**
	- * Architecture properties for SM10. Derives from the default ArchProps specialization.
	- */
	-template <>
	-struct ArchProps<100> : ArchProps<0>
	-{
	- // Callback utility
	- template <typename T>
	- static __host__ __device__ __forceinline__ void Callback(T &target, int sm_version)
	- {
	- if (sm_version > 100) {
	- ArchProps<120>::Callback(target, sm_version);
	- } else {
	- target.template Callback<ArchProps>();
	- }
	- }
	-};
	-
	-
	-/**
	- * Architecture properties for SM35
	- */
	-template <>
	-struct ArchProps<350> : ArchProps<300> {}; // Derives from SM30
	-
	-/**
	- * Architecture properties for SM21
	- */
	-template <>
	-struct ArchProps<210> : ArchProps<200> {}; // Derives from SM20
	-
	-/**
	- * Architecture properties for SM13
	- */
	-template <>
	-struct ArchProps<130> : ArchProps<120> {}; // Derives from SM12
	-
	-/**
	- * Architecture properties for SM11
	- */
	-template <>
	-struct ArchProps<110> : ArchProps<100> {}; // Derives from SM10
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/**
	- * \brief The architectural properties for the PTX version targeted by the active compiler pass.
	- */
	-struct PtxArchProps : ArchProps<CUB_PTX_ARCH> {};
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_debug.cuh b/lib/kokkos/TPL/cub/util_debug.cuh
	deleted file mode 100755
	index 2ac67d7d0..000000000
	--- a/lib/kokkos/TPL/cub/util_debug.cuh
	+++ /dev/null
	@@ -1,115 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Error and event logging routines.
	- *
	- * The following macros definitions are supported:
	- * - \p CUB_LOG. Simple event messages are printed to \p stdout.
	- */
	-
	-#pragma once
	-
	-#include <stdio.h>
	-#include "util_namespace.cuh"
	-#include "util_arch.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-/// CUB error reporting macro (prints error messages to stderr)
	-#if (defined(DEBUG) \|\| defined(_DEBUG))
	- #define CUB_STDERR
	-#endif
	-
	-
	-
	-/**
	- * \brief %If \p CUB_STDERR is defined and \p error is not \p cudaSuccess, the corresponding error message is printed to \p stderr (or \p stdout in device code) along with the supplied source context.
	- *
	- * \return The CUDA error.
	- */
	-__host__ __device__ __forceinline__ cudaError_t Debug(
	- cudaError_t error,
	- const char* filename,
	- int line)
	-{
	-#ifdef CUB_STDERR
	- if (error)
	- {
	- #if (CUB_PTX_ARCH == 0)
	- fprintf(stderr, "CUDA error %d [%s, %d]: %s\n", error, filename, line, cudaGetErrorString(error));
	- fflush(stderr);
	- #elif (CUB_PTX_ARCH >= 200)
	- printf("CUDA error %d [block %d, thread %d, %s, %d]\n", error, blockIdx.x, threadIdx.x, filename, line);
	- #endif
	- }
	-#endif
	- return error;
	-}
	-
	-
	-/**
	- * \brief Debug macro
	- */
	-#define CubDebug(e) cub::Debug((e), __FILE__, __LINE__)
	-
	-
	-/**
	- * \brief Debug macro with exit
	- */
	-#define CubDebugExit(e) if (cub::Debug((e), __FILE__, __LINE__)) { exit(1); }
	-
	-
	-/**
	- * \brief Log macro for printf statements.
	- */
	-#if (CUB_PTX_ARCH == 0)
	- #define CubLog(format, ...) printf(format,__VA_ARGS__);
	-#elif (CUB_PTX_ARCH >= 200)
	- #define CubLog(format, ...) printf("[block %d, thread %d]: " format, blockIdx.x, threadIdx.x, __VA_ARGS__);
	-#endif
	-
	-
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_device.cuh b/lib/kokkos/TPL/cub/util_device.cuh
	deleted file mode 100755
	index 0631b924a..000000000
	--- a/lib/kokkos/TPL/cub/util_device.cuh
	+++ /dev/null
	@@ -1,378 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Properties of a given CUDA device and the corresponding PTX bundle
	- */
	-
	-#pragma once
	-
	-#include "util_arch.cuh"
	-#include "util_debug.cuh"
	-#include "util_namespace.cuh"
	-#include "util_macro.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-
	-/**
	- * Empty kernel for querying PTX manifest metadata (e.g., version) for the current device
	- */
	-template <typename T>
	-__global__ void EmptyKernel(void) { }
	-
	-
	-/**
	- * Alias temporaries to externally-allocated device storage (or simply return the amount of storage needed).
	- */
	-template <int ALLOCATIONS>
	-__host__ __device__ __forceinline__
	-cudaError_t AliasTemporaries(
	- void *d_temp_storage, ///< [in] %Device allocation of temporary storage. When NULL, the required allocation size is returned in \p temp_storage_bytes and no work is done.
	- size_t &temp_storage_bytes, ///< [in,out] Size in bytes of \t d_temp_storage allocation
	- void* (&allocations)[ALLOCATIONS], ///< [in,out] Pointers to device allocations needed
	- size_t (&allocation_sizes)[ALLOCATIONS]) ///< [in] Sizes in bytes of device allocations needed
	-{
	- const int ALIGN_BYTES = 256;
	- const int ALIGN_MASK = ~(ALIGN_BYTES - 1);
	-
	- // Compute exclusive prefix sum over allocation requests
	- size_t bytes_needed = 0;
	- for (int i = 0; i < ALLOCATIONS; ++i)
	- {
	- size_t allocation_bytes = (allocation_sizes[i] + ALIGN_BYTES - 1) & ALIGN_MASK;
	- allocation_sizes[i] = bytes_needed;
	- bytes_needed += allocation_bytes;
	- }
	-
	- // Check if the caller is simply requesting the size of the storage allocation
	- if (!d_temp_storage)
	- {
	- temp_storage_bytes = bytes_needed;
	- return cudaSuccess;
	- }
	-
	- // Check if enough storage provided
	- if (temp_storage_bytes < bytes_needed)
	- {
	- return CubDebug(cudaErrorMemoryAllocation);
	- }
	-
	- // Alias
	- for (int i = 0; i < ALLOCATIONS; ++i)
	- {
	- allocations[i] = static_cast<char*>(d_temp_storage) + allocation_sizes[i];
	- }
	-
	- return cudaSuccess;
	-}
	-
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/**
	- * \brief Retrieves the PTX version (major * 100 + minor * 10)
	- */
	-__host__ __device__ __forceinline__ cudaError_t PtxVersion(int &ptx_version)
	-{
	-#ifndef CUB_RUNTIME_ENABLED
	-
	- // CUDA API calls not supported from this device
	- return cudaErrorInvalidConfiguration;
	-
	-#else
	-
	- cudaError_t error = cudaSuccess;
	- do
	- {
	- cudaFuncAttributes empty_kernel_attrs;
	- if (CubDebug(error = cudaFuncGetAttributes(&empty_kernel_attrs, EmptyKernel<void>))) break;
	- ptx_version = empty_kernel_attrs.ptxVersion * 10;
	- }
	- while (0);
	-
	- return error;
	-
	-#endif
	-}
	-
	-
	-/**
	- * Synchronize the stream if specified
	- */
	-__host__ __device__ __forceinline__
	-static cudaError_t SyncStream(cudaStream_t stream)
	-{
	-#ifndef __CUDA_ARCH__
	- return cudaStreamSynchronize(stream);
	-#else
	- // Device can't yet sync on a specific stream
	- return cudaDeviceSynchronize();
	-#endif
	-}
	-
	-
	-
	-/**
	- * \brief Properties of a given CUDA device and the corresponding PTX bundle
	- */
	-class Device
	-{
	-private:
	-
	- /// Type definition of the EmptyKernel kernel entry point
	- typedef void (*EmptyKernelPtr)();
	-
	- /// Force EmptyKernel<void> to be generated if this class is used
	- __host__ __device__ __forceinline__
	- EmptyKernelPtr Empty()
	- {
	- return EmptyKernel<void>;
	- }
	-
	-public:
	-
	- // Version information
	- int sm_version; ///< SM version of target device (SM version X.YZ in XYZ integer form)
	- int ptx_version; ///< Bundled PTX version for target device (PTX version X.YZ in XYZ integer form)
	-
	- // Target device properties
	- int sm_count; ///< Number of SMs
	- int warp_threads; ///< Number of threads per warp
	- int smem_bank_bytes; ///< Number of bytes per SM bank
	- int smem_banks; ///< Number of smem banks
	- int smem_bytes; ///< Smem bytes per SM
	- int smem_alloc_unit; ///< Smem segment size
	- bool regs_by_block; ///< Whether registers are allocated by threadblock (or by warp)
	- int reg_alloc_unit; ///< Granularity of register allocation within the SM
	- int warp_alloc_unit; ///< Granularity of warp allocation within the SM
	- int max_sm_threads; ///< Maximum number of threads per SM
	- int max_sm_blocks; ///< Maximum number of threadblocks per SM
	- int max_block_threads; ///< Maximum number of threads per threadblock
	- int max_sm_registers; ///< Maximum number of registers per SM
	- int max_sm_warps; ///< Maximum number of warps per SM
	-
	- /**
	- * Callback for initializing device properties
	- */
	- template <typename ArchProps>
	- __host__ __device__ __forceinline__ void Callback()
	- {
	- warp_threads = ArchProps::WARP_THREADS;
	- smem_bank_bytes = ArchProps::SMEM_BANK_BYTES;
	- smem_banks = ArchProps::SMEM_BANKS;
	- smem_bytes = ArchProps::SMEM_BYTES;
	- smem_alloc_unit = ArchProps::SMEM_ALLOC_UNIT;
	- regs_by_block = ArchProps::REGS_BY_BLOCK;
	- reg_alloc_unit = ArchProps::REG_ALLOC_UNIT;
	- warp_alloc_unit = ArchProps::WARP_ALLOC_UNIT;
	- max_sm_threads = ArchProps::MAX_SM_THREADS;
	- max_sm_blocks = ArchProps::MAX_SM_THREADBLOCKS;
	- max_block_threads = ArchProps::MAX_BLOCK_THREADS;
	- max_sm_registers = ArchProps::MAX_SM_REGISTERS;
	- max_sm_warps = max_sm_threads / warp_threads;
	- }
	-
	-
	-public:
	-
	- /**
	- * Initializer. Properties are retrieved for the specified GPU ordinal.
	- */
	- __host__ __device__ __forceinline__
	- cudaError_t Init(int device_ordinal)
	- {
	- #ifndef CUB_RUNTIME_ENABLED
	-
	- // CUDA API calls not supported from this device
	- return CubDebug(cudaErrorInvalidConfiguration);
	-
	- #else
	-
	- cudaError_t error = cudaSuccess;
	- do
	- {
	- // Fill in SM version
	- int major, minor;
	- if (CubDebug(error = cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device_ordinal))) break;
	- if (CubDebug(error = cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device_ordinal))) break;
	- sm_version = major * 100 + minor * 10;
	-
	- // Fill in static SM properties
	- // Initialize our device properties via callback from static device properties
	- ArchProps<100>::Callback(*this, sm_version);
	-
	- // Fill in SM count
	- if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
	-
	- // Fill in PTX version
	- #if CUB_PTX_ARCH > 0
	- ptx_version = CUB_PTX_ARCH;
	- #else
	- if (CubDebug(error = PtxVersion(ptx_version))) break;
	- #endif
	-
	- }
	- while (0);
	-
	- return error;
	-
	- #endif
	- }
	-
	-
	- /**
	- * Initializer. Properties are retrieved for the current GPU ordinal.
	- */
	- __host__ __device__ __forceinline__
	- cudaError_t Init()
	- {
	- #ifndef CUB_RUNTIME_ENABLED
	-
	- // CUDA API calls not supported from this device
	- return CubDebug(cudaErrorInvalidConfiguration);
	-
	- #else
	-
	- cudaError_t error = cudaSuccess;
	- do
	- {
	- int device_ordinal;
	- if ((error = CubDebug(cudaGetDevice(&device_ordinal)))) break;
	- if ((error = Init(device_ordinal))) break;
	- }
	- while (0);
	- return error;
	-
	- #endif
	- }
	-
	-
	- /**
	- * Computes maximum SM occupancy in thread blocks for the given kernel
	- */
	- template <typename KernelPtr>
	- __host__ __device__ __forceinline__
	- cudaError_t MaxSmOccupancy(
	- int &max_sm_occupancy, ///< [out] maximum number of thread blocks that can reside on a single SM
	- KernelPtr kernel_ptr, ///< [in] Kernel pointer for which to compute SM occupancy
	- int block_threads) ///< [in] Number of threads per thread block
	- {
	- #ifndef CUB_RUNTIME_ENABLED
	-
	- // CUDA API calls not supported from this device
	- return CubDebug(cudaErrorInvalidConfiguration);
	-
	- #else
	-
	- cudaError_t error = cudaSuccess;
	- do
	- {
	- // Get kernel attributes
	- cudaFuncAttributes kernel_attrs;
	- if (CubDebug(error = cudaFuncGetAttributes(&kernel_attrs, kernel_ptr))) break;
	-
	- // Number of warps per threadblock
	- int block_warps = (block_threads + warp_threads - 1) / warp_threads;
	-
	- // Max warp occupancy
	- int max_warp_occupancy = (block_warps > 0) ?
	- max_sm_warps / block_warps :
	- max_sm_blocks;
	-
	- // Maximum register occupancy
	- int max_reg_occupancy;
	- if ((block_threads == 0) \|\| (kernel_attrs.numRegs == 0))
	- {
	- // Prevent divide-by-zero
	- max_reg_occupancy = max_sm_blocks;
	- }
	- else if (regs_by_block)
	- {
	- // Allocates registers by threadblock
	- int block_regs = CUB_ROUND_UP_NEAREST(kernel_attrs.numRegs * warp_threads * block_warps, reg_alloc_unit);
	- max_reg_occupancy = max_sm_registers / block_regs;
	- }
	- else
	- {
	- // Allocates registers by warp
	- int sm_sides = warp_alloc_unit;
	- int sm_registers_per_side = max_sm_registers / sm_sides;
	- int regs_per_warp = CUB_ROUND_UP_NEAREST(kernel_attrs.numRegs * warp_threads, reg_alloc_unit);
	- int warps_per_side = sm_registers_per_side / regs_per_warp;
	- int warps = warps_per_side * sm_sides;
	- max_reg_occupancy = warps / block_warps;
	- }
	-
	- // Shared memory per threadblock
	- int block_allocated_smem = CUB_ROUND_UP_NEAREST(
	- kernel_attrs.sharedSizeBytes,
	- smem_alloc_unit);
	-
	- // Max shared memory occupancy
	- int max_smem_occupancy = (block_allocated_smem > 0) ?
	- (smem_bytes / block_allocated_smem) :
	- max_sm_blocks;
	-
	- // Max occupancy
	- max_sm_occupancy = CUB_MIN(
	- CUB_MIN(max_sm_blocks, max_warp_occupancy),
	- CUB_MIN(max_smem_occupancy, max_reg_occupancy));
	-
	-// printf("max_smem_occupancy(%d), max_warp_occupancy(%d), max_reg_occupancy(%d)", max_smem_occupancy, max_warp_occupancy, max_reg_occupancy);
	-
	- } while (0);
	-
	- return error;
	-
	- #endif
	- }
	-
	-};
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_iterator.cuh b/lib/kokkos/TPL/cub/util_iterator.cuh
	deleted file mode 100755
	index 08b574ca5..000000000
	--- a/lib/kokkos/TPL/cub/util_iterator.cuh
	+++ /dev/null
	@@ -1,718 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Random-access iterator types
	- */
	-
	-#pragma once
	-
	-#include "thread/thread_load.cuh"
	-#include "util_device.cuh"
	-#include "util_debug.cuh"
	-#include "util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/******************************************************************************
	- * Texture references
	- *****************************************************************************/
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-// Anonymous namespace
	-namespace {
	-
	-/// Templated texture reference type
	-template <typename T>
	-struct TexIteratorRef
	-{
	- // Texture reference type
	- typedef texture<T, cudaTextureType1D, cudaReadModeElementType> TexRef;
	-
	- static TexRef ref;
	-
	- /**
	- * Bind texture
	- */
	- static cudaError_t BindTexture(void *d_in)
	- {
	- cudaChannelFormatDesc tex_desc = cudaCreateChannelDesc<T>();
	- if (d_in)
	- return (CubDebug(cudaBindTexture(NULL, ref, d_in, tex_desc)));
	-
	- return cudaSuccess;
	- }
	-
	- /**
	- * Unbind textures
	- */
	- static cudaError_t UnbindTexture()
	- {
	- return CubDebug(cudaUnbindTexture(ref));
	- }
	-};
	-
	-// Texture reference definitions
	-template <typename Value>
	-typename TexIteratorRef<Value>::TexRef TexIteratorRef<Value>::ref = 0;
	-
	-} // Anonymous namespace
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-
	-
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-/******************************************************************************
	- * Iterators
	- *****************************************************************************/
	-
	-/**
	- * \brief A simple random-access iterator pointing to a range of constant values
	- *
	- * \par Overview
	- * ConstantIteratorRA is a random-access iterator that when dereferenced, always
	- * returns the supplied constant of type \p OutputType.
	- *
	- * \tparam OutputType The value type of this iterator
	- */
	-template <typename OutputType>
	-class ConstantIteratorRA
	-{
	-public:
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- typedef ConstantIteratorRA self_type;
	- typedef OutputType value_type;
	- typedef OutputType reference;
	- typedef OutputType* pointer;
	- typedef std::random_access_iterator_tag iterator_category;
	- typedef int difference_type;
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-private:
	-
	- OutputType val;
	-
	-public:
	-
	- /// Constructor
	- __host__ __device__ __forceinline__ ConstantIteratorRA(
	- const OutputType &val) ///< Constant value for the iterator instance to report
	- :
	- val(val)
	- {}
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- __host__ __device__ __forceinline__ self_type operator++()
	- {
	- self_type i = *this;
	- return i;
	- }
	-
	- __host__ __device__ __forceinline__ self_type operator++(int junk)
	- {
	- return *this;
	- }
	-
	- __host__ __device__ __forceinline__ reference operator*()
	- {
	- return val;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator+(SizeT n)
	- {
	- return ConstantIteratorRA(val);
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator-(SizeT n)
	- {
	- return ConstantIteratorRA(val);
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ reference operator[](SizeT n)
	- {
	- return ConstantIteratorRA(val);
	- }
	-
	- __host__ __device__ __forceinline__ pointer operator->()
	- {
	- return &val;
	- }
	-
	- __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
	- {
	- return (val == rhs.val);
	- }
	-
	- __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
	- {
	- return (val != rhs.val);
	- }
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-};
	-
	-
	-
	-/**
	- * \brief A simple random-access transform iterator for applying a transformation operator.
	- *
	- * \par Overview
	- * TransformIteratorRA is a random-access iterator that wraps both a native
	- * device pointer of type <tt>InputType*</tt> and a unary conversion functor of
	- * type \p ConversionOp. \p OutputType references are made by pulling \p InputType
	- * values through the \p ConversionOp instance.
	- *
	- * \tparam InputType The value type of the pointer being wrapped
	- * \tparam ConversionOp Unary functor type for mapping objects of type \p InputType to type \p OutputType. Must have member <tt>OutputType operator()(const InputType &datum)</tt>.
	- * \tparam OutputType The value type of this iterator
	- */
	-template <typename OutputType, typename ConversionOp, typename InputType>
	-class TransformIteratorRA
	-{
	-public:
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- typedef TransformIteratorRA self_type;
	- typedef OutputType value_type;
	- typedef OutputType reference;
	- typedef OutputType* pointer;
	- typedef std::random_access_iterator_tag iterator_category;
	- typedef int difference_type;
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-private:
	-
	- ConversionOp conversion_op;
	- InputType* ptr;
	-
	-public:
	-
	- /**
	- * \brief Constructor
	- * @param ptr Native pointer to wrap
	- * @param conversion_op Binary transformation functor
	- */
	- __host__ __device__ __forceinline__ TransformIteratorRA(InputType* ptr, ConversionOp conversion_op) :
	- conversion_op(conversion_op),
	- ptr(ptr) {}
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- __host__ __device__ __forceinline__ self_type operator++()
	- {
	- self_type i = *this;
	- ptr++;
	- return i;
	- }
	-
	- __host__ __device__ __forceinline__ self_type operator++(int junk)
	- {
	- ptr++;
	- return *this;
	- }
	-
	- __host__ __device__ __forceinline__ reference operator*()
	- {
	- return conversion_op(*ptr);
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator+(SizeT n)
	- {
	- TransformIteratorRA retval(ptr + n, conversion_op);
	- return retval;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator-(SizeT n)
	- {
	- TransformIteratorRA retval(ptr - n, conversion_op);
	- return retval;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ reference operator[](SizeT n)
	- {
	- return conversion_op(ptr[n]);
	- }
	-
	- __host__ __device__ __forceinline__ pointer operator->()
	- {
	- return &conversion_op(*ptr);
	- }
	-
	- __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
	- {
	- return (ptr == rhs.ptr);
	- }
	-
	- __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
	- {
	- return (ptr != rhs.ptr);
	- }
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-};
	-
	-
	-
	-/**
	- * \brief A simple random-access iterator for loading primitive values through texture cache.
	- *
	- * \par Overview
	- * TexIteratorRA is a random-access iterator that wraps a native
	- * device pointer of type <tt>T*</tt>. References made through TexIteratorRA
	- * causes values to be pulled through texture cache.
	- *
	- * \par Usage Considerations
	- * - Can only be used with primitive types (e.g., \p char, \p int, \p float), with the exception of \p double
	- * - Only one TexIteratorRA or TexIteratorRA of a certain \p InputType can be bound at any given time (per host thread)
	- *
	- * \tparam InputType The value type of the pointer being wrapped
	- * \tparam ConversionOp Unary functor type for mapping objects of type \p InputType to type \p OutputType. Must have member <tt>OutputType operator()(const InputType &datum)</tt>.
	- * \tparam OutputType The value type of this iterator
	- */
	-template <typename T>
	-class TexIteratorRA
	-{
	-public:
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- typedef TexIteratorRA self_type;
	- typedef T value_type;
	- typedef T reference;
	- typedef T* pointer;
	- typedef std::random_access_iterator_tag iterator_category;
	- typedef int difference_type;
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	- /// Tag identifying iterator type as being texture-bindable
	- typedef void TexBindingTag;
	-
	-private:
	-
	- T* ptr;
	- size_t tex_align_offset;
	- cudaTextureObject_t tex_obj;
	-
	-public:
	-
	- /**
	- * \brief Constructor
	- */
	- __host__ __device__ __forceinline__ TexIteratorRA()
	- :
	- ptr(NULL),
	- tex_align_offset(0),
	- tex_obj(0)
	- {}
	-
	- /// \brief Bind iterator to texture reference
	- cudaError_t BindTexture(
	- T *ptr, ///< Native pointer to wrap that is aligned to cudaDeviceProp::textureAlignment
	- size_t bytes, ///< Number of items
	- size_t tex_align_offset = 0) ///< Offset (in items) from ptr denoting the position of the iterator
	- {
	- this->ptr = ptr;
	- this->tex_align_offset = tex_align_offset;
	-
	- int ptx_version;
	- cudaError_t error = cudaSuccess;
	- if (CubDebug(error = PtxVersion(ptx_version))) return error;
	- if (ptx_version >= 300)
	- {
	- // Use texture object
	- cudaChannelFormatDesc channel_desc = cudaCreateChannelDesc<T>();
	- cudaResourceDesc res_desc;
	- cudaTextureDesc tex_desc;
	- memset(&res_desc, 0, sizeof(cudaResourceDesc));
	- memset(&tex_desc, 0, sizeof(cudaTextureDesc));
	- res_desc.resType = cudaResourceTypeLinear;
	- res_desc.res.linear.devPtr = ptr;
	- res_desc.res.linear.desc = channel_desc;
	- res_desc.res.linear.sizeInBytes = bytes;
	- tex_desc.readMode = cudaReadModeElementType;
	- return cudaCreateTextureObject(&tex_obj, &res_desc, &tex_desc, NULL);
	- }
	- else
	- {
	- // Use texture reference
	- return TexIteratorRef<T>::BindTexture(ptr);
	- }
	- }
	-
	- /// \brief Unbind iterator to texture reference
	- cudaError_t UnbindTexture()
	- {
	- int ptx_version;
	- cudaError_t error = cudaSuccess;
	- if (CubDebug(error = PtxVersion(ptx_version))) return error;
	- if (ptx_version < 300)
	- {
	- // Use texture reference
	- return TexIteratorRef<T>::UnbindTexture();
	- }
	- else
	- {
	- // Use texture object
	- return cudaDestroyTextureObject(tex_obj);
	- }
	- }
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- __host__ __device__ __forceinline__ self_type operator++()
	- {
	- self_type i = *this;
	- ptr++;
	- tex_align_offset++;
	- return i;
	- }
	-
	- __host__ __device__ __forceinline__ self_type operator++(int junk)
	- {
	- ptr++;
	- tex_align_offset++;
	- return *this;
	- }
	-
	- __host__ __device__ __forceinline__ reference operator*()
	- {
	-#if (CUB_PTX_ARCH == 0)
	- // Simply dereference the pointer on the host
	- return *ptr;
	-#elif (CUB_PTX_ARCH < 300)
	- // Use the texture reference
	- return tex1Dfetch(TexIteratorRef<T>::ref, tex_align_offset);
	-#else
	- // Use the texture object
	- return conversion_op(tex1Dfetch<InputType>(tex_obj, tex_align_offset));
	-#endif
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator+(SizeT n)
	- {
	- TexIteratorRA retval;
	- retval.ptr = ptr + n;
	- retval.tex_align_offset = tex_align_offset + n;
	- return retval;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator-(SizeT n)
	- {
	- TexIteratorRA retval;
	- retval.ptr = ptr - n;
	- retval.tex_align_offset = tex_align_offset - n;
	- return retval;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ reference operator[](SizeT n)
	- {
	-#if (CUB_PTX_ARCH == 0)
	- // Simply dereference the pointer on the host
	- return ptr[n];
	-#elif (CUB_PTX_ARCH < 300)
	- // Use the texture reference
	- return tex1Dfetch(TexIteratorRef<T>::ref, tex_align_offset + n);
	-#else
	- // Use the texture object
	- return conversion_op(tex1Dfetch<InputType>(tex_obj, tex_align_offset + n));
	-#endif
	- }
	-
	- __host__ __device__ __forceinline__ pointer operator->()
	- {
	-#if (CUB_PTX_ARCH == 0)
	- // Simply dereference the pointer on the host
	- return &(*ptr);
	-#elif (CUB_PTX_ARCH < 300)
	- // Use the texture reference
	- return &(tex1Dfetch(TexIteratorRef<T>::ref, tex_align_offset));
	-#else
	- // Use the texture object
	- return conversion_op(tex1Dfetch<InputType>(tex_obj, tex_align_offset));
	-#endif
	- }
	-
	- __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
	- {
	- return (ptr == rhs.ptr);
	- }
	-
	- __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
	- {
	- return (ptr != rhs.ptr);
	- }
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-};
	-
	-
	-/**
	- * \brief A simple random-access transform iterator for loading primitive values through texture cache and and subsequently applying a transformation operator.
	- *
	- * \par Overview
	- * TexTransformIteratorRA is a random-access iterator that wraps both a native
	- * device pointer of type <tt>InputType*</tt> and a unary conversion functor of
	- * type \p ConversionOp. \p OutputType references are made by pulling \p InputType
	- * values through the texture cache and then transformed them using the
	- * \p ConversionOp instance.
	- *
	- * \par Usage Considerations
	- * - Can only be used with primitive types (e.g., \p char, \p int, \p float), with the exception of \p double
	- * - Only one TexIteratorRA or TexTransformIteratorRA of a certain \p InputType can be bound at any given time (per host thread)
	- *
	- * \tparam InputType The value type of the pointer being wrapped
	- * \tparam ConversionOp Unary functor type for mapping objects of type \p InputType to type \p OutputType. Must have member <tt>OutputType operator()(const InputType &datum)</tt>.
	- * \tparam OutputType The value type of this iterator
	- */
	-template <typename OutputType, typename ConversionOp, typename InputType>
	-class TexTransformIteratorRA
	-{
	-public:
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- typedef TexTransformIteratorRA self_type;
	- typedef OutputType value_type;
	- typedef OutputType reference;
	- typedef OutputType* pointer;
	- typedef std::random_access_iterator_tag iterator_category;
	- typedef int difference_type;
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	- /// Tag identifying iterator type as being texture-bindable
	- typedef void TexBindingTag;
	-
	-private:
	-
	- ConversionOp conversion_op;
	- InputType* ptr;
	- size_t tex_align_offset;
	- cudaTextureObject_t tex_obj;
	-
	-public:
	-
	- /**
	- * \brief Constructor
	- */
	- TexTransformIteratorRA(
	- ConversionOp conversion_op) ///< Binary transformation functor
	- :
	- conversion_op(conversion_op),
	- ptr(NULL),
	- tex_align_offset(0),
	- tex_obj(0)
	- {}
	-
	- /// \brief Bind iterator to texture reference
	- cudaError_t BindTexture(
	- InputType* ptr, ///< Native pointer to wrap that is aligned to cudaDeviceProp::textureAlignment
	- size_t bytes, ///< Number of items
	- size_t tex_align_offset = 0) ///< Offset (in items) from ptr denoting the position of the iterator
	- {
	- this->ptr = ptr;
	- this->tex_align_offset = tex_align_offset;
	-
	- int ptx_version;
	- cudaError_t error = cudaSuccess;
	- if (CubDebug(error = PtxVersion(ptx_version))) return error;
	- if (ptx_version >= 300)
	- {
	- // Use texture object
	- cudaChannelFormatDesc channel_desc = cudaCreateChannelDesc<InputType>();
	- cudaResourceDesc res_desc;
	- cudaTextureDesc tex_desc;
	- memset(&res_desc, 0, sizeof(cudaResourceDesc));
	- memset(&tex_desc, 0, sizeof(cudaTextureDesc));
	- res_desc.resType = cudaResourceTypeLinear;
	- res_desc.res.linear.devPtr = ptr;
	- res_desc.res.linear.desc = channel_desc;
	- res_desc.res.linear.sizeInBytes = bytes;
	- tex_desc.readMode = cudaReadModeElementType;
	- return cudaCreateTextureObject(&tex_obj, &res_desc, &tex_desc, NULL);
	- }
	- else
	- {
	- // Use texture reference
	- return TexIteratorRef<InputType>::BindTexture(ptr);
	- }
	- }
	-
	- /// \brief Unbind iterator to texture reference
	- cudaError_t UnbindTexture()
	- {
	- int ptx_version;
	- cudaError_t error = cudaSuccess;
	- if (CubDebug(error = PtxVersion(ptx_version))) return error;
	- if (ptx_version >= 300)
	- {
	- // Use texture object
	- return cudaDestroyTextureObject(tex_obj);
	- }
	- else
	- {
	- // Use texture reference
	- return TexIteratorRef<InputType>::UnbindTexture();
	- }
	- }
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- __host__ __device__ __forceinline__ self_type operator++()
	- {
	- self_type i = *this;
	- ptr++;
	- tex_align_offset++;
	- return i;
	- }
	-
	- __host__ __device__ __forceinline__ self_type operator++(int junk)
	- {
	- ptr++;
	- tex_align_offset++;
	- return *this;
	- }
	-
	- __host__ __device__ __forceinline__ reference operator*()
	- {
	-#if (CUB_PTX_ARCH == 0)
	- // Simply dereference the pointer on the host
	- return conversion_op(*ptr);
	-#elif (CUB_PTX_ARCH < 300)
	- // Use the texture reference
	- return conversion_op(tex1Dfetch(TexIteratorRef<InputType>::ref, tex_align_offset));
	-#else
	- // Use the texture object
	- return conversion_op(tex1Dfetch<InputType>(tex_obj, tex_align_offset));
	-#endif
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator+(SizeT n)
	- {
	- TexTransformIteratorRA retval(conversion_op);
	- retval.ptr = ptr + n;
	- retval.tex_align_offset = tex_align_offset + n;
	- return retval;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ self_type operator-(SizeT n)
	- {
	- TexTransformIteratorRA retval(conversion_op);
	- retval.ptr = ptr - n;
	- retval.tex_align_offset = tex_align_offset - n;
	- return retval;
	- }
	-
	- template <typename SizeT>
	- __host__ __device__ __forceinline__ reference operator[](SizeT n)
	- {
	-#if (CUB_PTX_ARCH == 0)
	- // Simply dereference the pointer on the host
	- return conversion_op(ptr[n]);
	-#elif (CUB_PTX_ARCH < 300)
	- // Use the texture reference
	- return conversion_op(tex1Dfetch(TexIteratorRef<InputType>::ref, tex_align_offset + n));
	-#else
	- // Use the texture object
	- return conversion_op(tex1Dfetch<InputType>(tex_obj, tex_align_offset + n));
	-#endif
	- }
	-
	- __host__ __device__ __forceinline__ pointer operator->()
	- {
	-#if (CUB_PTX_ARCH == 0)
	- // Simply dereference the pointer on the host
	- return &conversion_op(*ptr);
	-#elif (CUB_PTX_ARCH < 300)
	- // Use the texture reference
	- return &conversion_op(tex1Dfetch(TexIteratorRef<InputType>::ref, tex_align_offset));
	-#else
	- // Use the texture object
	- return &conversion_op(tex1Dfetch<InputType>(tex_obj, tex_align_offset));
	-#endif
	- }
	-
	- __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
	- {
	- return (ptr == rhs.ptr);
	- }
	-
	- __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
	- {
	- return (ptr != rhs.ptr);
	- }
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-};
	-
	-
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_macro.cuh b/lib/kokkos/TPL/cub/util_macro.cuh
	deleted file mode 100755
	index 091fd93c5..000000000
	--- a/lib/kokkos/TPL/cub/util_macro.cuh
	+++ /dev/null
	@@ -1,107 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/******************************************************************************
	- * Common C/C++ macro utilities
	- ******************************************************************************/
	-
	-#pragma once
	-
	-#include "util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-/**
	- * Align struct
	- */
	-#if defined(_WIN32) \|\| defined(_WIN64)
	- #define CUB_ALIGN(bytes) __declspec(align(32))
	-#else
	- #define CUB_ALIGN(bytes) __attribute__((aligned(bytes)))
	-#endif
	-
	-/**
	- * Select maximum(a, b)
	- */
	-#define CUB_MAX(a, b) (((a) > (b)) ? (a) : (b))
	-
	-/**
	- * Select minimum(a, b)
	- */
	-#define CUB_MIN(a, b) (((a) < (b)) ? (a) : (b))
	-
	-/**
	- * Quotient of x/y rounded down to nearest integer
	- */
	-#define CUB_QUOTIENT_FLOOR(x, y) ((x) / (y))
	-
	-/**
	- * Quotient of x/y rounded up to nearest integer
	- */
	-#define CUB_QUOTIENT_CEILING(x, y) (((x) + (y) - 1) / (y))
	-
	-/**
	- * x rounded up to the nearest multiple of y
	- */
	-#define CUB_ROUND_UP_NEAREST(x, y) ((((x) + (y) - 1) / (y)) * y)
	-
	-/**
	- * x rounded down to the nearest multiple of y
	- */
	-#define CUB_ROUND_DOWN_NEAREST(x, y) (((x) / (y)) * y)
	-
	-/**
	- * Return character string for given type
	- */
	-#define CUB_TYPE_STRING(type) ""#type
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	- #define CUB_CAT_(a, b) a ## b
	- #define CUB_CAT(a, b) CUB_CAT_(a, b)
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-/**
	- * Static assert
	- */
	-#define CUB_STATIC_ASSERT(cond, msg) typedef int CUB_CAT(cub_static_assert, __LINE__)[(cond) ? 1 : -1]
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_namespace.cuh b/lib/kokkos/TPL/cub/util_namespace.cuh
	deleted file mode 100755
	index 869ecc613..000000000
	--- a/lib/kokkos/TPL/cub/util_namespace.cuh
	+++ /dev/null
	@@ -1,41 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Place-holder for prefixing the cub namespace
	- */
	-
	-#pragma once
	-
	-// For example:
	-//#define CUB_NS_PREFIX namespace thrust{ namespace detail {
	-//#define CUB_NS_POSTFIX } }
	-
	-#define CUB_NS_PREFIX
	-#define CUB_NS_POSTFIX
	diff --git a/lib/kokkos/TPL/cub/util_ptx.cuh b/lib/kokkos/TPL/cub/util_ptx.cuh
	deleted file mode 100755
	index ad80b0401..000000000
	--- a/lib/kokkos/TPL/cub/util_ptx.cuh
	+++ /dev/null
	@@ -1,380 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * PTX intrinsics
	- */
	-
	-
	-#pragma once
	-
	-#include "util_type.cuh"
	-#include "util_arch.cuh"
	-#include "util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-/******************************************************************************
	- * PTX helper macros
	- ******************************************************************************/
	-
	-/**
	- * Register modifier for pointer-types (for inlining PTX assembly)
	- */
	-#if defined(_WIN64) \|\| defined(__LP64__)
	- #define __CUB_LP64__ 1
	- // 64-bit register modifier for inlined asm
	- #define _CUB_ASM_PTR_ "l"
	- #define _CUB_ASM_PTR_SIZE_ "u64"
	-#else
	- #define __CUB_LP64__ 0
	- // 32-bit register modifier for inlined asm
	- #define _CUB_ASM_PTR_ "r"
	- #define _CUB_ASM_PTR_SIZE_ "u32"
	-#endif
	-
	-
	-/******************************************************************************
	- * Inlined PTX intrinsics
	- ******************************************************************************/
	-
	-/**
	- * Shift-right then add. Returns (x >> shift) + addend.
	- */
	-__device__ __forceinline__ unsigned int SHR_ADD(
	- unsigned int x,
	- unsigned int shift,
	- unsigned int addend)
	-{
	- unsigned int ret;
	-#if __CUDA_ARCH__ >= 200
	- asm("vshr.u32.u32.u32.clamp.add %0, %1, %2, %3;" :
	- "=r"(ret) : "r"(x), "r"(shift), "r"(addend));
	-#else
	- ret = (x >> shift) + addend;
	-#endif
	- return ret;
	-}
	-
	-
	-/**
	- * Shift-left then add. Returns (x << shift) + addend.
	- */
	-__device__ __forceinline__ unsigned int SHL_ADD(
	- unsigned int x,
	- unsigned int shift,
	- unsigned int addend)
	-{
	- unsigned int ret;
	-#if __CUDA_ARCH__ >= 200
	- asm("vshl.u32.u32.u32.clamp.add %0, %1, %2, %3;" :
	- "=r"(ret) : "r"(x), "r"(shift), "r"(addend));
	-#else
	- ret = (x << shift) + addend;
	-#endif
	- return ret;
	-}
	-
	-
	-/**
	- * Bitfield-extract.
	- */
	-template <typename UnsignedBits>
	-__device__ __forceinline__ unsigned int BFE(
	- UnsignedBits source,
	- unsigned int bit_start,
	- unsigned int num_bits)
	-{
	- unsigned int bits;
	-#if __CUDA_ARCH__ >= 200
	- asm("bfe.u32 %0, %1, %2, %3;" : "=r"(bits) : "r"((unsigned int) source), "r"(bit_start), "r"(num_bits));
	-#else
	- const unsigned int MASK = (1 << num_bits) - 1;
	- bits = (source >> bit_start) & MASK;
	-#endif
	- return bits;
	-}
	-
	-
	-/**
	- * Bitfield-extract for 64-bit types.
	- */
	-__device__ __forceinline__ unsigned int BFE(
	- unsigned long long source,
	- unsigned int bit_start,
	- unsigned int num_bits)
	-{
	- const unsigned long long MASK = (1ull << num_bits) - 1;
	- return (source >> bit_start) & MASK;
	-}
	-
	-
	-/**
	- * Bitfield insert. Inserts the first num_bits of y into x starting at bit_start
	- */
	-__device__ __forceinline__ void BFI(
	- unsigned int &ret,
	- unsigned int x,
	- unsigned int y,
	- unsigned int bit_start,
	- unsigned int num_bits)
	-{
	-#if __CUDA_ARCH__ >= 200
	- asm("bfi.b32 %0, %1, %2, %3, %4;" :
	- "=r"(ret) : "r"(y), "r"(x), "r"(bit_start), "r"(num_bits));
	-#else
	- // TODO
	-#endif
	-}
	-
	-
	-/**
	- * Three-operand add
	- */
	-__device__ __forceinline__ unsigned int IADD3(unsigned int x, unsigned int y, unsigned int z)
	-{
	-#if __CUDA_ARCH__ >= 200
	- asm("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(x) : "r"(x), "r"(y), "r"(z));
	-#else
	- x = x + y + z;
	-#endif
	- return x;
	-}
	-
	-
	-/**
	- * Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and
	- * reassemble them into a 32-bit destination register
	- */
	-__device__ __forceinline__ int PRMT(unsigned int a, unsigned int b, unsigned int index)
	-{
	- int ret;
	- asm("prmt.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(a), "r"(b), "r"(index));
	- return ret;
	-}
	-
	-
	-/**
	- * Sync-threads barrier.
	- */
	-__device__ __forceinline__ void BAR(int count)
	-{
	- asm volatile("bar.sync 1, %0;" : : "r"(count));
	-}
	-
	-
	-/**
	- * Floating point multiply. (Mantissa LSB rounds towards zero.)
	- */
	-__device__ __forceinline__ float FMUL_RZ(float a, float b)
	-{
	- float d;
	- asm("mul.rz.f32 %0, %1, %2;" : "=f"(d) : "f"(a), "f"(b));
	- return d;
	-}
	-
	-
	-/**
	- * Floating point multiply-add. (Mantissa LSB rounds towards zero.)
	- */
	-__device__ __forceinline__ float FFMA_RZ(float a, float b, float c)
	-{
	- float d;
	- asm("fma.rz.f32 %0, %1, %2, %3;" : "=f"(d) : "f"(a), "f"(b), "f"(c));
	- return d;
	-}
	-
	-
	-/**
	- * Terminates the calling thread
	- */
	-__device__ __forceinline__ void ThreadExit() {
	- asm("exit;");
	-}
	-
	-
	-/**
	- * Returns the warp lane ID of the calling thread
	- */
	-__device__ __forceinline__ unsigned int LaneId()
	-{
	- unsigned int ret;
	- asm("mov.u32 %0, %laneid;" : "=r"(ret) );
	- return ret;
	-}
	-
	-
	-/**
	- * Returns the warp ID of the calling thread
	- */
	-__device__ __forceinline__ unsigned int WarpId()
	-{
	- unsigned int ret;
	- asm("mov.u32 %0, %warpid;" : "=r"(ret) );
	- return ret;
	-}
	-
	-/**
	- * Returns the warp lane mask of all lanes less than the calling thread
	- */
	-__device__ __forceinline__ unsigned int LaneMaskLt()
	-{
	- unsigned int ret;
	- asm("mov.u32 %0, %lanemask_lt;" : "=r"(ret) );
	- return ret;
	-}
	-
	-/**
	- * Returns the warp lane mask of all lanes less than or equal to the calling thread
	- */
	-__device__ __forceinline__ unsigned int LaneMaskLe()
	-{
	- unsigned int ret;
	- asm("mov.u32 %0, %lanemask_le;" : "=r"(ret) );
	- return ret;
	-}
	-
	-/**
	- * Returns the warp lane mask of all lanes greater than the calling thread
	- */
	-__device__ __forceinline__ unsigned int LaneMaskGt()
	-{
	- unsigned int ret;
	- asm("mov.u32 %0, %lanemask_gt;" : "=r"(ret) );
	- return ret;
	-}
	-
	-/**
	- * Returns the warp lane mask of all lanes greater than or equal to the calling thread
	- */
	-__device__ __forceinline__ unsigned int LaneMaskGe()
	-{
	- unsigned int ret;
	- asm("mov.u32 %0, %lanemask_ge;" : "=r"(ret) );
	- return ret;
	-}
	-
	-/**
	- * Portable implementation of __all
	- */
	-__device__ __forceinline__ int WarpAll(int cond)
	-{
	-#if CUB_PTX_ARCH < 120
	-
	- __shared__ volatile int warp_signals[PtxArchProps::MAX_SM_THREADS / PtxArchProps::WARP_THREADS];
	-
	- if (LaneId() == 0)
	- warp_signals[WarpId()] = 1;
	-
	- if (cond == 0)
	- warp_signals[WarpId()] = 0;
	-
	- return warp_signals[WarpId()];
	-
	-#else
	-
	- return __all(cond);
	-
	-#endif
	-}
	-
	-
	-/**
	- * Portable implementation of __any
	- */
	-__device__ __forceinline__ int WarpAny(int cond)
	-{
	-#if CUB_PTX_ARCH < 120
	-
	- __shared__ volatile int warp_signals[PtxArchProps::MAX_SM_THREADS / PtxArchProps::WARP_THREADS];
	-
	- if (LaneId() == 0)
	- warp_signals[WarpId()] = 0;
	-
	- if (cond)
	- warp_signals[WarpId()] = 1;
	-
	- return warp_signals[WarpId()];
	-
	-#else
	-
	- return __any(cond);
	-
	-#endif
	-}
	-
	-
	-/// Generic shuffle-up
	-template <typename T>
	-__device__ __forceinline__ T ShuffleUp(
	- T input, ///< [in] The value to broadcast
	- int src_offset) ///< [in] The up-offset of the peer to read from
	-{
	- enum
	- {
	- SHFL_C = 0,
	- };
	-
	- typedef typename WordAlignment<T>::ShuffleWord ShuffleWord;
	-
	- const int WORDS = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
	- T output;
	- ShuffleWord output_alias = reinterpret_cast<ShuffleWord >(&output);
	- ShuffleWord input_alias = reinterpret_cast<ShuffleWord >(&input);
	-
	- #pragma unroll
	- for (int WORD = 0; WORD < WORDS; ++WORD)
	- {
	- unsigned int shuffle_word = input_alias[WORD];
	- asm(
	- " shfl.up.b32 %0, %1, %2, %3;"
	- : "=r"(shuffle_word) : "r"(shuffle_word), "r"(src_offset), "r"(SHFL_C));
	- output_alias[WORD] = (ShuffleWord) shuffle_word;
	- }
	-
	- return output;
	-}
	-
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_type.cuh b/lib/kokkos/TPL/cub/util_type.cuh
	deleted file mode 100755
	index 836aa0f04..000000000
	--- a/lib/kokkos/TPL/cub/util_type.cuh
	+++ /dev/null
	@@ -1,685 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Common type manipulation (metaprogramming) utilities
	- */
	-
	-#pragma once
	-
	-#include <iostream>
	-#include <limits>
	-
	-#include "util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-
	-/******************************************************************************
	- * Type equality
	- ******************************************************************************/
	-
	-/**
	- * \brief Type selection (<tt>IF ? ThenType : ElseType</tt>)
	- */
	-template <bool IF, typename ThenType, typename ElseType>
	-struct If
	-{
	- /// Conditional type result
	- typedef ThenType Type; // true
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <typename ThenType, typename ElseType>
	-struct If<false, ThenType, ElseType>
	-{
	- typedef ElseType Type; // false
	-};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/******************************************************************************
	- * Conditional types
	- ******************************************************************************/
	-
	-
	-/**
	- * \brief Type equality test
	- */
	-template <typename A, typename B>
	-struct Equals
	-{
	- enum {
	- VALUE = 0,
	- NEGATE = 1
	- };
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <typename A>
	-struct Equals <A, A>
	-{
	- enum {
	- VALUE = 1,
	- NEGATE = 0
	- };
	-};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/******************************************************************************
	- * Marker types
	- ******************************************************************************/
	-
	-/**
	- * \brief A simple "NULL" marker type
	- */
	-struct NullType
	-{
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	- template <typename T>
	- __host__ __device__ __forceinline__ NullType& operator =(const T& b) { return *this; }
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-};
	-
	-
	-/**
	- * \brief Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values)
	- */
	-template <int A>
	-struct Int2Type
	-{
	- enum {VALUE = A};
	-};
	-
	-
	-/******************************************************************************
	- * Size and alignment
	- ******************************************************************************/
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <typename T>
	-struct WordAlignment
	-{
	- struct Pad
	- {
	- T val;
	- char byte;
	- };
	-
	- enum
	- {
	- /// The alignment of T in bytes
	- ALIGN_BYTES = sizeof(Pad) - sizeof(T)
	- };
	-
	- /// Biggest shuffle word that T is a whole multiple of and is not larger than the alignment of T
	- typedef typename If<(ALIGN_BYTES % 4 == 0),
	- int,
	- typename If<(ALIGN_BYTES % 2 == 0),
	- short,
	- char>::Type>::Type ShuffleWord;
	-
	- /// Biggest volatile word that T is a whole multiple of and is not larger than the alignment of T
	- typedef typename If<(ALIGN_BYTES % 8 == 0),
	- long long,
	- ShuffleWord>::Type VolatileWord;
	-
	- /// Biggest memory-access word that T is a whole multiple of and is not larger than the alignment of T
	- typedef typename If<(ALIGN_BYTES % 16 == 0),
	- longlong2,
	- typename If<(ALIGN_BYTES % 8 == 0),
	- long long, // needed to get heterogenous PODs to work on all platforms
	- ShuffleWord>::Type>::Type DeviceWord;
	-
	- enum
	- {
	- DEVICE_MULTIPLE = sizeof(DeviceWord) / sizeof(T)
	- };
	-
	- struct UninitializedBytes
	- {
	- char buf[sizeof(T)];
	- };
	-
	- struct UninitializedShuffleWords
	- {
	- ShuffleWord buf[sizeof(T) / sizeof(ShuffleWord)];
	- };
	-
	- struct UninitializedVolatileWords
	- {
	- VolatileWord buf[sizeof(T) / sizeof(VolatileWord)];
	- };
	-
	- struct UninitializedDeviceWords
	- {
	- DeviceWord buf[sizeof(T) / sizeof(DeviceWord)];
	- };
	-
	-
	-};
	-
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/******************************************************************************
	- * Wrapper types
	- ******************************************************************************/
	-
	-/**
	- * \brief A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions
	- */
	-template <typename T>
	-struct Uninitialized
	-{
	- /// Biggest memory-access word that T is a whole multiple of and is not larger than the alignment of T
	- typedef typename WordAlignment<T>::DeviceWord DeviceWord;
	-
	- enum
	- {
	- WORDS = sizeof(T) / sizeof(DeviceWord)
	- };
	-
	- /// Backing storage
	- DeviceWord storage[WORDS];
	-
	- /// Alias
	- __host__ __device__ __forceinline__ T& Alias()
	- {
	- return reinterpret_cast<T&>(*this);
	- }
	-};
	-
	-
	-/**
	- * \brief A wrapper for passing simple static arrays as kernel parameters
	- */
	-template <typename T, int COUNT>
	-struct ArrayWrapper
	-{
	- /// Static array of type \p T
	- T array[COUNT];
	-};
	-
	-
	-/**
	- * \brief Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth.
	- *
	- * Many multi-pass computations require a pair of "ping-pong" storage
	- * buffers (e.g., one for reading from and the other for writing to, and then
	- * vice-versa for the subsequent pass). This structure wraps a set of device
	- * buffers and a "selector" member to track which is "current".
	- */
	-template <typename T>
	-struct DoubleBuffer
	-{
	- /// Pair of device buffer pointers
	- T *d_buffers[2];
	-
	- /// Selector into \p d_buffers (i.e., the active/valid buffer)
	- int selector;
	-
	- /// \brief Constructor
	- __host__ __device__ __forceinline__ DoubleBuffer()
	- {
	- selector = 0;
	- d_buffers[0] = NULL;
	- d_buffers[1] = NULL;
	- }
	-
	- /// \brief Constructor
	- __host__ __device__ __forceinline__ DoubleBuffer(
	- T *d_current, ///< The currently valid buffer
	- T *d_alternate) ///< Alternate storage buffer of the same size as \p d_current
	- {
	- selector = 0;
	- d_buffers[0] = d_current;
	- d_buffers[1] = d_alternate;
	- }
	-
	- /// \brief Return pointer to the currently valid buffer
	- __host__ __device__ __forceinline__ T* Current() { return d_buffers[selector]; }
	-};
	-
	-
	-
	-/******************************************************************************
	- * Static math
	- ******************************************************************************/
	-
	-/**
	- * \brief Statically determine log2(N), rounded up.
	- *
	- * For example:
	- * Log2<8>::VALUE // 3
	- * Log2<3>::VALUE // 2
	- */
	-template <int N, int CURRENT_VAL = N, int COUNT = 0>
	-struct Log2
	-{
	- /// Static logarithm value
	- enum { VALUE = Log2<N, (CURRENT_VAL >> 1), COUNT + 1>::VALUE }; // Inductive case
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-template <int N, int COUNT>
	-struct Log2<N, 0, COUNT>
	-{
	- enum {VALUE = (1 << (COUNT - 1) < N) ? // Base case
	- COUNT :
	- COUNT - 1 };
	-};
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/**
	- * \brief Statically determine if N is a power-of-two
	- */
	-template <int N>
	-struct PowerOfTwo
	-{
	- enum { VALUE = ((N & (N - 1)) == 0) };
	-};
	-
	-
	-
	-/******************************************************************************
	- * Pointer vs. iterator detection
	- ******************************************************************************/
	-
	-
	-/**
	- * \brief Pointer vs. iterator
	- */
	-template <typename Tp>
	-struct IsPointer
	-{
	- enum { VALUE = 0 };
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <typename Tp>
	-struct IsPointer<Tp*>
	-{
	- enum { VALUE = 1 };
	-};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/******************************************************************************
	- * Qualifier detection
	- ******************************************************************************/
	-
	-/**
	- * \brief Volatile modifier test
	- */
	-template <typename Tp>
	-struct IsVolatile
	-{
	- enum { VALUE = 0 };
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <typename Tp>
	-struct IsVolatile<Tp volatile>
	-{
	- enum { VALUE = 1 };
	-};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/******************************************************************************
	- * Qualifier removal
	- ******************************************************************************/
	-
	-/**
	- * \brief Removes \p const and \p volatile qualifiers from type \p Tp.
	- *
	- * For example:
	- * <tt>typename RemoveQualifiers<volatile int>::Type // int;</tt>
	- */
	-template <typename Tp, typename Up = Tp>
	-struct RemoveQualifiers
	-{
	- /// Type without \p const and \p volatile qualifiers
	- typedef Up Type;
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <typename Tp, typename Up>
	-struct RemoveQualifiers<Tp, volatile Up>
	-{
	- typedef Up Type;
	-};
	-
	-template <typename Tp, typename Up>
	-struct RemoveQualifiers<Tp, const Up>
	-{
	- typedef Up Type;
	-};
	-
	-template <typename Tp, typename Up>
	-struct RemoveQualifiers<Tp, const volatile Up>
	-{
	- typedef Up Type;
	-};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-
	-/******************************************************************************
	- * Typedef-detection
	- ******************************************************************************/
	-
	-
	-/**
	- * \brief Defines a structure \p detector_name that is templated on type \p T. The \p detector_name struct exposes a constant member \p VALUE indicating whether or not parameter \p T exposes a nested type \p nested_type_name
	- */
	-#define CUB_DEFINE_DETECT_NESTED_TYPE(detector_name, nested_type_name) \
	- template <typename T> \
	- struct detector_name \
	- { \
	- template <typename C> \
	- static char& test(typename C::nested_type_name*); \
	- template <typename> \
	- static int& test(...); \
	- enum \
	- { \
	- VALUE = sizeof(test<T>(0)) < sizeof(int) \
	- }; \
	- };
	-
	-
	-
	-/******************************************************************************
	- * Simple enable-if (similar to Boost)
	- ******************************************************************************/
	-
	-/**
	- * \brief Simple enable-if (similar to Boost)
	- */
	-template <bool Condition, class T = void>
	-struct EnableIf
	-{
	- /// Enable-if type for SFINAE dummy variables
	- typedef T Type;
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <class T>
	-struct EnableIf<false, T> {};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/******************************************************************************
	- * Typedef-detection
	- ******************************************************************************/
	-
	-/**
	- * \brief Determine whether or not BinaryOp's functor is of the form <tt>bool operator()(const T& a, const T&b)</tt> or <tt>bool operator()(const T& a, const T&b, unsigned int idx)</tt>
	- */
	-template <typename T, typename BinaryOp>
	-struct BinaryOpHasIdxParam
	-{
	-private:
	- template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, unsigned int idx) const> struct SFINAE1 {};
	- template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, unsigned int idx)> struct SFINAE2 {};
	- template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, unsigned int idx) const> struct SFINAE3 {};
	- template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, unsigned int idx)> struct SFINAE4 {};
	-
	- template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, int idx) const> struct SFINAE5 {};
	- template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, int idx)> struct SFINAE6 {};
	- template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, int idx) const> struct SFINAE7 {};
	- template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, int idx)> struct SFINAE8 {};
	-
	- template <typename BinaryOpT> static char Test(SFINAE1<BinaryOpT, &BinaryOpT::operator()> *);
	- template <typename BinaryOpT> static char Test(SFINAE2<BinaryOpT, &BinaryOpT::operator()> *);
	- template <typename BinaryOpT> static char Test(SFINAE3<BinaryOpT, &BinaryOpT::operator()> *);
	- template <typename BinaryOpT> static char Test(SFINAE4<BinaryOpT, &BinaryOpT::operator()> *);
	-
	- template <typename BinaryOpT> static char Test(SFINAE5<BinaryOpT, &BinaryOpT::operator()> *);
	- template <typename BinaryOpT> static char Test(SFINAE6<BinaryOpT, &BinaryOpT::operator()> *);
	- template <typename BinaryOpT> static char Test(SFINAE7<BinaryOpT, &BinaryOpT::operator()> *);
	- template <typename BinaryOpT> static char Test(SFINAE8<BinaryOpT, &BinaryOpT::operator()> *);
	-
	- template <typename BinaryOpT> static int Test(...);
	-
	-public:
	-
	- /// Whether the functor BinaryOp has a third <tt>unsigned int</tt> index param
	- static const bool HAS_PARAM = sizeof(Test<BinaryOp>(NULL)) == sizeof(char);
	-};
	-
	-
	-
	-/******************************************************************************
	- * Simple type traits utilities.
	- *
	- * For example:
	- * Traits<int>::CATEGORY // SIGNED_INTEGER
	- * Traits<NullType>::NULL_TYPE // true
	- * Traits<uint4>::CATEGORY // NOT_A_NUMBER
	- * Traits<uint4>::PRIMITIVE; // false
	- *
	- ******************************************************************************/
	-
	-/**
	- * \brief Basic type traits categories
	- */
	-enum Category
	-{
	- NOT_A_NUMBER,
	- SIGNED_INTEGER,
	- UNSIGNED_INTEGER,
	- FLOATING_POINT
	-};
	-
	-
	-/**
	- * \brief Basic type traits
	- */
	-template <Category _CATEGORY, bool _PRIMITIVE, bool _NULL_TYPE, typename _UnsignedBits>
	-struct BaseTraits
	-{
	- /// Category
	- static const Category CATEGORY = _CATEGORY;
	- enum
	- {
	- PRIMITIVE = _PRIMITIVE,
	- NULL_TYPE = _NULL_TYPE,
	- };
	-};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-/**
	- * Basic type traits (unsigned primitive specialization)
	- */
	-template <typename _UnsignedBits>
	-struct BaseTraits<UNSIGNED_INTEGER, true, false, _UnsignedBits>
	-{
	- typedef _UnsignedBits UnsignedBits;
	-
	- static const Category CATEGORY = UNSIGNED_INTEGER;
	- static const UnsignedBits MIN_KEY = UnsignedBits(0);
	- static const UnsignedBits MAX_KEY = UnsignedBits(-1);
	-
	- enum
	- {
	- PRIMITIVE = true,
	- NULL_TYPE = false,
	- };
	-
	-
	- static __device__ __forceinline__ UnsignedBits TwiddleIn(UnsignedBits key)
	- {
	- return key;
	- }
	-
	- static __device__ __forceinline__ UnsignedBits TwiddleOut(UnsignedBits key)
	- {
	- return key;
	- }
	-};
	-
	-
	-/**
	- * Basic type traits (signed primitive specialization)
	- */
	-template <typename _UnsignedBits>
	-struct BaseTraits<SIGNED_INTEGER, true, false, _UnsignedBits>
	-{
	- typedef _UnsignedBits UnsignedBits;
	-
	- static const Category CATEGORY = SIGNED_INTEGER;
	- static const UnsignedBits HIGH_BIT = UnsignedBits(1) << ((sizeof(UnsignedBits) * 8) - 1);
	- static const UnsignedBits MIN_KEY = HIGH_BIT;
	- static const UnsignedBits MAX_KEY = UnsignedBits(-1) ^ HIGH_BIT;
	-
	- enum
	- {
	- PRIMITIVE = true,
	- NULL_TYPE = false,
	- };
	-
	- static __device__ __forceinline__ UnsignedBits TwiddleIn(UnsignedBits key)
	- {
	- return key ^ HIGH_BIT;
	- };
	-
	- static __device__ __forceinline__ UnsignedBits TwiddleOut(UnsignedBits key)
	- {
	- return key ^ HIGH_BIT;
	- };
	-
	-};
	-
	-
	-/**
	- * Basic type traits (fp primitive specialization)
	- */
	-template <typename _UnsignedBits>
	-struct BaseTraits<FLOATING_POINT, true, false, _UnsignedBits>
	-{
	- typedef _UnsignedBits UnsignedBits;
	-
	- static const Category CATEGORY = FLOATING_POINT;
	- static const UnsignedBits HIGH_BIT = UnsignedBits(1) << ((sizeof(UnsignedBits) * 8) - 1);
	- static const UnsignedBits MIN_KEY = UnsignedBits(-1);
	- static const UnsignedBits MAX_KEY = UnsignedBits(-1) ^ HIGH_BIT;
	-
	- static __device__ __forceinline__ UnsignedBits TwiddleIn(UnsignedBits key)
	- {
	- UnsignedBits mask = (key & HIGH_BIT) ? UnsignedBits(-1) : HIGH_BIT;
	- return key ^ mask;
	- };
	-
	- static __device__ __forceinline__ UnsignedBits TwiddleOut(UnsignedBits key)
	- {
	- UnsignedBits mask = (key & HIGH_BIT) ? HIGH_BIT : UnsignedBits(-1);
	- return key ^ mask;
	- };
	-
	- enum
	- {
	- PRIMITIVE = true,
	- NULL_TYPE = false,
	- };
	-};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/**
	- * \brief Numeric type traits
	- */
	-template <typename T> struct NumericTraits : BaseTraits<NOT_A_NUMBER, false, false, T> {};
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-template <> struct NumericTraits<NullType> : BaseTraits<NOT_A_NUMBER, false, true, NullType> {};
	-
	-template <> struct NumericTraits<char> : BaseTraits<(std::numeric_limits<char>::is_signed) ? SIGNED_INTEGER : UNSIGNED_INTEGER, true, false, unsigned char> {};
	-template <> struct NumericTraits<signed char> : BaseTraits<SIGNED_INTEGER, true, false, unsigned char> {};
	-template <> struct NumericTraits<short> : BaseTraits<SIGNED_INTEGER, true, false, unsigned short> {};
	-template <> struct NumericTraits<int> : BaseTraits<SIGNED_INTEGER, true, false, unsigned int> {};
	-template <> struct NumericTraits<long> : BaseTraits<SIGNED_INTEGER, true, false, unsigned long> {};
	-template <> struct NumericTraits<long long> : BaseTraits<SIGNED_INTEGER, true, false, unsigned long long> {};
	-
	-template <> struct NumericTraits<unsigned char> : BaseTraits<UNSIGNED_INTEGER, true, false, unsigned char> {};
	-template <> struct NumericTraits<unsigned short> : BaseTraits<UNSIGNED_INTEGER, true, false, unsigned short> {};
	-template <> struct NumericTraits<unsigned int> : BaseTraits<UNSIGNED_INTEGER, true, false, unsigned int> {};
	-template <> struct NumericTraits<unsigned long> : BaseTraits<UNSIGNED_INTEGER, true, false, unsigned long> {};
	-template <> struct NumericTraits<unsigned long long> : BaseTraits<UNSIGNED_INTEGER, true, false, unsigned long long> {};
	-
	-template <> struct NumericTraits<float> : BaseTraits<FLOATING_POINT, true, false, unsigned int> {};
	-template <> struct NumericTraits<double> : BaseTraits<FLOATING_POINT, true, false, unsigned long long> {};
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/**
	- * \brief Type traits
	- */
	-template <typename T>
	-struct Traits : NumericTraits<typename RemoveQualifiers<T>::Type> {};
	-
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/util_vector.cuh b/lib/kokkos/TPL/cub/util_vector.cuh
	deleted file mode 100755
	index 9a432dc58..000000000
	--- a/lib/kokkos/TPL/cub/util_vector.cuh
	+++ /dev/null
	@@ -1,166 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * Vector type inference utilities
	- */
	-
	-#pragma once
	-
	-#include <iostream>
	-
	-#include "util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup UtilModule
	- * @{
	- */
	-
	-
	-/******************************************************************************
	- * Vector type inference utilities. For example:
	- *
	- * typename VectorHelper<unsigned int, 2>::Type // Aliases uint2
	- *
	- ******************************************************************************/
	-
	-/**
	- * \brief Exposes a member typedef \p Type that names the corresponding CUDA vector type if one exists. Otherwise \p Type refers to the VectorHelper structure itself, which will wrap the corresponding \p x, \p y, etc. vector fields.
	- */
	-template <typename T, int vec_elements> struct VectorHelper;
	-
	-#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	-enum
	-{
	- /// The maximum number of elements in CUDA vector types
	- MAX_VEC_ELEMENTS = 4,
	-};
	-
	-
	-/**
	- * Generic vector-1 type
	- */
	-template <typename T>
	-struct VectorHelper<T, 1>
	-{
	- enum { BUILT_IN = false };
	-
	- T x;
	-
	- typedef VectorHelper<T, 1> Type;
	-};
	-
	-/**
	- * Generic vector-2 type
	- */
	-template <typename T>
	-struct VectorHelper<T, 2>
	-{
	- enum { BUILT_IN = false };
	-
	- T x;
	- T y;
	-
	- typedef VectorHelper<T, 2> Type;
	-};
	-
	-/**
	- * Generic vector-3 type
	- */
	-template <typename T>
	-struct VectorHelper<T, 3>
	-{
	- enum { BUILT_IN = false };
	-
	- T x;
	- T y;
	- T z;
	-
	- typedef VectorHelper<T, 3> Type;
	-};
	-
	-/**
	- * Generic vector-4 type
	- */
	-template <typename T>
	-struct VectorHelper<T, 4>
	-{
	- enum { BUILT_IN = false };
	-
	- T x;
	- T y;
	- T z;
	- T w;
	-
	- typedef VectorHelper<T, 4> Type;
	-};
	-
	-/**
	- * Macro for expanding partially-specialized built-in vector types
	- */
	-#define CUB_DEFINE_VECTOR_TYPE(base_type,short_type) \
	- template<> struct VectorHelper<base_type, 1> { typedef short_type##1 Type; enum { BUILT_IN = true }; }; \
	- template<> struct VectorHelper<base_type, 2> { typedef short_type##2 Type; enum { BUILT_IN = true }; }; \
	- template<> struct VectorHelper<base_type, 3> { typedef short_type##3 Type; enum { BUILT_IN = true }; }; \
	- template<> struct VectorHelper<base_type, 4> { typedef short_type##4 Type; enum { BUILT_IN = true }; };
	-
	-// Expand CUDA vector types for built-in primitives
	-CUB_DEFINE_VECTOR_TYPE(char, char)
	-CUB_DEFINE_VECTOR_TYPE(signed char, char)
	-CUB_DEFINE_VECTOR_TYPE(short, short)
	-CUB_DEFINE_VECTOR_TYPE(int, int)
	-CUB_DEFINE_VECTOR_TYPE(long, long)
	-CUB_DEFINE_VECTOR_TYPE(long long, longlong)
	-CUB_DEFINE_VECTOR_TYPE(unsigned char, uchar)
	-CUB_DEFINE_VECTOR_TYPE(unsigned short, ushort)
	-CUB_DEFINE_VECTOR_TYPE(unsigned int, uint)
	-CUB_DEFINE_VECTOR_TYPE(unsigned long, ulong)
	-CUB_DEFINE_VECTOR_TYPE(unsigned long long, ulonglong)
	-CUB_DEFINE_VECTOR_TYPE(float, float)
	-CUB_DEFINE_VECTOR_TYPE(double, double)
	-CUB_DEFINE_VECTOR_TYPE(bool, uchar)
	-
	-// Undefine macros
	-#undef CUB_DEFINE_VECTOR_TYPE
	-
	-#endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-/** @} */ // end group UtilModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/warp/specializations/warp_reduce_shfl.cuh b/lib/kokkos/TPL/cub/warp/specializations/warp_reduce_shfl.cuh
	deleted file mode 100755
	index 317b62990..000000000
	--- a/lib/kokkos/TPL/cub/warp/specializations/warp_reduce_shfl.cuh
	+++ /dev/null
	@@ -1,358 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::WarpReduceShfl provides SHFL-based variants of parallel reduction across CUDA warps.
	- */
	-
	-#pragma once
	-
	-#include "../../thread/thread_operators.cuh"
	-#include "../../util_ptx.cuh"
	-#include "../../util_type.cuh"
	-#include "../../util_macro.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \brief WarpReduceShfl provides SHFL-based variants of parallel reduction across CUDA warps.
	- */
	-template <
	- typename T, ///< Data type being reduced
	- int LOGICAL_WARPS, ///< Number of logical warps entrant
	- int LOGICAL_WARP_THREADS> ///< Number of threads per logical warp
	-struct WarpReduceShfl
	-{
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- enum
	- {
	- /// The number of warp reduction steps
	- STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
	-
	- // The 5-bit SHFL mask for logically splitting warps into sub-segments
	- SHFL_MASK = (-1 << STEPS) & 31,
	-
	- // The 5-bit SFHL clamp
	- SHFL_CLAMP = LOGICAL_WARP_THREADS - 1,
	-
	- // The packed C argument (mask starts 8 bits up)
	- SHFL_C = (SHFL_MASK << 8) \| SHFL_CLAMP,
	- };
	-
	-
	- /// Shared memory storage layout type
	- typedef NullType TempStorage;
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- int warp_id;
	- int lane_id;
	-
	-
	- /******************************************************************************
	- * Construction
	- ******************************************************************************/
	-
	- /// Constructor
	- __device__ __forceinline__ WarpReduceShfl(
	- TempStorage &temp_storage,
	- int warp_id,
	- int lane_id)
	- :
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- /******************************************************************************
	- * Operation
	- ******************************************************************************/
	-
	- /// Summation (single-SHFL)
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE> ///< Number of items folded into each lane
	- __device__ __forceinline__ T Sum(
	- T input, ///< [in] Calling thread's input
	- int folded_items_per_warp, ///< [in] Total number of valid items folded into each logical warp
	- Int2Type<true> single_shfl) ///< [in] Marker type indicating whether only one SHFL instruction is required
	- {
	- unsigned int output = reinterpret_cast<unsigned int &>(input);
	-
	- // Iterate reduction steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- if (FULL_WARPS)
	- {
	- // Use predicate set from SHFL to guard against invalid peers
	- asm(
	- "{"
	- " .reg .u32 r0;"
	- " .reg .pred p;"
	- " shfl.down.b32 r0\|p, %1, %2, %3;"
	- " @p add.u32 r0, r0, %4;"
	- " mov.u32 %0, r0;"
	- "}"
	- : "=r"(output) : "r"(output), "r"(OFFSET), "r"(SHFL_C), "r"(output));
	- }
	- else
	- {
	- // Set range predicate to guard against invalid peers
	- asm(
	- "{"
	- " .reg .u32 r0;"
	- " .reg .pred p;"
	- " shfl.down.b32 r0, %1, %2, %3;"
	- " setp.lt.u32 p, %5, %6;"
	- " mov.u32 %0, %1;"
	- " @p add.u32 %0, %1, r0;"
	- "}"
	- : "=r"(output) : "r"(output), "r"(OFFSET), "r"(SHFL_C), "r"(output), "r"((lane_id + OFFSET) * FOLDED_ITEMS_PER_LANE), "r"(folded_items_per_warp));
	- }
	- }
	-
	- return output;
	- }
	-
	-
	- /// Summation (multi-SHFL)
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE> ///< Number of items folded into each lane
	- __device__ __forceinline__ T Sum(
	- T input, ///< [in] Calling thread's input
	- int folded_items_per_warp, ///< [in] Total number of valid items folded into each logical warp
	- Int2Type<false> single_shfl) ///< [in] Marker type indicating whether only one SHFL instruction is required
	- {
	- // Delegate to generic reduce
	- return Reduce<FULL_WARPS, FOLDED_ITEMS_PER_LANE>(input, folded_items_per_warp, cub::Sum());
	- }
	-
	-
	- /// Summation (float)
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE> ///< Number of items folded into each lane
	- __device__ __forceinline__ float Sum(
	- float input, ///< [in] Calling thread's input
	- int folded_items_per_warp) ///< [in] Total number of valid items folded into each logical warp
	- {
	- T output = input;
	-
	- // Iterate reduction steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- if (FULL_WARPS)
	- {
	- // Use predicate set from SHFL to guard against invalid peers
	- asm(
	- "{"
	- " .reg .f32 r0;"
	- " .reg .pred p;"
	- " shfl.down.b32 r0\|p, %1, %2, %3;"
	- " @p add.f32 r0, r0, %4;"
	- " mov.f32 %0, r0;"
	- "}"
	- : "=f"(output) : "f"(output), "r"(OFFSET), "r"(SHFL_C), "f"(output));
	- }
	- else
	- {
	- // Set range predicate to guard against invalid peers
	- asm(
	- "{"
	- " .reg .f32 r0;"
	- " .reg .pred p;"
	- " shfl.down.b32 r0, %1, %2, %3;"
	- " setp.lt.u32 p, %5, %6;"
	- " mov.f32 %0, %1;"
	- " @p add.f32 %0, %0, r0;"
	- "}"
	- : "=f"(output) : "f"(output), "r"(OFFSET), "r"(SHFL_C), "f"(output), "r"((lane_id + OFFSET) * FOLDED_ITEMS_PER_LANE), "r"(folded_items_per_warp));
	- }
	- }
	-
	- return output;
	- }
	-
	- /// Summation (generic)
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE, ///< Number of items folded into each lane
	- typename _T>
	- __device__ __forceinline__ _T Sum(
	- _T input, ///< [in] Calling thread's input
	- int folded_items_per_warp) ///< [in] Total number of valid items folded into each logical warp
	- {
	- // Whether sharing can be done with a single SHFL instruction (vs multiple SFHL instructions)
	- Int2Type<(Traits<_T>::PRIMITIVE) && (sizeof(_T) <= sizeof(unsigned int))> single_shfl;
	-
	- return Sum<FULL_WARPS, FOLDED_ITEMS_PER_LANE>(input, folded_items_per_warp, single_shfl);
	- }
	-
	-
	- /// Reduction
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE, ///< Number of items folded into each lane
	- typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input
	- int folded_items_per_warp, ///< [in] Total number of valid items folded into each logical warp
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- typedef typename WordAlignment<T>::ShuffleWord ShuffleWord;
	-
	- const int WORDS = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
	- T output = input;
	- T temp;
	- ShuffleWord temp_alias = reinterpret_cast<ShuffleWord >(&temp);
	- ShuffleWord output_alias = reinterpret_cast<ShuffleWord >(&output);
	-
	- // Iterate scan steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- // Grab addend from peer
	- const int OFFSET = 1 << STEP;
	-
	- #pragma unroll
	- for (int WORD = 0; WORD < WORDS; ++WORD)
	- {
	- unsigned int shuffle_word = output_alias[WORD];
	- asm(
	- " shfl.down.b32 %0, %1, %2, %3;"
	- : "=r"(shuffle_word) : "r"(shuffle_word), "r"(OFFSET), "r"(SHFL_C));
	- temp_alias[WORD] = (ShuffleWord) shuffle_word;
	- }
	-
	- // Perform reduction op if from a valid peer
	- if (FULL_WARPS)
	- {
	- if (lane_id < LOGICAL_WARP_THREADS - OFFSET)
	- output = reduction_op(output, temp);
	- }
	- else
	- {
	- if (((lane_id + OFFSET) * FOLDED_ITEMS_PER_LANE) < folded_items_per_warp)
	- output = reduction_op(output, temp);
	- }
	- }
	-
	- return output;
	- }
	-
	-
	- /// Segmented reduction
	- template <
	- bool HEAD_SEGMENTED, ///< Whether flags indicate a segment-head or a segment-tail
	- typename Flag,
	- typename ReductionOp>
	- __device__ __forceinline__ T SegmentedReduce(
	- T input, ///< [in] Calling thread's input
	- Flag flag, ///< [in] Whether or not the current lane is a segment head/tail
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- typedef typename WordAlignment<T>::ShuffleWord ShuffleWord;
	-
	- T output = input;
	-
	- const int WORDS = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
	- T temp;
	- ShuffleWord temp_alias = reinterpret_cast<ShuffleWord >(&temp);
	- ShuffleWord output_alias = reinterpret_cast<ShuffleWord >(&output);
	-
	- // Get the start flags for each thread in the warp.
	- int warp_flags = __ballot(flag);
	-
	- if (!HEAD_SEGMENTED)
	- warp_flags <<= 1;
	-
	- // Keep bits above the current thread.
	- warp_flags &= LaneMaskGt();
	-
	- // Accommodate packing of multiple logical warps in a single physical warp
	- if ((LOGICAL_WARPS > 1) && (LOGICAL_WARP_THREADS < 32))
	- warp_flags >>= (warp_id * LOGICAL_WARP_THREADS);
	-
	- // Find next flag
	- int next_flag = __clz(__brev(warp_flags));
	-
	- // Clip the next segment at the warp boundary if necessary
	- if (LOGICAL_WARP_THREADS != 32)
	- next_flag = CUB_MIN(next_flag, LOGICAL_WARP_THREADS);
	-
	- // Iterate scan steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- // Grab addend from peer
	- #pragma unroll
	- for (int WORD = 0; WORD < WORDS; ++WORD)
	- {
	- unsigned int shuffle_word = output_alias[WORD];
	-
	- asm(
	- " shfl.down.b32 %0, %1, %2, %3;"
	- : "=r"(shuffle_word) : "r"(shuffle_word), "r"(OFFSET), "r"(SHFL_C));
	- temp_alias[WORD] = (ShuffleWord) shuffle_word;
	-
	- }
	-
	- // Perform reduction op if valid
	- if (OFFSET < next_flag - lane_id)
	- output = reduction_op(output, temp);
	- }
	-
	- return output;
	- }
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/warp/specializations/warp_reduce_smem.cuh b/lib/kokkos/TPL/cub/warp/specializations/warp_reduce_smem.cuh
	deleted file mode 100755
	index a32d5fdd7..000000000
	--- a/lib/kokkos/TPL/cub/warp/specializations/warp_reduce_smem.cuh
	+++ /dev/null
	@@ -1,291 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::WarpReduceSmem provides smem-based variants of parallel reduction across CUDA warps.
	- */
	-
	-#pragma once
	-
	-#include "../../thread/thread_operators.cuh"
	-#include "../../thread/thread_load.cuh"
	-#include "../../thread/thread_store.cuh"
	-#include "../../util_type.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief WarpReduceSmem provides smem-based variants of parallel reduction across CUDA warps.
	- */
	-template <
	- typename T, ///< Data type being reduced
	- int LOGICAL_WARPS, ///< Number of logical warps entrant
	- int LOGICAL_WARP_THREADS> ///< Number of threads per logical warp
	-struct WarpReduceSmem
	-{
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- enum
	- {
	- /// Whether the logical warp size is a power-of-two
	- POW_OF_TWO = ((LOGICAL_WARP_THREADS & (LOGICAL_WARP_THREADS - 1)) == 0),
	-
	- /// The number of warp scan steps
	- STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
	-
	- /// The number of threads in half a warp
	- HALF_WARP_THREADS = 1 << (STEPS - 1),
	-
	- /// The number of shared memory elements per warp
	- WARP_SMEM_ELEMENTS = LOGICAL_WARP_THREADS + HALF_WARP_THREADS,
	- };
	-
	- /// Shared memory flag type
	- typedef unsigned char SmemFlag;
	-
	- /// Shared memory storage layout type (1.5 warps-worth of elements for each warp)
	- typedef T _TempStorage[LOGICAL_WARPS][WARP_SMEM_ELEMENTS];
	-
	- // Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- _TempStorage &temp_storage;
	- int warp_id;
	- int lane_id;
	-
	-
	- /******************************************************************************
	- * Construction
	- ******************************************************************************/
	-
	- /// Constructor
	- __device__ __forceinline__ WarpReduceSmem(
	- TempStorage &temp_storage,
	- int warp_id,
	- int lane_id)
	- :
	- temp_storage(temp_storage.Alias()),
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- /******************************************************************************
	- * Operation
	- ******************************************************************************/
	-
	- /**
	- * Reduction
	- */
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE, ///< Number of items folded into each lane
	- typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input
	- int folded_items_per_warp, ///< [in] Total number of valid items folded into each logical warp
	- ReductionOp reduction_op) ///< [in] Reduction operator
	- {
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- // Share input through buffer
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][lane_id], input);
	-
	- // Update input if peer_addend is in range
	- if ((FULL_WARPS && POW_OF_TWO) \|\| ((lane_id + OFFSET) * FOLDED_ITEMS_PER_LANE < folded_items_per_warp))
	- {
	- T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][lane_id + OFFSET]);
	- input = reduction_op(input, peer_addend);
	- }
	- }
	-
	- return input;
	- }
	-
	-
	- /**
	- * Segmented reduction
	- */
	- template <
	- bool HEAD_SEGMENTED, ///< Whether flags indicate a segment-head or a segment-tail
	- typename Flag,
	- typename ReductionOp>
	- __device__ __forceinline__ T SegmentedReduce(
	- T input, ///< [in] Calling thread's input
	- Flag flag, ///< [in] Whether or not the current lane is a segment head/tail
	- ReductionOp reduction_op) ///< [in] Reduction operator
	- {
	- #if CUB_PTX_ARCH >= 200
	-
	- // Ballot-based segmented reduce
	-
	- // Get the start flags for each thread in the warp.
	- int warp_flags = __ballot(flag);
	-
	- if (!HEAD_SEGMENTED)
	- warp_flags <<= 1;
	-
	- // Keep bits above the current thread.
	- warp_flags &= LaneMaskGt();
	-
	- // Accommodate packing of multiple logical warps in a single physical warp
	- if ((LOGICAL_WARPS > 1) && (LOGICAL_WARP_THREADS < 32))
	- warp_flags >>= (warp_id * LOGICAL_WARP_THREADS);
	-
	- // Find next flag
	- int next_flag = __clz(__brev(warp_flags));
	-
	- // Clip the next segment at the warp boundary if necessary
	- if (LOGICAL_WARP_THREADS != 32)
	- next_flag = CUB_MIN(next_flag, LOGICAL_WARP_THREADS);
	-
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- // Share input into buffer
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][lane_id], input);
	-
	- // Update input if peer_addend is in range
	- if (OFFSET < next_flag - lane_id)
	- {
	- T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][lane_id + OFFSET]);
	- input = reduction_op(input, peer_addend);
	- }
	- }
	-
	- return input;
	-
	- #else
	-
	- // Smem-based segmented reduce
	-
	- enum
	- {
	- UNSET = 0x0, // Is initially unset
	- SET = 0x1, // Is initially set
	- SEEN = 0x2, // Has seen another head flag from a successor peer
	- };
	-
	- // Alias flags onto shared data storage
	- volatile SmemFlag flag_storage = reinterpret_cast<SmemFlag>(temp_storage[warp_id]);
	-
	- SmemFlag flag_status = (flag) ? SET : UNSET;
	-
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- // Share input through buffer
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][lane_id], input);
	-
	- // Get peer from buffer
	- T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][lane_id + OFFSET]);
	-
	- // Share flag through buffer
	- flag_storage[lane_id] = flag_status;
	-
	- // Get peer flag from buffer
	- SmemFlag peer_flag_status = flag_storage[lane_id + OFFSET];
	-
	- // Update input if peer was in range
	- if (lane_id < LOGICAL_WARP_THREADS - OFFSET)
	- {
	- if (HEAD_SEGMENTED)
	- {
	- // Head-segmented
	- if ((flag_status & SEEN) == 0)
	- {
	- // Has not seen a more distant head flag
	- if (peer_flag_status & SET)
	- {
	- // Has now seen a head flag
	- flag_status \|= SEEN;
	- }
	- else
	- {
	- // Peer is not a head flag: grab its count
	- input = reduction_op(input, peer_addend);
	- }
	-
	- // Update seen status to include that of peer
	- flag_status \|= (peer_flag_status & SEEN);
	- }
	- }
	- else
	- {
	- // Tail-segmented. Simply propagate flag status
	- if (!flag_status)
	- {
	- input = reduction_op(input, peer_addend);
	- flag_status \|= peer_flag_status;
	- }
	-
	- }
	- }
	- }
	-
	- return input;
	-
	- #endif
	- }
	-
	-
	- /**
	- * Summation
	- */
	- template <
	- bool FULL_WARPS, ///< Whether all lanes in each warp are contributing a valid fold of items
	- int FOLDED_ITEMS_PER_LANE> ///< Number of items folded into each lane
	- __device__ __forceinline__ T Sum(
	- T input, ///< [in] Calling thread's input
	- int folded_items_per_warp) ///< [in] Total number of valid items folded into each logical warp
	- {
	- return Reduce<FULL_WARPS, FOLDED_ITEMS_PER_LANE>(input, folded_items_per_warp, cub::Sum());
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/warp/specializations/warp_scan_shfl.cuh b/lib/kokkos/TPL/cub/warp/specializations/warp_scan_shfl.cuh
	deleted file mode 100755
	index 5585396ce..000000000
	--- a/lib/kokkos/TPL/cub/warp/specializations/warp_scan_shfl.cuh
	+++ /dev/null
	@@ -1,371 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::WarpScanShfl provides SHFL-based variants of parallel prefix scan across CUDA warps.
	- */
	-
	-#pragma once
	-
	-#include "../../thread/thread_operators.cuh"
	-#include "../../util_type.cuh"
	-#include "../../util_ptx.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief WarpScanShfl provides SHFL-based variants of parallel prefix scan across CUDA warps.
	- */
	-template <
	- typename T, ///< Data type being scanned
	- int LOGICAL_WARPS, ///< Number of logical warps entrant
	- int LOGICAL_WARP_THREADS> ///< Number of threads per logical warp
	-struct WarpScanShfl
	-{
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- enum
	- {
	- /// The number of warp scan steps
	- STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
	-
	- // The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
	- SHFL_C = ((-1 << STEPS) & 31) << 8,
	- };
	-
	- /// Shared memory storage layout type
	- typedef NullType TempStorage;
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- int warp_id;
	- int lane_id;
	-
	- /******************************************************************************
	- * Construction
	- ******************************************************************************/
	-
	- /// Constructor
	- __device__ __forceinline__ WarpScanShfl(
	- TempStorage &temp_storage,
	- int warp_id,
	- int lane_id)
	- :
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- /******************************************************************************
	- * Operation
	- ******************************************************************************/
	-
	- /// Broadcast
	- __device__ __forceinline__ T Broadcast(
	- T input, ///< [in] The value to broadcast
	- int src_lane) ///< [in] Which warp lane is to do the broadcasting
	- {
	- typedef typename WordAlignment<T>::ShuffleWord ShuffleWord;
	-
	- const int WORDS = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
	- T output;
	- ShuffleWord output_alias = reinterpret_cast<ShuffleWord >(&output);
	- ShuffleWord input_alias = reinterpret_cast<ShuffleWord >(&input);
	-
	- #pragma unroll
	- for (int WORD = 0; WORD < WORDS; ++WORD)
	- {
	- unsigned int shuffle_word = input_alias[WORD];
	- asm("shfl.idx.b32 %0, %1, %2, %3;"
	- : "=r"(shuffle_word) : "r"(shuffle_word), "r"(src_lane), "r"(LOGICAL_WARP_THREADS - 1));
	- output_alias[WORD] = (ShuffleWord) shuffle_word;
	- }
	-
	- return output;
	- }
	-
	-
	- //---------------------------------------------------------------------
	- // Inclusive operations
	- //---------------------------------------------------------------------
	-
	- /// Inclusive prefix sum with aggregate (single-SHFL)
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate, ///< [out] Warp-wide aggregate reduction of input items.
	- Int2Type<true> single_shfl)
	- {
	- unsigned int temp = reinterpret_cast<unsigned int &>(input);
	-
	- // Iterate scan steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- // Use predicate set from SHFL to guard against invalid peers
	- asm(
	- "{"
	- " .reg .u32 r0;"
	- " .reg .pred p;"
	- " shfl.up.b32 r0\|p, %1, %2, %3;"
	- " @p add.u32 r0, r0, %4;"
	- " mov.u32 %0, r0;"
	- "}"
	- : "=r"(temp) : "r"(temp), "r"(1 << STEP), "r"(SHFL_C), "r"(temp));
	- }
	-
	- output = temp;
	-
	- // Grab aggregate from last warp lane
	- warp_aggregate = Broadcast(output, LOGICAL_WARP_THREADS - 1);
	- }
	-
	-
	- /// Inclusive prefix sum with aggregate (multi-SHFL)
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate, ///< [out] Warp-wide aggregate reduction of input items.
	- Int2Type<false> single_shfl) ///< [in] Marker type indicating whether only one SHFL instruction is required
	- {
	- // Delegate to generic scan
	- InclusiveScan(input, output, Sum(), warp_aggregate);
	- }
	-
	-
	- /// Inclusive prefix sum with aggregate (specialized for float)
	- __device__ __forceinline__ void InclusiveSum(
	- float input, ///< [in] Calling thread's input item.
	- float &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- float &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- output = input;
	-
	- // Iterate scan steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- // Use predicate set from SHFL to guard against invalid peers
	- asm(
	- "{"
	- " .reg .f32 r0;"
	- " .reg .pred p;"
	- " shfl.up.b32 r0\|p, %1, %2, %3;"
	- " @p add.f32 r0, r0, %4;"
	- " mov.f32 %0, r0;"
	- "}"
	- : "=f"(output) : "f"(output), "r"(1 << STEP), "r"(SHFL_C), "f"(output));
	- }
	-
	- // Grab aggregate from last warp lane
	- warp_aggregate = Broadcast(output, LOGICAL_WARP_THREADS - 1);
	- }
	-
	-
	- /// Inclusive prefix sum with aggregate (specialized for unsigned long long)
	- __device__ __forceinline__ void InclusiveSum(
	- unsigned long long input, ///< [in] Calling thread's input item.
	- unsigned long long &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- unsigned long long &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- output = input;
	-
	- // Iterate scan steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- // Use predicate set from SHFL to guard against invalid peers
	- asm(
	- "{"
	- " .reg .u32 r0;"
	- " .reg .u32 r1;"
	- " .reg .u32 lo;"
	- " .reg .u32 hi;"
	- " .reg .pred p;"
	- " mov.b64 {lo, hi}, %1;"
	- " shfl.up.b32 r0\|p, lo, %2, %3;"
	- " shfl.up.b32 r1\|p, hi, %2, %3;"
	- " @p add.cc.u32 r0, r0, lo;"
	- " @p addc.u32 r1, r1, hi;"
	- " mov.b64 %0, {r0, r1};"
	- "}"
	- : "=l"(output) : "l"(output), "r"(1 << STEP), "r"(SHFL_C));
	- }
	-
	- // Grab aggregate from last warp lane
	- warp_aggregate = Broadcast(output, LOGICAL_WARP_THREADS - 1);
	- }
	-
	-
	- /// Inclusive prefix sum with aggregate (generic)
	- template <typename _T>
	- __device__ __forceinline__ void InclusiveSum(
	- _T input, ///< [in] Calling thread's input item.
	- _T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- _T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- // Whether sharing can be done with a single SHFL instruction (vs multiple SFHL instructions)
	- Int2Type<(Traits<_T>::PRIMITIVE) && (sizeof(_T) <= sizeof(unsigned int))> single_shfl;
	-
	- InclusiveSum(input, output, warp_aggregate, single_shfl);
	- }
	-
	-
	- /// Inclusive prefix sum
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output) ///< [out] Calling thread's output item. May be aliased with \p input.
	- {
	- T warp_aggregate;
	- InclusiveSum(input, output, warp_aggregate);
	- }
	-
	-
	- /// Inclusive scan with aggregate
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- output = input;
	-
	- // Iterate scan steps
	- #pragma unroll
	- for (int STEP = 0; STEP < STEPS; STEP++)
	- {
	- // Grab addend from peer
	- const int OFFSET = 1 << STEP;
	- T temp = ShuffleUp(output, OFFSET);
	-
	- // Perform scan op if from a valid peer
	- if (lane_id >= OFFSET)
	- output = scan_op(temp, output);
	- }
	-
	- // Grab aggregate from last warp lane
	- warp_aggregate = Broadcast(output, LOGICAL_WARP_THREADS - 1);
	- }
	-
	-
	- /// Inclusive scan
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- T warp_aggregate;
	- InclusiveScan(input, output, scan_op, warp_aggregate);
	- }
	-
	-
	- //---------------------------------------------------------------------
	- // Exclusive operations
	- //---------------------------------------------------------------------
	-
	- /// Exclusive scan with aggregate
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- // Compute inclusive scan
	- T inclusive;
	- InclusiveScan(input, inclusive, scan_op, warp_aggregate);
	-
	- // Grab result from predecessor
	- T exclusive = ShuffleUp(inclusive, 1);
	-
	- output = (lane_id == 0) ?
	- identity :
	- exclusive;
	- }
	-
	-
	- /// Exclusive scan
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- T warp_aggregate;
	- ExclusiveScan(input, output, identity, scan_op, warp_aggregate);
	- }
	-
	-
	- /// Exclusive scan with aggregate, without identity
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- // Compute inclusive scan
	- T inclusive;
	- InclusiveScan(input, inclusive, scan_op, warp_aggregate);
	-
	- // Grab result from predecessor
	- output = ShuffleUp(inclusive, 1);
	- }
	-
	-
	- /// Exclusive scan without identity
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- T warp_aggregate;
	- ExclusiveScan(input, output, scan_op, warp_aggregate);
	- }
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/warp/specializations/warp_scan_smem.cuh b/lib/kokkos/TPL/cub/warp/specializations/warp_scan_smem.cuh
	deleted file mode 100755
	index 513b35cef..000000000
	--- a/lib/kokkos/TPL/cub/warp/specializations/warp_scan_smem.cuh
	+++ /dev/null
	@@ -1,327 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * cub::WarpScanSmem provides smem-based variants of parallel prefix scan across CUDA warps.
	- */
	-
	-#pragma once
	-
	-#include "../../thread/thread_operators.cuh"
	-#include "../../thread/thread_load.cuh"
	-#include "../../thread/thread_store.cuh"
	-#include "../../util_type.cuh"
	-#include "../../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \brief WarpScanSmem provides smem-based variants of parallel prefix scan across CUDA warps.
	- */
	-template <
	- typename T, ///< Data type being scanned
	- int LOGICAL_WARPS, ///< Number of logical warps entrant
	- int LOGICAL_WARP_THREADS> ///< Number of threads per logical warp
	-struct WarpScanSmem
	-{
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- enum
	- {
	- /// The number of warp scan steps
	- STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
	-
	- /// The number of threads in half a warp
	- HALF_WARP_THREADS = 1 << (STEPS - 1),
	-
	- /// The number of shared memory elements per warp
	- WARP_SMEM_ELEMENTS = LOGICAL_WARP_THREADS + HALF_WARP_THREADS,
	- };
	-
	-
	- /// Shared memory storage layout type (1.5 warps-worth of elements for each warp)
	- typedef T _TempStorage[LOGICAL_WARPS][WARP_SMEM_ELEMENTS];
	-
	- // Alias wrapper allowing storage to be unioned
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- _TempStorage &temp_storage;
	- unsigned int warp_id;
	- unsigned int lane_id;
	-
	-
	- /******************************************************************************
	- * Construction
	- ******************************************************************************/
	-
	- /// Constructor
	- __device__ __forceinline__ WarpScanSmem(
	- TempStorage &temp_storage,
	- int warp_id,
	- int lane_id)
	- :
	- temp_storage(temp_storage.Alias()),
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- /******************************************************************************
	- * Operation
	- ******************************************************************************/
	-
	- /// Initialize identity padding (specialized for operations that have identity)
	- __device__ __forceinline__ void InitIdentity(Int2Type<true> has_identity)
	- {
	- T identity = T();
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][lane_id], identity);
	- }
	-
	-
	- /// Initialize identity padding (specialized for operations without identity)
	- __device__ __forceinline__ void InitIdentity(Int2Type<false> has_identity)
	- {}
	-
	-
	- /// Basic inclusive scan iteration(template unrolled, base-case specialization)
	- template <
	- bool HAS_IDENTITY,
	- typename ScanOp>
	- __device__ __forceinline__ void ScanStep(
	- T &partial,
	- ScanOp scan_op,
	- Int2Type<STEPS> step)
	- {}
	-
	-
	- /// Basic inclusive scan iteration (template unrolled, inductive-case specialization)
	- template <
	- bool HAS_IDENTITY,
	- int STEP,
	- typename ScanOp>
	- __device__ __forceinline__ void ScanStep(
	- T &partial,
	- ScanOp scan_op,
	- Int2Type<STEP> step)
	- {
	- const int OFFSET = 1 << STEP;
	-
	- // Share partial into buffer
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][HALF_WARP_THREADS + lane_id], partial);
	-
	- // Update partial if addend is in range
	- if (HAS_IDENTITY \|\| (lane_id >= OFFSET))
	- {
	- T addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][HALF_WARP_THREADS + lane_id - OFFSET]);
	- partial = scan_op(addend, partial);
	- }
	-
	- ScanStep<HAS_IDENTITY>(partial, scan_op, Int2Type<STEP + 1>());
	- }
	-
	-
	- /// Broadcast
	- __device__ __forceinline__ T Broadcast(
	- T input, ///< [in] The value to broadcast
	- unsigned int src_lane) ///< [in] Which warp lane is to do the broadcasting
	- {
	- if (lane_id == src_lane)
	- {
	- ThreadStore<STORE_VOLATILE>(temp_storage[warp_id], input);
	- }
	-
	- return ThreadLoad<LOAD_VOLATILE>(temp_storage[warp_id]);
	- }
	-
	-
	- /// Basic inclusive scan
	- template <
	- bool HAS_IDENTITY,
	- bool SHARE_FINAL,
	- typename ScanOp>
	- __device__ __forceinline__ T BasicScan(
	- T partial, ///< Calling thread's input partial reduction
	- ScanOp scan_op) ///< Binary associative scan functor
	- {
	- // Iterate scan steps
	- ScanStep<HAS_IDENTITY>(partial, scan_op, Int2Type<0>());
	-
	- if (SHARE_FINAL)
	- {
	- // Share partial into buffer
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][HALF_WARP_THREADS + lane_id], partial);
	- }
	-
	- return partial;
	- }
	-
	-
	- /// Inclusive prefix sum
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output) ///< [out] Calling thread's output item. May be aliased with \p input.
	- {
	- const bool HAS_IDENTITY = Traits<T>::PRIMITIVE;
	-
	- // Initialize identity region
	- InitIdentity(Int2Type<HAS_IDENTITY>());
	-
	- // Compute inclusive warp scan (has identity, don't share final)
	- output = BasicScan<HAS_IDENTITY, false>(input, Sum());
	- }
	-
	-
	- /// Inclusive prefix sum with aggregate
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- const bool HAS_IDENTITY = Traits<T>::PRIMITIVE;
	-
	- // Initialize identity region
	- InitIdentity(Int2Type<HAS_IDENTITY>());
	-
	- // Compute inclusive warp scan (has identity, share final)
	- output = BasicScan<HAS_IDENTITY, true>(input, Sum());
	-
	- // Retrieve aggregate in <em>warp-lane</em><sub>0</sub>
	- warp_aggregate = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][WARP_SMEM_ELEMENTS - 1]);
	- }
	-
	-
	- /// Inclusive scan
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- // Compute inclusive warp scan (no identity, don't share final)
	- output = BasicScan<false, false>(input, scan_op);
	- }
	-
	-
	- /// Inclusive scan with aggregate
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- // Compute inclusive warp scan (no identity, share final)
	- output = BasicScan<false, true>(input, scan_op);
	-
	- // Retrieve aggregate
	- warp_aggregate = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][WARP_SMEM_ELEMENTS - 1]);
	- }
	-
	- /// Exclusive scan
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- // Initialize identity region
	- ThreadStore<STORE_VOLATILE>(&temp_storage[warp_id][lane_id], identity);
	-
	- // Compute inclusive warp scan (identity, share final)
	- T inclusive = BasicScan<true, true>(input, scan_op);
	-
	- // Retrieve exclusive scan
	- output = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][HALF_WARP_THREADS + lane_id - 1]);
	- }
	-
	-
	- /// Exclusive scan with aggregate
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- // Exclusive warp scan (which does share final)
	- ExclusiveScan(input, output, identity, scan_op);
	-
	- // Retrieve aggregate
	- warp_aggregate = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][WARP_SMEM_ELEMENTS - 1]);
	- }
	-
	-
	- /// Exclusive scan without identity
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- // Compute inclusive warp scan (no identity, share final)
	- T inclusive = BasicScan<false, true>(input, scan_op);
	-
	- // Retrieve exclusive scan
	- output = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][HALF_WARP_THREADS + lane_id - 1]);
	- }
	-
	-
	- /// Exclusive scan with aggregate, without identity
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- // Exclusive warp scan (which does share final)
	- ExclusiveScan(input, output, scan_op);
	-
	- // Retrieve aggregate
	- warp_aggregate = ThreadLoad<LOAD_VOLATILE>(&temp_storage[warp_id][WARP_SMEM_ELEMENTS - 1]);
	- }
	-
	-};
	-
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/warp/warp_reduce.cuh b/lib/kokkos/TPL/cub/warp/warp_reduce.cuh
	deleted file mode 100755
	index 548369da1..000000000
	--- a/lib/kokkos/TPL/cub/warp/warp_reduce.cuh
	+++ /dev/null
	@@ -1,677 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::WarpReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across CUDA warp threads.
	- */
	-
	-#pragma once
	-
	-#include "specializations/warp_reduce_shfl.cuh"
	-#include "specializations/warp_reduce_smem.cuh"
	-#include "../thread/thread_operators.cuh"
	-#include "../util_arch.cuh"
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-
	-/**
	- * \addtogroup WarpModule
	- * @{
	- */
	-
	-/**
	- * \brief The WarpReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across CUDA warp threads. ![](warp_reduce_logo.png)
	- *
	- * \par Overview
	- * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
	- * uses a binary combining operator to compute a single aggregate from a list of input elements.
	- *
	- * \tparam T The reduction input/output element type
	- * \tparam LOGICAL_WARPS <b>[optional]</b> The number of entrant "logical" warps performing concurrent warp reductions. Default is 1.
	- * \tparam LOGICAL_WARP_THREADS <b>[optional]</b> The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM20).
	- *
	- * \par Simple Examples
	- * \warpcollective{WarpReduce}
	- * \par
	- * The code snippet below illustrates four concurrent warp sum reductions within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for 4 warps on type int
	- * typedef cub::WarpReduce<int, 4> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Return the warp-wide sums to each lane0 (threads 0, 32, 64, and 96)
	- * int aggregate = WarpReduce(temp_storage).Sum(thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, 1, 2, 3, ..., 127</tt>.
	- * The corresponding output \p aggregate in threads 0, 32, 64, and 96 will \p 496, \p 1520,
	- * \p 2544, and \p 3568, respectively (and is undefined in other threads).
	- *
	- * \par
	- * The code snippet below illustrates a single warp sum reduction within a block of
	- * 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for one warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- * ...
	- *
	- * // Only the first warp performs a reduction
	- * if (threadIdx.x < 32)
	- * {
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Return the warp-wide sum to lane0
	- * int aggregate = WarpReduce(temp_storage).Sum(thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the warp of threads is <tt>0, 1, 2, 3, ..., 31</tt>.
	- * The corresponding output \p aggregate in thread0 will be \p 496 (and is undefined in other threads).
	- *
	- * \par Usage and Performance Considerations
	- * - Supports "logical" warps smaller than the physical warp size (e.g., logical warps of 8 threads)
	- * - The number of entrant threads must be an multiple of \p LOGICAL_WARP_THREADS
	- * - Warp reductions are concurrent if more than one logical warp is participating
	- * - Uses special instructions when applicable (e.g., warp \p SHFL instructions)
	- * - Uses synchronization-free communication between warp lanes when applicable
	- * - Zero bank conflicts for most types
	- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
	- * - Summation (<b><em>vs.</em></b> generic reduction)
	- * - The architecture's warp size is a whole multiple of \p LOGICAL_WARP_THREADS
	- *
	- */
	-template <
	- typename T,
	- int LOGICAL_WARPS = 1,
	- int LOGICAL_WARP_THREADS = PtxArchProps::WARP_THREADS>
	-class WarpReduce
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- enum
	- {
	- POW_OF_TWO = ((LOGICAL_WARP_THREADS & (LOGICAL_WARP_THREADS - 1)) == 0),
	- };
	-
	-public:
	-
	- #ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document
	-
	- /// Internal specialization. Use SHFL-based reduction if (architecture is >= SM30) and ((only one logical warp) or (LOGICAL_WARP_THREADS is a power-of-two))
	- typedef typename If<(CUB_PTX_ARCH >= 300) && ((LOGICAL_WARPS == 1) \|\| POW_OF_TWO),
	- WarpReduceShfl<T, LOGICAL_WARPS, LOGICAL_WARP_THREADS>,
	- WarpReduceSmem<T, LOGICAL_WARPS, LOGICAL_WARP_THREADS> >::Type InternalWarpReduce;
	-
	- #endif // DOXYGEN_SHOULD_SKIP_THIS
	-
	-
	-private:
	-
	- /// Shared memory storage layout type for WarpReduce
	- typedef typename InternalWarpReduce::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Warp ID
	- int warp_id;
	-
	- /// Lane ID
	- int lane_id;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	-public:
	-
	- /// \smemstorage{WarpReduce}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Logical warp and lane identifiers are constructed from <tt>threadIdx.x</tt>.
	- *
	- */
	- __device__ __forceinline__ WarpReduce()
	- :
	- temp_storage(PrivateStorage()),
	- warp_id((LOGICAL_WARPS == 1) ?
	- 0 :
	- threadIdx.x / LOGICAL_WARP_THREADS),
	- lane_id(((LOGICAL_WARPS == 1) \|\| (LOGICAL_WARP_THREADS == PtxArchProps::WARP_THREADS)) ?
	- LaneId() :
	- threadIdx.x % LOGICAL_WARP_THREADS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ WarpReduce(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- warp_id((LOGICAL_WARPS == 1) ?
	- 0 :
	- threadIdx.x / LOGICAL_WARP_THREADS),
	- lane_id(((LOGICAL_WARPS == 1) \|\| (LOGICAL_WARP_THREADS == PtxArchProps::WARP_THREADS)) ?
	- LaneId() :
	- threadIdx.x % LOGICAL_WARP_THREADS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Threads are identified using the given warp and lane identifiers.
	- */
	- __device__ __forceinline__ WarpReduce(
	- int warp_id, ///< [in] A suitable warp membership identifier
	- int lane_id) ///< [in] A lane identifier within the warp
	- :
	- temp_storage(PrivateStorage()),
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Threads are identified using the given warp and lane identifiers.
	- */
	- __device__ __forceinline__ WarpReduce(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int warp_id, ///< [in] A suitable warp membership identifier
	- int lane_id) ///< [in] A lane identifier within the warp
	- :
	- temp_storage(temp_storage.Alias()),
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Summation reductions
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes a warp-wide sum in each active warp. The output is valid in warp <em>lane</em><sub>0</sub>.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp sum reductions within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for 4 warps on type int
	- * typedef cub::WarpReduce<int, 4> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Return the warp-wide sums to each lane0
	- * int aggregate = WarpReduce(temp_storage).Sum(thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, 1, 2, 3, ..., 127</tt>.
	- * The corresponding output \p aggregate in threads 0, 32, 64, and 96 will \p 496, \p 1520,
	- * \p 2544, and \p 3568, respectively (and is undefined in other threads).
	- *
	- */
	- __device__ __forceinline__ T Sum(
	- T input) ///< [in] Calling thread's input
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).Sum<true, 1>(input, LOGICAL_WARP_THREADS);
	- }
	-
	- /**
	- * \brief Computes a partially-full warp-wide sum in each active warp. The output is valid in warp <em>lane</em><sub>0</sub>.
	- *
	- * All threads in each logical warp must agree on the same value for \p valid_items. Otherwise the result is undefined.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a sum reduction within a single, partially-full
	- * block of 32 threads (one warp).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, int valid_items)
	- * {
	- * // Specialize WarpReduce for a single warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread if in range
	- * int thread_data;
	- * if (threadIdx.x < valid_items)
	- * thread_data = d_data[threadIdx.x];
	- *
	- * // Return the warp-wide sums to each lane0
	- * int aggregate = WarpReduce(temp_storage).Sum(
	- * thread_data, valid_items);
	- *
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, ...</tt> and \p valid_items
	- * is \p 4. The corresponding output \p aggregate in thread0 is \p 6 (and is
	- * undefined in other threads).
	- *
	- */
	- __device__ __forceinline__ T Sum(
	- T input, ///< [in] Calling thread's input
	- int valid_items) ///< [in] Total number of valid items in the calling thread's logical warp (may be less than \p LOGICAL_WARP_THREADS)
	- {
	- // Determine if we don't need bounds checking
	- if (valid_items >= LOGICAL_WARP_THREADS)
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).Sum<true, 1>(input, valid_items);
	- }
	- else
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).Sum<false, 1>(input, valid_items);
	- }
	- }
	-
	-
	- /**
	- * \brief Computes a segmented sum in each active warp where segments are defined by head-flags. The sum of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a head-segmented warp sum
	- * reduction within a block of 32 threads (one warp).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for a single warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item and flag per thread
	- * int thread_data = ...
	- * int head_flag = ...
	- *
	- * // Return the warp-wide sums to each lane0
	- * int aggregate = WarpReduce(temp_storage).HeadSegmentedSum(
	- * thread_data, head_flag);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data and \p head_flag across the block of threads
	- * is <tt>0, 1, 2, 3, ..., 31</tt> and is <tt>1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0</tt>,
	- * respectively. The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
	- * \p 6, \p 22, \p 38, etc. (and is undefined in other threads).
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- *
	- */
	- template <
	- typename Flag>
	- __device__ __forceinline__ T HeadSegmentedSum(
	- T input, ///< [in] Calling thread's input
	- Flag head_flag) ///< [in] Head flag denoting whether or not \p input is the start of a new segment
	- {
	- return HeadSegmentedReduce(input, head_flag, cub::Sum());
	- }
	-
	-
	- /**
	- * \brief Computes a segmented sum in each active warp where segments are defined by tail-flags. The sum of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a tail-segmented warp sum
	- * reduction within a block of 32 threads (one warp).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for a single warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item and flag per thread
	- * int thread_data = ...
	- * int tail_flag = ...
	- *
	- * // Return the warp-wide sums to each lane0
	- * int aggregate = WarpReduce(temp_storage).TailSegmentedSum(
	- * thread_data, tail_flag);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data and \p tail_flag across the block of threads
	- * is <tt>0, 1, 2, 3, ..., 31</tt> and is <tt>0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1</tt>,
	- * respectively. The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
	- * \p 6, \p 22, \p 38, etc. (and is undefined in other threads).
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- typename Flag>
	- __device__ __forceinline__ T TailSegmentedSum(
	- T input, ///< [in] Calling thread's input
	- Flag tail_flag) ///< [in] Head flag denoting whether or not \p input is the start of a new segment
	- {
	- return TailSegmentedReduce(input, tail_flag, cub::Sum());
	- }
	-
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Generic reductions
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Computes a warp-wide reduction in each active warp using the specified binary reduction functor. The output is valid in warp <em>lane</em><sub>0</sub>.
	- *
	- * Supports non-commutative reduction operators
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp max reductions within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for 4 warps on type int
	- * typedef cub::WarpReduce<int, 4> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Return the warp-wide reductions to each lane0
	- * int aggregate = WarpReduce(temp_storage).Reduce(
	- * thread_data, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, 1, 2, 3, ..., 127</tt>.
	- * The corresponding output \p aggregate in threads 0, 32, 64, and 96 will \p 31, \p 63,
	- * \p 95, and \p 127, respectively (and is undefined in other threads).
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input
	- ReductionOp reduction_op) ///< [in] Binary reduction operator
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).Reduce<true, 1>(input, LOGICAL_WARP_THREADS, reduction_op);
	- }
	-
	- /**
	- * \brief Computes a partially-full warp-wide reduction in each active warp using the specified binary reduction functor. The output is valid in warp <em>lane</em><sub>0</sub>.
	- *
	- * All threads in each logical warp must agree on the same value for \p valid_items. Otherwise the result is undefined.
	- *
	- * Supports non-commutative reduction operators
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a max reduction within a single, partially-full
	- * block of 32 threads (one warp).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(int *d_data, int valid_items)
	- * {
	- * // Specialize WarpReduce for a single warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread if in range
	- * int thread_data;
	- * if (threadIdx.x < valid_items)
	- * thread_data = d_data[threadIdx.x];
	- *
	- * // Return the warp-wide reductions to each lane0
	- * int aggregate = WarpReduce(temp_storage).Reduce(
	- * thread_data, cub::Max(), valid_items);
	- *
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, ...</tt> and \p valid_items
	- * is \p 4. The corresponding output \p aggregate in thread0 is \p 3 (and is
	- * undefined in other threads).
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ReductionOp>
	- __device__ __forceinline__ T Reduce(
	- T input, ///< [in] Calling thread's input
	- ReductionOp reduction_op, ///< [in] Binary reduction operator
	- int valid_items) ///< [in] Total number of valid items in the calling thread's logical warp (may be less than \p LOGICAL_WARP_THREADS)
	- {
	- // Determine if we don't need bounds checking
	- if (valid_items >= LOGICAL_WARP_THREADS)
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).Reduce<true, 1>(input, valid_items, reduction_op);
	- }
	- else
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).Reduce<false, 1>(input, valid_items, reduction_op);
	- }
	- }
	-
	-
	- /**
	- * \brief Computes a segmented reduction in each active warp where segments are defined by head-flags. The reduction of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
	- *
	- * Supports non-commutative reduction operators
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a head-segmented warp max
	- * reduction within a block of 32 threads (one warp).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for a single warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item and flag per thread
	- * int thread_data = ...
	- * int head_flag = ...
	- *
	- * // Return the warp-wide reductions to each lane0
	- * int aggregate = WarpReduce(temp_storage).HeadSegmentedReduce(
	- * thread_data, head_flag, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data and \p head_flag across the block of threads
	- * is <tt>0, 1, 2, 3, ..., 31</tt> and is <tt>1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0</tt>,
	- * respectively. The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
	- * \p 3, \p 7, \p 11, etc. (and is undefined in other threads).
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- typename ReductionOp,
	- typename Flag>
	- __device__ __forceinline__ T HeadSegmentedReduce(
	- T input, ///< [in] Calling thread's input
	- Flag head_flag, ///< [in] Head flag denoting whether or not \p input is the start of a new segment
	- ReductionOp reduction_op) ///< [in] Reduction operator
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).template SegmentedReduce<true>(input, head_flag, reduction_op);
	- }
	-
	-
	- /**
	- * \brief Computes a segmented reduction in each active warp where segments are defined by tail-flags. The reduction of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
	- *
	- * Supports non-commutative reduction operators
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a tail-segmented warp max
	- * reduction within a block of 32 threads (one warp).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpReduce for a single warp on type int
	- * typedef cub::WarpReduce<int, 1> WarpReduce;
	- *
	- * // Allocate shared memory for WarpReduce
	- * __shared__ typename WarpReduce::TempStorage temp_storage;
	- *
	- * // Obtain one input item and flag per thread
	- * int thread_data = ...
	- * int tail_flag = ...
	- *
	- * // Return the warp-wide reductions to each lane0
	- * int aggregate = WarpReduce(temp_storage).TailSegmentedReduce(
	- * thread_data, tail_flag, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data and \p tail_flag across the block of threads
	- * is <tt>0, 1, 2, 3, ..., 31</tt> and is <tt>0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1</tt>,
	- * respectively. The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
	- * \p 3, \p 7, \p 11, etc. (and is undefined in other threads).
	- *
	- * \tparam ReductionOp <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <
	- typename ReductionOp,
	- typename Flag>
	- __device__ __forceinline__ T TailSegmentedReduce(
	- T input, ///< [in] Calling thread's input
	- Flag tail_flag, ///< [in] Tail flag denoting whether or not \p input is the end of the current segment
	- ReductionOp reduction_op) ///< [in] Reduction operator
	- {
	- return InternalWarpReduce(temp_storage, warp_id, lane_id).template SegmentedReduce<false>(input, tail_flag, reduction_op);
	- }
	-
	-
	-
	- //@} end member group
	-};
	-
	-/** @} */ // end group WarpModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/cub/warp/warp_scan.cuh b/lib/kokkos/TPL/cub/warp/warp_scan.cuh
	deleted file mode 100755
	index a588b52bd..000000000
	--- a/lib/kokkos/TPL/cub/warp/warp_scan.cuh
	+++ /dev/null
	@@ -1,1297 +0,0 @@
	-/******************************************************************************
	- * Copyright (c) 2011, Duane Merrill. All rights reserved.
	- * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
	- *
	- * Redistribution and use in source and binary forms, with or without
	- * modification, are permitted provided that the following conditions are met:
	- * * Redistributions of source code must retain the above copyright
	- * notice, this list of conditions and the following disclaimer.
	- * * Redistributions in binary form must reproduce the above copyright
	- * notice, this list of conditions and the following disclaimer in the
	- * documentation and/or other materials provided with the distribution.
	- * * Neither the name of the NVIDIA CORPORATION nor the
	- * names of its contributors may be used to endorse or promote products
	- * derived from this software without specific prior written permission.
	- *
	- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
	- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
	- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
	- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
	- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
	- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
	- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	- *
	- ******************************************************************************/
	-
	-/**
	- * \file
	- * The cub::WarpScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix scan of items partitioned across CUDA warp threads.
	- */
	-
	-#pragma once
	-
	-#include "specializations/warp_scan_shfl.cuh"
	-#include "specializations/warp_scan_smem.cuh"
	-#include "../thread/thread_operators.cuh"
	-#include "../util_arch.cuh"
	-#include "../util_type.cuh"
	-#include "../util_namespace.cuh"
	-
	-/// Optional outer namespace(s)
	-CUB_NS_PREFIX
	-
	-/// CUB namespace
	-namespace cub {
	-
	-/**
	- * \addtogroup WarpModule
	- * @{
	- */
	-
	-/**
	- * \brief The WarpScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix scan of items partitioned across CUDA warp threads. ![](warp_scan_logo.png)
	- *
	- * \par Overview
	- * Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
	- * produces an output list where each element is computed to be the reduction
	- * of the elements occurring earlier in the input list. <em>Prefix sum</em>
	- * connotes a prefix scan with the addition operator. The term \em inclusive indicates
	- * that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
	- * The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
	- * the <em>i</em><sup>th</sup> output reduction.
	- *
	- * \tparam T The scan input/output element type
	- * \tparam LOGICAL_WARPS <b>[optional]</b> The number of "logical" warps performing concurrent warp scans. Default is 1.
	- * \tparam LOGICAL_WARP_THREADS <b>[optional]</b> The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20).
	- *
	- * \par Simple Examples
	- * \warpcollective{WarpScan}
	- * \par
	- * The code snippet below illustrates four concurrent warp prefix sums within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute warp-wide prefix sums
	- * WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, 1, 1, ...</tt>.
	- * The corresponding output \p thread_data in each of the four warps of threads will be
	- * <tt>0, 1, 2, 3, ..., 31</tt>.
	- *
	- * \par
	- * The code snippet below illustrates a single warp prefix sum within a block of
	- * 128 threads.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for one warp on type int
	- * typedef cub::WarpScan<int, 1> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- * ...
	- *
	- * // Only the first warp performs a prefix sum
	- * if (threadIdx.x < 32)
	- * {
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute warp-wide prefix sums
	- * WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the warp of threads is <tt>1, 1, 1, 1, ...</tt>.
	- * The corresponding output \p thread_data will be <tt>0, 1, 2, 3, ..., 31</tt>.
	- *
	- * \par Usage and Performance Considerations
	- * - Supports "logical" warps smaller than the physical warp size (e.g., a logical warp of 8 threads)
	- * - The number of entrant threads must be an multiple of \p LOGICAL_WARP_THREADS
	- * - Warp scans are concurrent if more than one warp is participating
	- * - Uses special instructions when applicable (e.g., warp \p SHFL)
	- * - Uses synchronization-free communication between warp lanes when applicable
	- * - Zero bank conflicts for most types.
	- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
	- * - Summation (<b><em>vs.</em></b> generic scan)
	- * - The architecture's warp size is a whole multiple of \p LOGICAL_WARP_THREADS
	- *
	- */
	-template <
	- typename T,
	- int LOGICAL_WARPS = 1,
	- int LOGICAL_WARP_THREADS = PtxArchProps::WARP_THREADS>
	-class WarpScan
	-{
	-private:
	-
	- /******************************************************************************
	- * Constants and typedefs
	- ******************************************************************************/
	-
	- enum
	- {
	- POW_OF_TWO = ((LOGICAL_WARP_THREADS & (LOGICAL_WARP_THREADS - 1)) == 0),
	- };
	-
	- /// Internal specialization. Use SHFL-based reduction if (architecture is >= SM30) and ((only one logical warp) or (LOGICAL_WARP_THREADS is a power-of-two))
	- typedef typename If<(CUB_PTX_ARCH >= 300) && ((LOGICAL_WARPS == 1) \|\| POW_OF_TWO),
	- WarpScanShfl<T, LOGICAL_WARPS, LOGICAL_WARP_THREADS>,
	- WarpScanSmem<T, LOGICAL_WARPS, LOGICAL_WARP_THREADS> >::Type InternalWarpScan;
	-
	- /// Shared memory storage layout type for WarpScan
	- typedef typename InternalWarpScan::TempStorage _TempStorage;
	-
	-
	- /******************************************************************************
	- * Thread fields
	- ******************************************************************************/
	-
	- /// Shared storage reference
	- _TempStorage &temp_storage;
	-
	- /// Warp ID
	- int warp_id;
	-
	- /// Lane ID
	- int lane_id;
	-
	-
	- /******************************************************************************
	- * Utility methods
	- ******************************************************************************/
	-
	- /// Internal storage allocator
	- __device__ __forceinline__ _TempStorage& PrivateStorage()
	- {
	- __shared__ TempStorage private_storage;
	- return private_storage;
	- }
	-
	-
	-public:
	-
	- /// \smemstorage{WarpScan}
	- struct TempStorage : Uninitialized<_TempStorage> {};
	-
	-
	- /****************************************************************//
	- * \name Collective constructors
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Logical warp and lane identifiers are constructed from <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ WarpScan()
	- :
	- temp_storage(PrivateStorage()),
	- warp_id((LOGICAL_WARPS == 1) ?
	- 0 :
	- threadIdx.x / LOGICAL_WARP_THREADS),
	- lane_id(((LOGICAL_WARPS == 1) \|\| (LOGICAL_WARP_THREADS == PtxArchProps::WARP_THREADS)) ?
	- LaneId() :
	- threadIdx.x % LOGICAL_WARP_THREADS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Logical warp and lane identifiers are constructed from <tt>threadIdx.x</tt>.
	- */
	- __device__ __forceinline__ WarpScan(
	- TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage
	- :
	- temp_storage(temp_storage.Alias()),
	- warp_id((LOGICAL_WARPS == 1) ?
	- 0 :
	- threadIdx.x / LOGICAL_WARP_THREADS),
	- lane_id(((LOGICAL_WARPS == 1) \|\| (LOGICAL_WARP_THREADS == PtxArchProps::WARP_THREADS)) ?
	- LaneId() :
	- threadIdx.x % LOGICAL_WARP_THREADS)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Threads are identified using the given warp and lane identifiers.
	- */
	- __device__ __forceinline__ WarpScan(
	- int warp_id, ///< [in] A suitable warp membership identifier
	- int lane_id) ///< [in] A lane identifier within the warp
	- :
	- temp_storage(PrivateStorage()),
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- /**
	- * \brief Collective constructor using the specified memory allocation as temporary storage. Threads are identified using the given warp and lane identifiers.
	- */
	- __device__ __forceinline__ WarpScan(
	- TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage
	- int warp_id, ///< [in] A suitable warp membership identifier
	- int lane_id) ///< [in] A lane identifier within the warp
	- :
	- temp_storage(temp_storage.Alias()),
	- warp_id(warp_id),
	- lane_id(lane_id)
	- {}
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Inclusive prefix sums
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an inclusive prefix sum in each logical warp.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute inclusive warp-wide prefix sums
	- * WarpScan(temp_storage).InclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, 1, 1, ...</tt>.
	- * The corresponding output \p thread_data in each of the four warps of threads will be
	- * <tt>1, 2, 3, ..., 32</tt>.
	- */
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output) ///< [out] Calling thread's output item. May be aliased with \p input.
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).InclusiveSum(input, output);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * The \p warp_aggregate is undefined in threads other than <em>warp-lane</em><sub>0</sub>.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute inclusive warp-wide prefix sums
	- * int warp_aggregate;
	- * WarpScan(temp_storage).InclusiveSum(thread_data, thread_data, warp_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, 1, 1, ...</tt>.
	- * The corresponding output \p thread_data in each of the four warps of threads will be
	- * <tt>1, 2, 3, ..., 32</tt>. Furthermore, \p warp_aggregate for all threads in all warps will be \p 32.
	- */
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).InclusiveSum(input, output, warp_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * The \p warp_aggregate is undefined in threads other than <em>warp-lane</em><sub>0</sub>.
	- *
	- * The \p warp_prefix_op functor must implement a member function <tt>T operator()(T warp_aggregate)</tt>.
	- * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the entire warp of threads, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the threadblock-wide prefix. Can be stateful.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block of 32 threads (one warp) that progressively
	- * computes an inclusive prefix sum over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 32 integer items that are partitioned across the warp.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct WarpPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ WarpPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the entire warp. Lane-0 is responsible
	- * // for returning a value for seeding the warp-wide scan.
	- * __device__ int operator()(int warp_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total += warp_aggregate;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize WarpScan for one warp
	- * typedef cub::WarpScan<int, 1> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * WarpPrefixOp prefix_op(0);
	- *
	- * // Have the warp iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 32)
	- * {
	- * // Load a segment of consecutive items
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the warp-wide inclusive prefix sum
	- * int warp_aggregate;
	- * WarpScan(temp_storage).InclusiveSum(
	- * thread_data, thread_data, warp_aggregate, prefix_op);
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
	- * The corresponding output for the first segment will be <tt>1, 2, 3, ..., 32</tt>.
	- * The output for the second segment will be <tt>33, 34, 35, ..., 64</tt>. Furthermore,
	- * the value \p 32 will be stored in \p warp_aggregate for all threads after each scan.
	- *
	- * \tparam WarpPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T warp_aggregate)</tt>
	- */
	- template <typename WarpPrefixOp>
	- __device__ __forceinline__ void InclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate, ///< [out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items, exclusive of the \p warp_prefix_op value
	- WarpPrefixOp &warp_prefix_op) ///< [in-out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
	- {
	- // Compute inclusive warp scan
	- InclusiveSum(input, output, warp_aggregate);
	-
	- // Compute warp-wide prefix from aggregate, then broadcast to other lanes
	- T prefix;
	- prefix = warp_prefix_op(warp_aggregate);
	- prefix = InternalWarpScan(temp_storage, warp_id, lane_id).Broadcast(prefix, 0);
	-
	- // Update output
	- output = prefix + output;
	- }
	-
	- //@} end member group
	-
	-private:
	-
	- /// Computes an exclusive prefix sum in each logical warp.
	- __device__ __forceinline__ void ExclusiveSum(T input, T &output, Int2Type<true> is_primitive)
	- {
	- // Compute exclusive warp scan from inclusive warp scan
	- T inclusive;
	- InclusiveSum(input, inclusive);
	- output = inclusive - input;
	- }
	-
	- /// Computes an exclusive prefix sum in each logical warp. Specialized for non-primitive types.
	- __device__ __forceinline__ void ExclusiveSum(T input, T &output, Int2Type<false> is_primitive)
	- {
	- // Delegate to regular scan for non-primitive types (because we won't be able to use subtraction)
	- T identity = T();
	- ExclusiveScan(input, output, identity, Sum());
	- }
	-
	- /// Computes an exclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, Int2Type<true> is_primitive)
	- {
	- // Compute exclusive warp scan from inclusive warp scan
	- T inclusive;
	- InclusiveSum(input, inclusive, warp_aggregate);
	- output = inclusive - input;
	- }
	-
	- /// Computes an exclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. Specialized for non-primitive types.
	- __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, Int2Type<false> is_primitive)
	- {
	- // Delegate to regular scan for non-primitive types (because we won't be able to use subtraction)
	- T identity = T();
	- ExclusiveScan(input, output, identity, Sum(), warp_aggregate);
	- }
	-
	- /// Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- template <typename WarpPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op, Int2Type<true> is_primitive)
	- {
	- // Compute exclusive warp scan from inclusive warp scan
	- T inclusive;
	- InclusiveSum(input, inclusive, warp_aggregate, warp_prefix_op);
	- output = inclusive - input;
	- }
	-
	- /// Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs. Specialized for non-primitive types.
	- template <typename WarpPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op, Int2Type<false> is_primitive)
	- {
	- // Delegate to regular scan for non-primitive types (because we won't be able to use subtraction)
	- T identity = T();
	- ExclusiveScan(input, output, identity, Sum(), warp_aggregate, warp_prefix_op);
	- }
	-
	-public:
	-
	-
	- /****************************************************************//
	- * \name Exclusive prefix sums
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive prefix sum in each logical warp.
	- *
	- * This operation assumes the value of obtained by the <tt>T</tt>'s default
	- * constructor (or by zero-initialization if no user-defined default
	- * constructor exists) is suitable as the identity value "zero" for
	- * addition.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute exclusive warp-wide prefix sums
	- * WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, 1, 1, ...</tt>.
	- * The corresponding output \p thread_data in each of the four warps of threads will be
	- * <tt>0, 1, 2, ..., 31</tt>.
	- *
	- */
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output) ///< [out] Calling thread's output item. May be aliased with \p input.
	- {
	- ExclusiveSum(input, output, Int2Type<Traits<T>::PRIMITIVE>());
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive prefix sum in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * This operation assumes the value of obtained by the <tt>T</tt>'s default
	- * constructor (or by zero-initialization if no user-defined default
	- * constructor exists) is suitable as the identity value "zero" for
	- * addition.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute exclusive warp-wide prefix sums
	- * int warp_aggregate;
	- * WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data, warp_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, 1, 1, ...</tt>.
	- * The corresponding output \p thread_data in each of the four warps of threads will be
	- * <tt>0, 1, 2, ..., 31</tt>. Furthermore, \p warp_aggregate for all threads in all warps will be \p 32.
	- */
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- ExclusiveSum(input, output, warp_aggregate, Int2Type<Traits<T>::PRIMITIVE>());
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * This operation assumes the value of obtained by the <tt>T</tt>'s default
	- * constructor (or by zero-initialization if no user-defined default
	- * constructor exists) is suitable as the identity value "zero" for
	- * addition.
	- *
	- * The \p warp_prefix_op functor must implement a member function <tt>T operator()(T warp_aggregate)</tt>.
	- * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the entire warp of threads, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the threadblock-wide prefix. Can be stateful.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block of 32 threads (one warp) that progressively
	- * computes an exclusive prefix sum over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 32 integer items that are partitioned across the warp.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct WarpPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ WarpPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the entire warp. Lane-0 is responsible
	- * // for returning a value for seeding the warp-wide scan.
	- * __device__ int operator()(int warp_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total += warp_aggregate;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize WarpScan for one warp
	- * typedef cub::WarpScan<int, 1> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * WarpPrefixOp prefix_op(0);
	- *
	- * // Have the warp iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 32)
	- * {
	- * // Load a segment of consecutive items
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the warp-wide exclusive prefix sum
	- * int warp_aggregate;
	- * WarpScan(temp_storage).ExclusiveSum(
	- * thread_data, thread_data, warp_aggregate, prefix_op);
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
	- * The corresponding output for the first segment will be <tt>0, 1, 2, ..., 31</tt>.
	- * The output for the second segment will be <tt>32, 33, 34, ..., 63</tt>. Furthermore,
	- * the value \p 32 will be stored in \p warp_aggregate for all threads after each scan.
	- *
	- * \tparam WarpPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T warp_aggregate)</tt>
	- */
	- template <typename WarpPrefixOp>
	- __device__ __forceinline__ void ExclusiveSum(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T &warp_aggregate, ///< [out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items (exclusive of the \p warp_prefix_op value).
	- WarpPrefixOp &warp_prefix_op) ///< [in-out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
	- {
	- ExclusiveSum(input, output, warp_aggregate, warp_prefix_op, Int2Type<Traits<T>::PRIMITIVE>());
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Inclusive prefix scans
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Computes an inclusive prefix sum using the specified binary scan functor in each logical warp.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute inclusive warp-wide prefix max scans
	- * WarpScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.
	- * The corresponding output \p thread_data in the first warp would be
	- * <tt>0, 0, 2, 2, ..., 30, 30</tt>, the output for the second warp would be <tt>32, 32, 34, 34, ..., 62, 62</tt>, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).InclusiveScan(input, output, scan_op);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute inclusive warp-wide prefix max scans
	- * int warp_aggregate;
	- * WarpScan(temp_storage).InclusiveScan(
	- * thread_data, thread_data, cub::Max(), warp_aggregate);
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.
	- * The corresponding output \p thread_data in the first warp would be
	- * <tt>0, 0, 2, 2, ..., 30, 30</tt>, the output for the second warp would be <tt>32, 32, 34, 34, ..., 62, 62</tt>, etc.
	- * Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads
	- * in the second warp, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).InclusiveScan(input, output, scan_op, warp_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * The \p warp_prefix_op functor must implement a member function <tt>T operator()(T warp_aggregate)</tt>.
	- * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the entire warp of threads, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the threadblock-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block of 32 threads (one warp) that progressively
	- * computes an inclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 32 integer items that are partitioned across the warp.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct WarpPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ WarpPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the entire warp. Lane-0 is responsible
	- * // for returning a value for seeding the warp-wide scan.
	- * __device__ int operator()(int warp_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (warp_aggregate > old_prefix) ? warp_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize WarpScan for one warp
	- * typedef cub::WarpScan<int, 1> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * WarpPrefixOp prefix_op(0);
	- *
	- * // Have the warp iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 32)
	- * {
	- * // Load a segment of consecutive items
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the warp-wide inclusive prefix max scan
	- * int warp_aggregate;
	- * WarpScan(temp_storage).InclusiveScan(
	- * thread_data, thread_data, cub::Max(), warp_aggregate, prefix_op);
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>0, 0, 2, 2, ..., 30, 30</tt>.
	- * The output for the second segment will be <tt>32, 32, 34, 34, ..., 62, 62</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 30 in all threads after the first scan, assigned \p 62 after the second
	- * scan, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam WarpPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T warp_aggregate)</tt>
	- */
	- template <
	- typename ScanOp,
	- typename WarpPrefixOp>
	- __device__ __forceinline__ void InclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate, ///< [out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items (exclusive of the \p warp_prefix_op value).
	- WarpPrefixOp &warp_prefix_op) ///< [in-out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
	- {
	- // Compute inclusive warp scan
	- InclusiveScan(input, output, scan_op, warp_aggregate);
	-
	- // Compute warp-wide prefix from aggregate, then broadcast to other lanes
	- T prefix;
	- prefix = warp_prefix_op(warp_aggregate);
	- prefix = InternalWarpScan(temp_storage, warp_id, lane_id).Broadcast(prefix, 0);
	-
	- // Update output
	- output = scan_op(prefix, output);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Exclusive prefix scans
	- *********************************************************************/
	- //@{
	-
	- /**
	- * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute exclusive warp-wide prefix max scans
	- * WarpScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.
	- * The corresponding output \p thread_data in the first warp would be
	- * <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>30, 32, 32, 34, ..., 60, 62</tt>, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).ExclusiveScan(input, output, identity, scan_op);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute exclusive warp-wide prefix max scans
	- * WarpScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.
	- * The corresponding output \p thread_data in the first warp would be
	- * <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>30, 32, 32, 34, ..., 60, 62</tt>, etc.
	- * Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads
	- * in the second warp, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).ExclusiveScan(input, output, identity, scan_op, warp_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor \p warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * The \p warp_prefix_op functor must implement a member function <tt>T operator()(T warp_aggregate)</tt>.
	- * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the entire warp of threads, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the threadblock-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block of 32 threads (one warp) that progressively
	- * computes an exclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 32 integer items that are partitioned across the warp.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct WarpPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ WarpPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the entire warp. Lane-0 is responsible
	- * // for returning a value for seeding the warp-wide scan.
	- * __device__ int operator()(int warp_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (warp_aggregate > old_prefix) ? warp_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize WarpScan for one warp
	- * typedef cub::WarpScan<int, 1> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * WarpPrefixOp prefix_op(INT_MIN);
	- *
	- * // Have the warp iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 32)
	- * {
	- * // Load a segment of consecutive items
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the warp-wide exclusive prefix max scan
	- * int warp_aggregate;
	- * WarpScan(temp_storage).ExclusiveScan(
	- * thread_data, thread_data, INT_MIN, cub::Max(), warp_aggregate, prefix_op);
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>.
	- * The output for the second segment will be <tt>30, 32, 32, 34, ..., 60, 62</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 30 in all threads after the first scan, assigned \p 62 after the second
	- * scan, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam WarpPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T warp_aggregate)</tt>
	- */
	- template <
	- typename ScanOp,
	- typename WarpPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- T identity, ///< [in] Identity value
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate, ///< [out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items (exclusive of the \p warp_prefix_op value).
	- WarpPrefixOp &warp_prefix_op) ///< [in-out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
	- {
	- // Exclusive warp scan
	- ExclusiveScan(input, output, identity, scan_op, warp_aggregate);
	-
	- // Compute warp-wide prefix from aggregate, then broadcast to other lanes
	- T prefix = warp_prefix_op(warp_aggregate);
	- prefix = InternalWarpScan(temp_storage, warp_id, lane_id).Broadcast(prefix, 0);
	-
	- // Update output
	- output = (lane_id == 0) ?
	- prefix :
	- scan_op(prefix, output);
	- }
	-
	-
	- //@} end member group
	- /****************************************************************//
	- * \name Identityless exclusive prefix scans
	- *********************************************************************/
	- //@{
	-
	-
	- /**
	- * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the \p output computed for <em>warp-lane</em><sub>0</sub> is undefined.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute exclusive warp-wide prefix max scans
	- * WarpScan(temp_storage).ExclusiveScan(thread_data, thread_data, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.
	- * The corresponding output \p thread_data in the first warp would be
	- * <tt>?, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>?, 32, 32, 34, ..., 60, 62</tt>, etc.
	- * (The output \p thread_data in each warp lane0 is undefined.)
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op) ///< [in] Binary scan operator
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).ExclusiveScan(input, output, scan_op);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the \p output computed for <em>warp-lane</em><sub>0</sub> is undefined. Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
	- * 128 threads (one per each of the 32-thread warps).
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * __global__ void ExampleKernel(...)
	- * {
	- * // Specialize WarpScan for 4 warps on type int
	- * typedef cub::WarpScan<int, 4> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Obtain one input item per thread
	- * int thread_data = ...
	- *
	- * // Compute exclusive warp-wide prefix max scans
	- * WarpScan(temp_storage).ExclusiveScan(thread_data, thread_data, cub::Max());
	- *
	- * \endcode
	- * \par
	- * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.
	- * The corresponding output \p thread_data in the first warp would be
	- * <tt>?, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>?, 32, 32, 34, ..., 60, 62</tt>, etc.
	- * (The output \p thread_data in each warp lane0 is undefined.) Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads
	- * in the second warp, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- */
	- template <typename ScanOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate) ///< [out] Warp-wide aggregate reduction of input items.
	- {
	- InternalWarpScan(temp_storage, warp_id, lane_id).ExclusiveScan(input, output, scan_op, warp_aggregate);
	- }
	-
	-
	- /**
	- * \brief Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The \p warp_prefix_op value from thread-thread-lane<sub>0</sub> is applied to all scan outputs. Also computes the warp-wide \p warp_aggregate of all inputs for thread-thread-lane<sub>0</sub>.
	- *
	- * The \p warp_prefix_op functor must implement a member function <tt>T operator()(T warp_aggregate)</tt>.
	- * The functor's input parameter \p warp_aggregate is the same value also returned by the scan operation.
	- * The functor will be invoked by the entire warp of threads, however only the return value from
	- * <em>lane</em><sub>0</sub> is applied as the threadblock-wide prefix. Can be stateful.
	- *
	- * Supports non-commutative scan operators.
	- *
	- * \smemreuse
	- *
	- * The code snippet below illustrates a single thread block of 32 threads (one warp) that progressively
	- * computes an exclusive prefix max scan over multiple "tiles" of input using a
	- * prefix functor to maintain a running total between block-wide scans. Each tile consists
	- * of 32 integer items that are partitioned across the warp.
	- * \par
	- * \code
	- * #include <cub/cub.cuh>
	- *
	- * // A stateful callback functor that maintains a running prefix to be applied
	- * // during consecutive scan operations.
	- * struct WarpPrefixOp
	- * {
	- * // Running prefix
	- * int running_total;
	- *
	- * // Constructor
	- * __device__ WarpPrefixOp(int running_total) : running_total(running_total) {}
	- *
	- * // Callback operator to be entered by the entire warp. Lane-0 is responsible
	- * // for returning a value for seeding the warp-wide scan.
	- * __device__ int operator()(int warp_aggregate)
	- * {
	- * int old_prefix = running_total;
	- * running_total = (warp_aggregate > old_prefix) ? warp_aggregate : old_prefix;
	- * return old_prefix;
	- * }
	- * };
	- *
	- * __global__ void ExampleKernel(int *d_data, int num_items, ...)
	- * {
	- * // Specialize WarpScan for one warp
	- * typedef cub::WarpScan<int, 1> WarpScan;
	- *
	- * // Allocate shared memory for WarpScan
	- * __shared__ typename WarpScan::TempStorage temp_storage;
	- *
	- * // Initialize running total
	- * WarpPrefixOp prefix_op(INT_MIN);
	- *
	- * // Have the warp iterate over segments of items
	- * for (int block_offset = 0; block_offset < num_items; block_offset += 32)
	- * {
	- * // Load a segment of consecutive items
	- * int thread_data = d_data[block_offset];
	- *
	- * // Collectively compute the warp-wide exclusive prefix max scan
	- * int warp_aggregate;
	- * WarpScan(temp_storage).ExclusiveScan(
	- * thread_data, thread_data, INT_MIN, cub::Max(), warp_aggregate, prefix_op);
	- *
	- * // Store scanned items to output segment
	- * d_data[block_offset] = thread_data;
	- * }
	- * \endcode
	- * \par
	- * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
	- * The corresponding output for the first segment will be <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>.
	- * The output for the second segment will be <tt>30, 32, 32, 34, ..., 60, 62</tt>. Furthermore,
	- * \p block_aggregate will be assigned \p 30 in all threads after the first scan, assigned \p 62 after the second
	- * scan, etc.
	- *
	- * \tparam ScanOp <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
	- * \tparam WarpPrefixOp <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T warp_aggregate)</tt>
	- */
	- template <
	- typename ScanOp,
	- typename WarpPrefixOp>
	- __device__ __forceinline__ void ExclusiveScan(
	- T input, ///< [in] Calling thread's input item.
	- T &output, ///< [out] Calling thread's output item. May be aliased with \p input.
	- ScanOp scan_op, ///< [in] Binary scan operator
	- T &warp_aggregate, ///< [out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items (exclusive of the \p warp_prefix_op value).
	- WarpPrefixOp &warp_prefix_op) ///< [in-out] <b>[<em>warp-lane</em><sub>0</sub> only]</b> Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
	- {
	- // Exclusive warp scan
	- ExclusiveScan(input, output, scan_op, warp_aggregate);
	-
	- // Compute warp-wide prefix from aggregate, then broadcast to other lanes
	- T prefix = warp_prefix_op(warp_aggregate);
	- prefix = InternalWarpScan(temp_storage, warp_id, lane_id).Broadcast(prefix, 0);
	-
	- // Update output with prefix
	- output = (lane_id == 0) ?
	- prefix :
	- scan_op(prefix, output);
	- }
	-
	- //@} end member group
	-};
	-
	-/** @} */ // end group WarpModule
	-
	-} // CUB namespace
	-CUB_NS_POSTFIX // Optional outer namespace(s)
	diff --git a/lib/kokkos/TPL/KokkosTPL_dummy.cpp b/lib/kokkos/algorithms/src/KokkosAlgorithms_dummy.cpp
	similarity index 100%
	rename from lib/kokkos/TPL/KokkosTPL_dummy.cpp
	rename to lib/kokkos/algorithms/src/KokkosAlgorithms_dummy.cpp
	diff --git a/lib/kokkos/algorithms/src/Kokkos_Random.hpp b/lib/kokkos/algorithms/src/Kokkos_Random.hpp
	index 903bc4eb0..11763c2f1 100755
	--- a/lib/kokkos/algorithms/src/Kokkos_Random.hpp
	+++ b/lib/kokkos/algorithms/src/Kokkos_Random.hpp
	@@ -1,1691 +1,1691 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-
	+#ifndef KOKKOS_RANDOM_HPP
	+#define KOKKOS_RANDOM_HPP

	#include <Kokkos_Core.hpp>
	#include <cstdio>
	#include <cstdlib>
	#include <cmath>

	-#ifndef KOKKOS_RANDOM_HPP
	-#define KOKKOS_RANDOM_HPP
	-
	-// These generators are based on Vigna, Sebastiano (2014). "An experimental exploration of Marsaglia's xorshift generators, scrambled"
	-// See: http://arxiv.org/abs/1402.6246
	+/// \file Kokkos_Random.hpp
	+/// \brief Pseudorandom number generators
	+///
	+/// These generators are based on Vigna, Sebastiano (2014). "An
	+/// experimental exploration of Marsaglia's xorshift generators,
	+/// scrambled." See: http://arxiv.org/abs/1402.6246

	namespace Kokkos {

	/*Template functions to get equidistributed random numbers from a generator for a specific Scalar type

	template<class Generator,Scalar>
	struct rand{

	//Max value returned by draw(Generator& gen)
	KOKKOS_INLINE_FUNCTION
	static Scalar max();

	//Returns a value between zero and max()
	KOKKOS_INLINE_FUNCTION
	static Scalar draw(Generator& gen);

	//Returns a value between zero and range()
	//Note: for floating point values range can be larger than max()
	KOKKOS_INLINE_FUNCTION
	static Scalar draw(Generator& gen, const Scalar& range){}

	//Return value between start and end
	KOKKOS_INLINE_FUNCTION
	static Scalar draw(Generator& gen, const Scalar& start, const Scalar& end);
	};

	The Random number generators themselves have two components a state-pool and the actual generator
	A state-pool manages a number of generators, so that each active thread is able to grep its own.
	This allows the generation of random numbers which are independent between threads. Note that
	in contrast to CuRand none of the functions of the pool (or the generator) are collectives,
	i.e. all functions can be called inside conditionals.

	template<class Device>
	class Pool {
	public:
	//The Kokkos device type
	typedef Device device_type;
	//The actual generator type
	typedef Generator<Device> generator_type;

	//Default constructor: does not initialize a pool
	Pool();

	//Initializing constructor: calls init(seed,Device_Specific_Number);
	Pool(unsigned int seed);

	//Intialize Pool with seed as a starting seed with a pool_size of num_states
	//The Random_XorShift64 generator is used in serial to initialize all states,
	//thus the intialization process is platform independent and deterministic.
	void init(unsigned int seed, int num_states);

	//Get a generator. This will lock one of the states, guaranteeing that each thread
	//will have its private generator. Note: on Cuda getting a state involves atomics,
	//and is thus not deterministic!
	generator_type get_state();

	//Give a state back to the pool. This unlocks the state, and writes the modified
	//state of the generator back to the pool.
	void free_state(generator_type gen);

	}

	template<class Device>
	class Generator {
	public:
	//The Kokkos device type
	typedef DeviceType device_type;

	//Max return values of respective [X]rand[S]() functions
	enum {MAX_URAND = 0xffffffffU};
	enum {MAX_URAND64 = 0xffffffffffffffffULL-1};
	enum {MAX_RAND = static_cast<int>(0xffffffffU/2)};
	enum {MAX_RAND64 = static_cast<int64_t>(0xffffffffffffffffULL/2-1)};


	//Init with a state and the idx with respect to pool. Note: in serial the
	//Generator can be used by just giving it the necessary state arguments
	KOKKOS_INLINE_FUNCTION
	Generator (STATE_ARGUMENTS, int state_idx = 0);

	//Draw a equidistributed uint32_t in the range (0,MAX_URAND]
	KOKKOS_INLINE_FUNCTION
	uint32_t urand();

	//Draw a equidistributed uint64_t in the range (0,MAX_URAND64]
	KOKKOS_INLINE_FUNCTION
	uint64_t urand64();

	//Draw a equidistributed uint32_t in the range (0,range]
	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& range);

	//Draw a equidistributed uint32_t in the range (start,end]
	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& start, const uint32_t& end );

	//Draw a equidistributed uint64_t in the range (0,range]
	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& range);

	//Draw a equidistributed uint64_t in the range (start,end]
	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& start, const uint64_t& end );

	//Draw a equidistributed int in the range (0,MAX_RAND]
	KOKKOS_INLINE_FUNCTION
	int rand();

	//Draw a equidistributed int in the range (0,range]
	KOKKOS_INLINE_FUNCTION
	int rand(const int& range);

	//Draw a equidistributed int in the range (start,end]
	KOKKOS_INLINE_FUNCTION
	int rand(const int& start, const int& end );

	//Draw a equidistributed int64_t in the range (0,MAX_RAND64]
	KOKKOS_INLINE_FUNCTION
	int64_t rand64();

	//Draw a equidistributed int64_t in the range (0,range]
	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& range);

	//Draw a equidistributed int64_t in the range (start,end]
	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& start, const int64_t& end );

	//Draw a equidistributed float in the range (0,1.0]
	KOKKOS_INLINE_FUNCTION
	float frand();

	//Draw a equidistributed float in the range (0,range]
	KOKKOS_INLINE_FUNCTION
	float frand(const float& range);

	//Draw a equidistributed float in the range (start,end]
	KOKKOS_INLINE_FUNCTION
	float frand(const float& start, const float& end );

	//Draw a equidistributed double in the range (0,1.0]
	KOKKOS_INLINE_FUNCTION
	double drand();

	//Draw a equidistributed double in the range (0,range]
	KOKKOS_INLINE_FUNCTION
	double drand(const double& range);

	//Draw a equidistributed double in the range (start,end]
	KOKKOS_INLINE_FUNCTION
	double drand(const double& start, const double& end );

	//Draw a standard normal distributed double
	KOKKOS_INLINE_FUNCTION
	double normal() ;

	//Draw a normal distributed double with given mean and standard deviation
	KOKKOS_INLINE_FUNCTION
	double normal(const double& mean, const double& std_dev=1.0);
	}

	//Additional Functions:

	//Fills view with random numbers in the range (0,range]
	template<class ViewType, class PoolType>
	void fill_random(ViewType view, PoolType pool, ViewType::value_type range);

	//Fills view with random numbers in the range (start,end]
	template<class ViewType, class PoolType>
	void fill_random(ViewType view, PoolType pool,
	ViewType::value_type start, ViewType::value_type end);

	*/

	template<class Generator, class Scalar>
	struct rand;


	template<class Generator>
	struct rand<Generator,char> {

	KOKKOS_INLINE_FUNCTION
	static short max(){return 127;}
	KOKKOS_INLINE_FUNCTION
	static short draw(Generator& gen)
	{return short((gen.rand()&0xff+256)%256);}
	KOKKOS_INLINE_FUNCTION
	static short draw(Generator& gen, const char& range)
	{return char(gen.rand(range));}
	KOKKOS_INLINE_FUNCTION
	static short draw(Generator& gen, const char& start, const char& end)
	{return char(gen.rand(start,end));}

	};

	template<class Generator>
	struct rand<Generator,short> {
	KOKKOS_INLINE_FUNCTION
	static short max(){return 32767;}
	KOKKOS_INLINE_FUNCTION
	static short draw(Generator& gen)
	{return short((gen.rand()&0xffff+65536)%32768);}
	KOKKOS_INLINE_FUNCTION
	static short draw(Generator& gen, const short& range)
	{return short(gen.rand(range));}
	KOKKOS_INLINE_FUNCTION
	static short draw(Generator& gen, const short& start, const short& end)
	{return short(gen.rand(start,end));}

	};

	template<class Generator>
	struct rand<Generator,int> {
	KOKKOS_INLINE_FUNCTION
	static int max(){return Generator::MAX_RAND;}
	KOKKOS_INLINE_FUNCTION
	static int draw(Generator& gen)
	{return gen.rand();}
	KOKKOS_INLINE_FUNCTION
	static int draw(Generator& gen, const int& range)
	{return gen.rand(range);}
	KOKKOS_INLINE_FUNCTION
	static int draw(Generator& gen, const int& start, const int& end)
	{return gen.rand(start,end);}

	};

	template<class Generator>
	struct rand<Generator,unsigned int> {
	KOKKOS_INLINE_FUNCTION
	static unsigned int max () {
	return Generator::MAX_URAND;
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned int draw (Generator& gen) {
	return gen.urand ();
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned int draw(Generator& gen, const unsigned int& range) {
	return gen.urand (range);
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned int
	draw (Generator& gen, const unsigned int& start, const unsigned int& end) {
	return gen.urand (start, end);
	}
	};

	template<class Generator>
	struct rand<Generator,long> {
	KOKKOS_INLINE_FUNCTION
	static long max () {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (long) == 4 ?
	static_cast<long> (Generator::MAX_RAND) :
	static_cast<long> (Generator::MAX_RAND64);
	}
	KOKKOS_INLINE_FUNCTION
	static long draw (Generator& gen) {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (long) == 4 ?
	static_cast<long> (gen.rand ()) :
	static_cast<long> (gen.rand64 ());
	}
	KOKKOS_INLINE_FUNCTION
	static long draw (Generator& gen, const long& range) {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (long) == 4 ?
	static_cast<long> (gen.rand (static_cast<int> (range))) :
	static_cast<long> (gen.rand64 (range));
	}
	KOKKOS_INLINE_FUNCTION
	static long draw (Generator& gen, const long& start, const long& end) {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (long) == 4 ?
	static_cast<long> (gen.rand (static_cast<int> (start),
	static_cast<int> (end))) :
	static_cast<long> (gen.rand64 (start, end));
	}
	};

	template<class Generator>
	struct rand<Generator,unsigned long> {
	KOKKOS_INLINE_FUNCTION
	static unsigned long max () {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (unsigned long) == 4 ?
	static_cast<unsigned long> (Generator::MAX_URAND) :
	static_cast<unsigned long> (Generator::MAX_URAND64);
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned long draw (Generator& gen) {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (unsigned long) == 4 ?
	static_cast<unsigned long> (gen.urand ()) :
	static_cast<unsigned long> (gen.urand64 ());
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned long draw(Generator& gen, const unsigned long& range) {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (unsigned long) == 4 ?
	static_cast<unsigned long> (gen.urand (static_cast<unsigned int> (range))) :
	static_cast<unsigned long> (gen.urand64 (range));
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned long
	draw (Generator& gen, const unsigned long& start, const unsigned long& end) {
	// FIXME (mfh 26 Oct 2014) It would be better to select the
	// return value at compile time, using something like enable_if.
	return sizeof (unsigned long) == 4 ?
	static_cast<unsigned long> (gen.urand (static_cast<unsigned int> (start),
	static_cast<unsigned int> (end))) :
	static_cast<unsigned long> (gen.urand64 (start, end));
	}
	};

	// NOTE (mfh 26 oct 2014) This is a partial specialization for long
	// long, a C99 / C++11 signed type which is guaranteed to be at
	// least 64 bits. Do NOT write a partial specialization for
	// int64_t!!! This is just a typedef! It could be either long or
	// long long. We don't know which a priori, and I've seen both.
	// The types long and long long are guaranteed to differ, so it's
	// always safe to specialize for both.
	template<class Generator>
	struct rand<Generator, long long> {
	KOKKOS_INLINE_FUNCTION
	static long long max () {
	// FIXME (mfh 26 Oct 2014) It's legal for long long to be > 64 bits.
	return Generator::MAX_RAND64;
	}
	KOKKOS_INLINE_FUNCTION
	static long long draw (Generator& gen) {
	// FIXME (mfh 26 Oct 2014) It's legal for long long to be > 64 bits.
	return gen.rand64 ();
	}
	KOKKOS_INLINE_FUNCTION
	static long long draw (Generator& gen, const long long& range) {
	// FIXME (mfh 26 Oct 2014) It's legal for long long to be > 64 bits.
	return gen.rand64 (range);
	}
	KOKKOS_INLINE_FUNCTION
	static long long draw (Generator& gen, const long long& start, const long long& end) {
	// FIXME (mfh 26 Oct 2014) It's legal for long long to be > 64 bits.
	return gen.rand64 (start, end);
	}
	};

	// NOTE (mfh 26 oct 2014) This is a partial specialization for
	// unsigned long long, a C99 / C++11 unsigned type which is
	// guaranteed to be at least 64 bits. Do NOT write a partial
	// specialization for uint64_t!!! This is just a typedef! It could
	// be either unsigned long or unsigned long long. We don't know
	// which a priori, and I've seen both. The types unsigned long and
	// unsigned long long are guaranteed to differ, so it's always safe
	// to specialize for both.
	template<class Generator>
	struct rand<Generator,unsigned long long> {
	KOKKOS_INLINE_FUNCTION
	static unsigned long long max () {
	// FIXME (mfh 26 Oct 2014) It's legal for unsigned long long to be > 64 bits.
	return Generator::MAX_URAND64;
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned long long draw (Generator& gen) {
	// FIXME (mfh 26 Oct 2014) It's legal for unsigned long long to be > 64 bits.
	return gen.urand64 ();
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned long long draw (Generator& gen, const unsigned long long& range) {
	// FIXME (mfh 26 Oct 2014) It's legal for long long to be > 64 bits.
	return gen.urand64 (range);
	}
	KOKKOS_INLINE_FUNCTION
	static unsigned long long
	draw (Generator& gen, const unsigned long long& start, const unsigned long long& end) {
	// FIXME (mfh 26 Oct 2014) It's legal for long long to be > 64 bits.
	return gen.urand64 (start, end);
	}
	};

	template<class Generator>
	struct rand<Generator,float> {
	KOKKOS_INLINE_FUNCTION
	static float max(){return 1.0f;}
	KOKKOS_INLINE_FUNCTION
	static float draw(Generator& gen)
	{return gen.frand();}
	KOKKOS_INLINE_FUNCTION
	static float draw(Generator& gen, const float& range)
	{return gen.frand(range);}
	KOKKOS_INLINE_FUNCTION
	static float draw(Generator& gen, const float& start, const float& end)
	{return gen.frand(start,end);}

	};

	template<class Generator>
	struct rand<Generator,double> {
	KOKKOS_INLINE_FUNCTION
	static double max(){return 1.0;}
	KOKKOS_INLINE_FUNCTION
	static double draw(Generator& gen)
	{return gen.drand();}
	KOKKOS_INLINE_FUNCTION
	static double draw(Generator& gen, const double& range)
	{return gen.drand(range);}
	KOKKOS_INLINE_FUNCTION
	static double draw(Generator& gen, const double& start, const double& end)
	{return gen.drand(start,end);}

	};

	template<class DeviceType>
	class Random_XorShift64_Pool;

	template<class DeviceType>
	class Random_XorShift64 {
	private:
	uint64_t state_;
	const int state_idx_;
	friend class Random_XorShift64_Pool<DeviceType>;
	public:

	typedef DeviceType device_type;

	enum {MAX_URAND = 0xffffffffU};
	enum {MAX_URAND64 = 0xffffffffffffffffULL-1};
	enum {MAX_RAND = static_cast<int>(0xffffffff/2)};
	enum {MAX_RAND64 = static_cast<int64_t>(0xffffffffffffffffLL/2-1)};

	KOKKOS_INLINE_FUNCTION
	Random_XorShift64 (uint64_t state, int state_idx = 0)
	: state_(state),state_idx_(state_idx){}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand() {
	state_ ^= state_ >> 12;
	state_ ^= state_ << 25;
	state_ ^= state_ >> 27;

	uint64_t tmp = state_ * 2685821657736338717ULL;
	tmp = tmp>>16;
	return static_cast<uint32_t>(tmp&MAX_URAND);
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64() {
	state_ ^= state_ >> 12;
	state_ ^= state_ << 25;
	state_ ^= state_ >> 27;
	return (state_ * 2685821657736338717ULL) - 1;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& range) {
	const uint32_t max_val = (MAX_URAND/range)*range;
	uint32_t tmp = urand();
	while(tmp>=max_val)
	tmp = urand();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& start, const uint32_t& end ) {
	return urand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& range) {
	const uint64_t max_val = (MAX_URAND64/range)*range;
	uint64_t tmp = urand64();
	while(tmp>=max_val)
	tmp = urand64();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& start, const uint64_t& end ) {
	return urand64(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	int rand() {
	return static_cast<int>(urand()/2);
	}

	KOKKOS_INLINE_FUNCTION
	int rand(const int& range) {
	const int max_val = (MAX_RAND/range)*range;
	int tmp = rand();
	while(tmp>=max_val)
	tmp = rand();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	int rand(const int& start, const int& end ) {
	return rand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64() {
	return static_cast<int64_t>(urand64()/2);
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& range) {
	const int64_t max_val = (MAX_RAND64/range)*range;
	int64_t tmp = rand64();
	while(tmp>=max_val)
	tmp = rand64();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& start, const int64_t& end ) {
	return rand64(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	float frand() {
	return 1.0f * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	float frand(const float& range) {
	return range * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	float frand(const float& start, const float& end ) {
	return frand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	double drand() {
	return 1.0 * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	double drand(const double& range) {
	return range * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	double drand(const double& start, const double& end ) {
	return drand(end-start)+start;
	}

	//Marsaglia polar method for drawing a standard normal distributed random number
	KOKKOS_INLINE_FUNCTION
	double normal() {
	double S = 2.0;
	double U;
	while(S>=1.0) {
	U = drand();
	const double V = drand();
	S = UU+VV;
	}
	return Usqrt(-2.0log(S)/S);
	}

	KOKKOS_INLINE_FUNCTION
	double normal(const double& mean, const double& std_dev=1.0) {
	return mean + normal()*std_dev;
	}

	};

	template<class DeviceType = Kokkos::DefaultExecutionSpace>
	class Random_XorShift64_Pool {
	private:
	typedef View<int*,DeviceType> lock_type;
	typedef View<uint64_t*,DeviceType> state_data_type;
	lock_type locks_;
	state_data_type state_;
	int num_states_;

	public:
	typedef Random_XorShift64<DeviceType> generator_type;
	typedef DeviceType device_type;

	Random_XorShift64_Pool() {
	num_states_ = 0;
	}
	- Random_XorShift64_Pool(unsigned int seed) {
	+ Random_XorShift64_Pool(uint64_t seed) {
	num_states_ = 0;
	init(seed,DeviceType::max_hardware_threads());
	}

	Random_XorShift64_Pool(const Random_XorShift64_Pool& src):
	locks_(src.locks_),
	state_(src.state_),
	num_states_(src.num_states_)
	{}

	Random_XorShift64_Pool operator = (const Random_XorShift64_Pool& src) {
	locks_ = src.locks_;
	state_ = src.state_;
	num_states_ = src.num_states_;
	return *this;
	}

	- void init(unsigned int seed, int num_states) {
	+ void init(uint64_t seed, int num_states) {
	num_states_ = num_states;

	locks_ = lock_type("Kokkos::Random_XorShift64::locks",num_states_);
	state_ = state_data_type("Kokkos::Random_XorShift64::state",num_states_);

	typename state_data_type::HostMirror h_state = create_mirror_view(state_);
	typename lock_type::HostMirror h_lock = create_mirror_view(locks_);

	// Execute on the HostMirror's default execution space.
	Random_XorShift64<typename state_data_type::HostMirror::execution_space> gen(seed,0);
	for(int i = 0; i < 17; i++)
	gen.rand();
	for(int i = 0; i < num_states_; i++) {
	int n1 = gen.rand();
	int n2 = gen.rand();
	int n3 = gen.rand();
	int n4 = gen.rand();
	h_state(i) = (((static_cast<uint64_t>(n1)) & 0xffff)<<00) \|
	(((static_cast<uint64_t>(n2)) & 0xffff)<<16) \|
	(((static_cast<uint64_t>(n3)) & 0xffff)<<32) \|
	(((static_cast<uint64_t>(n4)) & 0xffff)<<48);
	h_lock(i) = 0;
	}
	deep_copy(state_,h_state);
	deep_copy(locks_,h_lock);
	}

	KOKKOS_INLINE_FUNCTION
	Random_XorShift64<DeviceType> get_state() const {
	const int i = DeviceType::hardware_thread_id();;
	return Random_XorShift64<DeviceType>(state_(i),i);
	}

	KOKKOS_INLINE_FUNCTION
	void free_state(const Random_XorShift64<DeviceType>& state) const {
	state_(state.state_idx_) = state.state_;
	}
	};


	template<class DeviceType>
	class Random_XorShift1024_Pool;

	template<class DeviceType>
	class Random_XorShift1024 {
	private:
	int p_;
	const int state_idx_;
	uint64_t state_[16];
	friend class Random_XorShift1024_Pool<DeviceType>;
	public:

	typedef DeviceType device_type;

	enum {MAX_URAND = 0xffffffffU};
	enum {MAX_URAND64 = 0xffffffffffffffffULL-1};
	enum {MAX_RAND = static_cast<int>(0xffffffffU/2)};
	enum {MAX_RAND64 = static_cast<int64_t>(0xffffffffffffffffULL/2-1)};

	KOKKOS_INLINE_FUNCTION
	Random_XorShift1024 (uint64_t* state, int p, int state_idx = 0):
	p_(p),state_idx_(state_idx){
	for(int i=0 ; i<16; i++)
	state_[i] = state[i];
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand() {
	uint64_t state_0 = state_[ p_ ];
	uint64_t state_1 = state_[ p_ = ( p_ + 1 ) & 15 ];
	state_1 ^= state_1 << 31;
	state_1 ^= state_1 >> 11;
	state_0 ^= state_0 >> 30;
	uint64_t tmp = ( state_[ p_ ] = state_0 ^ state_1 ) * 1181783497276652981ULL;
	tmp = tmp>>16;
	return static_cast<uint32_t>(tmp&MAX_URAND);
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64() {
	uint64_t state_0 = state_[ p_ ];
	uint64_t state_1 = state_[ p_ = ( p_ + 1 ) & 15 ];
	state_1 ^= state_1 << 31;
	state_1 ^= state_1 >> 11;
	state_0 ^= state_0 >> 30;
	return (( state_[ p_ ] = state_0 ^ state_1 ) * 1181783497276652981LL) - 1;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& range) {
	const uint32_t max_val = (MAX_URAND/range)*range;
	uint32_t tmp = urand();
	while(tmp>=max_val)
	tmp = urand();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& start, const uint32_t& end ) {
	return urand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& range) {
	const uint64_t max_val = (MAX_URAND64/range)*range;
	uint64_t tmp = urand64();
	while(tmp>=max_val)
	tmp = urand64();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& start, const uint64_t& end ) {
	return urand64(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	int rand() {
	return static_cast<int>(urand()/2);
	}

	KOKKOS_INLINE_FUNCTION
	int rand(const int& range) {
	const int max_val = (MAX_RAND/range)*range;
	int tmp = rand();
	while(tmp>=max_val)
	tmp = rand();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	int rand(const int& start, const int& end ) {
	return rand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64() {
	return static_cast<int64_t>(urand64()/2);
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& range) {
	const int64_t max_val = (MAX_RAND64/range)*range;
	int64_t tmp = rand64();
	while(tmp>=max_val)
	tmp = rand64();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& start, const int64_t& end ) {
	return rand64(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	float frand() {
	return 1.0f * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	float frand(const float& range) {
	return range * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	float frand(const float& start, const float& end ) {
	return frand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	double drand() {
	return 1.0 * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	double drand(const double& range) {
	return range * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	double drand(const double& start, const double& end ) {
	return frand(end-start)+start;
	}

	//Marsaglia polar method for drawing a standard normal distributed random number
	KOKKOS_INLINE_FUNCTION
	double normal() {
	double S = 2.0;
	double U;
	while(S>=1.0) {
	U = drand();
	const double V = drand();
	S = UU+VV;
	}
	return Usqrt(-2.0log(S)/S);
	}

	KOKKOS_INLINE_FUNCTION
	double normal(const double& mean, const double& std_dev=1.0) {
	return mean + normal()*std_dev;
	}
	};


	template<class DeviceType = Kokkos::DefaultExecutionSpace>
	class Random_XorShift1024_Pool {
	private:
	typedef View<int*,DeviceType> int_view_type;
	typedef View<uint64_t*[16],DeviceType> state_data_type;

	int_view_type locks_;
	state_data_type state_;
	int_view_type p_;
	int num_states_;

	public:
	typedef Random_XorShift1024<DeviceType> generator_type;

	typedef DeviceType device_type;

	Random_XorShift1024_Pool() {
	num_states_ = 0;
	}

	inline
	- Random_XorShift1024_Pool(unsigned int seed){
	+ Random_XorShift1024_Pool(uint64_t seed){
	num_states_ = 0;
	init(seed,DeviceType::max_hardware_threads());
	}

	Random_XorShift1024_Pool(const Random_XorShift1024_Pool& src):
	locks_(src.locks_),
	state_(src.state_),
	p_(src.p_),
	num_states_(src.num_states_)
	{}

	Random_XorShift1024_Pool operator = (const Random_XorShift1024_Pool& src) {
	locks_ = src.locks_;
	state_ = src.state_;
	p_ = src.p_;
	num_states_ = src.num_states_;
	return *this;
	}

	inline
	- void init(unsigned int seed, int num_states) {
	+ void init(uint64_t seed, int num_states) {
	num_states_ = num_states;

	locks_ = int_view_type("Kokkos::Random_XorShift1024::locks",num_states_);
	state_ = state_data_type("Kokkos::Random_XorShift1024::state",num_states_);
	p_ = int_view_type("Kokkos::Random_XorShift1024::p",num_states_);

	typename state_data_type::HostMirror h_state = create_mirror_view(state_);
	typename int_view_type::HostMirror h_lock = create_mirror_view(locks_);
	typename int_view_type::HostMirror h_p = create_mirror_view(p_);

	// Execute on the HostMirror's default execution space.
	Random_XorShift64<typename state_data_type::HostMirror::execution_space> gen(seed,0);
	for(int i = 0; i < 17; i++)
	gen.rand();
	for(int i = 0; i < num_states_; i++) {
	for(int j = 0; j < 16 ; j++) {
	int n1 = gen.rand();
	int n2 = gen.rand();
	int n3 = gen.rand();
	int n4 = gen.rand();
	h_state(i,j) = (((static_cast<uint64_t>(n1)) & 0xffff)<<00) \|
	(((static_cast<uint64_t>(n2)) & 0xffff)<<16) \|
	(((static_cast<uint64_t>(n3)) & 0xffff)<<32) \|
	(((static_cast<uint64_t>(n4)) & 0xffff)<<48);
	}
	h_p(i) = 0;
	h_lock(i) = 0;
	}
	deep_copy(state_,h_state);
	deep_copy(locks_,h_lock);
	}

	KOKKOS_INLINE_FUNCTION
	Random_XorShift1024<DeviceType> get_state() const {
	const int i = DeviceType::hardware_thread_id();
	return Random_XorShift1024<DeviceType>(&state_(i,0),p_(i),i);
	};

	KOKKOS_INLINE_FUNCTION
	void free_state(const Random_XorShift1024<DeviceType>& state) const {
	for(int i = 0; i<16; i++)
	state_(state.state_idx_,i) = state.state_[i];
	p_(state.state_idx_) = state.p_;
	}
	};

	#if defined(KOKKOS_HAVE_CUDA) && defined(__CUDACC__)

	template<>
	class Random_XorShift1024<Kokkos::Cuda> {
	private:
	int p_;
	const int state_idx_;
	uint64_t* state_;
	friend class Random_XorShift1024_Pool<Kokkos::Cuda>;
	public:

	typedef Kokkos::Cuda device_type;

	enum {MAX_URAND = 0xffffffffU};
	enum {MAX_URAND64 = 0xffffffffffffffffULL-1};
	enum {MAX_RAND = static_cast<int>(0xffffffffU/2)};
	enum {MAX_RAND64 = static_cast<int64_t>(0xffffffffffffffffULL/2-1)};

	KOKKOS_INLINE_FUNCTION
	Random_XorShift1024 (uint64_t* state, int p, int state_idx = 0):
	p_(p),state_idx_(state_idx),state_(state){
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand() {
	uint64_t state_0 = state_[ p_ ];
	uint64_t state_1 = state_[ p_ = ( p_ + 1 ) & 15 ];
	state_1 ^= state_1 << 31;
	state_1 ^= state_1 >> 11;
	state_0 ^= state_0 >> 30;
	uint64_t tmp = ( state_[ p_ ] = state_0 ^ state_1 ) * 1181783497276652981ULL;
	tmp = tmp>>16;
	return static_cast<uint32_t>(tmp&MAX_URAND);
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64() {
	uint64_t state_0 = state_[ p_ ];
	uint64_t state_1 = state_[ p_ = ( p_ + 1 ) & 15 ];
	state_1 ^= state_1 << 31;
	state_1 ^= state_1 >> 11;
	state_0 ^= state_0 >> 30;
	return (( state_[ p_ ] = state_0 ^ state_1 ) * 1181783497276652981LL) - 1;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& range) {
	const uint32_t max_val = (MAX_URAND/range)*range;
	uint32_t tmp = urand();
	while(tmp>=max_val)
	urand();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t urand(const uint32_t& start, const uint32_t& end ) {
	return urand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& range) {
	const uint64_t max_val = (MAX_URAND64/range)*range;
	uint64_t tmp = urand64();
	while(tmp>=max_val)
	urand64();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	uint64_t urand64(const uint64_t& start, const uint64_t& end ) {
	return urand64(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	int rand() {
	return static_cast<int>(urand()/2);
	}

	KOKKOS_INLINE_FUNCTION
	int rand(const int& range) {
	const int max_val = (MAX_RAND/range)*range;
	int tmp = rand();
	while(tmp>=max_val)
	rand();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	int rand(const int& start, const int& end ) {
	return rand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64() {
	return static_cast<int64_t>(urand64()/2);
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& range) {
	const int64_t max_val = (MAX_RAND64/range)*range;
	int64_t tmp = rand64();
	while(tmp>=max_val)
	rand64();
	return tmp%range;
	}

	KOKKOS_INLINE_FUNCTION
	int64_t rand64(const int64_t& start, const int64_t& end ) {
	return rand64(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	float frand() {
	return 1.0f * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	float frand(const float& range) {
	return range * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	float frand(const float& start, const float& end ) {
	return frand(end-start)+start;
	}

	KOKKOS_INLINE_FUNCTION
	double drand() {
	return 1.0 * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	double drand(const double& range) {
	return range * urand64()/MAX_URAND64;
	}

	KOKKOS_INLINE_FUNCTION
	double drand(const double& start, const double& end ) {
	return frand(end-start)+start;
	}

	//Marsaglia polar method for drawing a standard normal distributed random number
	KOKKOS_INLINE_FUNCTION
	double normal() {
	double S = 2.0;
	double U;
	while(S>=1.0) {
	U = drand();
	const double V = drand();
	S = UU+VV;
	}
	return Usqrt(-2.0log(S)/S);
	}

	KOKKOS_INLINE_FUNCTION
	double normal(const double& mean, const double& std_dev=1.0) {
	return mean + normal()*std_dev;
	}
	};

	template<>
	inline
	-Random_XorShift64_Pool<Kokkos::Cuda>::Random_XorShift64_Pool(unsigned int seed) {
	+Random_XorShift64_Pool<Kokkos::Cuda>::Random_XorShift64_Pool(uint64_t seed) {
	num_states_ = 0;
	init(seed,4*32768);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	Random_XorShift64<Kokkos::Cuda> Random_XorShift64_Pool<Kokkos::Cuda>::get_state() const {
	#ifdef __CUDA_ARCH__
	const int i_offset = (threadIdx.xblockDim.y + threadIdx.y)blockDim.z+threadIdx.z;
	int i = ((blockIdx.xgridDim.y+blockIdx.y)gridDim.z + blockIdx.z) *
	blockDim.xblockDim.yblockDim.z + i_offset;
	while(Kokkos::atomic_compare_exchange(&locks_(i),0,1)) {
	i+=blockDim.xblockDim.yblockDim.z;
	if(i>=num_states_) {i = i_offset;}
	}

	return Random_XorShift64<Kokkos::Cuda>(state_(i),i);
	#else
	return Random_XorShift64<Kokkos::Cuda>(state_(0),0);
	#endif
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void Random_XorShift64_Pool<Kokkos::Cuda>::free_state(const Random_XorShift64<Kokkos::Cuda> &state) const {
	#ifdef __CUDA_ARCH__
	state_(state.state_idx_) = state.state_;
	locks_(state.state_idx_) = 0;
	return;
	#endif
	}


	template<>
	inline
	-Random_XorShift1024_Pool<Kokkos::Cuda>::Random_XorShift1024_Pool(unsigned int seed) {
	+Random_XorShift1024_Pool<Kokkos::Cuda>::Random_XorShift1024_Pool(uint64_t seed) {
	num_states_ = 0;
	init(seed,4*32768);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	Random_XorShift1024<Kokkos::Cuda> Random_XorShift1024_Pool<Kokkos::Cuda>::get_state() const {
	#ifdef __CUDA_ARCH__
	const int i_offset = (threadIdx.xblockDim.y + threadIdx.y)blockDim.z+threadIdx.z;
	int i = ((blockIdx.xgridDim.y+blockIdx.y)gridDim.z + blockIdx.z) *
	blockDim.xblockDim.yblockDim.z + i_offset;
	while(Kokkos::atomic_compare_exchange(&locks_(i),0,1)) {
	i+=blockDim.xblockDim.yblockDim.z;
	if(i>=num_states_) {i = i_offset;}
	}

	return Random_XorShift1024<Kokkos::Cuda>(&state_(i,0), p_(i), i);
	#else
	return Random_XorShift1024<Kokkos::Cuda>(&state_(0,0), p_(0), 0);
	#endif
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void Random_XorShift1024_Pool<Kokkos::Cuda>::free_state(const Random_XorShift1024<Kokkos::Cuda> &state) const {
	#ifdef __CUDA_ARCH__
	for(int i=0; i<16; i++)
	state_(state.state_idx_,i) = state.state_[i];
	locks_(state.state_idx_) = 0;
	return;
	#endif
	}


	#endif



	template<class ViewType, class RandomPool, int loops, int rank>
	struct fill_random_functor_range;
	template<class ViewType, class RandomPool, int loops, int rank>
	struct fill_random_functor_begin_end;

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,1>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0())
	a(idx) = Rand::draw(gen,range);
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,2>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	a(idx,k) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};


	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,3>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	a(idx,k,l) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,4>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	a(idx,k,l,m) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,5>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	a(idx,k,l,m,n) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,6>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	for(unsigned int o=0;o<a.dimension_5();o++)
	a(idx,k,l,m,n,o) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,7>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	for(unsigned int o=0;o<a.dimension_5();o++)
	for(unsigned int p=0;p<a.dimension_6();p++)
	a(idx,k,l,m,n,o,p) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_range<ViewType,RandomPool,loops,8>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type range;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_range(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type range_):
	a(a_),rand_pool(rand_pool_),range(range_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	for(unsigned int o=0;o<a.dimension_5();o++)
	for(unsigned int p=0;p<a.dimension_6();p++)
	for(unsigned int q=0;q<a.dimension_7();q++)
	a(idx,k,l,m,n,o,p,q) = Rand::draw(gen,range);
	}
	}
	rand_pool.free_state(gen);
	}
	};
	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,1>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0())
	a(idx) = Rand::draw(gen,begin,end);
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,2>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	a(idx,k) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};


	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,3>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	a(idx,k,l) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,4>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	a(idx,k,l,m) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,5>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()){
	for(unsigned int l=0;l<a.dimension_1();l++)
	for(unsigned int m=0;m<a.dimension_2();m++)
	for(unsigned int n=0;n<a.dimension_3();n++)
	for(unsigned int o=0;o<a.dimension_4();o++)
	a(idx,l,m,n,o) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,6>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	for(unsigned int o=0;o<a.dimension_5();o++)
	a(idx,k,l,m,n,o) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};


	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,7>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	for(unsigned int o=0;o<a.dimension_5();o++)
	for(unsigned int p=0;p<a.dimension_6();p++)
	a(idx,k,l,m,n,o,p) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool, int loops>
	struct fill_random_functor_begin_end<ViewType,RandomPool,loops,8>{
	- typedef typename ViewType::device_type device_type;
	+ typedef typename ViewType::execution_space execution_space;
	ViewType a;
	RandomPool rand_pool;
	typename ViewType::const_value_type begin,end;

	typedef rand<typename RandomPool::generator_type, typename ViewType::non_const_value_type> Rand;

	fill_random_functor_begin_end(ViewType a_, RandomPool rand_pool_,
	typename ViewType::const_value_type begin_, typename ViewType::const_value_type end_):
	a(a_),rand_pool(rand_pool_),begin(begin_),end(end_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (unsigned int i) const {
	typename RandomPool::generator_type gen = rand_pool.get_state();
	for(unsigned int j=0;j<loops;j++) {
	const uint64_t idx = i*loops+j;
	if(idx<a.dimension_0()) {
	for(unsigned int k=0;k<a.dimension_1();k++)
	for(unsigned int l=0;l<a.dimension_2();l++)
	for(unsigned int m=0;m<a.dimension_3();m++)
	for(unsigned int n=0;n<a.dimension_4();n++)
	for(unsigned int o=0;o<a.dimension_5();o++)
	for(unsigned int p=0;p<a.dimension_6();p++)
	for(unsigned int q=0;q<a.dimension_7();q++)
	a(idx,k,l,m,n,o,p,q) = Rand::draw(gen,begin,end);
	}
	}
	rand_pool.free_state(gen);
	}
	};

	template<class ViewType, class RandomPool>
	void fill_random(ViewType a, RandomPool g, typename ViewType::const_value_type range) {
	int64_t LDA = a.dimension_0();
	if(LDA>0)
	parallel_for((LDA+127)/128,fill_random_functor_range<ViewType,RandomPool,128,ViewType::Rank>(a,g,range));
	}

	template<class ViewType, class RandomPool>
	void fill_random(ViewType a, RandomPool g, typename ViewType::const_value_type begin,typename ViewType::const_value_type end ) {
	int64_t LDA = a.dimension_0();
	if(LDA>0)
	parallel_for((LDA+127)/128,fill_random_functor_begin_end<ViewType,RandomPool,128,ViewType::Rank>(a,g,begin,end));
	}
	}

	#endif
	diff --git a/lib/kokkos/algorithms/src/Kokkos_Sort.hpp b/lib/kokkos/algorithms/src/Kokkos_Sort.hpp
	index 99bd2ff12..8d97472aa 100755
	--- a/lib/kokkos/algorithms/src/Kokkos_Sort.hpp
	+++ b/lib/kokkos/algorithms/src/Kokkos_Sort.hpp
	@@ -1,486 +1,496 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/


	#ifndef KOKKOS_SORT_HPP_
	#define KOKKOS_SORT_HPP_

	#include <Kokkos_Core.hpp>

	#include <algorithm>

	namespace Kokkos {

	namespace SortImpl {

	template<class ValuesViewType, int Rank=ValuesViewType::Rank>
	struct CopyOp;

	template<class ValuesViewType>
	struct CopyOp<ValuesViewType,1> {
	template<class DstType, class SrcType>
	KOKKOS_INLINE_FUNCTION
	static void copy(DstType& dst, size_t i_dst,
	SrcType& src, size_t i_src ) {
	dst(i_dst) = src(i_src);
	}
	};

	template<class ValuesViewType>
	struct CopyOp<ValuesViewType,2> {
	template<class DstType, class SrcType>
	KOKKOS_INLINE_FUNCTION
	static void copy(DstType& dst, size_t i_dst,
	SrcType& src, size_t i_src ) {
	for(int j = 0;j< (int) dst.dimension_1(); j++)
	dst(i_dst,j) = src(i_src,j);
	}
	};

	template<class ValuesViewType>
	struct CopyOp<ValuesViewType,3> {
	template<class DstType, class SrcType>
	KOKKOS_INLINE_FUNCTION
	static void copy(DstType& dst, size_t i_dst,
	SrcType& src, size_t i_src ) {
	for(int j = 0; j<dst.dimension_1(); j++)
	for(int k = 0; k<dst.dimension_2(); k++)
	dst(i_dst,j,k) = src(i_src,j,k);
	}
	};
	}

	template<class KeyViewType, class BinSortOp, class ExecutionSpace = typename KeyViewType::execution_space,
	class SizeType = typename KeyViewType::memory_space::size_type>
	class BinSort {


	public:
	template<class ValuesViewType, class PermuteViewType, class CopyOp>
	struct bin_sort_sort_functor {
	typedef ExecutionSpace execution_space;
	typedef typename ValuesViewType::non_const_type values_view_type;
	typedef typename ValuesViewType::const_type const_values_view_type;
	Kokkos::View<typename values_view_type::const_data_type,typename values_view_type::array_layout,
	typename values_view_type::memory_space,Kokkos::MemoryTraits<Kokkos::RandomAccess> > values;
	values_view_type sorted_values;
	typename PermuteViewType::const_type sort_order;
	bin_sort_sort_functor(const_values_view_type values_, values_view_type sorted_values_, PermuteViewType sort_order_):
	values(values_),sorted_values(sorted_values_),sort_order(sort_order_) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (const int& i) const {
	//printf("Sort: %i %i\n",i,sort_order(i));
	CopyOp::copy(sorted_values,i,values,sort_order(i));
	}
	};

	typedef ExecutionSpace execution_space;
	typedef BinSortOp bin_op_type;

	struct bin_count_tag {};
	struct bin_offset_tag {};
	struct bin_binning_tag {};
	struct bin_sort_bins_tag {};

	public:
	typedef SizeType size_type;
	typedef size_type value_type;

	typedef Kokkos::View<size_type*, execution_space> offset_type;
	typedef Kokkos::View<const int*, execution_space> bin_count_type;


	typedef Kokkos::View<typename KeyViewType::const_data_type,
	typename KeyViewType::array_layout,
	typename KeyViewType::memory_space> const_key_view_type;
	typedef Kokkos::View<typename KeyViewType::const_data_type,
	typename KeyViewType::array_layout,
	typename KeyViewType::memory_space,
	Kokkos::MemoryTraits<Kokkos::RandomAccess> > const_rnd_key_view_type;

	typedef typename KeyViewType::non_const_value_type non_const_key_scalar;
	typedef typename KeyViewType::const_value_type const_key_scalar;

	private:
	const_key_view_type keys;
	const_rnd_key_view_type keys_rnd;

	public:
	BinSortOp bin_op;

	offset_type bin_offsets;

	Kokkos::View<int*, ExecutionSpace, Kokkos::MemoryTraits<Kokkos::Atomic> > bin_count_atomic;
	bin_count_type bin_count_const;

	offset_type sort_order;

	bool sort_within_bins;

	public:

	// Constructor: takes the keys, the binning_operator and optionally whether to sort within bins (default false)
	BinSort(const_key_view_type keys_, BinSortOp bin_op_,
	bool sort_within_bins_ = false)
	:keys(keys_),keys_rnd(keys_), bin_op(bin_op_) {

	bin_count_atomic = Kokkos::View<int*, ExecutionSpace >("Kokkos::SortImpl::BinSortFunctor::bin_count",bin_op.max_bins());
	bin_count_const = bin_count_atomic;
	bin_offsets = offset_type("Kokkos::SortImpl::BinSortFunctor::bin_offsets",bin_op.max_bins());
	sort_order = offset_type("PermutationVector",keys.dimension_0());
	sort_within_bins = sort_within_bins_;
	}

	// Create the permutation vector, the bin_offset array and the bin_count array. Can be called again if keys changed
	void create_permute_vector() {
	Kokkos::parallel_for (Kokkos::RangePolicy<ExecutionSpace,bin_count_tag> (0,keys.dimension_0()),*this);
	Kokkos::parallel_scan(Kokkos::RangePolicy<ExecutionSpace,bin_offset_tag> (0,bin_op.max_bins()) ,*this);

	Kokkos::deep_copy(bin_count_atomic,0);
	Kokkos::parallel_for (Kokkos::RangePolicy<ExecutionSpace,bin_binning_tag> (0,keys.dimension_0()),*this);

	if(sort_within_bins)
	Kokkos::parallel_for (Kokkos::RangePolicy<ExecutionSpace,bin_sort_bins_tag>(0,bin_op.max_bins()) ,*this);
	}

	// Sort a view with respect ot the first dimension using the permutation array
	template<class ValuesViewType>
	void sort(ValuesViewType values) {
	ValuesViewType sorted_values = ValuesViewType("Copy",
	values.dimension_0(),
	values.dimension_1(),
	values.dimension_2(),
	values.dimension_3(),
	values.dimension_4(),
	values.dimension_5(),
	values.dimension_6(),
	values.dimension_7());

	parallel_for(values.dimension_0(),
	bin_sort_sort_functor<ValuesViewType, offset_type,
	SortImpl::CopyOp<ValuesViewType> >(values,sorted_values,sort_order));

	deep_copy(values,sorted_values);
	}

	// Get the permutation vector
	KOKKOS_INLINE_FUNCTION
	offset_type get_permute_vector() const { return sort_order;}

	// Get the start offsets for each bin
	KOKKOS_INLINE_FUNCTION
	offset_type get_bin_offsets() const { return bin_offsets;}

	// Get the count for each bin
	KOKKOS_INLINE_FUNCTION
	bin_count_type get_bin_count() const {return bin_count_const;}

	public:
	KOKKOS_INLINE_FUNCTION
	void operator() (const bin_count_tag& tag, const int& i) const {
	bin_count_atomic(bin_op.bin(keys,i))++;
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (const bin_offset_tag& tag, const int& i, value_type& offset, const bool& final) const {
	if(final) {
	bin_offsets(i) = offset;
	}
	offset+=bin_count_const(i);
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (const bin_binning_tag& tag, const int& i) const {
	const int bin = bin_op.bin(keys,i);
	const int count = bin_count_atomic(bin)++;

	sort_order(bin_offsets(bin) + count) = i;
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (const bin_sort_bins_tag& tag, const int&i ) const {
	bool sorted = false;
	int upper_bound = bin_offsets(i)+bin_count_const(i);
	while(!sorted) {
	sorted = true;
	int old_idx = sort_order(bin_offsets(i));
	int new_idx;
	for(int k=bin_offsets(i)+1; k<upper_bound; k++) {
	new_idx = sort_order(k);

	if(!bin_op(keys_rnd,old_idx,new_idx)) {
	sort_order(k-1) = new_idx;
	sort_order(k) = old_idx;
	sorted = false;
	} else {
	old_idx = new_idx;
	}
	}
	upper_bound--;
	}
	}
	};

	namespace SortImpl {

	template<class KeyViewType>
	struct DefaultBinOp1D {
	const int max_bins_;
	const double mul_;
	typename KeyViewType::const_value_type range_;
	typename KeyViewType::const_value_type min_;

	//Construct BinOp with number of bins, minimum value and maxuimum value
	- DefaultBinOp1D(int max_bins, typename KeyViewType::const_value_type min,
	+ DefaultBinOp1D(int max_bins__, typename KeyViewType::const_value_type min,
	typename KeyViewType::const_value_type max )
	- :max_bins_(max_bins+1),mul_(1.0*max_bins/(max-min)),range_(max-min),min_(min) {}
	+ :max_bins_(max_bins__+1),mul_(1.0*max_bins__/(max-min)),range_(max-min),min_(min) {}

	//Determine bin index from key value
	template<class ViewType>
	KOKKOS_INLINE_FUNCTION
	int bin(ViewType& keys, const int& i) const {
	return int(mul_*(keys(i)-min_));
	}

	//Return maximum bin index + 1
	KOKKOS_INLINE_FUNCTION
	int max_bins() const {
	return max_bins_;
	}

	//Compare to keys within a bin if true new_val will be put before old_val
	template<class ViewType, typename iType1, typename iType2>
	KOKKOS_INLINE_FUNCTION
	bool operator()(ViewType& keys, iType1& i1, iType2& i2) const {
	return keys(i1)<keys(i2);
	}
	};

	template<class KeyViewType>
	struct DefaultBinOp3D {
	int max_bins_[3];
	double mul_[3];
	typename KeyViewType::non_const_value_type range_[3];
	typename KeyViewType::non_const_value_type min_[3];

	- DefaultBinOp3D(int max_bins[], typename KeyViewType::const_value_type min[],
	+ DefaultBinOp3D(int max_bins__[], typename KeyViewType::const_value_type min[],
	typename KeyViewType::const_value_type max[] )
	{
	- max_bins_[0] = max_bins[0]+1;
	- max_bins_[1] = max_bins[1]+1;
	- max_bins_[2] = max_bins[2]+1;
	- mul_[0] = 1.0*max_bins[0]/(max[0]-min[0]);
	- mul_[1] = 1.0*max_bins[1]/(max[1]-min[1]);
	- mul_[2] = 1.0*max_bins[2]/(max[2]-min[2]);
	+ max_bins_[0] = max_bins__[0]+1;
	+ max_bins_[1] = max_bins__[1]+1;
	+ max_bins_[2] = max_bins__[2]+1;
	+ mul_[0] = 1.0*max_bins__[0]/(max[0]-min[0]);
	+ mul_[1] = 1.0*max_bins__[1]/(max[1]-min[1]);
	+ mul_[2] = 1.0*max_bins__[2]/(max[2]-min[2]);
	range_[0] = max[0]-min[0];
	range_[1] = max[1]-min[1];
	range_[2] = max[2]-min[2];
	min_[0] = min[0];
	min_[1] = min[1];
	min_[2] = min[2];
	}

	template<class ViewType>
	KOKKOS_INLINE_FUNCTION
	int bin(ViewType& keys, const int& i) const {
	return int( (((int(mul_[0](keys(i,0)-min_[0]))max_bins_[1]) +
	int(mul_[1](keys(i,1)-min_[1])))max_bins_[2]) +
	int(mul_[2]*(keys(i,2)-min_[2])));
	}

	KOKKOS_INLINE_FUNCTION
	int max_bins() const {
	return max_bins_[0]max_bins_[1]max_bins_[2];
	}

	template<class ViewType, typename iType1, typename iType2>
	KOKKOS_INLINE_FUNCTION
	bool operator()(ViewType& keys, iType1& i1 , iType2& i2) const {
	if (keys(i1,0)>keys(i2,0)) return true;
	else if (keys(i1,0)==keys(i2,0)) {
	if (keys(i1,1)>keys(i2,1)) return true;
	else if (keys(i1,1)==keys(i2,2)) {
	if (keys(i1,2)>keys(i2,2)) return true;
	}
	}
	return false;
	}
	};

	template<typename Scalar>
	struct min_max {
	Scalar min;
	Scalar max;
	bool init;

	KOKKOS_INLINE_FUNCTION
	min_max() {
	min = 0;
	max = 0;
	init = 0;
	}

	KOKKOS_INLINE_FUNCTION
	min_max (const min_max& val) {
	min = val.min;
	max = val.max;
	init = val.init;
	}

	KOKKOS_INLINE_FUNCTION
	min_max operator = (const min_max& val) {
	min = val.min;
	max = val.max;
	init = val.init;
	return *this;
	}

	KOKKOS_INLINE_FUNCTION
	void operator+= (const Scalar& val) {
	if(init) {
	min = min<val?min:val;
	max = max>val?max:val;
	} else {
	min = val;
	max = val;
	init = 1;
	}
	}

	KOKKOS_INLINE_FUNCTION
	void operator+= (const min_max& val) {
	if(init && val.init) {
	min = min<val.min?min:val.min;
	max = max>val.max?max:val.max;
	} else {
	if(val.init) {
	min = val.min;
	max = val.max;
	init = 1;
	}
	}
	}

	KOKKOS_INLINE_FUNCTION
	void operator+= (volatile const Scalar& val) volatile {
	if(init) {
	min = min<val?min:val;
	max = max>val?max:val;
	} else {
	min = val;
	max = val;
	init = 1;
	}
	}

	KOKKOS_INLINE_FUNCTION
	void operator+= (volatile const min_max& val) volatile {
	if(init && val.init) {
	min = min<val.min?min:val.min;
	max = max>val.max?max:val.max;
	} else {
	if(val.init) {
	min = val.min;
	max = val.max;
	init = 1;
	}
	}
	}
	};


	template<class ViewType>
	struct min_max_functor {
	typedef typename ViewType::execution_space execution_space;
	ViewType view;
	typedef min_max<typename ViewType::non_const_value_type> value_type;
	min_max_functor (const ViewType view_):view(view_) {
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(const size_t& i, value_type& val) const {
	val += view(i);
	}
	};

	template<class ViewType>
	bool try_std_sort(ViewType view) {
	bool possible = true;
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	size_t stride[8];
	view.stride(stride);
	+#else
	+ size_t stride[8] = { view.stride_0()
	+ , view.stride_1()
	+ , view.stride_2()
	+ , view.stride_3()
	+ , view.stride_4()
	+ , view.stride_5()
	+ , view.stride_6()
	+ , view.stride_7()
	+ };
	+#endif
	possible = possible && Impl::is_same<typename ViewType::memory_space, HostSpace>::value;
	possible = possible && (ViewType::Rank == 1);
	possible = possible && (stride[0] == 1);
	if(possible) {
	std::sort(view.ptr_on_device(),view.ptr_on_device()+view.dimension_0());
	}
	return possible;
	}

	}

	template<class ViewType>
	void sort(ViewType view, bool always_use_kokkos_sort = false) {
	if(!always_use_kokkos_sort) {
	if(SortImpl::try_std_sort(view)) return;
	}

	typedef SortImpl::DefaultBinOp1D<ViewType> CompType;
	SortImpl::min_max<typename ViewType::non_const_value_type> val;
	parallel_reduce(view.dimension_0(),SortImpl::min_max_functor<ViewType>(view),val);
	BinSort<ViewType, CompType> bin_sort(view,CompType(view.dimension_0()/2,val.min,val.max),true);
	bin_sort.create_permute_vector();
	bin_sort.sort(view);
	}

	/*template<class ViewType, class Comparator>
	void sort(ViewType view, Comparator comp, bool always_use_kokkos_sort = false) {

	}*/

	}

	#endif
	diff --git a/lib/kokkos/algorithms/unit_tests/Makefile b/lib/kokkos/algorithms/unit_tests/Makefile
	new file mode 100755
	index 000000000..5fc94ac0f
	--- /dev/null
	+++ b/lib/kokkos/algorithms/unit_tests/Makefile
	@@ -0,0 +1,92 @@
	+KOKKOS_PATH = ../..
	+
	+GTEST_PATH = ../../TPL/gtest
	+
	+vpath %.cpp ${KOKKOS_PATH}/algorithms/unit_tests
	+
	+default: build_all
	+ echo "End Build"
	+
	+
	+include $(KOKKOS_PATH)/Makefile.kokkos
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ CXX = nvcc_wrapper
	+ CXXFLAGS ?= -O3
	+ LINK = $(CXX)
	+ LDFLAGS ?= -lpthread
	+else
	+ CXX ?= g++
	+ CXXFLAGS ?= -O3
	+ LINK ?= $(CXX)
	+ LDFLAGS ?= -lpthread
	+endif
	+
	+KOKKOS_CXXFLAGS += -I$(GTEST_PATH) -I${KOKKOS_PATH}/algorithms/unit_tests
	+
	+TEST_TARGETS =
	+TARGETS =
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ OBJ_CUDA = TestCuda.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosAlgorithms_UnitTest_Cuda
	+ TEST_TARGETS += test-cuda
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ OBJ_THREADS = TestThreads.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosAlgorithms_UnitTest_Threads
	+ TEST_TARGETS += test-threads
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ OBJ_OPENMP = TestOpenMP.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosAlgorithms_UnitTest_OpenMP
	+ TEST_TARGETS += test-openmp
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
	+ OBJ_SERIAL = TestSerial.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosAlgorithms_UnitTest_Serial
	+ TEST_TARGETS += test-serial
	+endif
	+
	+KokkosAlgorithms_UnitTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosAlgorithms_UnitTest_Cuda
	+
	+KokkosAlgorithms_UnitTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosAlgorithms_UnitTest_Threads
	+
	+KokkosAlgorithms_UnitTest_OpenMP: $(OBJ_OPENMP) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_OPENMP) $(KOKKOS_LIBS) $(LIB) -o KokkosAlgorithms_UnitTest_OpenMP
	+
	+KokkosAlgorithms_UnitTest_Serial: $(OBJ_SERIAL) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_SERIAL) $(KOKKOS_LIBS) $(LIB) -o KokkosAlgorithms_UnitTest_Serial
	+
	+test-cuda: KokkosAlgorithms_UnitTest_Cuda
	+ ./KokkosAlgorithms_UnitTest_Cuda
	+
	+test-threads: KokkosAlgorithms_UnitTest_Threads
	+ ./KokkosAlgorithms_UnitTest_Threads
	+
	+test-openmp: KokkosAlgorithms_UnitTest_OpenMP
	+ ./KokkosAlgorithms_UnitTest_OpenMP
	+
	+test-serial: KokkosAlgorithms_UnitTest_Serial
	+ ./KokkosAlgorithms_UnitTest_Serial
	+
	+build_all: $(TARGETS)
	+
	+test: $(TEST_TARGETS)
	+
	+clean: kokkos-clean
	+ rm -f *.o $(TARGETS)
	+
	+# Compilation rules
	+
	+%.o:%.cpp $(KOKKOS_CPP_DEPENDS)
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
	+
	+gtest-all.o:$(GTEST_PATH)/gtest/gtest-all.cc
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $(GTEST_PATH)/gtest/gtest-all.cc
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/algorithms/unit_tests/TestCuda.cpp
	similarity index 50%
	copy from lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	copy to lib/kokkos/algorithms/unit_tests/TestCuda.cpp
	index 0dcb3977a..d19c778c4 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/algorithms/unit_tests/TestCuda.cpp
	@@ -1,84 +1,110 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	-#define KOKKOS_PHYSICAL_LAYOUT_HPP
	-
	-
	-#include <Kokkos_View.hpp>
	-namespace Kokkos {
	-namespace Impl {
	-
	-
	-
	-struct PhysicalLayout {
	- enum LayoutType {Left,Right,Scalar,Error};
	- LayoutType layout_type;
	- int rank;
	- long long int stride[8]; //distance between two neighboring elements in a given dimension
	-
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #ifdef KOKKOS_HAVE_CUDA
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #endif
	+#include <stdint.h>
	+#include <iostream>
	+#include <iomanip>
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#ifdef KOKKOS_HAVE_CUDA
	+
	+#include <TestRandom.hpp>
	+#include <TestSort.hpp>
	+
	+namespace Test {
	+
	+class cuda : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+ Kokkos::HostSpace::execution_space::initialize();
	+ Kokkos::Cuda::initialize( Kokkos::Cuda::SelectDevice(0) );
	+ }
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Cuda::finalize();
	+ Kokkos::HostSpace::execution_space::finalize();
	+ }
	};

	+void cuda_test_random_xorshift64( int num_draws )
	+{
	+ Impl::test_random<Kokkos::Random_XorShift64_Pool<Kokkos::Cuda> >(num_draws);
	}
	+
	+void cuda_test_random_xorshift1024( int num_draws )
	+{
	+ Impl::test_random<Kokkos::Random_XorShift1024_Pool<Kokkos::Cuda> >(num_draws);
	+}
	+
	+
	+#define CUDA_RANDOM_XORSHIFT64( num_draws ) \
	+ TEST_F( cuda, Random_XorShift64 ) { \
	+ cuda_test_random_xorshift64(num_draws); \
	+ }
	+
	+#define CUDA_RANDOM_XORSHIFT1024( num_draws ) \
	+ TEST_F( cuda, Random_XorShift1024 ) { \
	+ cuda_test_random_xorshift1024(num_draws); \
	+ }
	+
	+#define CUDA_SORT_UNSIGNED( size ) \
	+ TEST_F( cuda, SortUnsigned ) { \
	+ Impl::test_sort< Kokkos::Cuda, unsigned >(size); \
	+ }
	+
	+CUDA_RANDOM_XORSHIFT64( 132141141 )
	+CUDA_RANDOM_XORSHIFT1024( 52428813 )
	+CUDA_SORT_UNSIGNED(171)
	+
	+#undef CUDA_RANDOM_XORSHIFT64
	+#undef CUDA_RANDOM_XORSHIFT1024
	+#undef CUDA_SORT_UNSIGNED
	}
	-#endif
	+
	+#endif /* #ifdef KOKKOS_HAVE_CUDA */
	+
	diff --git a/lib/kokkos/core/src/Kokkos_CudaTypes.hpp b/lib/kokkos/algorithms/unit_tests/TestOpenMP.cpp
	similarity index 52%
	copy from lib/kokkos/core/src/Kokkos_CudaTypes.hpp
	copy to lib/kokkos/algorithms/unit_tests/TestOpenMP.cpp
	index 899e7e1fa..4b06dffcb 100755
	--- a/lib/kokkos/core/src/Kokkos_CudaTypes.hpp
	+++ b/lib/kokkos/algorithms/unit_tests/TestOpenMP.cpp
	@@ -1,139 +1,102 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_CUDATYPES_HPP
	-#define KOKKOS_CUDATYPES_HPP
	+#include <gtest/gtest.h>

	-#include <Kokkos_Macros.hpp>
	+#include <Kokkos_Core.hpp>

	//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#if defined( __CUDACC__ )
	-
	-namespace Kokkos {
	-
	-typedef ::int2 int2 ;
	-typedef ::int3 int3 ;
	-typedef ::int4 int4 ;
	-
	-typedef ::float2 float2 ;
	-typedef ::float3 float3 ;
	-typedef ::float4 float4 ;
	-
	-typedef ::double2 double2 ;
	-typedef ::double3 double3 ;
	-typedef ::double4 double4 ;
	-
	-} // namespace Kokkos
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#else /* NOT #if defined( __CUDACC__ ) */
	-
	-namespace Kokkos {
	+#include <TestRandom.hpp>
	+#include <TestSort.hpp>
	+#include <iomanip>

	-struct int2 {
	- int x;
	- int y;
	-};
	+namespace Test {

	-struct int3 {
	- int x;
	- int y;
	- int z;
	-};
	+#ifdef KOKKOS_HAVE_OPENMP
	+class openmp : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;

	-struct int4 {
	- int x;
	- int y;
	- int z;
	- int w;
	-};
	+ unsigned threads_count = omp_get_max_threads();

	-struct float2 {
	- float x;
	- float y;
	-};
	+ if ( Kokkos::hwloc::available() ) {
	+ threads_count = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa();
	+ }

	-struct float3 {
	- float x;
	- float y;
	- float z;
	-};
	+ Kokkos::OpenMP::initialize( threads_count );
	+ }

	-struct float4 {
	- float x;
	- float y;
	- float z;
	- float w;
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::OpenMP::finalize();
	+ }
	};

	-struct double2 {
	- double x;
	- double y;
	-};
	+#define OPENMP_RANDOM_XORSHIFT64( num_draws ) \
	+ TEST_F( openmp, Random_XorShift64 ) { \
	+ Impl::test_random<Kokkos::Random_XorShift64_Pool<Kokkos::OpenMP> >(num_draws); \
	+ }

	-struct double3 {
	- double x;
	- double y;
	- double z;
	-};
	+#define OPENMP_RANDOM_XORSHIFT1024( num_draws ) \
	+ TEST_F( openmp, Random_XorShift1024 ) { \
	+ Impl::test_random<Kokkos::Random_XorShift1024_Pool<Kokkos::OpenMP> >(num_draws); \
	+ }

	-struct double4 {
	- double x;
	- double y;
	- double z;
	- double w;
	-};
	+#define OPENMP_SORT_UNSIGNED( size ) \
	+ TEST_F( openmp, SortUnsigned ) { \
	+ Impl::test_sort< Kokkos::OpenMP, unsigned >(size); \
	+ }

	-} // namespace Kokkos
	+OPENMP_RANDOM_XORSHIFT64( 10240000 )
	+OPENMP_RANDOM_XORSHIFT1024( 10130144 )
	+OPENMP_SORT_UNSIGNED(171)

	+#undef OPENMP_RANDOM_XORSHIFT64
	+#undef OPENMP_RANDOM_XORSHIFT1024
	+#undef OPENMP_SORT_UNSIGNED
	#endif
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#endif /* #define KOKKOS_CUDATYPES_HPP */
	+} // namespace test

	diff --git a/lib/kokkos/algorithms/unit_tests/TestRandom.hpp b/lib/kokkos/algorithms/unit_tests/TestRandom.hpp
	new file mode 100755
	index 000000000..eade74ed9
	--- /dev/null
	+++ b/lib/kokkos/algorithms/unit_tests/TestRandom.hpp
	@@ -0,0 +1,476 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef KOKKOS_TEST_DUALVIEW_HPP
	+#define KOKKOS_TEST_DUALVIEW_HPP
	+
	+#include <gtest/gtest.h>
	+#include <iostream>
	+#include <cstdlib>
	+#include <cstdio>
	+#include <impl/Kokkos_Timer.hpp>
	+#include <Kokkos_Core.hpp>
	+#include <Kokkos_Random.hpp>
	+#include <cmath>
	+
	+namespace Test {
	+
	+namespace Impl{
	+
	+// This test runs the random number generators and uses some statistic tests to
	+// check the 'goodness' of the random numbers:
	+// (i) mean: the mean is expected to be 0.5*RAND_MAX
	+// (ii) variance: the variance is 1/3meanmean
	+// (iii) covariance: the covariance is 0
	+// (iv) 1-tupledistr: the mean, variance and covariance of a 1D Histrogram of random numbers
	+// (v) 3-tupledistr: the mean, variance and covariance of a 3D Histrogram of random numbers
	+
	+#define HIST_DIM3D 24
	+#define HIST_DIM1D (HIST_DIM3DHIST_DIM3DHIST_DIM3D)
	+
	+struct RandomProperties {
	+ uint64_t count;
	+ double mean;
	+ double variance;
	+ double covariance;
	+ double min;
	+ double max;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ RandomProperties() {
	+ count = 0;
	+ mean = 0.0;
	+ variance = 0.0;
	+ covariance = 0.0;
	+ min = 1e64;
	+ max = -1e64;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ RandomProperties& operator+=(const RandomProperties& add) {
	+ count += add.count;
	+ mean += add.mean;
	+ variance += add.variance;
	+ covariance += add.covariance;
	+ min = add.min<min?add.min:min;
	+ max = add.max>max?add.max:max;
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator+=(const volatile RandomProperties& add) volatile {
	+ count += add.count;
	+ mean += add.mean;
	+ variance += add.variance;
	+ covariance += add.covariance;
	+ min = add.min<min?add.min:min;
	+ max = add.max>max?add.max:max;
	+ }
	+};
	+
	+template<class GeneratorPool, class Scalar>
	+struct test_random_functor {
	+ typedef typename GeneratorPool::generator_type rnd_type;
	+
	+ typedef RandomProperties value_type;
	+ typedef typename GeneratorPool::device_type device_type;
	+
	+ GeneratorPool rand_pool;
	+ const double mean;
	+
	+ // NOTE (mfh 03 Nov 2014): Kokkos::rand::max() is supposed to define
	+ // an exclusive upper bound on the range of random numbers that
	+ // draw() can generate. However, for the float specialization, some
	+ // implementations might violate this upper bound, due to rounding
	+ // error. Just in case, we leave an extra space at the end of each
	+ // dimension, in the View types below.
	+ typedef Kokkos::View<int[HIST_DIM1D+1],typename GeneratorPool::device_type> type_1d;
	+ type_1d density_1d;
	+ typedef Kokkos::View<int[HIST_DIM3D+1][HIST_DIM3D+1][HIST_DIM3D+1],typename GeneratorPool::device_type> type_3d;
	+ type_3d density_3d;
	+
	+ test_random_functor (GeneratorPool rand_pool_, type_1d d1d, type_3d d3d) :
	+ rand_pool (rand_pool_),
	+ mean (0.5*Kokkos::rand<rnd_type,Scalar>::max ()),
	+ density_1d (d1d),
	+ density_3d (d3d)
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, RandomProperties& prop) const {
	+ using Kokkos::atomic_fetch_add;
	+
	+ rnd_type rand_gen = rand_pool.get_state();
	+ for (int k = 0; k < 1024; ++k) {
	+ const Scalar tmp = Kokkos::rand<rnd_type,Scalar>::draw(rand_gen);
	+ prop.count++;
	+ prop.mean += tmp;
	+ prop.variance += (tmp-mean)*(tmp-mean);
	+ const Scalar tmp2 = Kokkos::rand<rnd_type,Scalar>::draw(rand_gen);
	+ prop.count++;
	+ prop.mean += tmp2;
	+ prop.variance += (tmp2-mean)*(tmp2-mean);
	+ prop.covariance += (tmp-mean)*(tmp2-mean);
	+ const Scalar tmp3 = Kokkos::rand<rnd_type,Scalar>::draw(rand_gen);
	+ prop.count++;
	+ prop.mean += tmp3;
	+ prop.variance += (tmp3-mean)*(tmp3-mean);
	+ prop.covariance += (tmp2-mean)*(tmp3-mean);
	+
	+ // NOTE (mfh 03 Nov 2014): Kokkos::rand::max() is supposed to
	+ // define an exclusive upper bound on the range of random
	+ // numbers that draw() can generate. However, for the float
	+ // specialization, some implementations might violate this upper
	+ // bound, due to rounding error. Just in case, we have left an
	+ // extra space at the end of each dimension of density_1d and
	+ // density_3d.
	+ //
	+ // Please note that those extra entries might not get counted in
	+ // the histograms. However, if Kokkos::rand is broken and only
	+ // returns values of max(), the histograms will still catch this
	+ // indirectly, since none of the other values will be filled in.
	+
	+ const Scalar theMax = Kokkos::rand<rnd_type, Scalar>::max ();
	+
	+ const uint64_t ind1_1d = static_cast<uint64_t> (1.0 * HIST_DIM1D * tmp / theMax);
	+ const uint64_t ind2_1d = static_cast<uint64_t> (1.0 * HIST_DIM1D * tmp2 / theMax);
	+ const uint64_t ind3_1d = static_cast<uint64_t> (1.0 * HIST_DIM1D * tmp3 / theMax);
	+
	+ const uint64_t ind1_3d = static_cast<uint64_t> (1.0 * HIST_DIM3D * tmp / theMax);
	+ const uint64_t ind2_3d = static_cast<uint64_t> (1.0 * HIST_DIM3D * tmp2 / theMax);
	+ const uint64_t ind3_3d = static_cast<uint64_t> (1.0 * HIST_DIM3D * tmp3 / theMax);
	+
	+ atomic_fetch_add (&density_1d(ind1_1d), 1);
	+ atomic_fetch_add (&density_1d(ind2_1d), 1);
	+ atomic_fetch_add (&density_1d(ind3_1d), 1);
	+ atomic_fetch_add (&density_3d(ind1_3d, ind2_3d, ind3_3d), 1);
	+ }
	+ rand_pool.free_state(rand_gen);
	+ }
	+};
	+
	+template<class DeviceType>
	+struct test_histogram1d_functor {
	+ typedef RandomProperties value_type;
	+ typedef typename DeviceType::execution_space execution_space;
	+ typedef typename DeviceType::memory_space memory_space;
	+
	+ // NOTE (mfh 03 Nov 2014): Kokkos::rand::max() is supposed to define
	+ // an exclusive upper bound on the range of random numbers that
	+ // draw() can generate. However, for the float specialization, some
	+ // implementations might violate this upper bound, due to rounding
	+ // error. Just in case, we leave an extra space at the end of each
	+ // dimension, in the View type below.
	+ typedef Kokkos::View<int[HIST_DIM1D+1], memory_space> type_1d;
	+ type_1d density_1d;
	+ double mean;
	+
	+ test_histogram1d_functor (type_1d d1d, int num_draws) :
	+ density_1d (d1d),
	+ mean (1.0num_draws/HIST_DIM1D3)
	+ {
	+ printf ("Mean: %e\n", mean);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION void
	+ operator() (const typename memory_space::size_type i,
	+ RandomProperties& prop) const
	+ {
	+ typedef typename memory_space::size_type size_type;
	+ const double count = density_1d(i);
	+ prop.mean += count;
	+ prop.variance += 1.0 * (count - mean) * (count - mean);
	+ //prop.covariance += 1.0countcount;
	+ prop.min = count < prop.min ? count : prop.min;
	+ prop.max = count > prop.max ? count : prop.max;
	+ if (i < static_cast<size_type> (HIST_DIM1D-1)) {
	+ prop.covariance += (count - mean) * (density_1d(i+1) - mean);
	+ }
	+ }
	+};
	+
	+template<class DeviceType>
	+struct test_histogram3d_functor {
	+ typedef RandomProperties value_type;
	+ typedef typename DeviceType::execution_space execution_space;
	+ typedef typename DeviceType::memory_space memory_space;
	+
	+ // NOTE (mfh 03 Nov 2014): Kokkos::rand::max() is supposed to define
	+ // an exclusive upper bound on the range of random numbers that
	+ // draw() can generate. However, for the float specialization, some
	+ // implementations might violate this upper bound, due to rounding
	+ // error. Just in case, we leave an extra space at the end of each
	+ // dimension, in the View type below.
	+ typedef Kokkos::View<int[HIST_DIM3D+1][HIST_DIM3D+1][HIST_DIM3D+1], memory_space> type_3d;
	+ type_3d density_3d;
	+ double mean;
	+
	+ test_histogram3d_functor (type_3d d3d, int num_draws) :
	+ density_3d (d3d),
	+ mean (1.0*num_draws/HIST_DIM1D)
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION void
	+ operator() (const typename memory_space::size_type i,
	+ RandomProperties& prop) const
	+ {
	+ typedef typename memory_space::size_type size_type;
	+ const double count = density_3d(i/(HIST_DIM3D*HIST_DIM3D),
	+ (i % (HIST_DIM3D*HIST_DIM3D))/HIST_DIM3D,
	+ i % HIST_DIM3D);
	+ prop.mean += count;
	+ prop.variance += (count - mean) * (count - mean);
	+ if (i < static_cast<size_type> (HIST_DIM1D-1)) {
	+ const double count_next = density_3d((i+1)/(HIST_DIM3D*HIST_DIM3D),
	+ ((i+1)%(HIST_DIM3D*HIST_DIM3D))/HIST_DIM3D,
	+ (i+1)%HIST_DIM3D);
	+ prop.covariance += (count - mean) * (count_next - mean);
	+ }
	+ }
	+};
	+
	+//
	+// Templated test that uses the above functors.
	+//
	+template <class RandomGenerator,class Scalar>
	+struct test_random_scalar {
	+ typedef typename RandomGenerator::generator_type rnd_type;
	+
	+ int pass_mean,pass_var,pass_covar;
	+ int pass_hist1d_mean,pass_hist1d_var,pass_hist1d_covar;
	+ int pass_hist3d_mean,pass_hist3d_var,pass_hist3d_covar;
	+
	+ test_random_scalar (typename test_random_functor<RandomGenerator,int>::type_1d& density_1d,
	+ typename test_random_functor<RandomGenerator,int>::type_3d& density_3d,
	+ RandomGenerator& pool,
	+ unsigned int num_draws)
	+ {
	+ using std::cerr;
	+ using std::endl;
	+ using Kokkos::parallel_reduce;
	+
	+ {
	+ cerr << " -- Testing randomness properties" << endl;
	+
	+ RandomProperties result;
	+ typedef test_random_functor<RandomGenerator, Scalar> functor_type;
	+ parallel_reduce (num_draws/1024, functor_type (pool, density_1d, density_3d), result);
	+
	+ //printf("Result: %lf %lf %lf\n",result.mean/num_draws/3,result.variance/num_draws/3,result.covariance/num_draws/2);
	+ double tolerance = 2.0*sqrt(1.0/num_draws);
	+ double mean_expect = 0.5*Kokkos::rand<rnd_type,Scalar>::max();
	+ double variance_expect = 1.0/3.0mean_expectmean_expect;
	+ double mean_eps = mean_expect/(result.mean/num_draws/3)-1.0;
	+ double variance_eps = variance_expect/(result.variance/num_draws/3)-1.0;
	+ double covariance_eps = result.covariance/num_draws/2/variance_expect;
	+ pass_mean = ((-tolerance < mean_eps) &&
	+ ( tolerance > mean_eps)) ? 1:0;
	+ pass_var = ((-tolerance < variance_eps) &&
	+ ( tolerance > variance_eps)) ? 1:0;
	+ pass_covar = ((-1.4*tolerance < covariance_eps) &&
	+ ( 1.4*tolerance > covariance_eps)) ? 1:0;
	+ cerr << "Pass: " << pass_mean
	+ << " " << pass_var
	+ << " " << mean_eps
	+ << " " << variance_eps
	+ << " " << covariance_eps
	+ << " \|\| " << tolerance << endl;
	+ }
	+ {
	+ cerr << " -- Testing 1-D histogram" << endl;
	+
	+ RandomProperties result;
	+ typedef test_histogram1d_functor<typename RandomGenerator::device_type> functor_type;
	+ parallel_reduce (HIST_DIM1D, functor_type (density_1d, num_draws), result);
	+
	+ double tolerance = 6*sqrt(1.0/HIST_DIM1D);
	+ double mean_expect = 1.0num_draws3/HIST_DIM1D;
	+ double variance_expect = 1.0num_draws3/HIST_DIM1D*(1.0-1.0/HIST_DIM1D);
	+ double covariance_expect = -1.0num_draws3/HIST_DIM1D/HIST_DIM1D;
	+ double mean_eps = mean_expect/(result.mean/HIST_DIM1D)-1.0;
	+ double variance_eps = variance_expect/(result.variance/HIST_DIM1D)-1.0;
	+ double covariance_eps = (result.covariance/HIST_DIM1D - covariance_expect)/mean_expect;
	+ pass_hist1d_mean = ((-tolerance < mean_eps) &&
	+ ( tolerance > mean_eps)) ? 1:0;
	+ pass_hist1d_var = ((-tolerance < variance_eps) &&
	+ ( tolerance > variance_eps)) ? 1:0;
	+ pass_hist1d_covar = ((-tolerance < covariance_eps) &&
	+ ( tolerance > covariance_eps)) ? 1:0;
	+
	+ cerr << "Density 1D: " << mean_eps
	+ << " " << variance_eps
	+ << " " << (result.covariance/HIST_DIM1D/HIST_DIM1D)
	+ << " \|\| " << tolerance
	+ << " " << result.min
	+ << " " << result.max
	+ << " \|\| " << result.variance/HIST_DIM1D
	+ << " " << 1.0num_draws3/HIST_DIM1D*(1.0-1.0/HIST_DIM1D)
	+ << " \|\| " << result.covariance/HIST_DIM1D
	+ << " " << -1.0num_draws3/HIST_DIM1D/HIST_DIM1D
	+ << endl;
	+ }
	+ {
	+ cerr << " -- Testing 3-D histogram" << endl;
	+
	+ RandomProperties result;
	+ typedef test_histogram3d_functor<typename RandomGenerator::device_type> functor_type;
	+ parallel_reduce (HIST_DIM1D, functor_type (density_3d, num_draws), result);
	+
	+ double tolerance = 6*sqrt(1.0/HIST_DIM1D);
	+ double mean_expect = 1.0*num_draws/HIST_DIM1D;
	+ double variance_expect = 1.0num_draws/HIST_DIM1D(1.0-1.0/HIST_DIM1D);
	+ double covariance_expect = -1.0*num_draws/HIST_DIM1D/HIST_DIM1D;
	+ double mean_eps = mean_expect/(result.mean/HIST_DIM1D)-1.0;
	+ double variance_eps = variance_expect/(result.variance/HIST_DIM1D)-1.0;
	+ double covariance_eps = (result.covariance/HIST_DIM1D - covariance_expect)/mean_expect;
	+ pass_hist3d_mean = ((-tolerance < mean_eps) &&
	+ ( tolerance > mean_eps)) ? 1:0;
	+ pass_hist3d_var = ((-tolerance < variance_eps) &&
	+ ( tolerance > variance_eps)) ? 1:0;
	+ pass_hist3d_covar = ((-tolerance < covariance_eps) &&
	+ ( tolerance > covariance_eps)) ? 1:0;
	+
	+ cerr << "Density 3D: " << mean_eps
	+ << " " << variance_eps
	+ << " " << result.covariance/HIST_DIM1D/HIST_DIM1D
	+ << " \|\| " << tolerance
	+ << " " << result.min
	+ << " " << result.max << endl;
	+ }
	+ }
	+};
	+
	+template <class RandomGenerator>
	+void test_random(unsigned int num_draws)
	+{
	+ using std::cerr;
	+ using std::endl;
	+ typename test_random_functor<RandomGenerator,int>::type_1d density_1d("D1d");
	+ typename test_random_functor<RandomGenerator,int>::type_3d density_3d("D3d");
	+
	+ cerr << "Test Scalar=int" << endl;
	+ RandomGenerator pool(31891);
	+ test_random_scalar<RandomGenerator,int> test_int(density_1d,density_3d,pool,num_draws);
	+ ASSERT_EQ( test_int.pass_mean,1);
	+ ASSERT_EQ( test_int.pass_var,1);
	+ ASSERT_EQ( test_int.pass_covar,1);
	+ ASSERT_EQ( test_int.pass_hist1d_mean,1);
	+ ASSERT_EQ( test_int.pass_hist1d_var,1);
	+ ASSERT_EQ( test_int.pass_hist1d_covar,1);
	+ ASSERT_EQ( test_int.pass_hist3d_mean,1);
	+ ASSERT_EQ( test_int.pass_hist3d_var,1);
	+ ASSERT_EQ( test_int.pass_hist3d_covar,1);
	+ deep_copy(density_1d,0);
	+ deep_copy(density_3d,0);
	+
	+ cerr << "Test Scalar=unsigned int" << endl;
	+ test_random_scalar<RandomGenerator,unsigned int> test_uint(density_1d,density_3d,pool,num_draws);
	+ ASSERT_EQ( test_uint.pass_mean,1);
	+ ASSERT_EQ( test_uint.pass_var,1);
	+ ASSERT_EQ( test_uint.pass_covar,1);
	+ ASSERT_EQ( test_uint.pass_hist1d_mean,1);
	+ ASSERT_EQ( test_uint.pass_hist1d_var,1);
	+ ASSERT_EQ( test_uint.pass_hist1d_covar,1);
	+ ASSERT_EQ( test_uint.pass_hist3d_mean,1);
	+ ASSERT_EQ( test_uint.pass_hist3d_var,1);
	+ ASSERT_EQ( test_uint.pass_hist3d_covar,1);
	+ deep_copy(density_1d,0);
	+ deep_copy(density_3d,0);
	+
	+ cerr << "Test Scalar=int64_t" << endl;
	+ test_random_scalar<RandomGenerator,int64_t> test_int64(density_1d,density_3d,pool,num_draws);
	+ ASSERT_EQ( test_int64.pass_mean,1);
	+ ASSERT_EQ( test_int64.pass_var,1);
	+ ASSERT_EQ( test_int64.pass_covar,1);
	+ ASSERT_EQ( test_int64.pass_hist1d_mean,1);
	+ ASSERT_EQ( test_int64.pass_hist1d_var,1);
	+ ASSERT_EQ( test_int64.pass_hist1d_covar,1);
	+ ASSERT_EQ( test_int64.pass_hist3d_mean,1);
	+ ASSERT_EQ( test_int64.pass_hist3d_var,1);
	+ ASSERT_EQ( test_int64.pass_hist3d_covar,1);
	+ deep_copy(density_1d,0);
	+ deep_copy(density_3d,0);
	+
	+ cerr << "Test Scalar=uint64_t" << endl;
	+ test_random_scalar<RandomGenerator,uint64_t> test_uint64(density_1d,density_3d,pool,num_draws);
	+ ASSERT_EQ( test_uint64.pass_mean,1);
	+ ASSERT_EQ( test_uint64.pass_var,1);
	+ ASSERT_EQ( test_uint64.pass_covar,1);
	+ ASSERT_EQ( test_uint64.pass_hist1d_mean,1);
	+ ASSERT_EQ( test_uint64.pass_hist1d_var,1);
	+ ASSERT_EQ( test_uint64.pass_hist1d_covar,1);
	+ ASSERT_EQ( test_uint64.pass_hist3d_mean,1);
	+ ASSERT_EQ( test_uint64.pass_hist3d_var,1);
	+ ASSERT_EQ( test_uint64.pass_hist3d_covar,1);
	+ deep_copy(density_1d,0);
	+ deep_copy(density_3d,0);
	+
	+ cerr << "Test Scalar=float" << endl;
	+ test_random_scalar<RandomGenerator,float> test_float(density_1d,density_3d,pool,num_draws);
	+ ASSERT_EQ( test_float.pass_mean,1);
	+ ASSERT_EQ( test_float.pass_var,1);
	+ ASSERT_EQ( test_float.pass_covar,1);
	+ ASSERT_EQ( test_float.pass_hist1d_mean,1);
	+ ASSERT_EQ( test_float.pass_hist1d_var,1);
	+ ASSERT_EQ( test_float.pass_hist1d_covar,1);
	+ ASSERT_EQ( test_float.pass_hist3d_mean,1);
	+ ASSERT_EQ( test_float.pass_hist3d_var,1);
	+ ASSERT_EQ( test_float.pass_hist3d_covar,1);
	+ deep_copy(density_1d,0);
	+ deep_copy(density_3d,0);
	+
	+ cerr << "Test Scalar=double" << endl;
	+ test_random_scalar<RandomGenerator,double> test_double(density_1d,density_3d,pool,num_draws);
	+ ASSERT_EQ( test_double.pass_mean,1);
	+ ASSERT_EQ( test_double.pass_var,1);
	+ ASSERT_EQ( test_double.pass_covar,1);
	+ ASSERT_EQ( test_double.pass_hist1d_mean,1);
	+ ASSERT_EQ( test_double.pass_hist1d_var,1);
	+ ASSERT_EQ( test_double.pass_hist1d_covar,1);
	+ ASSERT_EQ( test_double.pass_hist3d_mean,1);
	+ ASSERT_EQ( test_double.pass_hist3d_var,1);
	+ ASSERT_EQ( test_double.pass_hist3d_covar,1);
	+}
	+}
	+
	+} // namespace Test
	+
	+#endif //KOKKOS_TEST_UNORDERED_MAP_HPP
	diff --git a/lib/kokkos/core/src/Kokkos_Core.hpp b/lib/kokkos/algorithms/unit_tests/TestSerial.cpp
	similarity index 56%
	copy from lib/kokkos/core/src/Kokkos_Core.hpp
	copy to lib/kokkos/algorithms/unit_tests/TestSerial.cpp
	index 8f5f34bfd..741cf97ae 100755
	--- a/lib/kokkos/core/src/Kokkos_Core.hpp
	+++ b/lib/kokkos/algorithms/unit_tests/TestSerial.cpp
	@@ -1,106 +1,99 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_CORE_HPP
	-#define KOKKOS_CORE_HPP
	-
	-//----------------------------------------------------------------------------
	-// Include the execution space header files for the enabled execution spaces.
	-
	-#include <Kokkos_Core_fwd.hpp>
	+#include <gtest/gtest.h>

	-#if defined( KOKKOS_HAVE_CUDA )
	-#include <Kokkos_Cuda.hpp>
	-#endif
	+#include <Kokkos_Core.hpp>

	-#if defined( KOKKOS_HAVE_OPENMP )
	-#include <Kokkos_OpenMP.hpp>
	-#endif
	+#include <TestRandom.hpp>
	+#include <TestSort.hpp>
	+#include <iomanip>

	-#if defined( KOKKOS_HAVE_SERIAL )
	-#include <Kokkos_Serial.hpp>
	-#endif
	-
	-#if defined( KOKKOS_HAVE_PTHREAD )
	-#include <Kokkos_Threads.hpp>
	-#endif
	-
	-#include <Kokkos_Pair.hpp>
	-#include <Kokkos_View.hpp>
	-#include <Kokkos_Vectorization.hpp>
	-#include <Kokkos_Atomic.hpp>
	-#include <Kokkos_hwloc.hpp>

	//----------------------------------------------------------------------------

	-namespace Kokkos {

	-struct InitArguments {
	- int num_threads;
	- int num_numa;
	- int device_id;
	+namespace Test {
	+
	+#ifdef KOKKOS_HAVE_SERIAL
	+class serial : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision (5) << std::scientific;
	+ Kokkos::Serial::initialize ();
	+ }

	- InitArguments() {
	- num_threads = -1;
	- num_numa = -1;
	- device_id = -1;
	+ static void TearDownTestCase ()
	+ {
	+ Kokkos::Serial::finalize ();
	}
	};

	-void initialize(int& narg, char* arg[]);
	+#define SERIAL_RANDOM_XORSHIFT64( num_draws ) \
	+ TEST_F( serial, Random_XorShift64 ) { \
	+ Impl::test_random<Kokkos::Random_XorShift64_Pool<Kokkos::Serial> >(num_draws); \
	+ }
	+
	+#define SERIAL_RANDOM_XORSHIFT1024( num_draws ) \
	+ TEST_F( serial, Random_XorShift1024 ) { \
	+ Impl::test_random<Kokkos::Random_XorShift1024_Pool<Kokkos::Serial> >(num_draws); \
	+ }

	-void initialize(const InitArguments& args = InitArguments());
	+#define SERIAL_SORT_UNSIGNED( size ) \
	+ TEST_F( serial, SortUnsigned ) { \
	+ Impl::test_sort< Kokkos::Serial, unsigned >(size); \
	+ }

	-/** \brief Finalize the spaces that were initialized via Kokkos::initialize */
	-void finalize();
	+SERIAL_RANDOM_XORSHIFT64( 10240000 )
	+SERIAL_RANDOM_XORSHIFT1024( 10130144 )
	+SERIAL_SORT_UNSIGNED(171)

	-/** \brief Finalize all known execution spaces */
	-void finalize_all();
	+#undef SERIAL_RANDOM_XORSHIFT64
	+#undef SERIAL_RANDOM_XORSHIFT1024
	+#undef SERIAL_SORT_UNSIGNED

	-void fence();
	+#endif // KOKKOS_HAVE_SERIAL
	+} // namespace Test

	-}

	-#endif
	diff --git a/lib/kokkos/algorithms/unit_tests/TestSort.hpp b/lib/kokkos/algorithms/unit_tests/TestSort.hpp
	new file mode 100755
	index 000000000..ccbcbdd00
	--- /dev/null
	+++ b/lib/kokkos/algorithms/unit_tests/TestSort.hpp
	@@ -0,0 +1,206 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef TESTSORT_HPP_
	+#define TESTSORT_HPP_
	+
	+#include <gtest/gtest.h>
	+#include<Kokkos_Core.hpp>
	+#include<Kokkos_Random.hpp>
	+#include<Kokkos_Sort.hpp>
	+
	+namespace Test {
	+
	+namespace Impl{
	+
	+template<class ExecutionSpace, class Scalar>
	+struct is_sorted_struct {
	+ typedef unsigned int value_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<Scalar*,ExecutionSpace> keys;
	+
	+ is_sorted_struct(Kokkos::View<Scalar*,ExecutionSpace> keys_):keys(keys_) {}
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, unsigned int& count) const {
	+ if(keys(i)>keys(i+1)) count++;
	+ }
	+};
	+
	+template<class ExecutionSpace, class Scalar>
	+struct sum {
	+ typedef double value_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<Scalar*,ExecutionSpace> keys;
	+
	+ sum(Kokkos::View<Scalar*,ExecutionSpace> keys_):keys(keys_) {}
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, double& count) const {
	+ count+=keys(i);
	+ }
	+};
	+
	+template<class ExecutionSpace, class Scalar>
	+struct bin3d_is_sorted_struct {
	+ typedef unsigned int value_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<Scalar*[3],ExecutionSpace> keys;
	+
	+ int max_bins;
	+ Scalar min;
	+ Scalar max;
	+
	+ bin3d_is_sorted_struct(Kokkos::View<Scalar*[3],ExecutionSpace> keys_,int max_bins_,Scalar min_,Scalar max_):
	+ keys(keys_),max_bins(max_bins_),min(min_),max(max_) {
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, unsigned int& count) const {
	+ int ix1 = int ((keys(i,0)-min)/max * max_bins);
	+ int iy1 = int ((keys(i,1)-min)/max * max_bins);
	+ int iz1 = int ((keys(i,2)-min)/max * max_bins);
	+ int ix2 = int ((keys(i+1,0)-min)/max * max_bins);
	+ int iy2 = int ((keys(i+1,1)-min)/max * max_bins);
	+ int iz2 = int ((keys(i+1,2)-min)/max * max_bins);
	+
	+ if (ix1>ix2) count++;
	+ else if(ix1==ix2) {
	+ if (iy1>iy2) count++;
	+ else if ((iy1==iy2) && (iz1>iz2)) count++;
	+ }
	+ }
	+};
	+
	+template<class ExecutionSpace, class Scalar>
	+struct sum3D {
	+ typedef double value_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<Scalar*[3],ExecutionSpace> keys;
	+
	+ sum3D(Kokkos::View<Scalar*[3],ExecutionSpace> keys_):keys(keys_) {}
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, double& count) const {
	+ count+=keys(i,0);
	+ count+=keys(i,1);
	+ count+=keys(i,2);
	+ }
	+};
	+
	+template<class ExecutionSpace, typename KeyType>
	+void test_1D_sort(unsigned int n,bool force_kokkos) {
	+ typedef Kokkos::View<KeyType*,ExecutionSpace> KeyViewType;
	+ KeyViewType keys("Keys",n);
	+
	+ Kokkos::Random_XorShift64_Pool<ExecutionSpace> g(1931);
	+ Kokkos::fill_random(keys,g,Kokkos::Random_XorShift64_Pool<ExecutionSpace>::generator_type::MAX_URAND);
	+
	+ double sum_before = 0.0;
	+ double sum_after = 0.0;
	+ unsigned int sort_fails = 0;
	+
	+ Kokkos::parallel_reduce(n,sum<ExecutionSpace, KeyType>(keys),sum_before);
	+
	+ Kokkos::sort(keys,force_kokkos);
	+
	+ Kokkos::parallel_reduce(n,sum<ExecutionSpace, KeyType>(keys),sum_after);
	+ Kokkos::parallel_reduce(n-1,is_sorted_struct<ExecutionSpace, KeyType>(keys),sort_fails);
	+
	+ double ratio = sum_before/sum_after;
	+ double epsilon = 1e-10;
	+ unsigned int equal_sum = (ratio > (1.0-epsilon)) && (ratio < (1.0+epsilon)) ? 1 : 0;
	+
	+ ASSERT_EQ(sort_fails,0);
	+ ASSERT_EQ(equal_sum,1);
	+}
	+
	+template<class ExecutionSpace, typename KeyType>
	+void test_3D_sort(unsigned int n) {
	+ typedef Kokkos::View<KeyType*[3],ExecutionSpace > KeyViewType;
	+
	+ KeyViewType keys("Keys",nnn);
	+
	+ Kokkos::Random_XorShift64_Pool<ExecutionSpace> g(1931);
	+ Kokkos::fill_random(keys,g,100.0);
	+
	+ double sum_before = 0.0;
	+ double sum_after = 0.0;
	+ unsigned int sort_fails = 0;
	+
	+ Kokkos::parallel_reduce(keys.dimension_0(),sum3D<ExecutionSpace, KeyType>(keys),sum_before);
	+
	+ int bin_1d = 1;
	+ while( bin_1dbin_1dbin_1d4< (int) keys.dimension_0() ) bin_1d=2;
	+ int bin_max[3] = {bin_1d,bin_1d,bin_1d};
	+ typename KeyViewType::value_type min[3] = {0,0,0};
	+ typename KeyViewType::value_type max[3] = {100,100,100};
	+
	+ typedef Kokkos::SortImpl::DefaultBinOp3D< KeyViewType > BinOp;
	+ BinOp bin_op(bin_max,min,max);
	+ Kokkos::BinSort< KeyViewType , BinOp >
	+ Sorter(keys,bin_op,false);
	+ Sorter.create_permute_vector();
	+ Sorter.template sort< KeyViewType >(keys);
	+
	+ Kokkos::parallel_reduce(keys.dimension_0(),sum3D<ExecutionSpace, KeyType>(keys),sum_after);
	+ Kokkos::parallel_reduce(keys.dimension_0()-1,bin3d_is_sorted_struct<ExecutionSpace, KeyType>(keys,bin_1d,min[0],max[0]),sort_fails);
	+
	+ double ratio = sum_before/sum_after;
	+ double epsilon = 1e-10;
	+ unsigned int equal_sum = (ratio > (1.0-epsilon)) && (ratio < (1.0+epsilon)) ? 1 : 0;
	+
	+ printf("3D Sort Sum: %f %f Fails: %u\n",sum_before,sum_after,sort_fails);
	+ ASSERT_EQ(sort_fails,0);
	+ ASSERT_EQ(equal_sum,1);
	+}
	+
	+template<class ExecutionSpace, typename KeyType>
	+void test_sort(unsigned int N)
	+{
	+ test_1D_sort<ExecutionSpace,KeyType>(NNN, true);
	+ test_1D_sort<ExecutionSpace,KeyType>(NNN, false);
	+ test_3D_sort<ExecutionSpace,KeyType>(N);
	+}
	+
	+}
	+}
	+#endif /* TESTSORT_HPP_ */
	diff --git a/lib/kokkos/core/src/Kokkos_CudaTypes.hpp b/lib/kokkos/algorithms/unit_tests/TestThreads.cpp
	similarity index 50%
	rename from lib/kokkos/core/src/Kokkos_CudaTypes.hpp
	rename to lib/kokkos/algorithms/unit_tests/TestThreads.cpp
	index 899e7e1fa..a61d6c8bd 100755
	--- a/lib/kokkos/core/src/Kokkos_CudaTypes.hpp
	+++ b/lib/kokkos/algorithms/unit_tests/TestThreads.cpp
	@@ -1,139 +1,113 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_CUDATYPES_HPP
	-#define KOKKOS_CUDATYPES_HPP
	+#include <gtest/gtest.h>

	-#include <Kokkos_Macros.hpp>
	+#include <Kokkos_Core.hpp>

	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	+#include <TestRandom.hpp>
	+#include <TestSort.hpp>
	+#include <iomanip>

	-#if defined( __CUDACC__ )

	-namespace Kokkos {
	+//----------------------------------------------------------------------------

	-typedef ::int2 int2 ;
	-typedef ::int3 int3 ;
	-typedef ::int4 int4 ;

	-typedef ::float2 float2 ;
	-typedef ::float3 float3 ;
	-typedef ::float4 float4 ;
	+namespace Test {

	-typedef ::double2 double2 ;
	-typedef ::double3 double3 ;
	-typedef ::double4 double4 ;
	+#ifdef KOKKOS_HAVE_PTHREAD
	+class threads : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;

	-} // namespace Kokkos
	+ unsigned num_threads = 4;

	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	+ if (Kokkos::hwloc::available()) {
	+ num_threads = Kokkos::hwloc::get_available_numa_count()
	+ * Kokkos::hwloc::get_available_cores_per_numa()
	+ // * Kokkos::hwloc::get_available_threads_per_core()
	+ ;

	-#else /* NOT #if defined( __CUDACC__ ) */
	+ }

	-namespace Kokkos {
	-
	-struct int2 {
	- int x;
	- int y;
	-};
	-
	-struct int3 {
	- int x;
	- int y;
	- int z;
	-};
	+ std::cout << "Threads: " << num_threads << std::endl;

	-struct int4 {
	- int x;
	- int y;
	- int z;
	- int w;
	-};
	+ Kokkos::Threads::initialize( num_threads );
	+ }

	-struct float2 {
	- float x;
	- float y;
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Threads::finalize();
	+ }
	};

	-struct float3 {
	- float x;
	- float y;
	- float z;
	-};
	+#define THREADS_RANDOM_XORSHIFT64( num_draws ) \
	+ TEST_F( threads, Random_XorShift64 ) { \
	+ Impl::test_random<Kokkos::Random_XorShift64_Pool<Kokkos::Threads> >(num_draws); \
	+ }

	-struct float4 {
	- float x;
	- float y;
	- float z;
	- float w;
	-};
	+#define THREADS_RANDOM_XORSHIFT1024( num_draws ) \
	+ TEST_F( threads, Random_XorShift1024 ) { \
	+ Impl::test_random<Kokkos::Random_XorShift1024_Pool<Kokkos::Threads> >(num_draws); \
	+ }

	-struct double2 {
	- double x;
	- double y;
	-};
	+#define THREADS_SORT_UNSIGNED( size ) \
	+ TEST_F( threads, SortUnsigned ) { \
	+ Impl::test_sort< Kokkos::Threads, double >(size); \
	+ }

	-struct double3 {
	- double x;
	- double y;
	- double z;
	-};

	-struct double4 {
	- double x;
	- double y;
	- double z;
	- double w;
	-};
	+THREADS_RANDOM_XORSHIFT64( 10240000 )
	+THREADS_RANDOM_XORSHIFT1024( 10130144 )
	+THREADS_SORT_UNSIGNED(171)

	-} // namespace Kokkos
	+#undef THREADS_RANDOM_XORSHIFT64
	+#undef THREADS_RANDOM_XORSHIFT1024
	+#undef THREADS_SORT_UNSIGNED

	#endif
	+} // namespace Test

	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#endif /* #define KOKKOS_CUDATYPES_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/algorithms/unit_tests/UnitTestMain.cpp
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/algorithms/unit_tests/UnitTestMain.cpp
	index 966291abd..f952ab3db 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/algorithms/unit_tests/UnitTestMain.cpp
	@@ -1,64 +1,50 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	+#include <gtest/gtest.h>

	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+int main(int argc, char *argv[]) {
	+ ::testing::InitGoogleTest(&argc,argv);
	+ return RUN_ALL_TESTS();
	+}

	diff --git a/lib/kokkos/containers/performance_tests/Makefile b/lib/kokkos/containers/performance_tests/Makefile
	new file mode 100755
	index 000000000..7ced94528
	--- /dev/null
	+++ b/lib/kokkos/containers/performance_tests/Makefile
	@@ -0,0 +1,81 @@
	+KOKKOS_PATH = ../..
	+
	+GTEST_PATH = ../../TPL/gtest
	+
	+vpath %.cpp ${KOKKOS_PATH}/containers/performance_tests
	+
	+default: build_all
	+ echo "End Build"
	+
	+
	+include $(KOKKOS_PATH)/Makefile.kokkos
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ CXX = nvcc_wrapper
	+ CXXFLAGS ?= -O3
	+ LINK = $(CXX)
	+ LDFLAGS ?= -lpthread
	+else
	+ CXX ?= g++
	+ CXXFLAGS ?= -O3
	+ LINK ?= $(CXX)
	+ LDFLAGS ?= -lpthread
	+endif
	+
	+KOKKOS_CXXFLAGS += -I$(GTEST_PATH) -I${KOKKOS_PATH}/containers/performance_tests
	+
	+TEST_TARGETS =
	+TARGETS =
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ OBJ_CUDA = TestCuda.o TestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_PerformanceTest_Cuda
	+ TEST_TARGETS += test-cuda
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ OBJ_THREADS = TestThreads.o TestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_PerformanceTest_Threads
	+ TEST_TARGETS += test-threads
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ OBJ_OPENMP = TestOpenMP.o TestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_PerformanceTest_OpenMP
	+ TEST_TARGETS += test-openmp
	+endif
	+
	+KokkosContainers_PerformanceTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Cuda
	+
	+KokkosContainers_PerformanceTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Threads
	+
	+KokkosContainers_PerformanceTest_OpenMP: $(OBJ_OPENMP) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_OPENMP) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_OpenMP
	+
	+test-cuda: KokkosContainers_PerformanceTest_Cuda
	+ ./KokkosContainers_PerformanceTest_Cuda
	+
	+test-threads: KokkosContainers_PerformanceTest_Threads
	+ ./KokkosContainers_PerformanceTest_Threads
	+
	+test-openmp: KokkosContainers_PerformanceTest_OpenMP
	+ ./KokkosContainers_PerformanceTest_OpenMP
	+
	+
	+build_all: $(TARGETS)
	+
	+test: $(TEST_TARGETS)
	+
	+clean: kokkos-clean
	+ rm -f *.o $(TARGETS)
	+
	+# Compilation rules
	+
	+%.o:%.cpp $(KOKKOS_CPP_DEPENDS)
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
	+
	+gtest-all.o:$(GTEST_PATH)/gtest/gtest-all.cc
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $(GTEST_PATH)/gtest/gtest-all.cc
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/containers/performance_tests/TestCuda.cpp
	similarity index 57%
	copy from lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	copy to lib/kokkos/containers/performance_tests/TestCuda.cpp
	index 0dcb3977a..aee262de9 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/containers/performance_tests/TestCuda.cpp
	@@ -1,84 +1,100 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	-#define KOKKOS_PHYSICAL_LAYOUT_HPP
	+#include <stdint.h>
	+#include <string>
	+#include <iostream>
	+#include <iomanip>
	+#include <sstream>
	+#include <fstream>
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>

	+#if defined( KOKKOS_HAVE_CUDA )

	-#include <Kokkos_View.hpp>
	-namespace Kokkos {
	-namespace Impl {
	+#include <Kokkos_UnorderedMap.hpp>

	+#include <TestGlobal2LocalIds.hpp>

	+#include <TestUnorderedMapPerformance.hpp>

	-struct PhysicalLayout {
	- enum LayoutType {Left,Right,Scalar,Error};
	- LayoutType layout_type;
	- int rank;
	- long long int stride[8]; //distance between two neighboring elements in a given dimension
	+namespace Performance {

	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #ifdef KOKKOS_HAVE_CUDA
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #endif
	+class cuda : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+ Kokkos::HostSpace::execution_space::initialize();
	+ Kokkos::Cuda::initialize( Kokkos::Cuda::SelectDevice(0) );
	+ }
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Cuda::finalize();
	+ Kokkos::HostSpace::execution_space::finalize();
	+ }
	};

	+TEST_F( cuda, global_2_local)
	+{
	+ std::cout << "Cuda" << std::endl;
	+ std::cout << "size, create, generate, fill, find" << std::endl;
	+ for (unsigned i=Performance::begin_id_size; i<=Performance::end_id_size; i *= Performance::id_step)
	+ test_global_to_local_ids<Kokkos::Cuda>(i);
	}
	+
	+TEST_F( cuda, unordered_map_performance_near)
	+{
	+ Perf::run_performance_tests<Kokkos::Cuda,true>("cuda-near");
	+}
	+
	+TEST_F( cuda, unordered_map_performance_far)
	+{
	+ Perf::run_performance_tests<Kokkos::Cuda,false>("cuda-far");
	}
	-#endif
	+
	+}
	+
	+#endif /* #if defined( KOKKOS_HAVE_CUDA ) */
	diff --git a/lib/kokkos/containers/performance_tests/TestGlobal2LocalIds.hpp b/lib/kokkos/containers/performance_tests/TestGlobal2LocalIds.hpp
	new file mode 100755
	index 000000000..fb70b8fe2
	--- /dev/null
	+++ b/lib/kokkos/containers/performance_tests/TestGlobal2LocalIds.hpp
	@@ -0,0 +1,231 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef KOKKOS_TEST_GLOBAL_TO_LOCAL_IDS_HPP
	+#define KOKKOS_TEST_GLOBAL_TO_LOCAL_IDS_HPP
	+
	+#include <Kokkos_Core.hpp>
	+#include <Kokkos_UnorderedMap.hpp>
	+#include <vector>
	+#include <algorithm>
	+
	+#include <impl/Kokkos_Timer.hpp>
	+
	+// This test will simulate global ids
	+
	+namespace Performance {
	+
	+static const unsigned begin_id_size = 256u;
	+static const unsigned end_id_size = 1u << 22;
	+static const unsigned id_step = 2u;
	+
	+union helper
	+{
	+ uint32_t word;
	+ uint8_t byte[4];
	+};
	+
	+
	+template <typename Device>
	+struct generate_ids
	+{
	+ typedef Device execution_space;
	+ typedef typename execution_space::size_type size_type;
	+ typedef Kokkos::View<uint32_t*,execution_space> local_id_view;
	+
	+ local_id_view local_2_global;
	+
	+ generate_ids( local_id_view & ids)
	+ : local_2_global(ids)
	+ {
	+ Kokkos::parallel_for(local_2_global.dimension_0(), *this);
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(size_type i) const
	+ {
	+
	+ helper x = {static_cast<uint32_t>(i)};
	+
	+ // shuffle the bytes of i to create a unique, semi-random global_id
	+ x.word = ~x.word;
	+
	+ uint8_t tmp = x.byte[3];
	+ x.byte[3] = x.byte[1];
	+ x.byte[1] = tmp;
	+
	+ tmp = x.byte[2];
	+ x.byte[2] = x.byte[0];
	+ x.byte[0] = tmp;
	+
	+ local_2_global[i] = x.word;
	+ }
	+
	+};
	+
	+template <typename Device>
	+struct fill_map
	+{
	+ typedef Device execution_space;
	+ typedef typename execution_space::size_type size_type;
	+ typedef Kokkos::View<const uint32_t*,execution_space, Kokkos::MemoryRandomAccess> local_id_view;
	+ typedef Kokkos::UnorderedMap<uint32_t,size_type,execution_space> global_id_view;
	+
	+ global_id_view global_2_local;
	+ local_id_view local_2_global;
	+
	+ fill_map( global_id_view gIds, local_id_view lIds)
	+ : global_2_local(gIds) , local_2_global(lIds)
	+ {
	+ Kokkos::parallel_for(local_2_global.dimension_0(), *this);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(size_type i) const
	+ {
	+ global_2_local.insert( local_2_global[i], i);
	+ }
	+
	+};
	+
	+template <typename Device>
	+struct find_test
	+{
	+ typedef Device execution_space;
	+ typedef typename execution_space::size_type size_type;
	+ typedef Kokkos::View<const uint32_t*,execution_space, Kokkos::MemoryRandomAccess> local_id_view;
	+ typedef Kokkos::UnorderedMap<const uint32_t, const size_type,execution_space> global_id_view;
	+
	+ global_id_view global_2_local;
	+ local_id_view local_2_global;
	+
	+ typedef size_t value_type;
	+
	+ find_test( global_id_view gIds, local_id_view lIds, value_type & num_errors)
	+ : global_2_local(gIds) , local_2_global(lIds)
	+ {
	+ Kokkos::parallel_reduce(local_2_global.dimension_0(), *this, num_errors);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init(value_type & v) const
	+ { v = 0; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join(volatile value_type & dst, volatile value_type const & src) const
	+ { dst += src; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(size_type i, value_type & num_errors) const
	+ {
	+ uint32_t index = global_2_local.find( local_2_global[i] );
	+
	+ if ( global_2_local.value_at(index) != i) ++num_errors;
	+ }
	+
	+};
	+
	+template <typename Device>
	+void test_global_to_local_ids(unsigned num_ids)
	+{
	+
	+ typedef Device execution_space;
	+ typedef typename execution_space::size_type size_type;
	+
	+ typedef Kokkos::View<uint32_t*,execution_space> local_id_view;
	+ typedef Kokkos::UnorderedMap<uint32_t,size_type,execution_space> global_id_view;
	+
	+ //size
	+ std::cout << num_ids << ", ";
	+
	+ double elasped_time = 0;
	+ Kokkos::Impl::Timer timer;
	+
	+ local_id_view local_2_global("local_ids", num_ids);
	+ global_id_view global_2_local((3u*num_ids)/2u);
	+
	+ //create
	+ elasped_time = timer.seconds();
	+ std::cout << elasped_time << ", ";
	+ timer.reset();
	+
	+ // generate unique ids
	+ {
	+ generate_ids<Device> gen(local_2_global);
	+ }
	+ Device::fence();
	+ // generate
	+ elasped_time = timer.seconds();
	+ std::cout << elasped_time << ", ";
	+ timer.reset();
	+
	+ {
	+ fill_map<Device> fill(global_2_local, local_2_global);
	+ }
	+ Device::fence();
	+
	+ // fill
	+ elasped_time = timer.seconds();
	+ std::cout << elasped_time << ", ";
	+ timer.reset();
	+
	+
	+ size_t num_errors = 0;
	+ for (int i=0; i<100; ++i)
	+ {
	+ find_test<Device> find(global_2_local, local_2_global,num_errors);
	+ }
	+ Device::fence();
	+
	+ // find
	+ elasped_time = timer.seconds();
	+ std::cout << elasped_time << std::endl;
	+
	+ ASSERT_EQ( num_errors, 0u);
	+}
	+
	+
	+} // namespace Performance
	+
	+
	+#endif //KOKKOS_TEST_GLOBAL_TO_LOCAL_IDS_HPP
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/containers/performance_tests/TestMain.cpp
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/containers/performance_tests/TestMain.cpp
	index 966291abd..f952ab3db 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/containers/performance_tests/TestMain.cpp
	@@ -1,64 +1,50 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	+#include <gtest/gtest.h>

	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+int main(int argc, char *argv[]) {
	+ ::testing::InitGoogleTest(&argc,argv);
	+ return RUN_ALL_TESTS();
	+}

	diff --git a/lib/kokkos/containers/performance_tests/TestOpenMP.cpp b/lib/kokkos/containers/performance_tests/TestOpenMP.cpp
	new file mode 100755
	index 000000000..82a9311df
	--- /dev/null
	+++ b/lib/kokkos/containers/performance_tests/TestOpenMP.cpp
	@@ -0,0 +1,131 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <Kokkos_UnorderedMap.hpp>
	+
	+#include <TestGlobal2LocalIds.hpp>
	+#include <TestUnorderedMapPerformance.hpp>
	+
	+#include <iomanip>
	+#include <sstream>
	+#include <string>
	+#include <fstream>
	+
	+
	+namespace Performance {
	+
	+class openmp : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+
	+ unsigned num_threads = 4;
	+
	+ if (Kokkos::hwloc::available()) {
	+ num_threads = Kokkos::hwloc::get_available_numa_count()
	+ * Kokkos::hwloc::get_available_cores_per_numa()
	+ * Kokkos::hwloc::get_available_threads_per_core()
	+ ;
	+
	+ }
	+
	+ std::cout << "OpenMP: " << num_threads << std::endl;
	+
	+ Kokkos::OpenMP::initialize( num_threads );
	+
	+ std::cout << "available threads: " << omp_get_max_threads() << std::endl;
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::OpenMP::finalize();
	+
	+ omp_set_num_threads(1);
	+
	+ ASSERT_EQ( 1 , omp_get_max_threads() );
	+ }
	+};
	+
	+TEST_F( openmp, global_2_local)
	+{
	+ std::cout << "OpenMP" << std::endl;
	+ std::cout << "size, create, generate, fill, find" << std::endl;
	+ for (unsigned i=Performance::begin_id_size; i<=Performance::end_id_size; i *= Performance::id_step)
	+ test_global_to_local_ids<Kokkos::OpenMP>(i);
	+}
	+
	+TEST_F( openmp, unordered_map_performance_near)
	+{
	+ unsigned num_openmp = 4;
	+ if (Kokkos::hwloc::available()) {
	+ num_openmp = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa() *
	+ Kokkos::hwloc::get_available_threads_per_core();
	+
	+ }
	+ std::ostringstream base_file_name;
	+ base_file_name << "openmp-" << num_openmp << "-near";
	+ Perf::run_performance_tests<Kokkos::OpenMP,true>(base_file_name.str());
	+}
	+
	+TEST_F( openmp, unordered_map_performance_far)
	+{
	+ unsigned num_openmp = 4;
	+ if (Kokkos::hwloc::available()) {
	+ num_openmp = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa() *
	+ Kokkos::hwloc::get_available_threads_per_core();
	+
	+ }
	+ std::ostringstream base_file_name;
	+ base_file_name << "openmp-" << num_openmp << "-far";
	+ Perf::run_performance_tests<Kokkos::OpenMP,false>(base_file_name.str());
	+}
	+
	+} // namespace test
	+
	diff --git a/lib/kokkos/containers/performance_tests/TestThreads.cpp b/lib/kokkos/containers/performance_tests/TestThreads.cpp
	new file mode 100755
	index 000000000..04d9dc0c1
	--- /dev/null
	+++ b/lib/kokkos/containers/performance_tests/TestThreads.cpp
	@@ -0,0 +1,126 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <Kokkos_UnorderedMap.hpp>
	+
	+#include <iomanip>
	+
	+#include <TestGlobal2LocalIds.hpp>
	+#include <TestUnorderedMapPerformance.hpp>
	+
	+#include <iomanip>
	+#include <sstream>
	+#include <string>
	+#include <fstream>
	+
	+namespace Performance {
	+
	+class threads : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+
	+ unsigned num_threads = 4;
	+
	+ if (Kokkos::hwloc::available()) {
	+ num_threads = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa() *
	+ Kokkos::hwloc::get_available_threads_per_core();
	+
	+ }
	+
	+ std::cout << "Threads: " << num_threads << std::endl;
	+
	+ Kokkos::Threads::initialize( num_threads );
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Threads::finalize();
	+ }
	+};
	+
	+TEST_F( threads, global_2_local)
	+{
	+ std::cout << "Threads" << std::endl;
	+ std::cout << "size, create, generate, fill, find" << std::endl;
	+ for (unsigned i=Performance::begin_id_size; i<=Performance::end_id_size; i *= Performance::id_step)
	+ test_global_to_local_ids<Kokkos::Threads>(i);
	+}
	+
	+TEST_F( threads, unordered_map_performance_near)
	+{
	+ unsigned num_threads = 4;
	+ if (Kokkos::hwloc::available()) {
	+ num_threads = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa() *
	+ Kokkos::hwloc::get_available_threads_per_core();
	+
	+ }
	+ std::ostringstream base_file_name;
	+ base_file_name << "threads-" << num_threads << "-near";
	+ Perf::run_performance_tests<Kokkos::Threads,true>(base_file_name.str());
	+}
	+
	+TEST_F( threads, unordered_map_performance_far)
	+{
	+ unsigned num_threads = 4;
	+ if (Kokkos::hwloc::available()) {
	+ num_threads = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa() *
	+ Kokkos::hwloc::get_available_threads_per_core();
	+
	+ }
	+ std::ostringstream base_file_name;
	+ base_file_name << "threads-" << num_threads << "-far";
	+ Perf::run_performance_tests<Kokkos::Threads,false>(base_file_name.str());
	+}
	+
	+} // namespace Performance
	+
	+
	diff --git a/lib/kokkos/containers/performance_tests/TestUnorderedMapPerformance.hpp b/lib/kokkos/containers/performance_tests/TestUnorderedMapPerformance.hpp
	new file mode 100755
	index 000000000..975800229
	--- /dev/null
	+++ b/lib/kokkos/containers/performance_tests/TestUnorderedMapPerformance.hpp
	@@ -0,0 +1,262 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef KOKKOS_TEST_UNORDERED_MAP_PERFORMANCE_HPP
	+#define KOKKOS_TEST_UNORDERED_MAP_PERFORMANCE_HPP
	+
	+#include <impl/Kokkos_Timer.hpp>
	+
	+#include <iostream>
	+#include <iomanip>
	+#include <fstream>
	+#include <string>
	+#include <sstream>
	+
	+
	+namespace Perf {
	+
	+template <typename Device, bool Near>
	+struct UnorderedMapTest
	+{
	+ typedef Device execution_space;
	+ typedef Kokkos::UnorderedMap<uint32_t, uint32_t, execution_space> map_type;
	+ typedef typename map_type::histogram_type histogram_type;
	+
	+ struct value_type {
	+ uint32_t failed_count;
	+ uint32_t max_list;
	+ };
	+
	+ uint32_t capacity;
	+ uint32_t inserts;
	+ uint32_t collisions;
	+ double seconds;
	+ map_type map;
	+ histogram_type histogram;
	+
	+ UnorderedMapTest( uint32_t arg_capacity, uint32_t arg_inserts, uint32_t arg_collisions)
	+ : capacity(arg_capacity)
	+ , inserts(arg_inserts)
	+ , collisions(arg_collisions)
	+ , seconds(0)
	+ , map(capacity)
	+ , histogram(map.get_histogram())
	+ {
	+ Kokkos::Impl::Timer wall_clock ;
	+ wall_clock.reset();
	+
	+ value_type v = {};
	+ int loop_count = 0;
	+ do {
	+ ++loop_count;
	+
	+ v = value_type();
	+ Kokkos::parallel_reduce(inserts, *this, v);
	+
	+ if (v.failed_count > 0u) {
	+ const uint32_t new_capacity = map.capacity() + ((map.capacity()*3ull)/20u) + v.failed_count/collisions ;
	+ map.rehash( new_capacity );
	+ }
	+ } while (v.failed_count > 0u);
	+
	+ seconds = wall_clock.seconds();
	+
	+ switch (loop_count)
	+ {
	+ case 1u: std::cout << " \033[0;32m" << loop_count << "\033[0m "; break;
	+ case 2u: std::cout << " \033[1;31m" << loop_count << "\033[0m "; break;
	+ default: std::cout << " \033[0;31m" << loop_count << "\033[0m "; break;
	+ }
	+ std::cout << std::setprecision(2) << std::fixed << std::setw(5) << (1e9*(seconds/(inserts))) << "; " << std::flush;
	+
	+ histogram.calculate();
	+ Device::fence();
	+ }
	+
	+ void print(std::ostream & metrics_out, std::ostream & length_out, std::ostream & distance_out, std::ostream & block_distance_out)
	+ {
	+ metrics_out << map.capacity() << " , ";
	+ metrics_out << inserts/collisions << " , ";
	+ metrics_out << (100.0 * inserts/collisions) / map.capacity() << " , ";
	+ metrics_out << inserts << " , ";
	+ metrics_out << (map.failed_insert() ? "true" : "false") << " , ";
	+ metrics_out << collisions << " , ";
	+ metrics_out << 1e9*(seconds/inserts) << " , ";
	+ metrics_out << seconds << std::endl;
	+
	+ length_out << map.capacity() << " , ";
	+ length_out << ((100.0 *inserts/collisions) / map.capacity()) << " , ";
	+ length_out << collisions << " , ";
	+ histogram.print_length(length_out);
	+
	+ distance_out << map.capacity() << " , ";
	+ distance_out << ((100.0 *inserts/collisions) / map.capacity()) << " , ";
	+ distance_out << collisions << " , ";
	+ histogram.print_distance(distance_out);
	+
	+ block_distance_out << map.capacity() << " , ";
	+ block_distance_out << ((100.0 *inserts/collisions) / map.capacity()) << " , ";
	+ block_distance_out << collisions << " , ";
	+ histogram.print_block_distance(block_distance_out);
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & v ) const
	+ {
	+ v.failed_count = 0;
	+ v.max_list = 0;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & dst, const volatile value_type & src ) const
	+ {
	+ dst.failed_count += src.failed_count;
	+ dst.max_list = src.max_list < dst.max_list ? dst.max_list : src.max_list;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(uint32_t i, value_type & v) const
	+ {
	+ const uint32_t key = Near ? i/collisions : i%(inserts/collisions);
	+ typename map_type::insert_result result = map.insert(key,i);
	+ v.failed_count += !result.failed() ? 0 : 1;
	+ v.max_list = result.list_position() < v.max_list ? v.max_list : result.list_position();
	+ }
	+
	+};
	+
	+//#define KOKKOS_COLLECT_UNORDERED_MAP_METRICS
	+
	+template <typename Device, bool Near>
	+void run_performance_tests(std::string const & base_file_name)
	+{
	+#if defined(KOKKOS_COLLECT_UNORDERED_MAP_METRICS)
	+ std::string metrics_file_name = base_file_name + std::string("-metrics.csv");
	+ std::string length_file_name = base_file_name + std::string("-length.csv");
	+ std::string distance_file_name = base_file_name + std::string("-distance.csv");
	+ std::string block_distance_file_name = base_file_name + std::string("-block_distance.csv");
	+
	+ std::ofstream metrics_out( metrics_file_name.c_str(), std::ofstream::out );
	+ std::ofstream length_out( length_file_name.c_str(), std::ofstream::out );
	+ std::ofstream distance_out( distance_file_name.c_str(), std::ofstream::out );
	+ std::ofstream block_distance_out( block_distance_file_name.c_str(), std::ofstream::out );
	+
	+
	+ /*
	+ const double test_ratios[] = {
	+ 0.50
	+ , 0.75
	+ , 0.80
	+ , 0.85
	+ , 0.90
	+ , 0.95
	+ , 1.00
	+ , 1.25
	+ , 2.00
	+ };
	+ */
	+
	+ const double test_ratios[] = { 1.00 };
	+
	+ const int num_ratios = sizeof(test_ratios) / sizeof(double);
	+
	+ /*
	+ const uint32_t collisions[] {
	+ 1
	+ , 4
	+ , 16
	+ , 64
	+ };
	+ */
	+
	+ const uint32_t collisions[] = { 16 };
	+
	+ const int num_collisions = sizeof(collisions) / sizeof(uint32_t);
	+
	+ // set up file headers
	+ metrics_out << "Capacity , Unique , Percent Full , Attempted Inserts , Failed Inserts , Collision Ratio , Nanoseconds/Inserts, Seconds" << std::endl;
	+ length_out << "Capacity , Percent Full , ";
	+ distance_out << "Capacity , Percent Full , ";
	+ block_distance_out << "Capacity , Percent Full , ";
	+
	+ for (int i=0; i<100; ++i) {
	+ length_out << i << " , ";
	+ distance_out << i << " , ";
	+ block_distance_out << i << " , ";
	+ }
	+
	+ length_out << "\b\b\b " << std::endl;
	+ distance_out << "\b\b\b " << std::endl;
	+ block_distance_out << "\b\b\b " << std::endl;
	+
	+ Kokkos::Impl::Timer wall_clock ;
	+ for (int i=0; i < num_collisions ; ++i) {
	+ wall_clock.reset();
	+ std::cout << "Collisions: " << collisions[i] << std::endl;
	+ for (int j = 0; j < num_ratios; ++j) {
	+ std::cout << std::setprecision(1) << std::fixed << std::setw(5) << (100.0*test_ratios[j]) << "% " << std::flush;
	+ for (uint32_t capacity = 1<<14; capacity < 1<<25; capacity = capacity << 1) {
	+ uint32_t inserts = static_cast<uint32_t>(test_ratios[j]*(capacity));
	+ std::cout << capacity << std::flush;
	+ UnorderedMapTest<Device, Near> test(capacity, inserts*collisions[i], collisions[i]);
	+ Device::fence();
	+ test.print(metrics_out, length_out, distance_out, block_distance_out);
	+ }
	+ std::cout << "\b\b " << std::endl;
	+
	+ }
	+ std::cout << " " << wall_clock.seconds() << " secs" << std::endl;
	+ }
	+ metrics_out.close();
	+ length_out.close();
	+ distance_out.close();
	+ block_distance_out.close();
	+#else
	+ (void)base_file_name;
	+ std::cout << "skipping test" << std::endl;
	+#endif
	+}
	+
	+
	+} // namespace Perf
	+
	+#endif //KOKKOS_TEST_UNORDERED_MAP_PERFORMANCE_HPP
	diff --git a/lib/kokkos/containers/src/Kokkos_Bitset.hpp b/lib/kokkos/containers/src/Kokkos_Bitset.hpp
	index b53daab80..b51b1c2b2 100755
	--- a/lib/kokkos/containers/src/Kokkos_Bitset.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_Bitset.hpp
	@@ -1,437 +1,437 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_BITSET_HPP
	#define KOKKOS_BITSET_HPP

	#include <Kokkos_Core.hpp>
	#include <Kokkos_Functional.hpp>

	#include <impl/Kokkos_Bitset_impl.hpp>

	#include <stdexcept>

	namespace Kokkos {

	template <typename Device = Kokkos::DefaultExecutionSpace >
	class Bitset;

	template <typename Device = Kokkos::DefaultExecutionSpace >
	class ConstBitset;

	template <typename DstDevice, typename SrcDevice>
	void deep_copy( Bitset<DstDevice> & dst, Bitset<SrcDevice> const& src);

	template <typename DstDevice, typename SrcDevice>
	void deep_copy( Bitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src);

	template <typename DstDevice, typename SrcDevice>
	void deep_copy( ConstBitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src);


	/// A thread safe view to a bitset
	template <typename Device>
	class Bitset
	{
	public:
	- typedef Device device_type;
	+ typedef Device execution_space;
	typedef unsigned size_type;

	enum { BIT_SCAN_REVERSE = 1u };
	enum { MOVE_HINT_BACKWARD = 2u };

	enum {
	BIT_SCAN_FORWARD_MOVE_HINT_FORWARD = 0u
	, BIT_SCAN_REVERSE_MOVE_HINT_FORWARD = BIT_SCAN_REVERSE
	, BIT_SCAN_FORWARD_MOVE_HINT_BACKWARD = MOVE_HINT_BACKWARD
	, BIT_SCAN_REVERSE_MOVE_HINT_BACKWARD = BIT_SCAN_REVERSE \| MOVE_HINT_BACKWARD
	};

	private:
	enum { block_size = static_cast<unsigned>(sizeof(unsigned)*CHAR_BIT) };
	enum { block_mask = block_size-1u };
	enum { block_shift = static_cast<int>(Impl::power_of_two<block_size>::value) };

	public:


	/// constructor
	/// arg_size := number of bit in set
	Bitset(unsigned arg_size = 0u)
	: m_size(arg_size)
	, m_last_block_mask(0u)
	, m_blocks("Bitset", ((m_size + block_mask) >> block_shift) )
	{
	for (int i=0, end = static_cast<int>(m_size & block_mask); i < end; ++i) {
	m_last_block_mask \|= 1u << i;
	}
	}

	/// assignment
	Bitset<Device> & operator = (Bitset<Device> const & rhs)
	{
	this->m_size = rhs.m_size;
	this->m_last_block_mask = rhs.m_last_block_mask;
	this->m_blocks = rhs.m_blocks;

	return *this;
	}

	/// copy constructor
	Bitset( Bitset<Device> const & rhs)
	: m_size( rhs.m_size )
	, m_last_block_mask( rhs.m_last_block_mask )
	, m_blocks( rhs.m_blocks )
	{}

	/// number of bits in the set
	/// can be call from the host or the device
	KOKKOS_FORCEINLINE_FUNCTION
	unsigned size() const
	{ return m_size; }

	/// number of bits which are set to 1
	/// can only be called from the host
	unsigned count() const
	{
	Impl::BitsetCount< Bitset<Device> > f(*this);
	return f.apply();
	}

	/// set all bits to 1
	/// can only be called from the host
	void set()
	{
	Kokkos::deep_copy(m_blocks, ~0u );

	if (m_last_block_mask) {
	//clear the unused bits in the last block
	- typedef Kokkos::Impl::DeepCopy< typename device_type::memory_space, Kokkos::HostSpace > raw_deep_copy;
	- raw_deep_copy( m_blocks.ptr_on_device() + (m_blocks.size() -1u), &m_last_block_mask, sizeof(unsigned));
	+ typedef Kokkos::Impl::DeepCopy< typename execution_space::memory_space, Kokkos::HostSpace > raw_deep_copy;
	+ raw_deep_copy( m_blocks.ptr_on_device() + (m_blocks.dimension_0() -1u), &m_last_block_mask, sizeof(unsigned));
	}
	}

	/// set all bits to 0
	/// can only be called from the host
	void reset()
	{
	Kokkos::deep_copy(m_blocks, 0u );
	}

	/// set all bits to 0
	/// can only be called from the host
	void clear()
	{
	Kokkos::deep_copy(m_blocks, 0u );
	}

	/// set i'th bit to 1
	/// can only be called from the device
	KOKKOS_FORCEINLINE_FUNCTION
	bool set( unsigned i ) const
	{
	if ( i < m_size ) {
	unsigned * block_ptr = &m_blocks[ i >> block_shift ];
	const unsigned mask = 1u << static_cast<int>( i & block_mask );

	return !( atomic_fetch_or( block_ptr, mask ) & mask );
	}
	return false;
	}

	/// set i'th bit to 0
	/// can only be called from the device
	KOKKOS_FORCEINLINE_FUNCTION
	bool reset( unsigned i ) const
	{
	if ( i < m_size ) {
	unsigned * block_ptr = &m_blocks[ i >> block_shift ];
	const unsigned mask = 1u << static_cast<int>( i & block_mask );

	return atomic_fetch_and( block_ptr, ~mask ) & mask;
	}
	return false;
	}

	/// return true if the i'th bit set to 1
	/// can only be called from the device
	KOKKOS_FORCEINLINE_FUNCTION
	bool test( unsigned i ) const
	{
	if ( i < m_size ) {
	const unsigned block = volatile_load(&m_blocks[ i >> block_shift ]);
	const unsigned mask = 1u << static_cast<int>( i & block_mask );
	return block & mask;
	}
	return false;
	}

	/// used with find_any_set_near or find_any_unset_near functions
	/// returns the max number of times those functions should be call
	/// when searching for an available bit
	KOKKOS_FORCEINLINE_FUNCTION
	unsigned max_hint() const
	{
	- return m_blocks.size();
	+ return m_blocks.dimension_0();
	}

	/// find a bit set to 1 near the hint
	/// returns a pair< bool, unsigned> where if result.first is true then result.second is the bit found
	/// and if result.first is false the result.second is a new hint
	KOKKOS_INLINE_FUNCTION
	Kokkos::pair<bool, unsigned> find_any_set_near( unsigned hint , unsigned scan_direction = BIT_SCAN_FORWARD_MOVE_HINT_FORWARD ) const
	{
	- const unsigned block_idx = (hint >> block_shift) < m_blocks.size() ? (hint >> block_shift) : 0;
	+ const unsigned block_idx = (hint >> block_shift) < m_blocks.dimension_0() ? (hint >> block_shift) : 0;
	const unsigned offset = hint & block_mask;
	unsigned block = volatile_load(&m_blocks[ block_idx ]);
	- block = !m_last_block_mask \|\| (block_idx < (m_blocks.size()-1)) ? block : block & m_last_block_mask ;
	+ block = !m_last_block_mask \|\| (block_idx < (m_blocks.dimension_0()-1)) ? block : block & m_last_block_mask ;

	return find_any_helper(block_idx, offset, block, scan_direction);
	}

	/// find a bit set to 0 near the hint
	/// returns a pair< bool, unsigned> where if result.first is true then result.second is the bit found
	/// and if result.first is false the result.second is a new hint
	KOKKOS_INLINE_FUNCTION
	Kokkos::pair<bool, unsigned> find_any_unset_near( unsigned hint , unsigned scan_direction = BIT_SCAN_FORWARD_MOVE_HINT_FORWARD ) const
	{
	const unsigned block_idx = hint >> block_shift;
	const unsigned offset = hint & block_mask;
	unsigned block = volatile_load(&m_blocks[ block_idx ]);
	- block = !m_last_block_mask \|\| (block_idx < (m_blocks.size()-1) ) ? ~block : ~block & m_last_block_mask ;
	+ block = !m_last_block_mask \|\| (block_idx < (m_blocks.dimension_0()-1) ) ? ~block : ~block & m_last_block_mask ;

	return find_any_helper(block_idx, offset, block, scan_direction);
	}

	private:

	KOKKOS_FORCEINLINE_FUNCTION
	Kokkos::pair<bool, unsigned> find_any_helper(unsigned block_idx, unsigned offset, unsigned block, unsigned scan_direction) const
	{
	Kokkos::pair<bool, unsigned> result( block > 0u, 0);

	if (!result.first) {
	result.second = update_hint( block_idx, offset, scan_direction );
	}
	else {
	result.second = scan_block( (block_idx << block_shift)
	, offset
	, block
	, scan_direction
	);
	}
	return result;
	}


	KOKKOS_FORCEINLINE_FUNCTION
	unsigned scan_block(unsigned block_start, int offset, unsigned block, unsigned scan_direction ) const
	{
	offset = !(scan_direction & BIT_SCAN_REVERSE) ? offset : (offset + block_mask) & block_mask;
	block = Impl::rotate_right(block, offset);
	return ((( !(scan_direction & BIT_SCAN_REVERSE) ?
	Impl::bit_scan_forward(block) :
	Impl::bit_scan_reverse(block)
	) + offset
	) & block_mask
	) + block_start;
	}

	KOKKOS_FORCEINLINE_FUNCTION
	unsigned update_hint( long long block_idx, unsigned offset, unsigned scan_direction ) const
	{
	block_idx += scan_direction & MOVE_HINT_BACKWARD ? -1 : 1;
	- block_idx = block_idx >= 0 ? block_idx : m_blocks.size() - 1;
	- block_idx = block_idx < static_cast<long long>(m_blocks.size()) ? block_idx : 0;
	+ block_idx = block_idx >= 0 ? block_idx : m_blocks.dimension_0() - 1;
	+ block_idx = block_idx < static_cast<long long>(m_blocks.dimension_0()) ? block_idx : 0;

	return static_cast<unsigned>(block_idx)*block_size + offset;
	}

	private:

	unsigned m_size;
	unsigned m_last_block_mask;
	- View< unsigned *, device_type, MemoryTraits<RandomAccess> > m_blocks;
	+ View< unsigned *, execution_space, MemoryTraits<RandomAccess> > m_blocks;

	private:
	template <typename DDevice>
	friend class Bitset;

	template <typename DDevice>
	friend class ConstBitset;

	template <typename Bitset>
	friend struct Impl::BitsetCount;

	template <typename DstDevice, typename SrcDevice>
	friend void deep_copy( Bitset<DstDevice> & dst, Bitset<SrcDevice> const& src);

	template <typename DstDevice, typename SrcDevice>
	friend void deep_copy( Bitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src);
	};

	/// a thread-safe view to a const bitset
	/// i.e. can only test bits
	template <typename Device>
	class ConstBitset
	{
	public:
	- typedef Device device_type;
	+ typedef Device execution_space;
	typedef unsigned size_type;

	private:
	enum { block_size = static_cast<unsigned>(sizeof(unsigned)*CHAR_BIT) };
	enum { block_mask = block_size -1u };
	enum { block_shift = static_cast<int>(Impl::power_of_two<block_size>::value) };

	public:
	ConstBitset()
	: m_size (0)
	{}

	ConstBitset(Bitset<Device> const& rhs)
	: m_size(rhs.m_size)
	, m_blocks(rhs.m_blocks)
	{}

	ConstBitset(ConstBitset<Device> const& rhs)
	: m_size( rhs.m_size )
	, m_blocks( rhs.m_blocks )
	{}

	ConstBitset<Device> & operator = (Bitset<Device> const & rhs)
	{
	this->m_size = rhs.m_size;
	this->m_blocks = rhs.m_blocks;

	return *this;
	}

	ConstBitset<Device> & operator = (ConstBitset<Device> const & rhs)
	{
	this->m_size = rhs.m_size;
	this->m_blocks = rhs.m_blocks;

	return *this;
	}


	KOKKOS_FORCEINLINE_FUNCTION
	unsigned size() const
	{
	return m_size;
	}

	unsigned count() const
	{
	Impl::BitsetCount< ConstBitset<Device> > f(*this);
	return f.apply();
	}

	KOKKOS_FORCEINLINE_FUNCTION
	bool test( unsigned i ) const
	{
	if ( i < m_size ) {
	const unsigned block = m_blocks[ i >> block_shift ];
	const unsigned mask = 1u << static_cast<int>( i & block_mask );
	return block & mask;
	}
	return false;
	}

	private:

	unsigned m_size;
	- View< const unsigned *, device_type, MemoryTraits<RandomAccess> > m_blocks;
	+ View< const unsigned *, execution_space, MemoryTraits<RandomAccess> > m_blocks;

	private:
	template <typename DDevice>
	friend class ConstBitset;

	template <typename Bitset>
	friend struct Impl::BitsetCount;

	template <typename DstDevice, typename SrcDevice>
	friend void deep_copy( Bitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src);

	template <typename DstDevice, typename SrcDevice>
	friend void deep_copy( ConstBitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src);
	};


	template <typename DstDevice, typename SrcDevice>
	void deep_copy( Bitset<DstDevice> & dst, Bitset<SrcDevice> const& src)
	{
	if (dst.size() != src.size()) {
	throw std::runtime_error("Error: Cannot deep_copy bitsets of different sizes!");
	}

	typedef Kokkos::Impl::DeepCopy< typename DstDevice::memory_space, typename SrcDevice::memory_space > raw_deep_copy;
	- raw_deep_copy(dst.m_blocks.ptr_on_device(), src.m_blocks.ptr_on_device(), sizeof(unsigned)*src.m_blocks.size());
	+ raw_deep_copy(dst.m_blocks.ptr_on_device(), src.m_blocks.ptr_on_device(), sizeof(unsigned)*src.m_blocks.dimension_0());
	}

	template <typename DstDevice, typename SrcDevice>
	void deep_copy( Bitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src)
	{
	if (dst.size() != src.size()) {
	throw std::runtime_error("Error: Cannot deep_copy bitsets of different sizes!");
	}

	typedef Kokkos::Impl::DeepCopy< typename DstDevice::memory_space, typename SrcDevice::memory_space > raw_deep_copy;
	- raw_deep_copy(dst.m_blocks.ptr_on_device(), src.m_blocks.ptr_on_device(), sizeof(unsigned)*src.m_blocks.size());
	+ raw_deep_copy(dst.m_blocks.ptr_on_device(), src.m_blocks.ptr_on_device(), sizeof(unsigned)*src.m_blocks.dimension_0());
	}

	template <typename DstDevice, typename SrcDevice>
	void deep_copy( ConstBitset<DstDevice> & dst, ConstBitset<SrcDevice> const& src)
	{
	if (dst.size() != src.size()) {
	throw std::runtime_error("Error: Cannot deep_copy bitsets of different sizes!");
	}

	typedef Kokkos::Impl::DeepCopy< typename DstDevice::memory_space, typename SrcDevice::memory_space > raw_deep_copy;
	- raw_deep_copy(dst.m_blocks.ptr_on_device(), src.m_blocks.ptr_on_device(), sizeof(unsigned)*src.m_blocks.size());
	+ raw_deep_copy(dst.m_blocks.ptr_on_device(), src.m_blocks.ptr_on_device(), sizeof(unsigned)*src.m_blocks.dimension_0());
	}

	} // namespace Kokkos

	#endif //KOKKOS_BITSET_HPP
	diff --git a/lib/kokkos/containers/src/Kokkos_DualView.hpp b/lib/kokkos/containers/src/Kokkos_DualView.hpp
	index 94d3e9ff1..95eea57e9 100755
	--- a/lib/kokkos/containers/src/Kokkos_DualView.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_DualView.hpp
	@@ -1,678 +1,840 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_DualView.hpp
	/// \brief Declaration and definition of Kokkos::DualView.
	///
	/// This header file declares and defines Kokkos::DualView and its
	/// related nonmember functions.

	#ifndef KOKKOS_DUALVIEW_HPP
	#define KOKKOS_DUALVIEW_HPP

	#include <Kokkos_Core.hpp>
	#include <impl/Kokkos_Error.hpp>

	namespace Kokkos {

	/* \class DualView
	* \brief Container to manage mirroring a Kokkos::View that lives
	* in device memory with a Kokkos::View that lives in host memory.
	*
	* This class provides capabilities to manage data which exists in two
	* memory spaces at the same time. It keeps views of the same layout
	* on two memory spaces as well as modified flags for both
	* allocations. Users are responsible for setting the modified flags
	* manually if they change the data in either memory space, by calling
	* the sync() method templated on the device where they modified the
	* data. Users may synchronize data by calling the modify() function,
	* templated on the device towards which they want to synchronize
	* (i.e., the target of the one-way copy operation).
	*
	* The DualView class also provides convenience methods such as
	* realloc, resize and capacity which call the appropriate methods of
	* the underlying Kokkos::View objects.
	*
	* The four template arguments are the same as those of Kokkos::View.
	* (Please refer to that class' documentation for a detailed
	* description.)
	*
	* \tparam DataType The type of the entries stored in the container.
	*
	* \tparam Layout The array's layout in memory.
	*
	* \tparam Device The Kokkos Device type. If its memory space is
	* not the same as the host's memory space, then DualView will
	* contain two separate Views: one in device memory, and one in
	* host memory. Otherwise, DualView will only store one View.
	*
	* \tparam MemoryTraits (optional) The user's intended memory access
	* behavior. Please see the documentation of Kokkos::View for
	* examples. The default suffices for most users.
	*/
	template< class DataType ,
	class Arg1Type = void ,
	class Arg2Type = void ,
	class Arg3Type = void>
	class DualView : public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
	{
	public:
	//! \name Typedefs for device types and various Kokkos::View specializations.
	//@{
	typedef ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type > traits ;

	//! The Kokkos Host Device type;
	typedef typename traits::host_mirror_space host_mirror_space ;

	//! The type of a Kokkos::View on the device.
	typedef View< typename traits::data_type ,
	typename traits::array_layout ,
	typename traits::device_type ,
	typename traits::memory_traits > t_dev ;

	/// \typedef t_host
	/// \brief The type of a Kokkos::View host mirror of \c t_dev.
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- typedef t_dev t_host;
	-#else
	typedef typename t_dev::HostMirror t_host ;
	-#endif

	//! The type of a const View on the device.
	//! The type of a Kokkos::View on the device.
	typedef View< typename traits::const_data_type ,
	typename traits::array_layout ,
	typename traits::device_type ,
	typename traits::memory_traits > t_dev_const ;

	/// \typedef t_host_const
	/// \brief The type of a const View host mirror of \c t_dev_const.
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- typedef t_dev_const t_host_const;
	-#else
	typedef typename t_dev_const::HostMirror t_host_const;
	-#endif

	//! The type of a const, random-access View on the device.
	typedef View< typename traits::const_data_type ,
	typename traits::array_layout ,
	typename traits::device_type ,
	- MemoryTraits<RandomAccess> > t_dev_const_randomread ;
	+ MemoryRandomAccess > t_dev_const_randomread ;

	/// \typedef t_host_const_randomread
	/// \brief The type of a const, random-access View host mirror of
	/// \c t_dev_const_randomread.
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- typedef t_dev_const_randomread t_host_const_randomread;
	-#else
	typedef typename t_dev_const_randomread::HostMirror t_host_const_randomread;
	-#endif

	//! The type of an unmanaged View on the device.
	typedef View< typename traits::data_type ,
	typename traits::array_layout ,
	typename traits::device_type ,
	MemoryUnmanaged> t_dev_um;

	//! The type of an unmanaged View host mirror of \c t_dev_um.
	typedef View< typename t_host::data_type ,
	typename t_host::array_layout ,
	typename t_host::device_type ,
	MemoryUnmanaged> t_host_um;

	//! The type of a const unmanaged View on the device.
	typedef View< typename traits::const_data_type ,
	typename traits::array_layout ,
	typename traits::device_type ,
	MemoryUnmanaged> t_dev_const_um;

	//! The type of a const unmanaged View host mirror of \c t_dev_const_um.
	typedef View<typename t_host::const_data_type,
	typename t_host::array_layout,
	typename t_host::device_type,
	MemoryUnmanaged> t_host_const_um;

	//@}
	//! \name The two View instances.
	//@{

	t_dev d_view;
	t_host h_view;

	//@}
	//! \name Counters to keep track of changes ("modified" flags)
	//@{

	- View<unsigned int,LayoutLeft,host_mirror_space> modified_device;
	- View<unsigned int,LayoutLeft,host_mirror_space> modified_host;
	+ View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_device;
	+ View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_host;

	//@}
	//! \name Constructors
	//@{

	/// \brief Empty constructor.
	///
	/// Both device and host View objects are constructed using their
	/// default constructors. The "modified" flags are both initialized
	/// to "unmodified."
	DualView () :
	- modified_device (View<unsigned int,LayoutLeft,host_mirror_space> ("DualView::modified_device")),
	- modified_host (View<unsigned int,LayoutLeft,host_mirror_space> ("DualView::modified_host"))
	+ modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
	+ modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
	{}

	/// \brief Constructor that allocates View objects on both host and device.
	///
	/// This constructor works like the analogous constructor of View.
	/// The first argument is a string label, which is entirely for your
	/// benefit. (Different DualView objects may have the same label if
	/// you like.) The arguments that follow are the dimensions of the
	/// View objects. For example, if the View has three dimensions,
	/// the first three integer arguments will be nonzero, and you may
	/// omit the integer arguments that follow.
	DualView (const std::string& label,
	const size_t n0 = 0,
	const size_t n1 = 0,
	const size_t n2 = 0,
	const size_t n3 = 0,
	const size_t n4 = 0,
	const size_t n5 = 0,
	const size_t n6 = 0,
	const size_t n7 = 0)
	: d_view (label, n0, n1, n2, n3, n4, n5, n6, n7)
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- , h_view (d_view) // with UVM, host View is _always_ a shallow copy
	-#else
	, h_view (create_mirror_view (d_view)) // without UVM, host View mirrors
	-#endif
	- , modified_device (View<unsigned int,LayoutLeft,host_mirror_space> ("DualView::modified_device"))
	- , modified_host (View<unsigned int,LayoutLeft,host_mirror_space> ("DualView::modified_host"))
	+ , modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device"))
	+ , modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
	{}

	//! Copy constructor (shallow copy)
	template<class SS, class LS, class DS, class MS>
	DualView (const DualView<SS,LS,DS,MS>& src) :
	d_view (src.d_view),
	h_view (src.h_view),
	modified_device (src.modified_device),
	modified_host (src.modified_host)
	{}

	/// \brief Create DualView from existing device and host View objects.
	///
	/// This constructor assumes that the device and host View objects
	/// are synchronized. You, the caller, are responsible for making
	/// sure this is the case before calling this constructor. After
	/// this constructor returns, you may use DualView's sync() and
	/// modify() methods to ensure synchronization of the View objects.
	///
	/// \param d_view_ Device View
	/// \param h_view_ Host View (must have type t_host = t_dev::HostMirror)
	DualView (const t_dev& d_view_, const t_host& h_view_) :
	d_view (d_view_),
	h_view (h_view_),
	- modified_device (View<unsigned int,LayoutLeft,host_mirror_space> ("DualView::modified_device")),
	- modified_host (View<unsigned int,LayoutLeft,host_mirror_space> ("DualView::modified_host"))
	+ modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
	+ modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
	{
	Impl::assert_shapes_are_equal (d_view.shape (), h_view.shape ());
	}

	//@}
	//! \name Methods for synchronizing, marking as modified, and getting Views.
	//@{

	/// \brief Return a View on a specific device \c Device.
	///
	/// Please don't be afraid of the if_c expression in the return
	/// value's type. That just tells the method what the return type
	/// should be: t_dev if the \c Device template parameter matches
	/// this DualView's device type, else t_host.
	///
	/// For example, suppose you create a DualView on Cuda, like this:
	/// \code
	/// typedef Kokkos::DualView<float, Kokkos::LayoutRight, Kokkos::Cuda> dual_view_type;
	/// dual_view_type DV ("my dual view", 100);
	/// \endcode
	/// If you want to get the CUDA device View, do this:
	/// \code
	/// typename dual_view_type::t_dev cudaView = DV.view<Kokkos::Cuda> ();
	/// \endcode
	/// and if you want to get the host mirror of that View, do this:
	/// \code
	/// typedef typename Kokkos::HostSpace::execution_space host_device_type;
	/// typename dual_view_type::t_host hostView = DV.view<host_device_type> ();
	/// \endcode
	template< class Device >
	KOKKOS_INLINE_FUNCTION
	const typename Impl::if_c<
	Impl::is_same<typename t_dev::memory_space,
	typename Device::memory_space>::value,
	t_dev,
	t_host>::type& view () const
	{
	return Impl::if_c<
	Impl::is_same<
	typename t_dev::memory_space,
	typename Device::memory_space>::value,
	t_dev,
	t_host >::select (d_view , h_view);
	}

	/// \brief Update data on device or host only if data in the other
	/// space has been marked as modified.
	///
	/// If \c Device is the same as this DualView's device type, then
	/// copy data from host to device. Otherwise, copy data from device
	/// to host. In either case, only copy if the source of the copy
	/// has been modified.
	///
	/// This is a one-way synchronization only. If the target of the
	/// copy has been modified, this operation will discard those
	/// modifications. It will also reset both device and host modified
	/// flags.
	///
	/// \note This method doesn't know on its own whether you modified
	/// the data in either View. You must manually mark modified data
	/// as modified, by calling the modify() method with the
	/// appropriate template parameter.
	template<class Device>
	void sync( const typename Impl::enable_if<
	( Impl::is_same< typename traits::data_type , typename traits::non_const_data_type>::value) \|\|
	( Impl::is_same< Device , int>::value)
	, int >::type& = 0)
	{
	const unsigned int dev =
	Impl::if_c<
	Impl::is_same<
	typename t_dev::memory_space,
	typename Device::memory_space>::value ,
	unsigned int,
	unsigned int>::select (1, 0);

	if (dev) { // if Device is the same as DualView's device type
	if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
	deep_copy (d_view, h_view);
	modified_host() = modified_device() = 0;
	}
	} else { // hopefully Device is the same as DualView's host type
	if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
	deep_copy (h_view, d_view);
	modified_host() = modified_device() = 0;
	}
	}
	+ if(Impl::is_same<typename t_host::memory_space,typename t_dev::memory_space>::value) {
	+ t_dev::execution_space::fence();
	+ t_host::execution_space::fence();
	+ }
	}

	template<class Device>
	void sync ( const typename Impl::enable_if<
	( ! Impl::is_same< typename traits::data_type , typename traits::non_const_data_type>::value ) \|\|
	( Impl::is_same< Device , int>::value)
	, int >::type& = 0 )
	{
	const unsigned int dev =
	Impl::if_c<
	Impl::is_same<
	typename t_dev::memory_space,
	typename Device::memory_space>::value,
	unsigned int,
	unsigned int>::select (1, 0);
	if (dev) { // if Device is the same as DualView's device type
	if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
	Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
	}
	} else { // hopefully Device is the same as DualView's host type
	if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
	Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
	}
	}
	}
	/// \brief Mark data as modified on the given device \c Device.
	///
	/// If \c Device is the same as this DualView's device type, then
	/// mark the device's data as modified. Otherwise, mark the host's
	/// data as modified.
	template<class Device>
	void modify () {
	const unsigned int dev =
	Impl::if_c<
	Impl::is_same<
	typename t_dev::memory_space,
	typename Device::memory_space>::value,
	unsigned int,
	unsigned int>::select (1, 0);

	if (dev) { // if Device is the same as DualView's device type
	// Increment the device's modified count.
	modified_device () = (modified_device () > modified_host () ?
	modified_device () : modified_host ()) + 1;
	} else { // hopefully Device is the same as DualView's host type
	// Increment the host's modified count.
	modified_host () = (modified_device () > modified_host () ?
	modified_device () : modified_host ()) + 1;
	}
	}

	//@}
	//! \name Methods for reallocating or resizing the View objects.
	//@{

	/// \brief Reallocate both View objects.
	///
	/// This discards any existing contents of the objects, and resets
	/// their modified flags. It does <i>not</i> copy the old contents
	/// of either View into the new View objects.
	void realloc( const size_t n0 = 0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0 ) {
	::Kokkos::realloc(d_view,n0,n1,n2,n3,n4,n5,n6,n7);
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- h_view = d_view ;
	-#else
	h_view = create_mirror_view( d_view );
	-#endif
	+
	/* Reset dirty flags */
	modified_device() = modified_host() = 0;
	}

	/// \brief Resize both views, copying old contents into new if necessary.
	///
	/// This method only copies the old contents into the new View
	/// objects for the device which was last marked as modified.
	void resize( const size_t n0 = 0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0 ) {
	if(modified_device() >= modified_host()) {
	/* Resize on Device */
	::Kokkos::resize(d_view,n0,n1,n2,n3,n4,n5,n6,n7);
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- h_view = d_view ;
	-#else
	h_view = create_mirror_view( d_view );
	-#endif

	/* Mark Device copy as modified */
	modified_device() = modified_device()+1;

	} else {
	/* Realloc on Device */

	::Kokkos::realloc(d_view,n0,n1,n2,n3,n4,n5,n6,n7);
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && defined(KOKKOS_USE_CUDA_UVM)
	- t_host temp_view = d_view ;
	-#else
	t_host temp_view = create_mirror_view( d_view );
	-#endif

	/* Remap on Host */
	- Impl::ViewRemap< t_host , t_host >( temp_view , h_view );
	+ Kokkos::deep_copy( temp_view , h_view );
	+
	h_view = temp_view;

	/* Mark Host copy as modified */
	modified_host() = modified_host()+1;
	}
	}

	//@}
	//! \name Methods for getting capacity, stride, or dimension(s).
	//@{

	//! The allocation size (same as Kokkos::View::capacity).
	size_t capacity() const {
	+#if defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+ return d_view.span();
	+#else
	return d_view.capacity();
	+#endif
	}

	//! Get stride(s) for each dimension.
	template< typename iType>
	void stride(iType* stride_) const {
	d_view.stride(stride_);
	}

	/* \brief return size of dimension 0 */
	size_t dimension_0() const {return d_view.dimension_0();}
	/* \brief return size of dimension 1 */
	size_t dimension_1() const {return d_view.dimension_1();}
	/* \brief return size of dimension 2 */
	size_t dimension_2() const {return d_view.dimension_2();}
	/* \brief return size of dimension 3 */
	size_t dimension_3() const {return d_view.dimension_3();}
	/* \brief return size of dimension 4 */
	size_t dimension_4() const {return d_view.dimension_4();}
	/* \brief return size of dimension 5 */
	size_t dimension_5() const {return d_view.dimension_5();}
	/* \brief return size of dimension 6 */
	size_t dimension_6() const {return d_view.dimension_6();}
	/* \brief return size of dimension 7 */
	size_t dimension_7() const {return d_view.dimension_7();}

	//@}
	};

	+} // namespace Kokkos
	//
	// Partial specializations of Kokkos::subview() for DualView objects.
	//

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+namespace Kokkos {
	+namespace Impl {
	+
	+template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	+ , class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	+ , class SubArg4_type , class SubArg5_type , class SubArg6_type , class SubArg7_type
	+ >
	+struct ViewSubview< DualView< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type >
	+ , SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	+ , SubArg4_type , SubArg5_type , SubArg6_type , SubArg7_type >
	+{
	+private:
	+
	+ typedef DualView< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type > SrcViewType ;
	+
	+ enum { V0 = Impl::is_same< SubArg0_type , void >::value ? 1 : 0 };
	+ enum { V1 = Impl::is_same< SubArg1_type , void >::value ? 1 : 0 };
	+ enum { V2 = Impl::is_same< SubArg2_type , void >::value ? 1 : 0 };
	+ enum { V3 = Impl::is_same< SubArg3_type , void >::value ? 1 : 0 };
	+ enum { V4 = Impl::is_same< SubArg4_type , void >::value ? 1 : 0 };
	+ enum { V5 = Impl::is_same< SubArg5_type , void >::value ? 1 : 0 };
	+ enum { V6 = Impl::is_same< SubArg6_type , void >::value ? 1 : 0 };
	+ enum { V7 = Impl::is_same< SubArg7_type , void >::value ? 1 : 0 };
	+
	+ // The source view rank must be equal to the input argument rank
	+ // Once a void argument is encountered all subsequent arguments must be void.
	+ enum { InputRank =
	+ Impl::StaticAssert<( SrcViewType::rank ==
	+ ( V0 ? 0 : (
	+ V1 ? 1 : (
	+ V2 ? 2 : (
	+ V3 ? 3 : (
	+ V4 ? 4 : (
	+ V5 ? 5 : (
	+ V6 ? 6 : (
	+ V7 ? 7 : 8 ))))))) ))
	+ &&
	+ ( SrcViewType::rank ==
	+ ( 8 - ( V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 ) ) )
	+ >::value ? SrcViewType::rank : 0 };
	+
	+ enum { R0 = Impl::ViewOffsetRange< SubArg0_type >::is_range ? 1 : 0 };
	+ enum { R1 = Impl::ViewOffsetRange< SubArg1_type >::is_range ? 1 : 0 };
	+ enum { R2 = Impl::ViewOffsetRange< SubArg2_type >::is_range ? 1 : 0 };
	+ enum { R3 = Impl::ViewOffsetRange< SubArg3_type >::is_range ? 1 : 0 };
	+ enum { R4 = Impl::ViewOffsetRange< SubArg4_type >::is_range ? 1 : 0 };
	+ enum { R5 = Impl::ViewOffsetRange< SubArg5_type >::is_range ? 1 : 0 };
	+ enum { R6 = Impl::ViewOffsetRange< SubArg6_type >::is_range ? 1 : 0 };
	+ enum { R7 = Impl::ViewOffsetRange< SubArg7_type >::is_range ? 1 : 0 };
	+
	+ enum { OutputRank = unsigned(R0) + unsigned(R1) + unsigned(R2) + unsigned(R3)
	+ + unsigned(R4) + unsigned(R5) + unsigned(R6) + unsigned(R7) };
	+
	+ // Reverse
	+ enum { R0_rev = 0 == InputRank ? 0u : (
	+ 1 == InputRank ? unsigned(R0) : (
	+ 2 == InputRank ? unsigned(R1) : (
	+ 3 == InputRank ? unsigned(R2) : (
	+ 4 == InputRank ? unsigned(R3) : (
	+ 5 == InputRank ? unsigned(R4) : (
	+ 6 == InputRank ? unsigned(R5) : (
	+ 7 == InputRank ? unsigned(R6) : unsigned(R7) ))))))) };
	+
	+ typedef typename SrcViewType::array_layout SrcViewLayout ;
	+
	+ // Choose array layout, attempting to preserve original layout if at all possible.
	+ typedef typename Impl::if_c<
	+ ( // Same Layout IF
	+ // OutputRank 0
	+ ( OutputRank == 0 )
	+ \|\|
	+ // OutputRank 1 or 2, InputLayout Left, Interval 0
	+ // because single stride one or second index has a stride.
	+ ( OutputRank <= 2 && R0 && Impl::is_same<SrcViewLayout,LayoutLeft>::value )
	+ \|\|
	+ // OutputRank 1 or 2, InputLayout Right, Interval [InputRank-1]
	+ // because single stride one or second index has a stride.
	+ ( OutputRank <= 2 && R0_rev && Impl::is_same<SrcViewLayout,LayoutRight>::value )
	+ ), SrcViewLayout , Kokkos::LayoutStride >::type OutputViewLayout ;
	+
	+ // Choose data type as a purely dynamic rank array to accomodate a runtime range.
	+ typedef typename Impl::if_c< OutputRank == 0 , typename SrcViewType::value_type ,
	+ typename Impl::if_c< OutputRank == 1 , typename SrcViewType::value_type *,
	+ typename Impl::if_c< OutputRank == 2 , typename SrcViewType::value_type **,
	+ typename Impl::if_c< OutputRank == 3 , typename SrcViewType::value_type ***,
	+ typename Impl::if_c< OutputRank == 4 , typename SrcViewType::value_type ****,
	+ typename Impl::if_c< OutputRank == 5 , typename SrcViewType::value_type *****,
	+ typename Impl::if_c< OutputRank == 6 , typename SrcViewType::value_type ******,
	+ typename Impl::if_c< OutputRank == 7 , typename SrcViewType::value_type *******,
	+ typename SrcViewType::value_type ********
	+ >::type >::type >::type >::type >::type >::type >::type >::type OutputData ;
	+
	+ // Choose space.
	+ // If the source view's template arg1 or arg2 is a space then use it,
	+ // otherwise use the source view's execution space.
	+
	+ typedef typename Impl::if_c< Impl::is_space< SrcArg1Type >::value , SrcArg1Type ,
	+ typename Impl::if_c< Impl::is_space< SrcArg2Type >::value , SrcArg2Type , typename SrcViewType::execution_space
	+ >::type >::type OutputSpace ;
	+
	+public:
	+
	+ // If keeping the layout then match non-data type arguments
	+ // else keep execution space and memory traits.
	+ typedef typename
	+ Impl::if_c< Impl::is_same< SrcViewLayout , OutputViewLayout >::value
	+ , Kokkos::DualView< OutputData , SrcArg1Type , SrcArg2Type , SrcArg3Type >
	+ , Kokkos::DualView< OutputData , OutputViewLayout , OutputSpace
	+ , typename SrcViewType::memory_traits >
	+ >::type type ;
	+};
	+
	+} /* namespace Impl */
	+} /* namespace Kokkos */
	+
	+namespace Kokkos {
	+
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , void , void , void
	+ , void , void , void , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , void , void , void
	+ , void , void , void , void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0);
	+ sub_view.d_view = subview(src.d_view,arg0);
	+ sub_view.h_view = subview(src.h_view,arg0);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}


	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , void , void
	+ , void , void , void , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , void , void
	+ , void , void , void , void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 , class ArgType2 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , void
	+ , void , void , void , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , void
	+ , void , void , void , void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1,arg2);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1,arg2);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1,arg2);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1,arg2);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , void , void , void , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , void , void , void , void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1,arg2,arg3);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1,arg2,arg3);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1,arg2,arg3);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1,arg2,arg3);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , void , void , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , void , void ,void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1,arg2,arg3,arg4);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1,arg2,arg3,arg4);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1,arg2,arg3,arg4);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1,arg2,arg3,arg4);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 , class ArgType5 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , ArgType5 , void , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 ,
	const ArgType5 & arg5 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , ArgType5 , void , void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1,arg2,arg3,arg4,arg5);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1,arg2,arg3,arg4,arg5);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1,arg2,arg3,arg4,arg5);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1,arg2,arg3,arg4,arg5);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 , class ArgType5 , class ArgType6 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , ArgType5 , ArgType6 , void
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 ,
	const ArgType5 & arg5 ,
	const ArgType6 & arg6 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , ArgType5 , ArgType6 , void
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	-template< class DstViewType ,
	- class T , class L , class D , class M ,
	+template< class D , class A1 , class A2 , class A3 ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 , class ArgType5 , class ArgType6 , class ArgType7 >
	-DstViewType
	-subview( const DualView<T,L,D,M> & src ,
	+typename Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , ArgType5 , ArgType6 , ArgType7
	+ >::type
	+subview( const DualView<D,A1,A2,A3> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 ,
	const ArgType5 & arg5 ,
	const ArgType6 & arg6 ,
	const ArgType7 & arg7 )
	{
	+ typedef typename
	+ Impl::ViewSubview< DualView<D,A1,A2,A3>
	+ , ArgType0 , ArgType1 , ArgType2 , ArgType3
	+ , ArgType4 , ArgType5 , ArgType6 , ArgType7
	+ >::type
	+ DstViewType ;
	DstViewType sub_view;
	- sub_view.d_view = subview<typename DstViewType::t_dev>(src.d_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6,arg7);
	- sub_view.h_view = subview<typename DstViewType::t_host>(src.h_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6,arg7);
	+ sub_view.d_view = subview(src.d_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6,arg7);
	+ sub_view.h_view = subview(src.h_view,arg0,arg1,arg2,arg3,arg4,arg5,arg6,arg7);
	sub_view.modified_device = src.modified_device;
	sub_view.modified_host = src.modified_host;
	return sub_view;
	}

	//
	// Partial specialization of Kokkos::deep_copy() for DualView objects.
	//

	template< class DT , class DL , class DD , class DM ,
	class ST , class SL , class SD , class SM >
	void
	deep_copy (DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
	const DualView<ST,SL,SD,SM>& src )
	{
	if (src.modified_device () >= src.modified_host ()) {
	deep_copy (dst.d_view, src.d_view);
	dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
	} else {
	deep_copy (dst.h_view, src.h_view);
	dst.template modify<typename DualView<DT,DL,DD,DM>::host_mirror_space> ();
	}
	}

	} // namespace Kokkos

	#endif
	diff --git a/lib/kokkos/containers/src/Kokkos_Functional.hpp b/lib/kokkos/containers/src/Kokkos_Functional.hpp
	index 74c3f7093..5c7350ef1 100755
	--- a/lib/kokkos/containers/src/Kokkos_Functional.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_Functional.hpp
	@@ -1,132 +1,173 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	#ifndef KOKKOS_FUNCTIONAL_HPP
	#define KOKKOS_FUNCTIONAL_HPP

	#include <Kokkos_Macros.hpp>
	#include <impl/Kokkos_Functional_impl.hpp>

	namespace Kokkos {

	// These should work for most types

	template <typename T>
	struct pod_hash
	{
	typedef T argument_type;
	typedef T first_argument_type;
	typedef uint32_t second_argument_type;
	typedef uint32_t result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t operator()(T const & t) const
	{ return Impl::MurmurHash3_x86_32( &t, sizeof(T), 0); }

	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t operator()(T const & t, uint32_t seed) const
	{ return Impl::MurmurHash3_x86_32( &t, sizeof(T), seed); }
	};

	template <typename T>
	struct pod_equal_to
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return Impl::bitwise_equal(&a,&b); }
	};

	template <typename T>
	struct pod_not_equal_to
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return !Impl::bitwise_equal(&a,&b); }
	};

	template <typename T>
	struct equal_to
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return a == b; }
	};

	template <typename T>
	struct not_equal_to
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return a != b; }
	};


	template <typename T>
	struct greater
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return a > b; }
	};


	template <typename T>
	struct less
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return a < b; }
	};

	template <typename T>
	struct greater_equal
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return a >= b; }
	};


	template <typename T>
	struct less_equal
	{
	typedef T first_argument_type;
	typedef T second_argument_type;
	typedef bool result_type;

	KOKKOS_FORCEINLINE_FUNCTION
	bool operator()(T const & a, T const & b) const
	{ return a <= b; }
	};

	} // namespace Kokkos


	#endif //KOKKOS_FUNCTIONAL_HPP


	diff --git a/lib/kokkos/containers/src/Kokkos_SegmentedView.hpp b/lib/kokkos/containers/src/Kokkos_SegmentedView.hpp
	index 6730757b3..3f328ba95 100755
	--- a/lib/kokkos/containers/src/Kokkos_SegmentedView.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_SegmentedView.hpp
	@@ -1,478 +1,531 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_SEGMENTED_VIEW_HPP_
	#define KOKKOS_SEGMENTED_VIEW_HPP_

	#include <Kokkos_Core.hpp>
	#include <impl/Kokkos_Error.hpp>
	#include <cstdio>

	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	namespace Kokkos {
	+namespace Experimental {

	namespace Impl {

	template<class DataType, class Arg1Type, class Arg2Type, class Arg3Type>
	struct delete_segmented_view;

	template<class MemorySpace>
	inline
	void DeviceSetAllocatableMemorySize(size_t) {}

	#if defined( KOKKOS_HAVE_CUDA )

	template<>
	inline
	void DeviceSetAllocatableMemorySize<Kokkos::CudaSpace>(size_t size) {
	#ifdef __CUDACC__
	size_t size_limit;
	cudaDeviceGetLimit(&size_limit,cudaLimitMallocHeapSize);
	if(size_limit<size)
	cudaDeviceSetLimit(cudaLimitMallocHeapSize,2*size);
	cudaDeviceGetLimit(&size_limit,cudaLimitMallocHeapSize);
	#endif
	}

	template<>
	inline
	void DeviceSetAllocatableMemorySize<Kokkos::CudaUVMSpace>(size_t size) {
	#ifdef __CUDACC__
	size_t size_limit;
	cudaDeviceGetLimit(&size_limit,cudaLimitMallocHeapSize);
	if(size_limit<size)
	cudaDeviceSetLimit(cudaLimitMallocHeapSize,2*size);
	cudaDeviceGetLimit(&size_limit,cudaLimitMallocHeapSize);
	#endif
	}

	#endif /* #if defined( KOKKOS_HAVE_CUDA ) */

	}

	template< class DataType ,
	class Arg1Type = void ,
	class Arg2Type = void ,
	class Arg3Type = void>
	-class SegmentedView : public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
	+class SegmentedView : public Kokkos::ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
	{
	public:
	//! \name Typedefs for device types and various Kokkos::View specializations.
	//@{
	- typedef ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type > traits ;
	+ typedef Kokkos::ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type > traits ;

	//! The type of a Kokkos::View on the device.
	- typedef View< typename traits::data_type ,
	+ typedef Kokkos::View< typename traits::data_type ,
	typename traits::array_layout ,
	typename traits::memory_space ,
	Kokkos::MemoryUnmanaged > t_dev ;


	private:
	Kokkos::View<t_dev*,typename traits::memory_space> segments_;

	Kokkos::View<int,typename traits::memory_space> realloc_lock;
	Kokkos::View<int,typename traits::memory_space> nsegments_;

	size_t segment_length_;
	size_t segment_length_m1_;
	int max_segments_;

	int segment_length_log2;

	// Dimensions, cardinality, capacity, and offset computation for
	// multidimensional array view of contiguous memory.
	// Inherits from Impl::Shape
	- typedef Impl::ViewOffset< typename traits::shape_type
	+ typedef Kokkos::Impl::ViewOffset< typename traits::shape_type
	, typename traits::array_layout
	> offset_map_type ;

	offset_map_type m_offset_map ;

	- typedef View< typename traits::array_intrinsic_type ,
	+ typedef Kokkos::View< typename traits::array_intrinsic_type ,
	typename traits::array_layout ,
	typename traits::memory_space ,
	typename traits::memory_traits > array_type ;

	- typedef View< typename traits::const_data_type ,
	+ typedef Kokkos::View< typename traits::const_data_type ,
	typename traits::array_layout ,
	typename traits::memory_space ,
	typename traits::memory_traits > const_type ;

	- typedef View< typename traits::non_const_data_type ,
	+ typedef Kokkos::View< typename traits::non_const_data_type ,
	typename traits::array_layout ,
	typename traits::memory_space ,
	typename traits::memory_traits > non_const_type ;

	- typedef View< typename traits::non_const_data_type ,
	+ typedef Kokkos::View< typename traits::non_const_data_type ,
	typename traits::array_layout ,
	HostSpace ,
	void > HostMirror ;

	template< bool Accessible >
	KOKKOS_INLINE_FUNCTION
	- typename Impl::enable_if< Accessible , typename traits::size_type >::type
	+ typename Kokkos::Impl::enable_if< Accessible , typename traits::size_type >::type
	dimension_0_intern() const { return nsegments_() * segment_length_ ; }

	template< bool Accessible >
	KOKKOS_INLINE_FUNCTION
	- typename Impl::enable_if< ! Accessible , typename traits::size_type >::type
	+ typename Kokkos::Impl::enable_if< ! Accessible , typename traits::size_type >::type
	dimension_0_intern() const
	{
	// In Host space
	int n = 0 ;
	#if ! defined( __CUDA_ARCH__ )
	- Impl::DeepCopy< HostSpace , typename traits::memory_space >( & n , nsegments_.ptr_on_device() , sizeof(int) );
	+ Kokkos::Impl::DeepCopy< HostSpace , typename traits::memory_space >( & n , nsegments_.ptr_on_device() , sizeof(int) );
	#endif

	return n * segment_length_ ;
	}

	public:

	enum { Rank = traits::rank };

	KOKKOS_INLINE_FUNCTION offset_map_type shape() const { return m_offset_map ; }

	/* \brief return (current) size of dimension 0 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_0() const {
	- enum { Accessible = Impl::VerifyExecutionCanAccessMemorySpace<
	- Impl::ActiveExecutionMemorySpace, typename traits::memory_space >::value };
	+ enum { Accessible = Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<
	+ Kokkos::Impl::ActiveExecutionMemorySpace, typename traits::memory_space >::value };
	int n = SegmentedView::dimension_0_intern< Accessible >();
	return n ;
	}

	/* \brief return size of dimension 1 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_1() const { return m_offset_map.N1 ; }
	/* \brief return size of dimension 2 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_2() const { return m_offset_map.N2 ; }
	/* \brief return size of dimension 3 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_3() const { return m_offset_map.N3 ; }
	/* \brief return size of dimension 4 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_4() const { return m_offset_map.N4 ; }
	/* \brief return size of dimension 5 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_5() const { return m_offset_map.N5 ; }
	/* \brief return size of dimension 6 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_6() const { return m_offset_map.N6 ; }
	/* \brief return size of dimension 7 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_7() const { return m_offset_map.N7 ; }

	/* \brief return size of dimension 2 */
	KOKKOS_INLINE_FUNCTION typename traits::size_type size() const {
	return dimension_0() *
	m_offset_map.N1 * m_offset_map.N2 * m_offset_map.N3 * m_offset_map.N4 *
	m_offset_map.N5 * m_offset_map.N6 * m_offset_map.N7 ;
	}

	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	typename traits::size_type dimension( const iType & i ) const {
	if(i==0)
	return dimension_0();
	else
	- return Impl::dimension( m_offset_map , i );
	+ return Kokkos::Impl::dimension( m_offset_map , i );
	}

	KOKKOS_INLINE_FUNCTION
	typename traits::size_type capacity() {
	return segments_.dimension_0() *
	m_offset_map.N1 * m_offset_map.N2 * m_offset_map.N3 * m_offset_map.N4 *
	m_offset_map.N5 * m_offset_map.N6 * m_offset_map.N7;
	}

	KOKKOS_INLINE_FUNCTION
	typename traits::size_type get_num_segments() {
	- enum { Accessible = Impl::VerifyExecutionCanAccessMemorySpace<
	- Impl::ActiveExecutionMemorySpace, typename traits::memory_space >::value };
	+ enum { Accessible = Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<
	+ Kokkos::Impl::ActiveExecutionMemorySpace, typename traits::memory_space >::value };
	int n = SegmentedView::dimension_0_intern< Accessible >();
	return n/segment_length_ ;
	}

	KOKKOS_INLINE_FUNCTION
	typename traits::size_type get_max_segments() {
	return max_segments_;
	}

	/// \brief Constructor that allocates View objects with an initial length of 0.
	///
	/// This constructor works mostly like the analogous constructor of View.
	/// The first argument is a string label, which is entirely for your
	/// benefit. (Different SegmentedView objects may have the same label if
	/// you like.) The second argument 'view_length' is the size of the segments.
	/// This number must be a power of two. The third argument n0 is the maximum
	/// value for the first dimension of the segmented view. The maximal allocatable
	/// number of Segments is thus: (n0+view_length-1)/view_length.
	/// The arguments that follow are the other dimensions of the (1-7) of the
	/// View objects. For example, for a View with 3 runtime dimensions,
	/// the first 4 integer arguments will be nonzero:
	/// SegmentedView("Name",32768,10000000,8,4). This allocates a SegmentedView
	/// with a maximum of 306 segments of dimension (32768,8,4). The logical size of
	/// the segmented view is (n,8,4) with n between 0 and 10000000.
	/// You may omit the integer arguments that follow.
	template< class LabelType >
	SegmentedView(const LabelType & label ,
	const size_t view_length ,
	const size_t n0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0
	): segment_length_(view_length),segment_length_m1_(view_length-1)
	{
	segment_length_log2 = -1;
	size_t l = segment_length_;
	while(l>0) {
	l>>=1;
	segment_length_log2++;
	}
	l = 1<<segment_length_log2;
	if(l!=segment_length_)
	- Impl::throw_runtime_exception("Kokkos::SegmentedView requires a 'power of 2' segment length");
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::SegmentedView requires a 'power of 2' segment length");

	max_segments_ = (n0+segment_length_m1_)/segment_length_;

	Impl::DeviceSetAllocatableMemorySize<typename traits::memory_space>(segment_length_max_segments_sizeof(typename traits::value_type));

	segments_ = Kokkos::View<t_dev*,typename traits::execution_space>(label , max_segments_);
	realloc_lock = Kokkos::View<int,typename traits::execution_space>("Lock");
	nsegments_ = Kokkos::View<int,typename traits::execution_space>("nviews");
	m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7, n0n1n2n3n4n5n6*n7 );

	}

	KOKKOS_INLINE_FUNCTION
	SegmentedView(const SegmentedView& src):
	segments_(src.segments_),
	realloc_lock (src.realloc_lock),
	nsegments_ (src.nsegments_),
	segment_length_(src.segment_length_),
	segment_length_m1_(src.segment_length_m1_),
	max_segments_ (src.max_segments_),
	segment_length_log2(src.segment_length_log2),
	m_offset_map (src.m_offset_map)
	{}

	KOKKOS_INLINE_FUNCTION
	SegmentedView& operator= (const SegmentedView& src) {
	segments_ = src.segments_;
	realloc_lock = src.realloc_lock;
	nsegments_ = src.nsegments_;
	segment_length_= src.segment_length_;
	segment_length_m1_= src.segment_length_m1_;
	max_segments_ = src.max_segments_;
	segment_length_log2= src.segment_length_log2;
	m_offset_map = src.m_offset_map;
	return *this;
	}

	~SegmentedView() {
	- if (traits::execution_space::in_parallel()) return;
	- int count = traits::memory_space::count(segments_.ptr_on_device());
	- if(count == 1) {
	+ if ( !segments_.tracker().ref_counting()) { return; }
	+ size_t ref_count = segments_.tracker().ref_count();
	+ if(ref_count == 1u) {
	Kokkos::fence();
	typename Kokkos::View<int,typename traits::execution_space>::HostMirror h_nviews("h_nviews");
	Kokkos::deep_copy(h_nviews,nsegments_);
	Kokkos::parallel_for(h_nviews(),Impl::delete_segmented_view<DataType , Arg1Type , Arg2Type, Arg3Type>(*this));
	}
	}

	KOKKOS_INLINE_FUNCTION
	t_dev get_segment(const int& i) const {
	return segments_[i];
	}

	template< class MemberType>
	KOKKOS_INLINE_FUNCTION
	void grow (MemberType& team_member, const size_t& growSize) const {
	if (growSize>max_segments_*segment_length_) {
	printf ("Exceeding maxSize: %lu %lu\n", growSize, max_segments_*segment_length_);
	return;
	}
	+
	if(team_member.team_rank()==0) {
	bool too_small = growSize > segment_length_ * nsegments_();
	- while(too_small && Kokkos::atomic_compare_exchange(&realloc_lock(),0,1) ) {
	- too_small = growSize > segment_length_ * nsegments_();
	- }
	- if(too_small) {
	- while(too_small) {
	- const size_t alloc_size = segment_length_m_offset_map.N1m_offset_map.N2m_offset_map.N3
	- m_offset_map.N4m_offset_map.N5m_offset_map.N6*m_offset_map.N7;
	- typename traits::non_const_value_type* const ptr = new typename traits::non_const_value_type[alloc_size];
	-
	- segments_(nsegments_()) =
	- t_dev(ptr,segment_length_,m_offset_map.N1,m_offset_map.N2,m_offset_map.N3,m_offset_map.N4,m_offset_map.N5,m_offset_map.N6,m_offset_map.N7);
	- nsegments_()++;
	- too_small = growSize > segment_length_ * nsegments_();
	+ if (too_small) {
	+ while(Kokkos::atomic_compare_exchange(&realloc_lock(),0,1) )
	+ ; // get the lock
	+ too_small = growSize > segment_length_ * nsegments_(); // Recheck once we have the lock
	+ if(too_small) {
	+ while(too_small) {
	+ const size_t alloc_size = segment_length_m_offset_map.N1m_offset_map.N2m_offset_map.N3
	+ m_offset_map.N4m_offset_map.N5m_offset_map.N6*m_offset_map.N7;
	+ typename traits::non_const_value_type* const ptr = new typename traits::non_const_value_type[alloc_size];
	+
	+ segments_(nsegments_()) =
	+ t_dev(ptr,segment_length_,m_offset_map.N1,m_offset_map.N2,m_offset_map.N3,m_offset_map.N4,m_offset_map.N5,m_offset_map.N6,m_offset_map.N7);
	+ nsegments_()++;
	+ too_small = growSize > segment_length_ * nsegments_();
	+ }
	}
	- realloc_lock() = 0;
	+ realloc_lock() = 0; //release the lock
	}
	}
	team_member.team_barrier();
	}

	KOKKOS_INLINE_FUNCTION
	void grow_non_thread_safe (const size_t& growSize) const {
	if (growSize>max_segments_*segment_length_) {
	printf ("Exceeding maxSize: %lu %lu\n", growSize, max_segments_*segment_length_);
	return;
	}
	bool too_small = growSize > segment_length_ * nsegments_();
	if(too_small) {
	while(too_small) {
	const size_t alloc_size = segment_length_m_offset_map.N1m_offset_map.N2m_offset_map.N3
	m_offset_map.N4m_offset_map.N5m_offset_map.N6*m_offset_map.N7;
	typename traits::non_const_value_type* const ptr =
	new typename traits::non_const_value_type[alloc_size];

	segments_(nsegments_()) =
	t_dev (ptr, segment_length_, m_offset_map.N1, m_offset_map.N2,
	m_offset_map.N3, m_offset_map.N4, m_offset_map.N5,
	m_offset_map.N6, m_offset_map.N7);
	nsegments_()++;
	too_small = growSize > segment_length_ * nsegments_();
	}
	}
	}

	template< typename iType0 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 1, iType0 >::type
	+ typename std::enable_if<( std::is_integral<iType0>::value && traits::rank == 1 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_));
	}

	template< typename iType0 , typename iType1 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 2,
	- iType0 , iType1>::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ traits::rank == 2 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1);
	}

	template< typename iType0 , typename iType1 , typename iType2 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 3,
	- iType0 , iType1 , iType2 >::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ std::is_integral<iType2>::value &&
	+ traits::rank == 3 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1,i2);
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 4,
	- iType0 , iType1 , iType2 , iType3 >::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ std::is_integral<iType2>::value &&
	+ std::is_integral<iType3>::value &&
	+ traits::rank == 4 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1,i2,i3);
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 5,
	- iType0 , iType1 , iType2 , iType3 , iType4 >::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ std::is_integral<iType2>::value &&
	+ std::is_integral<iType3>::value &&
	+ std::is_integral<iType4>::value &&
	+ traits::rank == 5 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1,i2,i3,i4);
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 6,
	- iType0 , iType1 , iType2 , iType3 , iType4 , iType5>::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ std::is_integral<iType2>::value &&
	+ std::is_integral<iType3>::value &&
	+ std::is_integral<iType4>::value &&
	+ std::is_integral<iType5>::value &&
	+ traits::rank == 6 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1,i2,i3,i4,i5);
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 , typename iType6 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 7,
	- iType0 , iType1 , iType2 , iType3 , iType4 , iType5 , iType6>::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ std::is_integral<iType2>::value &&
	+ std::is_integral<iType3>::value &&
	+ std::is_integral<iType4>::value &&
	+ std::is_integral<iType5>::value &&
	+ std::is_integral<iType6>::value &&
	+ traits::rank == 7 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const iType6 & i6 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1,i2,i3,i4,i5,i6);
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 , typename iType6 , typename iType7 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< typename traits::value_type & , traits, typename traits::array_layout, 8,
	- iType0 , iType1 , iType2 , iType3 , iType4 , iType5 , iType6 , iType7>::type
	+ typename std::enable_if<( std::is_integral<iType0>::value &&
	+ std::is_integral<iType1>::value &&
	+ std::is_integral<iType2>::value &&
	+ std::is_integral<iType3>::value &&
	+ std::is_integral<iType4>::value &&
	+ std::is_integral<iType5>::value &&
	+ std::is_integral<iType6>::value &&
	+ std::is_integral<iType7>::value &&
	+ traits::rank == 8 )
	+ , typename traits::value_type &
	+ >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const iType6 & i6 , const iType7 & i7 ) const
	{
	return segments_[i0>>segment_length_log2](i0&(segment_length_m1_),i1,i2,i3,i4,i5,i6,i7);
	}
	};

	namespace Impl {
	template<class DataType, class Arg1Type, class Arg2Type, class Arg3Type>
	struct delete_segmented_view {
	typedef SegmentedView<DataType , Arg1Type , Arg2Type, Arg3Type> view_type;
	- typedef typename view_type::execution_space device_type;
	+ typedef typename view_type::execution_space execution_space;

	view_type view_;
	delete_segmented_view(view_type view):view_(view) {
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (int i) const {
	delete [] view_.get_segment(i).ptr_on_device();
	}
	};

	}
	}
	+}
	+
	+#endif

	#endif
	diff --git a/lib/kokkos/containers/src/Kokkos_StaticCrsGraph.hpp b/lib/kokkos/containers/src/Kokkos_StaticCrsGraph.hpp
	index 44bd1da32..1ce38638a 100755
	--- a/lib/kokkos/containers/src/Kokkos_StaticCrsGraph.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_StaticCrsGraph.hpp
	@@ -1,225 +1,226 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_STATICCRSGRAPH_HPP
	#define KOKKOS_STATICCRSGRAPH_HPP

	#include <string>
	#include <vector>

	#include <Kokkos_Core.hpp>

	namespace Kokkos {

	/// \class StaticCrsGraph
	/// \brief Compressed row storage array.
	///
	/// \tparam DataType The type of stored entries. If a StaticCrsGraph is
	/// used as the graph of a sparse matrix, then this is usually an
	/// integer type, the type of the column indices in the sparse
	/// matrix.
	///
	/// \tparam Arg1Type The second template parameter, corresponding
	/// either to the Device type (if there are no more template
	/// parameters) or to the Layout type (if there is at least one more
	/// template parameter).
	///
	/// \tparam Arg2Type The third template parameter, which if provided
	/// corresponds to the Device type.
	///
	/// \tparam SizeType The type of row offsets. Usually the default
	/// parameter suffices. However, setting a nondefault value is
	/// necessary in some cases, for example, if you want to have a
	/// sparse matrices with dimensions (and therefore column indices)
	/// that fit in \c int, but want to store more than <tt>INT_MAX</tt>
	/// entries in the sparse matrix.
	///
	/// A row has a range of entries:
	/// <ul>
	/// <li> <tt> row_map[i0] <= entry < row_map[i0+1] </tt> </li>
	/// <li> <tt> 0 <= i1 < row_map[i0+1] - row_map[i0] </tt> </li>
	/// <li> <tt> entries( entry , i2 , i3 , ... ); </tt> </li>
	/// <li> <tt> entries( row_map[i0] + i1 , i2 , i3 , ... ); </tt> </li>
	/// </ul>
	template< class DataType,
	class Arg1Type,
	class Arg2Type = void,
	typename SizeType = typename ViewTraits<DataType*, Arg1Type, Arg2Type, void >::size_type>
	class StaticCrsGraph {
	private:
	typedef ViewTraits<DataType*, Arg1Type, Arg2Type, void> traits;

	public:
	typedef DataType data_type;
	typedef typename traits::array_layout array_layout;
	+ typedef typename traits::execution_space execution_space;
	typedef typename traits::device_type device_type;
	typedef SizeType size_type;

	typedef StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType > staticcrsgraph_type;
	typedef StaticCrsGraph< DataType , array_layout , typename traits::host_mirror_space , SizeType > HostMirror;
	typedef View< const size_type* , array_layout, device_type > row_map_type;
	typedef View< DataType* , array_layout, device_type > entries_type;

	entries_type entries;
	row_map_type row_map;

	//! Construct an empty view.
	StaticCrsGraph () : entries(), row_map() {}

	//! Copy constructor (shallow copy).
	StaticCrsGraph (const StaticCrsGraph& rhs) : entries (rhs.entries), row_map (rhs.row_map)
	{}

	template<class EntriesType, class RowMapType>
	StaticCrsGraph (const EntriesType& entries_,const RowMapType& row_map_) : entries (entries_), row_map (row_map_)
	{}

	/** \brief Assign to a view of the rhs array.
	* If the old view is the last view
	* then allocated memory is deallocated.
	*/
	StaticCrsGraph& operator= (const StaticCrsGraph& rhs) {
	entries = rhs.entries;
	row_map = rhs.row_map;
	return *this;
	}

	/** \brief Destroy this view of the array.
	* If the last view then allocated memory is deallocated.
	*/
	~StaticCrsGraph() {}

	- size_t numRows() const {
	- return row_map.dimension_0()>0?row_map.dimension_0()-1:0;
	+ KOKKOS_INLINE_FUNCTION
	+ size_type numRows() const {
	+ return (row_map.dimension_0 () != 0) ?
	+ row_map.dimension_0 () - static_cast<size_type> (1) :
	+ static_cast<size_type> (0);
	}
	-
	};

	//----------------------------------------------------------------------------

	template< class StaticCrsGraphType , class InputSizeType >
	typename StaticCrsGraphType::staticcrsgraph_type
	create_staticcrsgraph( const std::string & label ,
	const std::vector< InputSizeType > & input );

	template< class StaticCrsGraphType , class InputSizeType >
	typename StaticCrsGraphType::staticcrsgraph_type
	create_staticcrsgraph( const std::string & label ,
	const std::vector< std::vector< InputSizeType > > & input );

	//----------------------------------------------------------------------------

	template< class DataType ,
	class Arg1Type ,
	class Arg2Type ,
	typename SizeType >
	typename StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	create_mirror_view( const StaticCrsGraph<DataType,Arg1Type,Arg2Type,SizeType > & input );

	template< class DataType ,
	class Arg1Type ,
	class Arg2Type ,
	typename SizeType >
	typename StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	create_mirror( const StaticCrsGraph<DataType,Arg1Type,Arg2Type,SizeType > & input );

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#include <impl/Kokkos_StaticCrsGraph_factory.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class GraphType >
	struct StaticCrsGraphMaximumEntry {

	- typedef typename GraphType::device_type device_type ;
	+ typedef typename GraphType::execution_space execution_space ;
	typedef typename GraphType::data_type value_type ;

	const typename GraphType::entries_type entries ;

	StaticCrsGraphMaximumEntry( const GraphType & graph ) : entries( graph.entries ) {}

	KOKKOS_INLINE_FUNCTION
	void operator()( const unsigned i , value_type & update ) const
	{ if ( update < entries(i) ) update = entries(i); }

	KOKKOS_INLINE_FUNCTION
	void init( value_type & update ) const
	{ update = 0 ; }

	KOKKOS_INLINE_FUNCTION
	void join( volatile value_type & update ,
	volatile const value_type & input ) const
	{ if ( update < input ) update = input ; }
	};

	}

	template< class DataType, class Arg1Type, class Arg2Type, typename SizeType >
	DataType maximum_entry( const StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType > & graph )
	{
	typedef StaticCrsGraph<DataType,Arg1Type,Arg2Type,SizeType> GraphType ;
	typedef Impl::StaticCrsGraphMaximumEntry< GraphType > FunctorType ;

	DataType result = 0 ;
	Kokkos::parallel_reduce( graph.entries.dimension_0(),
	FunctorType(graph), result );
	return result ;
	}

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_CRSARRAY_HPP */

	diff --git a/lib/kokkos/containers/src/Kokkos_UnorderedMap.hpp b/lib/kokkos/containers/src/Kokkos_UnorderedMap.hpp
	index ccf25f53d..7a916c6ef 100755
	--- a/lib/kokkos/containers/src/Kokkos_UnorderedMap.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_UnorderedMap.hpp
	@@ -1,848 +1,848 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_UnorderedMap.hpp
	/// \brief Declaration and definition of Kokkos::UnorderedMap.
	///
	/// This header file declares and defines Kokkos::UnorderedMap and its
	/// related nonmember functions.

	#ifndef KOKKOS_UNORDERED_MAP_HPP
	#define KOKKOS_UNORDERED_MAP_HPP

	#include <Kokkos_Core.hpp>
	#include <Kokkos_Functional.hpp>

	#include <Kokkos_Bitset.hpp>

	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_UnorderedMap_impl.hpp>


	#include <iostream>

	#include <stdint.h>
	#include <stdexcept>


	namespace Kokkos {

	enum { UnorderedMapInvalidIndex = ~0u };

	/// \brief First element of the return value of UnorderedMap::insert().
	///
	/// Inserting an element into an UnorderedMap is not guaranteed to
	/// succeed. There are three possible conditions:
	/// <ol>
	/// <li> <tt>INSERT_FAILED</tt>: The insert failed. This usually
	/// means that the UnorderedMap ran out of space. </li>
	/// <li> <tt>INSERT_SUCCESS</tt>: The insert succeeded, and the key
	/// did <i>not</i> exist in the table before. </li>
	/// <li> <tt>INSERT_EXISTING</tt>: The insert succeeded, and the key
	/// <i>did</i> exist in the table before. The new value was
	/// ignored and the old value was left in place. </li>
	/// </ol>

	class UnorderedMapInsertResult
	{
	private:
	enum Status{
	SUCCESS = 1u << 31
	, EXISTING = 1u << 30
	, FREED_EXISTING = 1u << 29
	, LIST_LENGTH_MASK = ~(SUCCESS \| EXISTING \| FREED_EXISTING)
	};

	public:
	/// Did the map successful insert the key/value pair
	KOKKOS_FORCEINLINE_FUNCTION
	bool success() const { return (m_status & SUCCESS); }

	/// Was the key already present in the map
	KOKKOS_FORCEINLINE_FUNCTION
	bool existing() const { return (m_status & EXISTING); }

	/// Did the map fail to insert the key due to insufficent capacity
	KOKKOS_FORCEINLINE_FUNCTION
	bool failed() const { return m_index == UnorderedMapInvalidIndex; }

	/// Did the map lose a race condition to insert a dupulicate key/value pair
	/// where an index was claimed that needed to be released
	KOKKOS_FORCEINLINE_FUNCTION
	bool freed_existing() const { return (m_status & FREED_EXISTING); }

	/// How many iterations through the insert loop did it take before the
	/// map returned
	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t list_position() const { return (m_status & LIST_LENGTH_MASK); }

	/// Index where the key can be found as long as the insert did not fail
	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t index() const { return m_index; }

	KOKKOS_FORCEINLINE_FUNCTION
	UnorderedMapInsertResult()
	: m_index(UnorderedMapInvalidIndex)
	, m_status(0)
	{}

	KOKKOS_FORCEINLINE_FUNCTION
	void increment_list_position()
	{
	m_status += (list_position() < LIST_LENGTH_MASK) ? 1u : 0u;
	}

	KOKKOS_FORCEINLINE_FUNCTION
	void set_existing(uint32_t i, bool arg_freed_existing)
	{
	m_index = i;
	m_status = EXISTING \| (arg_freed_existing ? FREED_EXISTING : 0u) \| list_position();
	}

	KOKKOS_FORCEINLINE_FUNCTION
	void set_success(uint32_t i)
	{
	m_index = i;
	m_status = SUCCESS \| list_position();
	}

	private:
	uint32_t m_index;
	uint32_t m_status;
	};

	/// \class UnorderedMap
	/// \brief Thread-safe, performance-portable lookup table.
	///
	/// This class provides a lookup table. In terms of functionality,
	/// this class compares to std::unordered_map (new in C++11).
	/// "Unordered" means that keys are not stored in any particular
	/// order, unlike (for example) std::map. "Thread-safe" means that
	/// lookups, insertion, and deletion are safe to call by multiple
	/// threads in parallel. "Performance-portable" means that parallel
	/// performance of these operations is reasonable, on multiple
	/// hardware platforms. Platforms on which performance has been
	/// tested include conventional Intel x86 multicore processors, Intel
	/// Xeon Phi ("MIC"), and NVIDIA GPUs.
	///
	/// Parallel performance portability entails design decisions that
	/// might differ from one's expectation for a sequential interface.
	/// This particularly affects insertion of single elements. In an
	/// interface intended for sequential use, insertion might reallocate
	/// memory if the original allocation did not suffice to hold the new
	/// element. In this class, insertion does <i>not</i> reallocate
	/// memory. This means that it might fail. insert() returns an enum
	/// which indicates whether the insert failed. There are three
	/// possible conditions:
	/// <ol>
	/// <li> <tt>INSERT_FAILED</tt>: The insert failed. This usually
	/// means that the UnorderedMap ran out of space. </li>
	/// <li> <tt>INSERT_SUCCESS</tt>: The insert succeeded, and the key
	/// did <i>not</i> exist in the table before. </li>
	/// <li> <tt>INSERT_EXISTING</tt>: The insert succeeded, and the key
	/// <i>did</i> exist in the table before. The new value was
	/// ignored and the old value was left in place. </li>
	/// </ol>
	///
	/// \tparam Key Type of keys of the lookup table. If \c const, users
	/// are not allowed to add or remove keys, though they are allowed
	/// to change values. In that case, the implementation may make
	/// optimizations specific to the <tt>Device</tt>. For example, if
	/// <tt>Device</tt> is \c Cuda, it may use texture fetches to access
	/// keys.
	///
	/// \tparam Value Type of values stored in the lookup table. You may use
	/// \c void here, in which case the table will be a set of keys. If
	/// \c const, users are not allowed to change entries.
	/// In that case, the implementation may make
	/// optimizations specific to the \c Device, such as using texture
	/// fetches to access values.
	///
	/// \tparam Device The Kokkos Device type.
	///
	/// \tparam Hasher Definition of the hash function for instances of
	/// <tt>Key</tt>. The default will calculate a bitwise hash.
	///
	/// \tparam EqualTo Definition of the equality function for instances of
	/// <tt>Key</tt>. The default will do a bitwise equality comparison.
	///
	template < typename Key
	, typename Value
	, typename Device = Kokkos::DefaultExecutionSpace
	, typename Hasher = pod_hash<typename Impl::remove_const<Key>::type>
	, typename EqualTo = pod_equal_to<typename Impl::remove_const<Key>::type>
	>
	class UnorderedMap
	{
	private:
	typedef typename ViewTraits<Key,Device,void,void>::host_mirror_space host_mirror_space ;
	public:
	//! \name Public types and constants
	//@{

	//key_types
	typedef Key declared_key_type;
	typedef typename Impl::remove_const<declared_key_type>::type key_type;
	typedef typename Impl::add_const<key_type>::type const_key_type;

	//value_types
	typedef Value declared_value_type;
	typedef typename Impl::remove_const<declared_value_type>::type value_type;
	typedef typename Impl::add_const<value_type>::type const_value_type;

	- typedef Device device_type;
	+ typedef Device execution_space;
	typedef Hasher hasher_type;
	typedef EqualTo equal_to_type;
	typedef uint32_t size_type;

	//map_types
	- typedef UnorderedMap<declared_key_type,declared_value_type,device_type,hasher_type,equal_to_type> declared_map_type;
	- typedef UnorderedMap<key_type,value_type,device_type,hasher_type,equal_to_type> insertable_map_type;
	- typedef UnorderedMap<const_key_type,value_type,device_type,hasher_type,equal_to_type> modifiable_map_type;
	- typedef UnorderedMap<const_key_type,const_value_type,device_type,hasher_type,equal_to_type> const_map_type;
	+ typedef UnorderedMap<declared_key_type,declared_value_type,execution_space,hasher_type,equal_to_type> declared_map_type;
	+ typedef UnorderedMap<key_type,value_type,execution_space,hasher_type,equal_to_type> insertable_map_type;
	+ typedef UnorderedMap<const_key_type,value_type,execution_space,hasher_type,equal_to_type> modifiable_map_type;
	+ typedef UnorderedMap<const_key_type,const_value_type,execution_space,hasher_type,equal_to_type> const_map_type;

	static const bool is_set = Impl::is_same<void,value_type>::value;
	static const bool has_const_key = Impl::is_same<const_key_type,declared_key_type>::value;
	static const bool has_const_value = is_set \|\| Impl::is_same<const_value_type,declared_value_type>::value;

	static const bool is_insertable_map = !has_const_key && (is_set \|\| !has_const_value);
	static const bool is_modifiable_map = has_const_key && !has_const_value;
	static const bool is_const_map = has_const_key && has_const_value;


	typedef UnorderedMapInsertResult insert_result;

	typedef UnorderedMap<Key,Value,host_mirror_space,Hasher,EqualTo> HostMirror;

	typedef Impl::UnorderedMapHistogram<const_map_type> histogram_type;

	//@}

	private:
	enum { invalid_index = ~static_cast<size_type>(0) };

	typedef typename Impl::if_c< is_set, int, declared_value_type>::type impl_value_type;

	typedef typename Impl::if_c< is_insertable_map
	- , View< key_type *, device_type>
	- , View< const key_type *, device_type, MemoryTraits<RandomAccess> >
	+ , View< key_type *, execution_space>
	+ , View< const key_type *, execution_space, MemoryTraits<RandomAccess> >
	>::type key_type_view;

	typedef typename Impl::if_c< is_insertable_map \|\| is_modifiable_map
	- , View< impl_value_type *, device_type>
	- , View< const impl_value_type *, device_type, MemoryTraits<RandomAccess> >
	+ , View< impl_value_type *, execution_space>
	+ , View< const impl_value_type *, execution_space, MemoryTraits<RandomAccess> >
	>::type value_type_view;

	typedef typename Impl::if_c< is_insertable_map
	- , View< size_type *, device_type>
	- , View< const size_type *, device_type, MemoryTraits<RandomAccess> >
	+ , View< size_type *, execution_space>
	+ , View< const size_type *, execution_space, MemoryTraits<RandomAccess> >
	>::type size_type_view;

	typedef typename Impl::if_c< is_insertable_map
	- , Bitset< device_type >
	- , ConstBitset< device_type>
	+ , Bitset< execution_space >
	+ , ConstBitset< execution_space>
	>::type bitset_type;

	enum { modified_idx = 0, erasable_idx = 1, failed_insert_idx = 2 };
	enum { num_scalars = 3 };
	- typedef View< int[num_scalars], LayoutLeft, device_type> scalars_view;
	+ typedef View< int[num_scalars], LayoutLeft, execution_space> scalars_view;

	public:
	//! \name Public member functions
	//@{

	UnorderedMap()
	: m_bounded_insert()
	, m_hasher()
	, m_equal_to()
	, m_size()
	, m_available_indexes()
	, m_hash_lists()
	, m_next_index()
	, m_keys()
	, m_values()
	, m_scalars()
	{}

	/// \brief Constructor
	///
	/// \param capacity_hint [in] Initial guess of how many unique keys will be inserted into the map
	/// \param hash [in] Hasher function for \c Key instances. The
	/// default value usually suffices.
	UnorderedMap( size_type capacity_hint, hasher_type hasher = hasher_type(), equal_to_type equal_to = equal_to_type() )
	: m_bounded_insert(true)
	, m_hasher(hasher)
	, m_equal_to(equal_to)
	, m_size()
	, m_available_indexes(calculate_capacity(capacity_hint))
	, m_hash_lists(ViewAllocateWithoutInitializing("UnorderedMap hash list"), Impl::find_hash_size(capacity()))
	, m_next_index(ViewAllocateWithoutInitializing("UnorderedMap next index"), capacity()+1) // +1 so that the *_at functions can always return a valid reference
	, m_keys("UnorderedMap keys",capacity()+1)
	, m_values("UnorderedMap values",(is_set? 1 : capacity()+1))
	, m_scalars("UnorderedMap scalars")
	{
	if (!is_insertable_map) {
	throw std::runtime_error("Cannot construct a non-insertable (i.e. const key_type) unordered_map");
	}

	Kokkos::deep_copy(m_hash_lists, invalid_index);
	Kokkos::deep_copy(m_next_index, invalid_index);
	}

	void reset_failed_insert_flag()
	{
	reset_flag(failed_insert_idx);
	}

	histogram_type get_histogram()
	{
	return histogram_type(*this);
	}

	//! Clear all entries in the table.
	void clear()
	{
	m_bounded_insert = true;

	if (capacity() == 0) return;

	m_available_indexes.clear();

	Kokkos::deep_copy(m_hash_lists, invalid_index);
	Kokkos::deep_copy(m_next_index, invalid_index);
	{
	const key_type tmp = key_type();
	Kokkos::deep_copy(m_keys,tmp);
	}
	if (is_set){
	const impl_value_type tmp = impl_value_type();
	Kokkos::deep_copy(m_values,tmp);
	}
	{
	Kokkos::deep_copy(m_scalars, 0);
	}
	}

	/// \brief Change the capacity of the the map
	///
	/// If there are no failed inserts the current size of the map will
	/// be used as a lower bound for the input capacity.
	/// If the map is not empty and does not have failed inserts
	/// and the capacity changes then the current data is copied
	/// into the resized / rehashed map.
	///
	/// This is <i>not</i> a device function; it may <i>not</i> be
	/// called in a parallel kernel.
	bool rehash(size_type requested_capacity = 0)
	{
	const bool bounded_insert = (capacity() == 0) \|\| (size() == 0u);
	return rehash(requested_capacity, bounded_insert );
	}

	bool rehash(size_type requested_capacity, bool bounded_insert)
	{
	if(!is_insertable_map) return false;

	const size_type curr_size = size();
	requested_capacity = (requested_capacity < curr_size) ? curr_size : requested_capacity;

	insertable_map_type tmp(requested_capacity, m_hasher, m_equal_to);

	if (curr_size) {
	tmp.m_bounded_insert = false;
	Impl::UnorderedMapRehash<insertable_map_type> f(tmp,*this);
	f.apply();
	}
	tmp.m_bounded_insert = bounded_insert;

	*this = tmp;

	return true;
	}

	/// \brief The number of entries in the table.
	///
	/// This method has undefined behavior when erasable() is true.
	///
	/// Note that this is not a device function; it cannot be called in
	/// a parallel kernel. The value is not stored as a variable; it
	/// must be computed.
	size_type size() const
	{
	if( capacity() == 0u ) return 0u;
	if (modified()) {
	m_size = m_available_indexes.count();
	reset_flag(modified_idx);
	}
	return m_size;
	}

	/// \brief The current number of failed insert() calls.
	///
	/// This is <i>not</i> a device function; it may <i>not</i> be
	/// called in a parallel kernel. The value is not stored as a
	/// variable; it must be computed.
	bool failed_insert() const
	{
	return get_flag(failed_insert_idx);
	}

	bool erasable() const
	{
	return is_insertable_map ? get_flag(erasable_idx) : false;
	}

	bool begin_erase()
	{
	bool result = !erasable();
	if (is_insertable_map && result) {
	- device_type::fence();
	+ execution_space::fence();
	set_flag(erasable_idx);
	- device_type::fence();
	+ execution_space::fence();
	}
	return result;
	}

	bool end_erase()
	{
	bool result = erasable();
	if (is_insertable_map && result) {
	- device_type::fence();
	+ execution_space::fence();
	Impl::UnorderedMapErase<declared_map_type> f(*this);
	f.apply();
	- device_type::fence();
	+ execution_space::fence();
	reset_flag(erasable_idx);
	}
	return result;
	}

	/// \brief The maximum number of entries that the table can hold.
	///
	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel.
	KOKKOS_FORCEINLINE_FUNCTION
	size_type capacity() const
	{ return m_available_indexes.size(); }

	/// \brief The number of hash table "buckets."
	///
	/// This is different than the number of entries that the table can
	/// hold. Each key hashes to an index in [0, hash_capacity() - 1].
	/// That index can hold zero or more entries. This class decides
	/// what hash_capacity() should be, given the user's upper bound on
	/// the number of entries the table must be able to hold.
	///
	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel.
	KOKKOS_INLINE_FUNCTION
	size_type hash_capacity() const
	- { return m_hash_lists.size(); }
	+ { return m_hash_lists.dimension_0(); }

	//---------------------------------------------------------------------------
	//---------------------------------------------------------------------------


	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel. As discussed in the class documentation, it need not
	/// succeed. The return value tells you if it did.
	///
	/// \param k [in] The key to attempt to insert.
	/// \param v [in] The corresponding value to attempt to insert. If
	/// using this class as a set (with Value = void), then you need not
	/// provide this value.
	KOKKOS_INLINE_FUNCTION
	insert_result insert(key_type const& k, impl_value_type const&v = impl_value_type()) const
	{
	insert_result result;

	if ( !is_insertable_map \|\| capacity() == 0u \|\| m_scalars((int)erasable_idx) ) {
	return result;
	}

	if ( !m_scalars((int)modified_idx) ) {
	m_scalars((int)modified_idx) = true;
	}

	int volatile & failed_insert_ref = m_scalars((int)failed_insert_idx) ;

	const size_type hash_value = m_hasher(k);
	- const size_type hash_list = hash_value % m_hash_lists.size();
	+ const size_type hash_list = hash_value % m_hash_lists.dimension_0();

	size_type * curr_ptr = & m_hash_lists[ hash_list ];
	size_type new_index = invalid_index ;

	// Force integer multiply to long
	- size_type index_hint = static_cast<size_type>( (static_cast<double>(hash_list) * capacity()) / m_hash_lists.size());
	+ size_type index_hint = static_cast<size_type>( (static_cast<double>(hash_list) * capacity()) / m_hash_lists.dimension_0());

	size_type find_attempts = 0;

	enum { bounded_find_attempts = 32u };
	const size_type max_attempts = (m_bounded_insert && (bounded_find_attempts < m_available_indexes.max_hint()) ) ?
	bounded_find_attempts :
	m_available_indexes.max_hint();

	bool not_done = true ;

	#if defined( __MIC__ )
	#pragma noprefetch
	#endif
	while ( not_done ) {

	// Continue searching the unordered list for this key,
	// list will only be appended during insert phase.
	// Need volatile_load as other threads may be appending.
	size_type curr = volatile_load(curr_ptr);

	KOKKOS_NONTEMPORAL_PREFETCH_LOAD(&m_keys[curr != invalid_index ? curr : 0]);
	#if defined( __MIC__ )
	#pragma noprefetch
	#endif
	while ( curr != invalid_index && ! m_equal_to( volatile_load(&m_keys[curr]), k) ) {
	result.increment_list_position();
	index_hint = curr;
	curr_ptr = &m_next_index[curr];
	curr = volatile_load(curr_ptr);
	KOKKOS_NONTEMPORAL_PREFETCH_LOAD(&m_keys[curr != invalid_index ? curr : 0]);
	}

	//------------------------------------------------------------
	// If key already present then return that index.
	if ( curr != invalid_index ) {

	const bool free_existing = new_index != invalid_index;
	if ( free_existing ) {
	// Previously claimed an unused entry that was not inserted.
	// Release this unused entry immediately.
	if (!m_available_indexes.reset(new_index) ) {
	printf("Unable to free existing\n");
	}

	}

	result.set_existing(curr, free_existing);
	not_done = false ;
	}
	//------------------------------------------------------------
	// Key is not currently in the map.
	// If the thread has claimed an entry try to insert now.
	else {

	//------------------------------------------------------------
	// If have not already claimed an unused entry then do so now.
	if (new_index == invalid_index) {

	bool found = false;
	// use the hash_list as the flag for the search direction
	Kokkos::tie(found, index_hint) = m_available_indexes.find_any_unset_near( index_hint, hash_list );

	// found and index and this thread set it
	if ( !found && ++find_attempts >= max_attempts ) {
	failed_insert_ref = true;
	not_done = false ;
	}
	else if (m_available_indexes.set(index_hint) ) {
	new_index = index_hint;
	// Set key and value
	KOKKOS_NONTEMPORAL_PREFETCH_STORE(&m_keys[new_index]);
	m_keys[new_index] = k ;

	if (!is_set) {
	KOKKOS_NONTEMPORAL_PREFETCH_STORE(&m_values[new_index]);
	m_values[new_index] = v ;
	}

	// Do not proceed until key and value are updated in global memory
	memory_fence();
	}
	}
	else if (failed_insert_ref) {
	not_done = false;
	}

	// Attempt to append claimed entry into the list.
	// Another thread may also be trying to append the same list so protect with atomic.
	if ( new_index != invalid_index &&
	curr == atomic_compare_exchange(curr_ptr, static_cast<size_type>(invalid_index), new_index) ) {
	// Succeeded in appending
	result.set_success(new_index);
	not_done = false ;
	}
	}
	} // while ( not_done )

	return result ;
	}

	KOKKOS_INLINE_FUNCTION
	bool erase(key_type const& k) const
	{
	bool result = false;

	if(is_insertable_map && 0u < capacity() && m_scalars((int)erasable_idx)) {

	if ( ! m_scalars((int)modified_idx) ) {
	m_scalars((int)modified_idx) = true;
	}

	size_type index = find(k);
	if (valid_at(index)) {
	m_available_indexes.reset(index);
	result = true;
	}
	}

	return result;
	}

	/// \brief Find the given key \c k, if it exists in the table.
	///
	/// \return If the key exists in the table, the index of the
	/// value corresponding to that key; otherwise, an invalid index.
	///
	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel.
	KOKKOS_INLINE_FUNCTION
	size_type find( const key_type & k) const
	{
	- size_type curr = 0u < capacity() ? m_hash_lists( m_hasher(k) % m_hash_lists.size() ) : invalid_index ;
	+ size_type curr = 0u < capacity() ? m_hash_lists( m_hasher(k) % m_hash_lists.dimension_0() ) : invalid_index ;

	KOKKOS_NONTEMPORAL_PREFETCH_LOAD(&m_keys[curr != invalid_index ? curr : 0]);
	while (curr != invalid_index && !m_equal_to( m_keys[curr], k) ) {
	KOKKOS_NONTEMPORAL_PREFETCH_LOAD(&m_keys[curr != invalid_index ? curr : 0]);
	curr = m_next_index[curr];
	}

	return curr;
	}

	/// \brief Does the key exist in the map
	///
	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel.
	KOKKOS_INLINE_FUNCTION
	bool exists( const key_type & k) const
	{
	return valid_at(find(k));
	}


	/// \brief Get the value with \c i as its direct index.
	///
	/// \param i [in] Index directly into the array of entries.
	///
	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel.
	///
	/// 'const value_type' via Cuda texture fetch must return by value.
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::if_c< (is_set \|\| has_const_value), impl_value_type, impl_value_type &>::type
	value_at(size_type i) const
	{
	return m_values[ is_set ? 0 : (i < capacity() ? i : capacity()) ];
	}

	/// \brief Get the key with \c i as its direct index.
	///
	/// \param i [in] Index directly into the array of entries.
	///
	/// This <i>is</i> a device function; it may be called in a parallel
	/// kernel.
	KOKKOS_FORCEINLINE_FUNCTION
	key_type key_at(size_type i) const
	{
	return m_keys[ i < capacity() ? i : capacity() ];
	}

	KOKKOS_FORCEINLINE_FUNCTION
	bool valid_at(size_type i) const
	{
	return m_available_indexes.test(i);
	}

	template <typename SKey, typename SValue>
	UnorderedMap( UnorderedMap<SKey,SValue,Device,Hasher,EqualTo> const& src,
	typename Impl::enable_if< Impl::UnorderedMapCanAssign<declared_key_type,declared_value_type,SKey,SValue>::value,int>::type = 0
	)
	: m_bounded_insert(src.m_bounded_insert)
	, m_hasher(src.m_hasher)
	, m_equal_to(src.m_equal_to)
	, m_size(src.m_size)
	, m_available_indexes(src.m_available_indexes)
	, m_hash_lists(src.m_hash_lists)
	, m_next_index(src.m_next_index)
	, m_keys(src.m_keys)
	, m_values(src.m_values)
	, m_scalars(src.m_scalars)
	{}


	template <typename SKey, typename SValue>
	typename Impl::enable_if< Impl::UnorderedMapCanAssign<declared_key_type,declared_value_type,SKey,SValue>::value
	,declared_map_type & >::type
	operator=( UnorderedMap<SKey,SValue,Device,Hasher,EqualTo> const& src)
	{
	m_bounded_insert = src.m_bounded_insert;
	m_hasher = src.m_hasher;
	m_equal_to = src.m_equal_to;
	m_size = src.m_size;
	m_available_indexes = src.m_available_indexes;
	m_hash_lists = src.m_hash_lists;
	m_next_index = src.m_next_index;
	m_keys = src.m_keys;
	m_values = src.m_values;
	m_scalars = src.m_scalars;
	return *this;
	}

	template <typename SKey, typename SValue, typename SDevice>
	typename Impl::enable_if< Impl::is_same< typename Impl::remove_const<SKey>::type, key_type>::value &&
	Impl::is_same< typename Impl::remove_const<SValue>::type, value_type>::value
	>::type
	create_copy_view( UnorderedMap<SKey, SValue, SDevice, Hasher,EqualTo> const& src)
	{
	if (m_hash_lists.ptr_on_device() != src.m_hash_lists.ptr_on_device()) {

	insertable_map_type tmp;

	tmp.m_bounded_insert = src.m_bounded_insert;
	tmp.m_hasher = src.m_hasher;
	tmp.m_equal_to = src.m_equal_to;
	tmp.m_size = src.size();
	tmp.m_available_indexes = bitset_type( src.capacity() );
	- tmp.m_hash_lists = size_type_view( ViewAllocateWithoutInitializing("UnorderedMap hash list"), src.m_hash_lists.size() );
	- tmp.m_next_index = size_type_view( ViewAllocateWithoutInitializing("UnorderedMap next index"), src.m_next_index.size() );
	- tmp.m_keys = key_type_view( ViewAllocateWithoutInitializing("UnorderedMap keys"), src.m_keys.size() );
	- tmp.m_values = value_type_view( ViewAllocateWithoutInitializing("UnorderedMap values"), src.m_values.size() );
	+ tmp.m_hash_lists = size_type_view( ViewAllocateWithoutInitializing("UnorderedMap hash list"), src.m_hash_lists.dimension_0() );
	+ tmp.m_next_index = size_type_view( ViewAllocateWithoutInitializing("UnorderedMap next index"), src.m_next_index.dimension_0() );
	+ tmp.m_keys = key_type_view( ViewAllocateWithoutInitializing("UnorderedMap keys"), src.m_keys.dimension_0() );
	+ tmp.m_values = value_type_view( ViewAllocateWithoutInitializing("UnorderedMap values"), src.m_values.dimension_0() );
	tmp.m_scalars = scalars_view("UnorderedMap scalars");

	Kokkos::deep_copy(tmp.m_available_indexes, src.m_available_indexes);

	- typedef Kokkos::Impl::DeepCopy< typename device_type::memory_space, typename SDevice::memory_space > raw_deep_copy;
	+ typedef Kokkos::Impl::DeepCopy< typename execution_space::memory_space, typename SDevice::memory_space > raw_deep_copy;

	- raw_deep_copy(tmp.m_hash_lists.ptr_on_device(), src.m_hash_lists.ptr_on_device(), sizeof(size_type)*src.m_hash_lists.size());
	- raw_deep_copy(tmp.m_next_index.ptr_on_device(), src.m_next_index.ptr_on_device(), sizeof(size_type)*src.m_next_index.size());
	- raw_deep_copy(tmp.m_keys.ptr_on_device(), src.m_keys.ptr_on_device(), sizeof(key_type)*src.m_keys.size());
	+ raw_deep_copy(tmp.m_hash_lists.ptr_on_device(), src.m_hash_lists.ptr_on_device(), sizeof(size_type)*src.m_hash_lists.dimension_0());
	+ raw_deep_copy(tmp.m_next_index.ptr_on_device(), src.m_next_index.ptr_on_device(), sizeof(size_type)*src.m_next_index.dimension_0());
	+ raw_deep_copy(tmp.m_keys.ptr_on_device(), src.m_keys.ptr_on_device(), sizeof(key_type)*src.m_keys.dimension_0());
	if (!is_set) {
	- raw_deep_copy(tmp.m_values.ptr_on_device(), src.m_values.ptr_on_device(), sizeof(impl_value_type)*src.m_values.size());
	+ raw_deep_copy(tmp.m_values.ptr_on_device(), src.m_values.ptr_on_device(), sizeof(impl_value_type)*src.m_values.dimension_0());
	}
	raw_deep_copy(tmp.m_scalars.ptr_on_device(), src.m_scalars.ptr_on_device(), sizeof(int)*num_scalars );

	*this = tmp;
	}
	}

	//@}
	private: // private member functions

	bool modified() const
	{
	return get_flag(modified_idx);
	}

	void set_flag(int flag) const
	{
	- typedef Kokkos::Impl::DeepCopy< typename device_type::memory_space, Kokkos::HostSpace > raw_deep_copy;
	+ typedef Kokkos::Impl::DeepCopy< typename execution_space::memory_space, Kokkos::HostSpace > raw_deep_copy;
	const int true_ = true;
	raw_deep_copy(m_scalars.ptr_on_device() + flag, &true_, sizeof(int));
	}

	void reset_flag(int flag) const
	{
	- typedef Kokkos::Impl::DeepCopy< typename device_type::memory_space, Kokkos::HostSpace > raw_deep_copy;
	+ typedef Kokkos::Impl::DeepCopy< typename execution_space::memory_space, Kokkos::HostSpace > raw_deep_copy;
	const int false_ = false;
	raw_deep_copy(m_scalars.ptr_on_device() + flag, &false_, sizeof(int));
	}

	bool get_flag(int flag) const
	{
	- typedef Kokkos::Impl::DeepCopy< Kokkos::HostSpace, typename device_type::memory_space > raw_deep_copy;
	+ typedef Kokkos::Impl::DeepCopy< Kokkos::HostSpace, typename execution_space::memory_space > raw_deep_copy;
	int result = false;
	raw_deep_copy(&result, m_scalars.ptr_on_device() + flag, sizeof(int));
	return result;
	}

	static uint32_t calculate_capacity(uint32_t capacity_hint)
	{
	// increase by 16% and round to nears multiple of 128
	return capacity_hint ? ((static_cast<uint32_t>(7ullcapacity_hint/6u) + 127u)/128u)128u : 128u;
	}

	private: // private members
	bool m_bounded_insert;
	hasher_type m_hasher;
	equal_to_type m_equal_to;
	mutable size_type m_size;
	bitset_type m_available_indexes;
	size_type_view m_hash_lists;
	size_type_view m_next_index;
	key_type_view m_keys;
	value_type_view m_values;
	scalars_view m_scalars;

	template <typename KKey, typename VValue, typename DDevice, typename HHash, typename EEqualTo>
	friend class UnorderedMap;

	template <typename UMap>
	friend struct Impl::UnorderedMapErase;

	template <typename UMap>
	friend struct Impl::UnorderedMapHistogram;

	template <typename UMap>
	friend struct Impl::UnorderedMapPrint;
	};

	// Specialization of deep_copy for two UnorderedMap objects.
	template < typename DKey, typename DT, typename DDevice
	, typename SKey, typename ST, typename SDevice
	, typename Hasher, typename EqualTo >
	inline void deep_copy( UnorderedMap<DKey, DT, DDevice, Hasher, EqualTo> & dst
	, const UnorderedMap<SKey, ST, SDevice, Hasher, EqualTo> & src )
	{
	dst.create_copy_view(src);
	}


	} // namespace Kokkos

	#endif //KOKKOS_UNORDERED_MAP_HPP
	diff --git a/lib/kokkos/containers/src/Kokkos_Vector.hpp b/lib/kokkos/containers/src/Kokkos_Vector.hpp
	index ded6512e6..db54b0c35 100755
	--- a/lib/kokkos/containers/src/Kokkos_Vector.hpp
	+++ b/lib/kokkos/containers/src/Kokkos_Vector.hpp
	@@ -1,282 +1,287 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_VECTOR_HPP
	#define KOKKOS_VECTOR_HPP

	#include <Kokkos_Core_fwd.hpp>
	#include <Kokkos_DualView.hpp>

	/* Drop in replacement for std::vector based on Kokkos::DualView
	* Most functions only work on the host (it will not compile if called from device kernel)
	*
	*/
	namespace Kokkos {

	-template <typename Scalar, class Device = Kokkos::DefaultExecutionSpace >
	-class vector : public DualView<Scalar*,LayoutLeft,Device> {
	+template <typename Scalar, class Space = Kokkos::DefaultExecutionSpace >
	+class vector : public DualView<Scalar*,LayoutLeft,Space> {
	public:
	- typedef Device device_type;
	+ typedef typename Space::memory_space memory_space;
	+ typedef typename Space::execution_space execution_space;
	+ typedef typename Kokkos::Device<execution_space,memory_space> device_type;
	+
	typedef Scalar value_type;
	typedef Scalar* pointer;
	typedef const Scalar* const_pointer;
	typedef Scalar* reference;
	typedef const Scalar* const_reference;
	typedef Scalar* iterator;
	typedef const Scalar* const_iterator;

	private:
	size_t _size;
	typedef size_t size_type;
	float _extra_storage;
	- typedef DualView<Scalar*,LayoutLeft,Device> DV;
	+ typedef DualView<Scalar*,LayoutLeft,Space> DV;


	public:
	+#ifdef KOKKOS_CUDA_USE_UVM
	+ KOKKOS_INLINE_FUNCTION Scalar& operator() (int i) const {return DV::h_view(i);};
	+ KOKKOS_INLINE_FUNCTION Scalar& operator[] (int i) const {return DV::h_view(i);};
	+#else
	inline Scalar& operator() (int i) const {return DV::h_view(i);};
	inline Scalar& operator[] (int i) const {return DV::h_view(i);};
	-
	+#endif

	/* Member functions which behave like std::vector functions */

	vector():DV() {
	_size = 0;
	_extra_storage = 1.1;
	- DV::modified_host = 1;
	+ DV::modified_host() = 1;
	};


	- vector(int n, Scalar val=Scalar()):DualView<Scalar,LayoutLeft,Device>("Vector",size_t(n(1.1))) {
	+ vector(int n, Scalar val=Scalar()):DualView<Scalar,LayoutLeft,Space>("Vector",size_t(n(1.1))) {
	_size = n;
	_extra_storage = 1.1;
	- DV::modified_host = 1;
	+ DV::modified_host() = 1;

	assign(n,val);
	}


	void resize(size_t n) {
	if(n>=capacity())
	DV::resize(size_t (n*_extra_storage));
	_size = n;
	}

	void resize(size_t n, const Scalar& val) {
	assign(n,val);
	}

	void assign (size_t n, const Scalar& val) {

	/* Resize if necessary (behavour of std:vector) */

	if(n>capacity())
	DV::resize(size_t (n*_extra_storage));
	_size = n;

	/* Assign value either on host or on device */

	- if( DV::modified_host >= DV::modified_device ) {
	+ if( DV::modified_host() >= DV::modified_device() ) {
	set_functor_host f(DV::h_view,val);
	parallel_for(n,f);
	- DV::t_host::device_type::fence();
	- DV::modified_host++;
	+ DV::t_host::execution_space::fence();
	+ DV::modified_host()++;
	} else {
	set_functor f(DV::d_view,val);
	parallel_for(n,f);
	- DV::t_dev::device_type::fence();
	- DV::modified_device++;
	+ DV::t_dev::execution_space::fence();
	+ DV::modified_device()++;
	}
	}

	void reserve(size_t n) {
	DV::resize(size_t (n*_extra_storage));
	}

	void push_back(Scalar val) {
	- DV::modified_host++;
	+ DV::modified_host()++;
	if(_size == capacity()) {
	size_t new_size = _size*_extra_storage;
	if(new_size == _size) new_size++;
	DV::resize(new_size);
	}

	DV::h_view(_size) = val;
	_size++;

	};

	void pop_back() {
	_size--;
	};

	void clear() {
	_size = 0;
	}

	size_type size() const {return _size;};
	size_type max_size() const {return 2000000000;}
	size_type capacity() const {return DV::capacity();};
	bool empty() const {return _size==0;};

	iterator begin() const {return &DV::h_view(0);};

	iterator end() const {return &DV::h_view(_size);};


	/* std::algorithms wich work originally with iterators, here they are implemented as member functions */

	size_t
	lower_bound (const size_t& start,
	const size_t& theEnd,
	const Scalar& comp_val) const
	{
	int lower = start; // FIXME (mfh 24 Apr 2014) narrowing conversion
	int upper = _size > theEnd? theEnd : _size-1; // FIXME (mfh 24 Apr 2014) narrowing conversion
	if (upper <= lower) {
	return theEnd;
	}

	Scalar lower_val = DV::h_view(lower);
	Scalar upper_val = DV::h_view(upper);
	size_t idx = (upper+lower)/2;
	Scalar val = DV::h_view(idx);
	if(val>upper_val) return upper;
	if(val<lower_val) return start;

	while(upper>lower) {
	if(comp_val>val) {
	lower = ++idx;
	} else {
	upper = idx;
	}
	idx = (upper+lower)/2;
	val = DV::h_view(idx);
	}
	return idx;
	}

	bool is_sorted() {
	for(int i=0;i<_size-1;i++) {
	if(DV::h_view(i)>DV::h_view(i+1)) return false;
	}
	return true;
	}

	iterator find(Scalar val) const {
	if(_size == 0) return end();

	int upper,lower,current;
	current = _size/2;
	upper = _size-1;
	lower = 0;

	if((val<DV::h_view(0)) \|\| (val>DV::h_view(_size-1)) ) return end();

	while(upper>lower)
	{
	if(val>DV::h_view(current)) lower = current+1;
	else upper = current;
	current = (upper+lower)/2;
	}

	if(val==DV::h_view(current)) return &DV::h_view(current);
	else return end();
	}

	/* Additional functions for data management */

	void device_to_host(){
	deep_copy(DV::h_view,DV::d_view);
	}
	void host_to_device() const {
	deep_copy(DV::d_view,DV::h_view);
	}

	void on_host() {
	- DV::modified_host = DV::modified_device + 1;
	+ DV::modified_host() = DV::modified_device() + 1;
	}
	void on_device() {
	- DV::modified_device = DV::modified_host + 1;
	+ DV::modified_device() = DV::modified_host() + 1;
	}

	void set_overallocation(float extra) {
	_extra_storage = 1.0 + extra;
	}


	public:
	struct set_functor {
	- typedef typename DV::t_dev::device_type device_type;
	+ typedef typename DV::t_dev::execution_space execution_space;
	typename DV::t_dev _data;
	Scalar _val;

	set_functor(typename DV::t_dev data, Scalar val) :
	_data(data),_val(val) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (const int &i) const {
	_data(i) = _val;
	}
	};

	struct set_functor_host {
	- typedef typename DV::t_host::device_type device_type;
	+ typedef typename DV::t_host::execution_space execution_space;
	typename DV::t_host _data;
	Scalar _val;

	set_functor_host(typename DV::t_host data, Scalar val) :
	_data(data),_val(val) {}

	KOKKOS_INLINE_FUNCTION
	void operator() (const int &i) const {
	_data(i) = _val;
	}
	};

	};


	}
	#endif
	diff --git a/lib/kokkos/containers/src/impl/Kokkos_Bitset_impl.hpp b/lib/kokkos/containers/src/impl/Kokkos_Bitset_impl.hpp
	index dde5bffdf..7de290e71 100755
	--- a/lib/kokkos/containers/src/impl/Kokkos_Bitset_impl.hpp
	+++ b/lib/kokkos/containers/src/impl/Kokkos_Bitset_impl.hpp
	@@ -1,173 +1,173 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_BITSET_IMPL_HPP
	#define KOKKOS_BITSET_IMPL_HPP

	#include <Kokkos_Macros.hpp>
	#include <stdint.h>

	#include <cstdio>
	#include <climits>
	#include <iostream>
	#include <iomanip>

	namespace Kokkos { namespace Impl {

	KOKKOS_FORCEINLINE_FUNCTION
	unsigned rotate_right(unsigned i, int r)
	{
	enum { size = static_cast<int>(sizeof(unsigned)*CHAR_BIT) };
	return r ? ((i >> r) \| (i << (size-r))) : i ;
	}

	KOKKOS_FORCEINLINE_FUNCTION
	int bit_scan_forward(unsigned i)
	{
	#if defined( __CUDA_ARCH__ )
	return __ffs(i) - 1;
	#elif defined( __GNUC__ ) \|\| defined( __GNUG__ )
	return __builtin_ffs(i) - 1;
	#elif defined( __INTEL_COMPILER )
	return _bit_scan_forward(i);
	#else

	unsigned t = 1u;
	int r = 0;
	while (i && (i & t == 0))
	{
	t = t << 1;
	++r;
	}
	return r;
	#endif
	}


	KOKKOS_FORCEINLINE_FUNCTION
	int bit_scan_reverse(unsigned i)
	{
	enum { shift = static_cast<int>(sizeof(unsigned)*CHAR_BIT - 1) };
	#if defined( __CUDA_ARCH__ )
	return shift - __clz(i);
	#elif defined( __GNUC__ ) \|\| defined( __GNUG__ )
	return shift - __builtin_clz(i);
	#elif defined( __INTEL_COMPILER )
	return _bit_scan_reverse(i);
	#else
	unsigned t = 1u << shift;
	int r = 0;
	while (i && (i & t == 0))
	{
	t = t >> 1;
	++r;
	}
	return r;
	#endif
	}


	// count the bits set
	KOKKOS_FORCEINLINE_FUNCTION
	int popcount(unsigned i)
	{
	#if defined( __CUDA_ARCH__ )
	return __popc(i);
	#elif defined( __GNUC__ ) \|\| defined( __GNUG__ )
	return __builtin_popcount(i);
	#elif defined ( __INTEL_COMPILER )
	return _popcnt32(i);
	#else
	// http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
	i = i - ((i >> 1) & ~0u/3u); // temp
	i = (i & ~0u/15u3u) + ((i >> 2) & ~0u/15u3u); // temp
	i = (i + (i >> 4)) & ~0u/255u*15u; // temp
	return (int)((i * (~0u/255u)) >> (sizeof(unsigned) - 1) * CHAR_BIT); // count
	#endif
	}


	template <typename Bitset>
	struct BitsetCount
	{
	typedef Bitset bitset_type;
	- typedef typename bitset_type::device_type::execution_space device_type;
	+ typedef typename bitset_type::execution_space::execution_space execution_space;
	typedef typename bitset_type::size_type size_type;
	typedef size_type value_type;

	bitset_type m_bitset;

	BitsetCount( bitset_type const& bitset)
	: m_bitset(bitset)
	{}

	size_type apply() const
	{
	size_type count = 0u;
	- parallel_reduce(m_bitset.m_blocks.size(), *this, count);
	+ parallel_reduce(m_bitset.m_blocks.dimension_0(), *this, count);
	return count;
	}

	KOKKOS_INLINE_FUNCTION
	static void init( value_type & count)
	{
	count = 0u;
	}

	KOKKOS_INLINE_FUNCTION
	static void join( volatile value_type & count, const volatile size_type & incr )
	{
	count += incr;
	}

	KOKKOS_INLINE_FUNCTION
	void operator()( size_type i, value_type & count) const
	{
	count += popcount(m_bitset.m_blocks[i]);
	}
	};

	}} //Kokkos::Impl

	#endif // KOKKOS_BITSET_IMPL_HPP

	diff --git a/lib/kokkos/containers/src/impl/Kokkos_Functional_impl.hpp b/lib/kokkos/containers/src/impl/Kokkos_Functional_impl.hpp
	index 647024f48..c87bb8a3a 100755
	--- a/lib/kokkos/containers/src/impl/Kokkos_Functional_impl.hpp
	+++ b/lib/kokkos/containers/src/impl/Kokkos_Functional_impl.hpp
	@@ -1,154 +1,195 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	#ifndef KOKKOS_FUNCTIONAL_IMPL_HPP
	#define KOKKOS_FUNCTIONAL_IMPL_HPP

	#include <Kokkos_Macros.hpp>
	#include <stdint.h>

	namespace Kokkos { namespace Impl {

	// MurmurHash3 was written by Austin Appleby, and is placed in the public
	// domain. The author hereby disclaims copyright to this source code.
	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t getblock32 ( const uint8_t * p, int i )
	{
	// used to avoid aliasing error which could cause errors with
	// forced inlining
	return ((uint32_t)p[i*4+0])
	\| ((uint32_t)p[i*4+1] << 8)
	\| ((uint32_t)p[i*4+2] << 16)
	\| ((uint32_t)p[i*4+3] << 24);
	}

	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t rotl32 ( uint32_t x, int8_t r )
	{ return (x << r) \| (x >> (32 - r)); }

	KOKKOS_FORCEINLINE_FUNCTION
	uint32_t fmix32 ( uint32_t h )
	{
	h ^= h >> 16;
	h *= 0x85ebca6b;
	h ^= h >> 13;
	h *= 0xc2b2ae35;
	h ^= h >> 16;

	return h;
	}

	KOKKOS_INLINE_FUNCTION
	uint32_t MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed )
	{
	const uint8_t * data = (const uint8_t*)key;
	const int nblocks = len / 4;

	uint32_t h1 = seed;

	const uint32_t c1 = 0xcc9e2d51;
	const uint32_t c2 = 0x1b873593;

	//----------
	// body

	for(int i=0; i<nblocks; ++i)
	{
	uint32_t k1 = getblock32(data,i);

	k1 *= c1;
	k1 = rotl32(k1,15);
	k1 *= c2;

	h1 ^= k1;
	h1 = rotl32(h1,13);
	h1 = h1*5+0xe6546b64;
	}

	//----------
	// tail

	const uint8_t * tail = (const uint8_t)(data + nblocks4);

	uint32_t k1 = 0;

	switch(len & 3)
	{
	case 3: k1 ^= tail[2] << 16;
	case 2: k1 ^= tail[1] << 8;
	case 1: k1 ^= tail[0];
	k1 = c1; k1 = rotl32(k1,15); k1 = c2; h1 ^= k1;
	};

	//----------
	// finalization

	h1 ^= len;

	h1 = fmix32(h1);

	return h1;
	}


	#if defined( __GNUC__ ) /* GNU C */ \|\| \
	defined( __GNUG__ ) /* GNU C++ */ \|\| \
	defined( __clang__ )

	#define KOKKOS_MAY_ALIAS __attribute__((__may_alias__))

	#else

	#define KOKKOS_MAY_ALIAS

	#endif

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	bool bitwise_equal(T const * const a_ptr, T const * const b_ptr)
	{
	typedef uint64_t KOKKOS_MAY_ALIAS T64;
	typedef uint32_t KOKKOS_MAY_ALIAS T32;
	typedef uint16_t KOKKOS_MAY_ALIAS T16;
	typedef uint8_t KOKKOS_MAY_ALIAS T8;

	enum {
	NUM_8 = sizeof(T),
	NUM_16 = NUM_8 / 2,
	NUM_32 = NUM_8 / 4,
	NUM_64 = NUM_8 / 8
	};

	union {
	T const * const ptr;
	T64 const * const ptr64;
	T32 const * const ptr32;
	T16 const * const ptr16;
	T8 const * const ptr8;
	} a = {a_ptr}, b = {b_ptr};

	bool result = true;

	for (int i=0; i < NUM_64; ++i) {
	result = result && a.ptr64[i] == b.ptr64[i];
	}

	if ( NUM_64*2 < NUM_32 ) {
	result = result && a.ptr32[NUM_642] == b.ptr32[NUM_642];
	}

	if ( NUM_32*2 < NUM_16 ) {
	result = result && a.ptr16[NUM_322] == b.ptr16[NUM_322];
	}

	if ( NUM_16*2 < NUM_8 ) {
	result = result && a.ptr8[NUM_162] == b.ptr8[NUM_162];
	}

	return result;
	}



	#undef KOKKOS_MAY_ALIAS

	}} // namespace Kokkos::Impl

	#endif //KOKKOS_FUNCTIONAL_IMPL_HPP
	diff --git a/lib/kokkos/containers/src/impl/Kokkos_StaticCrsGraph_factory.hpp b/lib/kokkos/containers/src/impl/Kokkos_StaticCrsGraph_factory.hpp
	index ddd091a45..c52fc2435 100755
	--- a/lib/kokkos/containers/src/impl/Kokkos_StaticCrsGraph_factory.hpp
	+++ b/lib/kokkos/containers/src/impl/Kokkos_StaticCrsGraph_factory.hpp
	@@ -1,223 +1,208 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_IMPL_STATICCRSGRAPH_FACTORY_HPP
	#define KOKKOS_IMPL_STATICCRSGRAPH_FACTORY_HPP

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	template< class DataType , class Arg1Type , class Arg2Type , typename SizeType >
	inline
	typename StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	create_mirror_view( const StaticCrsGraph<DataType,Arg1Type,Arg2Type,SizeType > & view ,
	typename Impl::enable_if< ViewTraits<DataType,Arg1Type,Arg2Type,void>::is_hostspace >::type * = 0 )
	{
	return view ;
	}

	template< class DataType , class Arg1Type , class Arg2Type , typename SizeType >
	inline
	typename StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	create_mirror( const StaticCrsGraph<DataType,Arg1Type,Arg2Type,SizeType > & view )
	{
	// Force copy:
	//typedef Impl::ViewAssignment< Impl::ViewDefault > alloc ; // unused
	typedef StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType > staticcrsgraph_type ;

	typename staticcrsgraph_type::HostMirror tmp ;
	typename staticcrsgraph_type::row_map_type::HostMirror tmp_row_map = create_mirror( view.row_map);

	// Allocation to match:
	tmp.row_map = tmp_row_map ; // Assignment of 'const' from 'non-const'
	tmp.entries = create_mirror( view.entries );


	// Deep copy:
	deep_copy( tmp_row_map , view.row_map );
	deep_copy( tmp.entries , view.entries );

	return tmp ;
	}

	template< class DataType , class Arg1Type , class Arg2Type , typename SizeType >
	inline
	typename StaticCrsGraph< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	create_mirror_view( const StaticCrsGraph<DataType,Arg1Type,Arg2Type,SizeType > & view ,
	typename Impl::enable_if< ! ViewTraits<DataType,Arg1Type,Arg2Type,void>::is_hostspace >::type * = 0 )
	{
	return create_mirror( view );
	}
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	template< class StaticCrsGraphType , class InputSizeType >
	inline
	typename StaticCrsGraphType::staticcrsgraph_type
	create_staticcrsgraph( const std::string & label ,
	const std::vector< InputSizeType > & input )
	{
	typedef StaticCrsGraphType output_type ;
	//typedef std::vector< InputSizeType > input_type ; // unused

	typedef typename output_type::entries_type entries_type ;

	typedef View< typename output_type::size_type [] ,
	typename output_type::array_layout ,
	- typename output_type::device_type > work_type ;
	+ typename output_type::execution_space > work_type ;

	output_type output ;

	// Create the row map:

	const size_t length = input.size();

	{
	work_type row_work( "tmp" , length + 1 );

	typename work_type::HostMirror row_work_host =
	create_mirror_view( row_work );

	size_t sum = 0 ;
	row_work_host[0] = 0 ;
	for ( size_t i = 0 ; i < length ; ++i ) {
	row_work_host[i+1] = sum += input[i];
	}

	deep_copy( row_work , row_work_host );

	output.entries = entries_type( label , sum );
	output.row_map = row_work ;
	}

	return output ;
	}

	//----------------------------------------------------------------------------

	template< class StaticCrsGraphType , class InputSizeType >
	inline
	typename StaticCrsGraphType::staticcrsgraph_type
	create_staticcrsgraph( const std::string & label ,
	const std::vector< std::vector< InputSizeType > > & input )
	{
	- typedef StaticCrsGraphType output_type ;
	- //typedef std::vector< std::vector< InputSizeType > > input_type ; // unused
	- typedef typename output_type::entries_type entries_type ;
	- //typedef typename output_type::size_type size_type ; // unused
	-
	- // mfh 14 Feb 2014: This function doesn't actually create instances
	- // of ok_rank, but it needs to declare the typedef in order to do
	- // the static "assert" (a compile-time check that the given shape
	- // has rank 1). In order to avoid a "declared but unused typedef"
	- // warning, we declare an empty instance of this type, with the
	- // usual "(void)" marker to avoid a compiler warning for the unused
	- // variable.
	-
	- typedef typename
	- Impl::assert_shape_is_rank_one< typename entries_type::shape_type >::type
	- ok_rank ;
	- {
	- ok_rank thing;
	- (void) thing;
	- }
	+ typedef StaticCrsGraphType output_type ;
	+ typedef typename output_type::entries_type entries_type ;
	+
	+ static_assert( entries_type::rank == 1
	+ , "Graph entries view must be rank one" );

	typedef View< typename output_type::size_type [] ,
	typename output_type::array_layout ,
	- typename output_type::device_type > work_type ;
	+ typename output_type::execution_space > work_type ;

	output_type output ;

	// Create the row map:

	const size_t length = input.size();

	{
	work_type row_work( "tmp" , length + 1 );

	typename work_type::HostMirror row_work_host =
	create_mirror_view( row_work );

	size_t sum = 0 ;
	row_work_host[0] = 0 ;
	for ( size_t i = 0 ; i < length ; ++i ) {
	row_work_host[i+1] = sum += input[i].size();
	}

	deep_copy( row_work , row_work_host );

	output.entries = entries_type( label , sum );
	output.row_map = row_work ;
	}

	// Fill in the entries:
	{
	typename entries_type::HostMirror host_entries =
	create_mirror_view( output.entries );

	size_t sum = 0 ;
	for ( size_t i = 0 ; i < length ; ++i ) {
	for ( size_t j = 0 ; j < input[i].size() ; ++j , ++sum ) {
	host_entries( sum ) = input[i][j] ;
	}
	}

	deep_copy( output.entries , host_entries );
	}

	return output ;
	}

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_IMPL_CRSARRAY_FACTORY_HPP */

	diff --git a/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.cpp b/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.cpp
	index 150d3d893..843fd3a80 100755
	--- a/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.cpp
	+++ b/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.cpp
	@@ -1,101 +1,101 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <Kokkos_UnorderedMap.hpp>

	namespace Kokkos { namespace Impl {

	uint32_t find_hash_size(uint32_t size)
	{
	if (size == 0u) return 0u;

	// these primes try to preserve randomness of hash
	static const uint32_t primes [] = {
	3, 7, 13, 23, 53, 97, 193, 389, 769, 1543
	, 2237, 2423, 2617, 2797, 2999, 3167, 3359, 3539
	, 3727, 3911, 4441 , 4787 , 5119 , 5471 , 5801 , 6143 , 6521 , 6827
	, 7177 , 7517 , 7853 , 8887 , 9587 , 10243 , 10937 , 11617 , 12289
	, 12967 , 13649 , 14341 , 15013 , 15727
	, 17749 , 19121 , 20479 , 21859 , 23209 , 24593 , 25939 , 27329
	, 28669 , 30047 , 31469 , 35507 , 38231 , 40961 , 43711 , 46439
	, 49157 , 51893 , 54617 , 57347 , 60077 , 62801 , 70583 , 75619
	, 80669 , 85703 , 90749 , 95783 , 100823 , 105871 , 110909 , 115963
	, 120997 , 126031 , 141157 , 151237 , 161323 , 171401 , 181499 , 191579
	, 201653 , 211741 , 221813 , 231893 , 241979 , 252079
	, 282311 , 302483 , 322649 , 342803 , 362969 , 383143 , 403301 , 423457
	, 443629 , 463787 , 483953 , 504121 , 564617 , 604949 , 645313 , 685609
	, 725939 , 766273 , 806609 , 846931 , 887261 , 927587 , 967919 , 1008239
	, 1123477 , 1198397 , 1273289 , 1348177 , 1423067 , 1497983 , 1572869
	, 1647761 , 1722667 , 1797581 , 1872461 , 1947359 , 2022253
	, 2246953 , 2396759 , 2546543 , 2696363 , 2846161 , 2995973 , 3145739
	, 3295541 , 3445357 , 3595117 , 3744941 , 3894707 , 4044503
	, 4493921 , 4793501 , 5093089 , 5392679 , 5692279 , 5991883 , 6291469
	, 6591059 , 6890641 , 7190243 , 7489829 , 7789447 , 8089033
	, 8987807 , 9586981 , 10186177 , 10785371 , 11384539 , 11983729
	, 12582917 , 13182109 , 13781291 , 14380469 , 14979667 , 15578861
	, 16178053 , 17895707 , 19014187 , 20132683 , 21251141 , 22369661
	, 23488103 , 24606583 , 25725083 , 26843549 , 27962027 , 29080529
	, 30198989 , 31317469 , 32435981 , 35791397 , 38028379 , 40265327
	, 42502283 , 44739259 , 46976221 , 49213237 , 51450131 , 53687099
	, 55924061 , 58161041 , 60397993 , 62634959 , 64871921
	, 71582857 , 76056727 , 80530643 , 85004567 , 89478503 , 93952427
	, 98426347 , 102900263 , 107374217 , 111848111 , 116322053 , 120795971
	, 125269877 , 129743807 , 143165587 , 152113427 , 161061283 , 170009141
	, 178956983 , 187904819 , 196852693 , 205800547 , 214748383 , 223696237
	, 232644089 , 241591943 , 250539763 , 259487603 , 268435399
	};

	const uint32_t num_primes = sizeof(primes)/sizeof(uint32_t);

	uint32_t hsize = primes[num_primes-1] ;
	for (uint32_t i = 0; i < num_primes; ++i) {
	if (size <= primes[i]) {
	hsize = primes[i];
	break;
	}
	}
	return hsize;
	}

	}} // namespace Kokkos::Impl

	diff --git a/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.hpp b/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.hpp
	index b5c3304fb..b788c966e 100755
	--- a/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.hpp
	+++ b/lib/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.hpp
	@@ -1,297 +1,297 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_UNORDERED_MAP_IMPL_HPP
	#define KOKKOS_UNORDERED_MAP_IMPL_HPP

	#include <Kokkos_Core_fwd.hpp>
	#include <stdint.h>

	#include <cstdio>
	#include <climits>
	#include <iostream>
	#include <iomanip>

	namespace Kokkos { namespace Impl {

	uint32_t find_hash_size( uint32_t size );

	template <typename Map>
	struct UnorderedMapRehash
	{
	typedef Map map_type;
	typedef typename map_type::const_map_type const_map_type;
	- typedef typename map_type::device_type device_type;
	+ typedef typename map_type::execution_space execution_space;
	typedef typename map_type::size_type size_type;

	map_type m_dst;
	const_map_type m_src;

	UnorderedMapRehash( map_type const& dst, const_map_type const& src)
	: m_dst(dst), m_src(src)
	{}

	void apply() const
	{
	parallel_for(m_src.capacity(), *this);
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(size_type i) const
	{
	if ( m_src.valid_at(i) )
	m_dst.insert(m_src.key_at(i), m_src.value_at(i));
	}

	};

	template <typename UMap>
	struct UnorderedMapErase
	{
	typedef UMap map_type;
	- typedef typename map_type::device_type device_type;
	+ typedef typename map_type::execution_space execution_space;
	typedef typename map_type::size_type size_type;
	typedef typename map_type::key_type key_type;
	typedef typename map_type::impl_value_type value_type;

	map_type m_map;

	UnorderedMapErase( map_type const& map)
	: m_map(map)
	{}

	void apply() const
	{
	- parallel_for(m_map.m_hash_lists.size(), *this);
	+ parallel_for(m_map.m_hash_lists.dimension_0(), *this);
	}

	KOKKOS_INLINE_FUNCTION
	void operator()( size_type i ) const
	{
	const size_type invalid_index = map_type::invalid_index;

	size_type curr = m_map.m_hash_lists(i);
	size_type next = invalid_index;

	// remove erased head of the linked-list
	while (curr != invalid_index && !m_map.valid_at(curr)) {
	next = m_map.m_next_index[curr];
	m_map.m_next_index[curr] = invalid_index;
	m_map.m_keys[curr] = key_type();
	if (m_map.is_set) m_map.m_values[curr] = value_type();
	curr = next;
	m_map.m_hash_lists(i) = next;
	}

	// if the list is non-empty and the head is valid
	if (curr != invalid_index && m_map.valid_at(curr) ) {
	size_type prev = curr;
	curr = m_map.m_next_index[prev];

	while (curr != invalid_index) {
	next = m_map.m_next_index[curr];
	if (m_map.valid_at(curr)) {
	prev = curr;
	}
	else {
	// remove curr from list
	m_map.m_next_index[prev] = next;
	m_map.m_next_index[curr] = invalid_index;
	m_map.m_keys[curr] = key_type();
	if (map_type::is_set) m_map.m_values[curr] = value_type();
	}
	curr = next;
	}
	}
	}
	};

	template <typename UMap>
	struct UnorderedMapHistogram
	{
	typedef UMap map_type;
	- typedef typename map_type::device_type device_type;
	+ typedef typename map_type::execution_space execution_space;
	typedef typename map_type::size_type size_type;

	- typedef View<int[100], device_type> histogram_view;
	+ typedef View<int[100], execution_space> histogram_view;
	typedef typename histogram_view::HostMirror host_histogram_view;

	map_type m_map;
	histogram_view m_length;
	histogram_view m_distance;
	histogram_view m_block_distance;

	UnorderedMapHistogram( map_type const& map)
	: m_map(map)
	, m_length("UnorderedMap Histogram")
	, m_distance("UnorderedMap Histogram")
	, m_block_distance("UnorderedMap Histogram")
	{}

	void calculate()
	{
	- parallel_for(m_map.m_hash_lists.size(), *this);
	+ parallel_for(m_map.m_hash_lists.dimension_0(), *this);
	}

	void clear()
	{
	Kokkos::deep_copy(m_length, 0);
	Kokkos::deep_copy(m_distance, 0);
	Kokkos::deep_copy(m_block_distance, 0);
	}

	void print_length(std::ostream &out)
	{
	host_histogram_view host_copy = create_mirror_view(m_length);
	Kokkos::deep_copy(host_copy, m_length);

	- for (int i=0, size = host_copy.size(); i<size; ++i)
	+ for (int i=0, size = host_copy.dimension_0(); i<size; ++i)
	{
	out << host_copy[i] << " , ";
	}
	out << "\b\b\b " << std::endl;
	}

	void print_distance(std::ostream &out)
	{
	host_histogram_view host_copy = create_mirror_view(m_distance);
	Kokkos::deep_copy(host_copy, m_distance);

	- for (int i=0, size = host_copy.size(); i<size; ++i)
	+ for (int i=0, size = host_copy.dimension_0(); i<size; ++i)
	{
	out << host_copy[i] << " , ";
	}
	out << "\b\b\b " << std::endl;
	}

	void print_block_distance(std::ostream &out)
	{
	host_histogram_view host_copy = create_mirror_view(m_block_distance);
	Kokkos::deep_copy(host_copy, m_block_distance);

	- for (int i=0, size = host_copy.size(); i<size; ++i)
	+ for (int i=0, size = host_copy.dimension_0(); i<size; ++i)
	{
	out << host_copy[i] << " , ";
	}
	out << "\b\b\b " << std::endl;
	}

	KOKKOS_INLINE_FUNCTION
	void operator()( size_type i ) const
	{
	const size_type invalid_index = map_type::invalid_index;

	uint32_t length = 0;
	size_type min_index = ~0u, max_index = 0;
	for (size_type curr = m_map.m_hash_lists(i); curr != invalid_index; curr = m_map.m_next_index[curr]) {
	++length;
	min_index = (curr < min_index) ? curr : min_index;
	max_index = (max_index < curr) ? curr : max_index;
	}

	size_type distance = (0u < length) ? max_index - min_index : 0u;
	size_type blocks = (0u < length) ? max_index/32u - min_index/32u : 0u;

	// normalize data
	length = length < 100u ? length : 99u;
	distance = distance < 100u ? distance : 99u;
	blocks = blocks < 100u ? blocks : 99u;

	if (0u < length)
	{
	atomic_fetch_add( &m_length(length), 1);
	atomic_fetch_add( &m_distance(distance), 1);
	atomic_fetch_add( &m_block_distance(blocks), 1);
	}
	}
	};

	template <typename UMap>
	struct UnorderedMapPrint
	{
	typedef UMap map_type;
	- typedef typename map_type::device_type device_type;
	+ typedef typename map_type::execution_space execution_space;
	typedef typename map_type::size_type size_type;

	map_type m_map;

	UnorderedMapPrint( map_type const& map)
	: m_map(map)
	{}

	void apply()
	{
	- parallel_for(m_map.m_hash_lists.size(), *this);
	+ parallel_for(m_map.m_hash_lists.dimension_0(), *this);
	}

	KOKKOS_INLINE_FUNCTION
	void operator()( size_type i ) const
	{
	const size_type invalid_index = map_type::invalid_index;

	uint32_t list = m_map.m_hash_lists(i);
	for (size_type curr = list, ii=0; curr != invalid_index; curr = m_map.m_next_index[curr], ++ii) {
	printf("%d[%d]: %d->%d\n", list, ii, m_map.key_at(curr), m_map.value_at(curr));
	}
	}
	};

	template <typename DKey, typename DValue, typename SKey, typename SValue>
	struct UnorderedMapCanAssign : public false_ {};

	template <typename Key, typename Value>
	struct UnorderedMapCanAssign<Key,Value,Key,Value> : public true_ {};

	template <typename Key, typename Value>
	struct UnorderedMapCanAssign<const Key,Value,Key,Value> : public true_ {};

	template <typename Key, typename Value>
	struct UnorderedMapCanAssign<const Key,const Value,Key,Value> : public true_ {};

	template <typename Key, typename Value>
	struct UnorderedMapCanAssign<const Key,const Value,const Key,Value> : public true_ {};


	}} //Kokkos::Impl

	#endif // KOKKOS_UNORDERED_MAP_IMPL_HPP
	diff --git a/lib/kokkos/containers/unit_tests/Makefile b/lib/kokkos/containers/unit_tests/Makefile
	new file mode 100755
	index 000000000..176bfa906
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/Makefile
	@@ -0,0 +1,92 @@
	+KOKKOS_PATH = ../..
	+
	+GTEST_PATH = ../../TPL/gtest
	+
	+vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests
	+
	+default: build_all
	+ echo "End Build"
	+
	+
	+include $(KOKKOS_PATH)/Makefile.kokkos
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ CXX = nvcc_wrapper
	+ CXXFLAGS ?= -O3
	+ LINK = $(CXX)
	+ LDFLAGS ?= -lpthread
	+else
	+ CXX ?= g++
	+ CXXFLAGS ?= -O3
	+ LINK ?= $(CXX)
	+ LDFLAGS ?= -lpthread
	+endif
	+
	+KOKKOS_CXXFLAGS += -I$(GTEST_PATH) -I${KOKKOS_PATH}/containers/unit_tests
	+
	+TEST_TARGETS =
	+TARGETS =
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ OBJ_CUDA = TestCuda.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_UnitTest_Cuda
	+ TEST_TARGETS += test-cuda
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ OBJ_THREADS = TestThreads.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_UnitTest_Threads
	+ TEST_TARGETS += test-threads
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ OBJ_OPENMP = TestOpenMP.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_UnitTest_OpenMP
	+ TEST_TARGETS += test-openmp
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
	+ OBJ_SERIAL = TestSerial.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosContainers_UnitTest_Serial
	+ TEST_TARGETS += test-serial
	+endif
	+
	+KokkosContainers_UnitTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_UnitTest_Cuda
	+
	+KokkosContainers_UnitTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_UnitTest_Threads
	+
	+KokkosContainers_UnitTest_OpenMP: $(OBJ_OPENMP) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_OPENMP) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_UnitTest_OpenMP
	+
	+KokkosContainers_UnitTest_Serial: $(OBJ_SERIAL) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_SERIAL) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_UnitTest_Serial
	+
	+test-cuda: KokkosContainers_UnitTest_Cuda
	+ ./KokkosContainers_UnitTest_Cuda
	+
	+test-threads: KokkosContainers_UnitTest_Threads
	+ ./KokkosContainers_UnitTest_Threads
	+
	+test-openmp: KokkosContainers_UnitTest_OpenMP
	+ ./KokkosContainers_UnitTest_OpenMP
	+
	+test-serial: KokkosContainers_UnitTest_Serial
	+ ./KokkosContainers_UnitTest_Serial
	+
	+build_all: $(TARGETS)
	+
	+test: $(TEST_TARGETS)
	+
	+clean: kokkos-clean
	+ rm -f *.o $(TARGETS)
	+
	+# Compilation rules
	+
	+%.o:%.cpp $(KOKKOS_CPP_DEPENDS)
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
	+
	+gtest-all.o:$(GTEST_PATH)/gtest/gtest-all.cc
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $(GTEST_PATH)/gtest/gtest-all.cc
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestBitset.hpp b/lib/kokkos/containers/unit_tests/TestBitset.hpp
	new file mode 100755
	index 000000000..76fb30edc
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestBitset.hpp
	@@ -0,0 +1,285 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef KOKKOS_TEST_BITSET_HPP
	+#define KOKKOS_TEST_BITSET_HPP
	+
	+#include <gtest/gtest.h>
	+#include <iostream>
	+
	+
	+namespace Test {
	+
	+namespace Impl {
	+
	+template <typename Bitset, bool Set>
	+struct TestBitset
	+{
	+ typedef Bitset bitset_type;
	+ typedef typename bitset_type::execution_space execution_space;
	+ typedef uint32_t value_type;
	+
	+ bitset_type m_bitset;
	+
	+ TestBitset( bitset_type const& bitset)
	+ : m_bitset(bitset)
	+ {}
	+
	+ unsigned testit(unsigned collisions)
	+ {
	+ execution_space::fence();
	+
	+ unsigned count = 0;
	+ Kokkos::parallel_reduce( m_bitset.size()collisions, this, count);
	+ return count;
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & v ) const { v = 0; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & dst, const volatile value_type & src ) const
	+ { dst += src; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(uint32_t i, value_type & v) const
	+ {
	+ i = i % m_bitset.size();
	+ if (Set) {
	+ if (m_bitset.set(i)) {
	+ if (m_bitset.test(i)) ++v;
	+ }
	+ }
	+ else {
	+ if (m_bitset.reset(i)) {
	+ if (!m_bitset.test(i)) ++v;
	+ }
	+ }
	+ }
	+
	+};
	+
	+template <typename Bitset>
	+struct TestBitsetTest
	+{
	+ typedef Bitset bitset_type;
	+ typedef typename bitset_type::execution_space execution_space;
	+ typedef uint32_t value_type;
	+
	+ bitset_type m_bitset;
	+
	+ TestBitsetTest( bitset_type const& bitset)
	+ : m_bitset(bitset)
	+ {}
	+
	+ unsigned testit()
	+ {
	+ execution_space::fence();
	+
	+ unsigned count = 0;
	+ Kokkos::parallel_reduce( m_bitset.size(), *this, count);
	+ return count;
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & v ) const { v = 0; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & dst, const volatile value_type & src ) const
	+ { dst += src; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(uint32_t i, value_type & v) const
	+ {
	+ if (m_bitset.test( i )) ++v;
	+ }
	+};
	+
	+template <typename Bitset, bool Set>
	+struct TestBitsetAny
	+{
	+ typedef Bitset bitset_type;
	+ typedef typename bitset_type::execution_space execution_space;
	+ typedef uint32_t value_type;
	+
	+ bitset_type m_bitset;
	+
	+ TestBitsetAny( bitset_type const& bitset)
	+ : m_bitset(bitset)
	+ {}
	+
	+ unsigned testit()
	+ {
	+ execution_space::fence();
	+
	+ unsigned count = 0;
	+ Kokkos::parallel_reduce( m_bitset.size(), *this, count);
	+ return count;
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & v ) const { v = 0; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & dst, const volatile value_type & src ) const
	+ { dst += src; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(uint32_t i, value_type & v) const
	+ {
	+ bool result = false;
	+ unsigned attempts = 0;
	+ uint32_t hint = (i >> 4) << 4;
	+ while (attempts < m_bitset.max_hint()) {
	+ if (Set) {
	+ Kokkos::tie(result, hint) = m_bitset.find_any_unset_near(hint, i);
	+ if (result && m_bitset.set(hint)) {
	+ ++v;
	+ break;
	+ }
	+ else if (!result) {
	+ ++attempts;
	+ }
	+ }
	+ else {
	+ Kokkos::tie(result, hint) = m_bitset.find_any_set_near(hint, i);
	+ if (result && m_bitset.reset(hint)) {
	+ ++v;
	+ break;
	+ }
	+ else if (!result) {
	+ ++attempts;
	+ }
	+ }
	+ }
	+ }
	+
	+};
	+} // namespace Impl
	+
	+
	+
	+template <typename Device>
	+void test_bitset()
	+{
	+ typedef Kokkos::Bitset< Device > bitset_type;
	+ typedef Kokkos::ConstBitset< Device > const_bitset_type;
	+
	+ //unsigned test_sizes[] = { 0u, 1000u, 1u<<14, 1u<<16, 10000001 };
	+ unsigned test_sizes[] = { 1000u, 1u<<14, 1u<<16, 10000001 };
	+
	+ for (int i=0, end = sizeof(test_sizes)/sizeof(unsigned); i<end; ++i) {
	+
	+ //std::cout << "Bitset " << test_sizes[i] << std::endl;
	+
	+ bitset_type bitset(test_sizes[i]);
	+
	+ //std::cout << " Check inital count " << std::endl;
	+ // nothing should be set
	+ {
	+ Impl::TestBitsetTest< bitset_type > f(bitset);
	+ uint32_t count = f.testit();
	+ EXPECT_EQ(0u, count);
	+ EXPECT_EQ(count, bitset.count());
	+ }
	+
	+ //std::cout << " Check set() " << std::endl;
	+ bitset.set();
	+ // everything should be set
	+ {
	+ Impl::TestBitsetTest< const_bitset_type > f(bitset);
	+ uint32_t count = f.testit();
	+ EXPECT_EQ(bitset.size(), count);
	+ EXPECT_EQ(count, bitset.count());
	+ }
	+
	+ //std::cout << " Check reset() " << std::endl;
	+ bitset.reset();
	+ EXPECT_EQ(0u, bitset.count());
	+
	+ //std::cout << " Check set(i) " << std::endl;
	+ // test setting bits
	+ {
	+ Impl::TestBitset< bitset_type, true > f(bitset);
	+ uint32_t count = f.testit(10u);
	+ EXPECT_EQ( bitset.size(), bitset.count());
	+ EXPECT_EQ( bitset.size(), count );
	+ }
	+
	+ //std::cout << " Check reset(i) " << std::endl;
	+ // test resetting bits
	+ {
	+ Impl::TestBitset< bitset_type, false > f(bitset);
	+ uint32_t count = f.testit(10u);
	+ EXPECT_EQ( bitset.size(), count);
	+ EXPECT_EQ( 0u, bitset.count() );
	+ }
	+
	+
	+ //std::cout << " Check find_any_set(i) " << std::endl;
	+ // test setting any bits
	+ {
	+ Impl::TestBitsetAny< bitset_type, true > f(bitset);
	+ uint32_t count = f.testit();
	+ EXPECT_EQ( bitset.size(), bitset.count());
	+ EXPECT_EQ( bitset.size(), count );
	+ }
	+
	+ //std::cout << " Check find_any_unset(i) " << std::endl;
	+ // test resetting any bits
	+ {
	+ Impl::TestBitsetAny< bitset_type, false > f(bitset);
	+ uint32_t count = f.testit();
	+ EXPECT_EQ( bitset.size(), count);
	+ EXPECT_EQ( 0u, bitset.count() );
	+ }
	+
	+ }
	+
	+}
	+
	+} // namespace Test
	+
	+#endif //KOKKOS_TEST_BITSET_HPP
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestComplex.hpp b/lib/kokkos/containers/unit_tests/TestComplex.hpp
	new file mode 100755
	index 000000000..a2769fd11
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestComplex.hpp
	@@ -0,0 +1,264 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+
	+#ifndef KOKKOS_TEST_COMPLEX_HPP
	+#define KOKKOS_TEST_COMPLEX_HPP
	+
	+#include <Kokkos_Complex.hpp>
	+#include <gtest/gtest.h>
	+#include <iostream>
	+
	+namespace Test {
	+
	+namespace Impl {
	+ template <typename RealType>
	+ void testComplexConstructors () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+
	+ complex_type z1;
	+ complex_type z2 (0.0, 0.0);
	+ complex_type z3 (1.0, 0.0);
	+ complex_type z4 (0.0, 1.0);
	+ complex_type z5 (-1.0, -2.0);
	+
	+ ASSERT_TRUE( z1 == z2 );
	+ ASSERT_TRUE( z1 != z3 );
	+ ASSERT_TRUE( z1 != z4 );
	+ ASSERT_TRUE( z1 != z5 );
	+
	+ ASSERT_TRUE( z2 != z3 );
	+ ASSERT_TRUE( z2 != z4 );
	+ ASSERT_TRUE( z2 != z5 );
	+
	+ ASSERT_TRUE( z3 != z4 );
	+ ASSERT_TRUE( z3 != z5 );
	+
	+ complex_type z6 (-1.0, -2.0);
	+ ASSERT_TRUE( z5 == z6 );
	+
	+ // Make sure that complex has value semantics, in particular, that
	+ // equality tests use values and not pointers, so that
	+ // reassignment actually changes the value.
	+ z1 = complex_type (-3.0, -4.0);
	+ ASSERT_TRUE( z1.real () == -3.0 );
	+ ASSERT_TRUE( z1.imag () == -4.0 );
	+ ASSERT_TRUE( z1 != z2 );
	+
	+ complex_type z7 (1.0);
	+ ASSERT_TRUE( z3 == z7 );
	+ ASSERT_TRUE( z7 == 1.0 );
	+ ASSERT_TRUE( z7 != -1.0 );
	+
	+ z7 = complex_type (5.0);
	+ ASSERT_TRUE( z7.real () == 5.0 );
	+ ASSERT_TRUE( z7.imag () == 0.0 );
	+ }
	+
	+ template <typename RealType>
	+ void testPlus () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+
	+ complex_type z1 (1.0, -1.0);
	+ complex_type z2 (-1.0, 1.0);
	+ complex_type z3 = z1 + z2;
	+ ASSERT_TRUE( z3 == complex_type (0.0, 0.0) );
	+ }
	+
	+ template <typename RealType>
	+ void testMinus () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+
	+ // Test binary minus.
	+ complex_type z1 (1.0, -1.0);
	+ complex_type z2 (-1.0, 1.0);
	+ complex_type z3 = z1 - z2;
	+ ASSERT_TRUE( z3 == complex_type (2.0, -2.0) );
	+
	+ // Test unary minus.
	+ complex_type z4 (3.0, -4.0);
	+ ASSERT_TRUE( -z1 == complex_type (-3.0, 4.0) );
	+ }
	+
	+ template <typename RealType>
	+ void testTimes () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+
	+ complex_type z1 (1.0, -1.0);
	+ complex_type z2 (-1.0, 1.0);
	+ complex_type z3 = z1 - z2;
	+ ASSERT_TRUE( z3 == complex_type (2.0, -2.0) );
	+
	+ // Test unary minus.
	+ complex_type z4 (3.0, -4.0);
	+ ASSERT_TRUE( z4 == complex_type (3.0, -4.0) );
	+ ASSERT_TRUE( -z4 == complex_type (-3.0, 4.0) );
	+ ASSERT_TRUE( z4 == -complex_type (-3.0, 4.0) );
	+ }
	+
	+ template <typename RealType>
	+ void testDivide () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+
	+ // Test division of a complex number by a real number.
	+ complex_type z1 (1.0, -1.0);
	+ complex_type z2 (1.0 / 2.0, -1.0 / 2.0);
	+ ASSERT_TRUE( z1 / 2.0 == z2 );
	+
	+ // (-1+2i)/(1-i) == ((-1+2i)(1+i)) / ((1-i)(1+i))
	+ // (-1+2i)(1+i) == -3 + i
	+ complex_type z3 (-1.0, 2.0);
	+ complex_type z4 (1.0, -1.0);
	+ complex_type z5 (-3.0, 1.0);
	+ ASSERT_TRUE(z3 * Kokkos::conj (z4) == z5 );
	+
	+ // Test division of a complex number by a complex number.
	+ // This assumes that RealType is a floating-point type.
	+ complex_type z6 (Kokkos::real (z5) / 2.0,
	+ Kokkos::imag (z5) / 2.0);
	+
	+ complex_type z7 = z3 / z4;
	+ ASSERT_TRUE( z7 == z6 );
	+ }
	+
	+ template <typename RealType>
	+ void testOutsideKernel () {
	+ testComplexConstructors<RealType> ();
	+ testPlus<RealType> ();
	+ testTimes<RealType> ();
	+ testDivide<RealType> ();
	+ }
	+
	+
	+ template<typename RealType, typename Device>
	+ void testCreateView () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+ Kokkos::View<complex_type*, Device> x ("x", 10);
	+ ASSERT_TRUE( x.dimension_0 () == 10 );
	+
	+ // Test that View assignment works.
	+ Kokkos::View<complex_type*, Device> x_nonconst = x;
	+ Kokkos::View<const complex_type*, Device> x_const = x;
	+ }
	+
	+ template<typename RealType, typename Device>
	+ class Fill {
	+ public:
	+ typedef typename Device::execution_space execution_space;
	+
	+ typedef Kokkos::View<Kokkos::complex<RealType>*, Device> view_type;
	+ typedef typename view_type::size_type size_type;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator () (const size_type i) const {
	+ x_(i) = val_;
	+ }
	+
	+ Fill (const view_type& x, const Kokkos::complex<RealType>& val) :
	+ x_ (x), val_ (val)
	+ {}
	+
	+ private:
	+ view_type x_;
	+ const Kokkos::complex<RealType> val_;
	+ };
	+
	+ template<typename RealType, typename Device>
	+ class Sum {
	+ public:
	+ typedef typename Device::execution_space execution_space;
	+
	+ typedef Kokkos::View<const Kokkos::complex<RealType>*, Device> view_type;
	+ typedef typename view_type::size_type size_type;
	+ typedef Kokkos::complex<RealType> value_type;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator () (const size_type i, Kokkos::complex<RealType>& sum) const {
	+ sum += x_(i);
	+ }
	+
	+ Sum (const view_type& x) : x_ (x) {}
	+
	+ private:
	+ view_type x_;
	+ };
	+
	+ template<typename RealType, typename Device>
	+ void testInsideKernel () {
	+ typedef Kokkos::complex<RealType> complex_type;
	+ typedef Kokkos::View<complex_type*, Device> view_type;
	+ typedef typename view_type::size_type size_type;
	+
	+ const size_type N = 1000;
	+ view_type x ("x", N);
	+ ASSERT_TRUE( x.dimension_0 () == N );
	+
	+ // Kokkos::parallel_reduce (N, [=] (const size_type i, complex_type& result) {
	+ // result += x[i];
	+ // });
	+
	+ Kokkos::parallel_for (N, Fill<RealType, Device> (x, complex_type (1.0, -1.0)));
	+
	+ complex_type sum;
	+ Kokkos::parallel_reduce (N, Sum<RealType, Device> (x), sum);
	+
	+ ASSERT_TRUE( sum.real () == 1000.0 && sum.imag () == -1000.0 );
	+ }
	+} // namespace Impl
	+
	+
	+template <typename Device>
	+void testComplex ()
	+{
	+ Impl::testOutsideKernel<float> ();
	+ Impl::testOutsideKernel<double> ();
	+
	+ Impl::testCreateView<float, Device> ();
	+ Impl::testCreateView<double, Device> ();
	+
	+ Impl::testInsideKernel<float, Device> ();
	+ Impl::testInsideKernel<double, Device> ();
	+}
	+
	+
	+} // namespace Test
	+
	+#endif // KOKKOS_TEST_COMPLEX_HPP
	diff --git a/lib/kokkos/containers/unit_tests/TestCuda.cpp b/lib/kokkos/containers/unit_tests/TestCuda.cpp
	new file mode 100755
	index 000000000..2f79205c4
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestCuda.cpp
	@@ -0,0 +1,206 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <iostream>
	+#include <iomanip>
	+#include <stdint.h>
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <Kokkos_Bitset.hpp>
	+#include <Kokkos_UnorderedMap.hpp>
	+#include <Kokkos_Vector.hpp>
	+
	+#include <TestBitset.hpp>
	+#include <TestUnorderedMap.hpp>
	+#include <TestStaticCrsGraph.hpp>
	+#include <TestVector.hpp>
	+#include <TestDualView.hpp>
	+#include <TestSegmentedView.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+
	+#ifdef KOKKOS_HAVE_CUDA
	+
	+namespace Test {
	+
	+class cuda : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+ Kokkos::HostSpace::execution_space::initialize();
	+ Kokkos::Cuda::initialize( Kokkos::Cuda::SelectDevice(0) );
	+ }
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Cuda::finalize();
	+ Kokkos::HostSpace::execution_space::finalize();
	+ }
	+};
	+
	+TEST_F( cuda , staticcrsgraph )
	+{
	+ TestStaticCrsGraph::run_test_graph< Kokkos::Cuda >();
	+ TestStaticCrsGraph::run_test_graph2< Kokkos::Cuda >();
	+}
	+
	+
	+void cuda_test_insert_close( uint32_t num_nodes
	+ , uint32_t num_inserts
	+ , uint32_t num_duplicates
	+ )
	+{
	+ test_insert< Kokkos::Cuda >( num_nodes, num_inserts, num_duplicates, true);
	+}
	+
	+void cuda_test_insert_far( uint32_t num_nodes
	+ , uint32_t num_inserts
	+ , uint32_t num_duplicates
	+ )
	+{
	+ test_insert< Kokkos::Cuda >( num_nodes, num_inserts, num_duplicates, false);
	+}
	+
	+void cuda_test_failed_insert( uint32_t num_nodes )
	+{
	+ test_failed_insert< Kokkos::Cuda >( num_nodes );
	+}
	+
	+void cuda_test_deep_copy( uint32_t num_nodes )
	+{
	+ test_deep_copy< Kokkos::Cuda >( num_nodes );
	+}
	+
	+void cuda_test_vector_combinations(unsigned int size)
	+{
	+ test_vector_combinations<int,Kokkos::Cuda>(size);
	+}
	+
	+void cuda_test_dualview_combinations(unsigned int size)
	+{
	+ test_dualview_combinations<int,Kokkos::Cuda>(size);
	+}
	+
	+void cuda_test_segmented_view(unsigned int size)
	+{
	+ test_segmented_view<double,Kokkos::Cuda>(size);
	+}
	+
	+void cuda_test_bitset()
	+{
	+ test_bitset<Kokkos::Cuda>();
	+}
	+
	+
	+
	+/*TEST_F( cuda, bitset )
	+{
	+ cuda_test_bitset();
	+}*/
	+
	+#define CUDA_INSERT_TEST( name, num_nodes, num_inserts, num_duplicates, repeat ) \
	+ TEST_F( cuda, UnorderedMap_insert_##name##_##num_nodes##_##num_inserts##_##num_duplicates##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ cuda_test_insert_##name(num_nodes,num_inserts,num_duplicates); \
	+ }
	+
	+#define CUDA_FAILED_INSERT_TEST( num_nodes, repeat ) \
	+ TEST_F( cuda, UnorderedMap_failed_insert_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ cuda_test_failed_insert(num_nodes); \
	+ }
	+
	+#define CUDA_ASSIGNEMENT_TEST( num_nodes, repeat ) \
	+ TEST_F( cuda, UnorderedMap_assignment_operators_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ cuda_test_assignment_operators(num_nodes); \
	+ }
	+
	+#define CUDA_DEEP_COPY( num_nodes, repeat ) \
	+ TEST_F( cuda, UnorderedMap_deep_copy##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ cuda_test_deep_copy(num_nodes); \
	+ }
	+
	+#define CUDA_VECTOR_COMBINE_TEST( size ) \
	+ TEST_F( cuda, vector_combination##size##x) { \
	+ cuda_test_vector_combinations(size); \
	+ }
	+
	+#define CUDA_DUALVIEW_COMBINE_TEST( size ) \
	+ TEST_F( cuda, dualview_combination##size##x) { \
	+ cuda_test_dualview_combinations(size); \
	+ }
	+
	+#define CUDA_SEGMENTEDVIEW_TEST( size ) \
	+ TEST_F( cuda, segmentedview_##size##x) { \
	+ cuda_test_segmented_view(size); \
	+ }
	+
	+CUDA_DUALVIEW_COMBINE_TEST( 10 )
	+CUDA_VECTOR_COMBINE_TEST( 10 )
	+CUDA_VECTOR_COMBINE_TEST( 3057 )
	+
	+
	+CUDA_INSERT_TEST(close, 100000, 90000, 100, 500)
	+CUDA_INSERT_TEST(far, 100000, 90000, 100, 500)
	+CUDA_DEEP_COPY( 10000, 1 )
	+CUDA_FAILED_INSERT_TEST( 10000, 1000 )
	+CUDA_SEGMENTEDVIEW_TEST( 200 )
	+
	+
	+#undef CUDA_INSERT_TEST
	+#undef CUDA_FAILED_INSERT_TEST
	+#undef CUDA_ASSIGNEMENT_TEST
	+#undef CUDA_DEEP_COPY
	+#undef CUDA_VECTOR_COMBINE_TEST
	+#undef CUDA_DUALVIEW_COMBINE_TEST
	+#undef CUDA_SEGMENTEDVIEW_TEST
	+}
	+
	+#endif /* #ifdef KOKKOS_HAVE_CUDA */
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestDualView.hpp b/lib/kokkos/containers/unit_tests/TestDualView.hpp
	new file mode 100755
	index 000000000..e72c69f7d
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestDualView.hpp
	@@ -0,0 +1,121 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_TEST_DUALVIEW_HPP
	+#define KOKKOS_TEST_DUALVIEW_HPP
	+
	+#include <gtest/gtest.h>
	+#include <iostream>
	+#include <cstdlib>
	+#include <cstdio>
	+#include <impl/Kokkos_Timer.hpp>
	+
	+namespace Test {
	+
	+namespace Impl {
	+
	+ template <typename Scalar, class Device>
	+ struct test_dualview_combinations
	+ {
	+ typedef test_dualview_combinations<Scalar,Device> self_type;
	+
	+ typedef Scalar scalar_type;
	+ typedef Device execution_space;
	+
	+ Scalar reference;
	+ Scalar result;
	+
	+ template <typename ViewType>
	+ Scalar run_me(unsigned int n,unsigned int m){
	+ if(n<10) n = 10;
	+ if(m<3) m = 3;
	+ ViewType a("A",n,m);
	+
	+ Kokkos::deep_copy( a.d_view , 1 );
	+
	+ a.template modify<typename ViewType::execution_space>();
	+ a.template sync<typename ViewType::host_mirror_space>();
	+
	+ a.h_view(5,1) = 3;
	+ a.h_view(6,1) = 4;
	+ a.h_view(7,2) = 5;
	+ a.template modify<typename ViewType::host_mirror_space>();
	+ ViewType b = Kokkos::subview(a,std::pair<unsigned int, unsigned int>(6,9),std::pair<unsigned int, unsigned int>(0,1));
	+ a.template sync<typename ViewType::execution_space>();
	+ b.template modify<typename ViewType::execution_space>();
	+
	+ Kokkos::deep_copy( b.d_view , 2 );
	+
	+ a.template sync<typename ViewType::host_mirror_space>();
	+ Scalar count = 0;
	+ for(unsigned int i = 0; i<a.d_view.dimension_0(); i++)
	+ for(unsigned int j = 0; j<a.d_view.dimension_1(); j++)
	+ count += a.h_view(i,j);
	+ return count - a.d_view.dimension_0()a.d_view.dimension_1()-2-4-32;
	+ }
	+
	+
	+ test_dualview_combinations(unsigned int size)
	+ {
	+ result = run_me< Kokkos::DualView<Scalar**,Kokkos::LayoutLeft,Device> >(size,3);
	+ }
	+
	+ };
	+
	+} // namespace Impl
	+
	+
	+
	+
	+template <typename Scalar, typename Device>
	+void test_dualview_combinations(unsigned int size)
	+{
	+ Impl::test_dualview_combinations<Scalar,Device> test(size);
	+ ASSERT_EQ( test.result,0);
	+
	+}
	+
	+
	+} // namespace Test
	+
	+#endif //KOKKOS_TEST_UNORDERED_MAP_HPP
	diff --git a/lib/kokkos/containers/unit_tests/TestOpenMP.cpp b/lib/kokkos/containers/unit_tests/TestOpenMP.cpp
	new file mode 100755
	index 000000000..0ff9b4f66
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestOpenMP.cpp
	@@ -0,0 +1,162 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <Kokkos_Bitset.hpp>
	+#include <Kokkos_UnorderedMap.hpp>
	+#include <Kokkos_Vector.hpp>
	+
	+//----------------------------------------------------------------------------
	+#include <TestBitset.hpp>
	+#include <TestUnorderedMap.hpp>
	+#include <TestStaticCrsGraph.hpp>
	+#include <TestVector.hpp>
	+#include <TestDualView.hpp>
	+#include <TestSegmentedView.hpp>
	+#include <TestComplex.hpp>
	+
	+#include <iomanip>
	+
	+namespace Test {
	+
	+#ifdef KOKKOS_HAVE_OPENMP
	+class openmp : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+
	+ unsigned threads_count = 4 ;
	+
	+ if ( Kokkos::hwloc::available() ) {
	+ threads_count = Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa();
	+ }
	+
	+ Kokkos::OpenMP::initialize( threads_count );
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::OpenMP::finalize();
	+ }
	+};
	+
	+TEST_F( openmp, complex )
	+{
	+ testComplex<Kokkos::OpenMP> ();
	+}
	+
	+TEST_F( openmp, bitset )
	+{
	+ test_bitset<Kokkos::OpenMP>();
	+}
	+
	+TEST_F( openmp , staticcrsgraph )
	+{
	+ TestStaticCrsGraph::run_test_graph< Kokkos::OpenMP >();
	+ TestStaticCrsGraph::run_test_graph2< Kokkos::OpenMP >();
	+}
	+
	+#define OPENMP_INSERT_TEST( name, num_nodes, num_inserts, num_duplicates, repeat, near ) \
	+ TEST_F( openmp, UnorderedMap_insert_##name##_##num_nodes##_##num_inserts##_##num_duplicates##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_insert<Kokkos::OpenMP>(num_nodes,num_inserts,num_duplicates, near); \
	+ }
	+
	+#define OPENMP_FAILED_INSERT_TEST( num_nodes, repeat ) \
	+ TEST_F( openmp, UnorderedMap_failed_insert_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_failed_insert<Kokkos::OpenMP>(num_nodes); \
	+ }
	+
	+#define OPENMP_ASSIGNEMENT_TEST( num_nodes, repeat ) \
	+ TEST_F( openmp, UnorderedMap_assignment_operators_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_assignement_operators<Kokkos::OpenMP>(num_nodes); \
	+ }
	+
	+#define OPENMP_DEEP_COPY( num_nodes, repeat ) \
	+ TEST_F( openmp, UnorderedMap_deep_copy##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_deep_copy<Kokkos::OpenMP>(num_nodes); \
	+ }
	+
	+#define OPENMP_VECTOR_COMBINE_TEST( size ) \
	+ TEST_F( openmp, vector_combination##size##x) { \
	+ test_vector_combinations<int,Kokkos::OpenMP>(size); \
	+ }
	+
	+#define OPENMP_DUALVIEW_COMBINE_TEST( size ) \
	+ TEST_F( openmp, dualview_combination##size##x) { \
	+ test_dualview_combinations<int,Kokkos::OpenMP>(size); \
	+ }
	+
	+#define OPENMP_SEGMENTEDVIEW_TEST( size ) \
	+ TEST_F( openmp, segmentedview_##size##x) { \
	+ test_segmented_view<double,Kokkos::OpenMP>(size); \
	+ }
	+
	+OPENMP_INSERT_TEST(close, 100000, 90000, 100, 500, true)
	+OPENMP_INSERT_TEST(far, 100000, 90000, 100, 500, false)
	+OPENMP_FAILED_INSERT_TEST( 10000, 1000 )
	+OPENMP_DEEP_COPY( 10000, 1 )
	+
	+OPENMP_VECTOR_COMBINE_TEST( 10 )
	+OPENMP_VECTOR_COMBINE_TEST( 3057 )
	+OPENMP_DUALVIEW_COMBINE_TEST( 10 )
	+OPENMP_SEGMENTEDVIEW_TEST( 10000 )
	+
	+#undef OPENMP_INSERT_TEST
	+#undef OPENMP_FAILED_INSERT_TEST
	+#undef OPENMP_ASSIGNEMENT_TEST
	+#undef OPENMP_DEEP_COPY
	+#undef OPENMP_VECTOR_COMBINE_TEST
	+#undef OPENMP_DUALVIEW_COMBINE_TEST
	+#undef OPENMP_SEGMENTEDVIEW_TEST
	+#endif
	+} // namespace test
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestSegmentedView.hpp b/lib/kokkos/containers/unit_tests/TestSegmentedView.hpp
	new file mode 100755
	index 000000000..3da4bc781
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestSegmentedView.hpp
	@@ -0,0 +1,708 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_TEST_SEGMENTEDVIEW_HPP
	+#define KOKKOS_TEST_SEGMENTEDVIEW_HPP
	+
	+#include <gtest/gtest.h>
	+#include <iostream>
	+#include <cstdlib>
	+#include <cstdio>
	+#include <Kokkos_Core.hpp>
	+
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+#include <Kokkos_SegmentedView.hpp>
	+#include <impl/Kokkos_Timer.hpp>
	+
	+namespace Test {
	+
	+namespace Impl {
	+
	+ template<class ViewType , class ExecutionSpace, int Rank = ViewType::Rank>
	+ struct GrowTest;
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 1> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+team_member.team_size());
	+ value += team_idx + team_member.team_rank();
	+
	+ if((a.dimension_0()>team_idx+team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+team_member.team_rank()))
	+ a(team_idx+team_member.team_rank()) = team_idx+team_member.team_rank();
	+
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 2> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+ team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ value += team_idx + team_member.team_rank() + 13*k;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++) {
	+ a(team_idx+ team_member.team_rank(),k) =
	+ team_idx+ team_member.team_rank() + 13*k;
	+ }
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 3> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+ team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ for( typename ExecutionSpace::size_type l=0;l<3;l++)
	+ value += team_idx + team_member.team_rank() + 13k + 3l;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ a(team_idx+ team_member.team_rank(),k,l) =
	+ team_idx+ team_member.team_rank() + 13k + 3l;
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 4> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+ team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ for( typename ExecutionSpace::size_type l=0;l<3;l++)
	+ for( typename ExecutionSpace::size_type m=0;m<2;m++)
	+ value += team_idx + team_member.team_rank() + 13k + 3l + 7*m;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ a(team_idx+ team_member.team_rank(),k,l,m) =
	+ team_idx+ team_member.team_rank() + 13k + 3l + 7*m;
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 5> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+ team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ for( typename ExecutionSpace::size_type l=0;l<3;l++)
	+ for( typename ExecutionSpace::size_type m=0;m<2;m++)
	+ for( typename ExecutionSpace::size_type n=0;n<3;n++)
	+ value +=
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ a(team_idx+ team_member.team_rank(),k,l,m,n) =
	+ team_idx+ team_member.team_rank() + 13k + 3l + 7m + 5n;
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 6> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+ team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ for( typename ExecutionSpace::size_type l=0;l<3;l++)
	+ for( typename ExecutionSpace::size_type m=0;m<2;m++)
	+ for( typename ExecutionSpace::size_type n=0;n<3;n++)
	+ for( typename ExecutionSpace::size_type o=0;o<2;o++)
	+ value +=
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n + 2*o ;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ for( typename ExecutionSpace::size_type o=0;o<a.dimension_5();o++)
	+ a(team_idx+ team_member.team_rank(),k,l,m,n,o) =
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n + 2*o ;
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 7> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ a.grow(team_member , team_idx+ team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ for( typename ExecutionSpace::size_type l=0;l<3;l++)
	+ for( typename ExecutionSpace::size_type m=0;m<2;m++)
	+ for( typename ExecutionSpace::size_type n=0;n<3;n++)
	+ for( typename ExecutionSpace::size_type o=0;o<2;o++)
	+ for( typename ExecutionSpace::size_type p=0;p<4;p++)
	+ value +=
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n + 2o + 15p ;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ for( typename ExecutionSpace::size_type o=0;o<a.dimension_5();o++)
	+ for( typename ExecutionSpace::size_type p=0;p<a.dimension_6();p++)
	+ a(team_idx+ team_member.team_rank(),k,l,m,n,o,p) =
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n + 2o + 15p ;
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct GrowTest<ViewType , ExecutionSpace , 8> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ GrowTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+ a.grow(team_member , team_idx + team_member.team_size());
	+
	+ for( typename ExecutionSpace::size_type k=0;k<7;k++)
	+ for( typename ExecutionSpace::size_type l=0;l<3;l++)
	+ for( typename ExecutionSpace::size_type m=0;m<2;m++)
	+ for( typename ExecutionSpace::size_type n=0;n<3;n++)
	+ for( typename ExecutionSpace::size_type o=0;o<2;o++)
	+ for( typename ExecutionSpace::size_type p=0;p<4;p++)
	+ for( typename ExecutionSpace::size_type q=0;q<3;q++)
	+ value +=
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n + 2o + 15p + 17*q;
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ for( typename ExecutionSpace::size_type o=0;o<a.dimension_5();o++)
	+ for( typename ExecutionSpace::size_type p=0;p<a.dimension_6();p++)
	+ for( typename ExecutionSpace::size_type q=0;q<a.dimension_7();q++)
	+ a(team_idx+ team_member.team_rank(),k,l,m,n,o,p,q) =
	+ team_idx + team_member.team_rank() + 13k + 3l + 7m + 5n + 2o + 15p + 17*q;
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace, int Rank = ViewType::Rank>
	+ struct VerifyTest;
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 1> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ value += a(team_idx+ team_member.team_rank());
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 2> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ value += a(team_idx+ team_member.team_rank(),k);
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 3> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ value += a(team_idx+ team_member.team_rank(),k,l);
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 4> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ value += a(team_idx+ team_member.team_rank(),k,l,m);
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 5> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ value += a(team_idx+ team_member.team_rank(),k,l,m,n);
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 6> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ for( typename ExecutionSpace::size_type o=0;o<a.dimension_5();o++)
	+ value += a(team_idx+ team_member.team_rank(),k,l,m,n,o);
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 7> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ for( typename ExecutionSpace::size_type o=0;o<a.dimension_5();o++)
	+ for( typename ExecutionSpace::size_type p=0;p<a.dimension_6();p++)
	+ value += a(team_idx+ team_member.team_rank(),k,l,m,n,o,p);
	+ }
	+ }
	+ };
	+
	+ template<class ViewType , class ExecutionSpace>
	+ struct VerifyTest<ViewType , ExecutionSpace , 8> {
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+ typedef typename Policy::member_type team_type;
	+ typedef double value_type;
	+
	+ ViewType a;
	+
	+ VerifyTest(ViewType in):a(in) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (team_type team_member, double& value) const {
	+ unsigned int team_idx = team_member.league_rank() * team_member.team_size();
	+
	+ if((a.dimension_0()>team_idx+ team_member.team_rank()) &&
	+ (a.dimension(0)>team_idx+ team_member.team_rank())) {
	+ for( typename ExecutionSpace::size_type k=0;k<a.dimension_1();k++)
	+ for( typename ExecutionSpace::size_type l=0;l<a.dimension_2();l++)
	+ for( typename ExecutionSpace::size_type m=0;m<a.dimension_3();m++)
	+ for( typename ExecutionSpace::size_type n=0;n<a.dimension_4();n++)
	+ for( typename ExecutionSpace::size_type o=0;o<a.dimension_5();o++)
	+ for( typename ExecutionSpace::size_type p=0;p<a.dimension_6();p++)
	+ for( typename ExecutionSpace::size_type q=0;q<a.dimension_7();q++)
	+ value += a(team_idx+ team_member.team_rank(),k,l,m,n,o,p,q);
	+ }
	+ }
	+ };
	+
	+ template <typename Scalar, class ExecutionSpace>
	+ struct test_segmented_view
	+ {
	+ typedef test_segmented_view<Scalar,ExecutionSpace> self_type;
	+
	+ typedef Scalar scalar_type;
	+ typedef ExecutionSpace execution_space;
	+ typedef Kokkos::TeamPolicy<execution_space> Policy;
	+
	+ double result;
	+ double reference;
	+
	+ template <class ViewType>
	+ void run_me(ViewType a, int max_length){
	+ const int team_size = Policy::team_size_max( GrowTest<ViewType,execution_space>(a) );
	+ const int nteams = max_length/team_size;
	+
	+ reference = 0;
	+ result = 0;
	+
	+ Kokkos::parallel_reduce(Policy(nteams,team_size),GrowTest<ViewType,execution_space>(a),reference);
	+ Kokkos::fence();
	+ Kokkos::parallel_reduce(Policy(nteams,team_size),VerifyTest<ViewType,execution_space>(a),result);
	+ Kokkos::fence();
	+ }
	+
	+
	+ test_segmented_view(unsigned int size,int rank)
	+ {
	+ reference = 0;
	+ result = 0;
	+
	+ const int dim_1 = 7;
	+ const int dim_2 = 3;
	+ const int dim_3 = 2;
	+ const int dim_4 = 3;
	+ const int dim_5 = 2;
	+ const int dim_6 = 4;
	+ //const int dim_7 = 3;
	+
	+ if(rank==1) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*,Kokkos::LayoutLeft,ExecutionSpace> rank1_view;
	+ run_me< rank1_view >(rank1_view("Rank1",128,size), size);
	+ }
	+ if(rank==2) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar**,Kokkos::LayoutLeft,ExecutionSpace> rank2_view;
	+ run_me< rank2_view >(rank2_view("Rank2",128,size,dim_1), size);
	+ }
	+ if(rank==3) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*[7][3][2],Kokkos::LayoutRight,ExecutionSpace> rank3_view;
	+ run_me< rank3_view >(rank3_view("Rank3",128,size), size);
	+ }
	+ if(rank==4) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar****,Kokkos::LayoutRight,ExecutionSpace> rank4_view;
	+ run_me< rank4_view >(rank4_view("Rank4",128,size,dim_1,dim_2,dim_3), size);
	+ }
	+ if(rank==5) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*[7][3][2][3],Kokkos::LayoutLeft,ExecutionSpace> rank5_view;
	+ run_me< rank5_view >(rank5_view("Rank5",128,size), size);
	+ }
	+ if(rank==6) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*****[2],Kokkos::LayoutRight,ExecutionSpace> rank6_view;
	+ run_me< rank6_view >(rank6_view("Rank6",128,size,dim_1,dim_2,dim_3,dim_4), size);
	+ }
	+ if(rank==7) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*******,Kokkos::LayoutLeft,ExecutionSpace> rank7_view;
	+ run_me< rank7_view >(rank7_view("Rank7",128,size,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6), size);
	+ }
	+ if(rank==8) {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*****[2][4][3],Kokkos::LayoutLeft,ExecutionSpace> rank8_view;
	+ run_me< rank8_view >(rank8_view("Rank8",128,size,dim_1,dim_2,dim_3,dim_4), size);
	+ }
	+ }
	+
	+ };
	+
	+} // namespace Impl
	+
	+
	+
	+
	+template <typename Scalar, class ExecutionSpace>
	+void test_segmented_view(unsigned int size)
	+{
	+ {
	+ typedef Kokkos::Experimental::SegmentedView<Scalar*****[2][4][3],Kokkos::LayoutLeft,ExecutionSpace> view_type;
	+ view_type a("A",128,size,7,3,2,3);
	+ double reference;
	+
	+ Impl::GrowTest<view_type,ExecutionSpace> f(a);
	+
	+ const int team_size = Kokkos::TeamPolicy<ExecutionSpace>::team_size_max( f );
	+ const int nteams = (size+team_size-1)/team_size;
	+
	+ Kokkos::parallel_reduce(Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size),f,reference);
	+
	+ size_t real_size = ((size+127)/128)*128;
	+
	+ ASSERT_EQ(real_size,a.dimension_0());
	+ ASSERT_EQ(7,a.dimension_1());
	+ ASSERT_EQ(3,a.dimension_2());
	+ ASSERT_EQ(2,a.dimension_3());
	+ ASSERT_EQ(3,a.dimension_4());
	+ ASSERT_EQ(2,a.dimension_5());
	+ ASSERT_EQ(4,a.dimension_6());
	+ ASSERT_EQ(3,a.dimension_7());
	+ ASSERT_EQ(real_size,a.dimension(0));
	+ ASSERT_EQ(7,a.dimension(1));
	+ ASSERT_EQ(3,a.dimension(2));
	+ ASSERT_EQ(2,a.dimension(3));
	+ ASSERT_EQ(3,a.dimension(4));
	+ ASSERT_EQ(2,a.dimension(5));
	+ ASSERT_EQ(4,a.dimension(6));
	+ ASSERT_EQ(3,a.dimension(7));
	+ ASSERT_EQ(8,a.Rank);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,1);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,2);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,3);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,4);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,5);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,6);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,7);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+ {
	+ Impl::test_segmented_view<Scalar,ExecutionSpace> test(size,8);
	+ ASSERT_EQ(test.reference,test.result);
	+ }
	+
	+}
	+
	+
	+} // namespace Test
	+
	+#else
	+
	+template <typename Scalar, class ExecutionSpace>
	+void test_segmented_view(unsigned int ) {}
	+
	+#endif
	+
	+#endif /* #ifndef KOKKOS_TEST_SEGMENTEDVIEW_HPP */
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestSerial.cpp b/lib/kokkos/containers/unit_tests/TestSerial.cpp
	new file mode 100755
	index 000000000..6f00b113f
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestSerial.cpp
	@@ -0,0 +1,158 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#if ! defined(KOKKOS_HAVE_SERIAL)
	+# error "It doesn't make sense to build this file unless the Kokkos::Serial device is enabled. If you see this message, it probably means that there is an error in Kokkos' CMake build infrastructure."
	+#else
	+
	+#include <Kokkos_Bitset.hpp>
	+#include <Kokkos_UnorderedMap.hpp>
	+#include <Kokkos_Vector.hpp>
	+
	+#include <TestBitset.hpp>
	+#include <TestUnorderedMap.hpp>
	+#include <TestStaticCrsGraph.hpp>
	+#include <TestVector.hpp>
	+#include <TestDualView.hpp>
	+#include <TestSegmentedView.hpp>
	+#include <TestComplex.hpp>
	+
	+#include <iomanip>
	+
	+namespace Test {
	+
	+class serial : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase () {
	+ std::cout << std::setprecision(5) << std::scientific;
	+ Kokkos::Serial::initialize ();
	+ }
	+
	+ static void TearDownTestCase () {
	+ Kokkos::Serial::finalize ();
	+ }
	+};
	+
	+
	+TEST_F( serial , staticcrsgraph )
	+{
	+ TestStaticCrsGraph::run_test_graph< Kokkos::Serial >();
	+ TestStaticCrsGraph::run_test_graph2< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, complex )
	+{
	+ testComplex<Kokkos::Serial> ();
	+}
	+
	+TEST_F( serial, bitset )
	+{
	+ test_bitset<Kokkos::Serial> ();
	+}
	+
	+#define SERIAL_INSERT_TEST( name, num_nodes, num_inserts, num_duplicates, repeat, near ) \
	+ TEST_F( serial, UnorderedMap_insert_##name##_##num_nodes##_##num_inserts##_##num_duplicates##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_insert<Kokkos::Serial> (num_nodes, num_inserts, num_duplicates, near); \
	+ }
	+
	+#define SERIAL_FAILED_INSERT_TEST( num_nodes, repeat ) \
	+ TEST_F( serial, UnorderedMap_failed_insert_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_failed_insert<Kokkos::Serial> (num_nodes); \
	+ }
	+
	+#define SERIAL_ASSIGNEMENT_TEST( num_nodes, repeat ) \
	+ TEST_F( serial, UnorderedMap_assignment_operators_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_assignement_operators<Kokkos::Serial> (num_nodes); \
	+ }
	+
	+#define SERIAL_DEEP_COPY( num_nodes, repeat ) \
	+ TEST_F( serial, UnorderedMap_deep_copy##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_deep_copy<Kokkos::Serial> (num_nodes); \
	+ }
	+
	+#define SERIAL_VECTOR_COMBINE_TEST( size ) \
	+ TEST_F( serial, vector_combination##size##x) { \
	+ test_vector_combinations<int,Kokkos::Serial>(size); \
	+ }
	+
	+#define SERIAL_DUALVIEW_COMBINE_TEST( size ) \
	+ TEST_F( serial, dualview_combination##size##x) { \
	+ test_dualview_combinations<int,Kokkos::Serial>(size); \
	+ }
	+
	+#define SERIAL_SEGMENTEDVIEW_TEST( size ) \
	+ TEST_F( serial, segmentedview_##size##x) { \
	+ test_segmented_view<double,Kokkos::Serial>(size); \
	+ }
	+
	+SERIAL_INSERT_TEST(close, 100000, 90000, 100, 500, true)
	+SERIAL_INSERT_TEST(far, 100000, 90000, 100, 500, false)
	+SERIAL_FAILED_INSERT_TEST( 10000, 1000 )
	+SERIAL_DEEP_COPY( 10000, 1 )
	+
	+SERIAL_VECTOR_COMBINE_TEST( 10 )
	+SERIAL_VECTOR_COMBINE_TEST( 3057 )
	+SERIAL_DUALVIEW_COMBINE_TEST( 10 )
	+SERIAL_SEGMENTEDVIEW_TEST( 10000 )
	+
	+#undef SERIAL_INSERT_TEST
	+#undef SERIAL_FAILED_INSERT_TEST
	+#undef SERIAL_ASSIGNEMENT_TEST
	+#undef SERIAL_DEEP_COPY
	+#undef SERIAL_VECTOR_COMBINE_TEST
	+#undef SERIAL_DUALVIEW_COMBINE_TEST
	+#undef SERIAL_SEGMENTEDVIEW_TEST
	+
	+} // namespace test
	+
	+#endif // KOKKOS_HAVE_SERIAL
	+
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp b/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp
	new file mode 100755
	index 000000000..52b45b786
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp
	@@ -0,0 +1,149 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <vector>
	+
	+#include <Kokkos_StaticCrsGraph.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace TestStaticCrsGraph {
	+
	+template< class Space >
	+void run_test_graph()
	+{
	+ typedef Kokkos::StaticCrsGraph< unsigned , Space > dView ;
	+ typedef typename dView::HostMirror hView ;
	+
	+ const unsigned LENGTH = 1000 ;
	+ dView dx ;
	+ hView hx ;
	+
	+ std::vector< std::vector< int > > graph( LENGTH );
	+
	+ for ( size_t i = 0 ; i < LENGTH ; ++i ) {
	+ graph[i].reserve(8);
	+ for ( size_t j = 0 ; j < 8 ; ++j ) {
	+ graph[i].push_back( i + j * 3 );
	+ }
	+ }
	+
	+ dx = Kokkos::create_staticcrsgraph<dView>( "dx" , graph );
	+ hx = Kokkos::create_mirror( dx );
	+
	+ ASSERT_EQ( hx.row_map.dimension_0() - 1 , LENGTH );
	+
	+ for ( size_t i = 0 ; i < LENGTH ; ++i ) {
	+ const size_t begin = hx.row_map[i];
	+ const size_t n = hx.row_map[i+1] - begin ;
	+ ASSERT_EQ( n , graph[i].size() );
	+ for ( size_t j = 0 ; j < n ; ++j ) {
	+ ASSERT_EQ( (int) hx.entries( j + begin ) , graph[i][j] );
	+ }
	+ }
	+}
	+
	+template< class Space >
	+void run_test_graph2()
	+{
	+ typedef Kokkos::StaticCrsGraph< unsigned[3] , Space > dView ;
	+ typedef typename dView::HostMirror hView ;
	+
	+ const unsigned LENGTH = 10 ;
	+
	+ std::vector< size_t > sizes( LENGTH );
	+
	+ size_t total_length = 0 ;
	+
	+ for ( size_t i = 0 ; i < LENGTH ; ++i ) {
	+ total_length += ( sizes[i] = 6 + i % 4 );
	+ }
	+
	+ dView dx = Kokkos::create_staticcrsgraph<dView>( "test" , sizes );
	+ hView hx = Kokkos::create_mirror( dx );
	+ hView mx = Kokkos::create_mirror( dx );
	+
	+ ASSERT_EQ( (size_t) dx.row_map.dimension_0() , (size_t) LENGTH + 1 );
	+ ASSERT_EQ( (size_t) hx.row_map.dimension_0() , (size_t) LENGTH + 1 );
	+ ASSERT_EQ( (size_t) mx.row_map.dimension_0() , (size_t) LENGTH + 1 );
	+
	+ ASSERT_EQ( (size_t) dx.entries.dimension_0() , (size_t) total_length );
	+ ASSERT_EQ( (size_t) hx.entries.dimension_0() , (size_t) total_length );
	+ ASSERT_EQ( (size_t) mx.entries.dimension_0() , (size_t) total_length );
	+
	+ ASSERT_EQ( (size_t) dx.entries.dimension_1() , (size_t) 3 );
	+ ASSERT_EQ( (size_t) hx.entries.dimension_1() , (size_t) 3 );
	+ ASSERT_EQ( (size_t) mx.entries.dimension_1() , (size_t) 3 );
	+
	+ for ( size_t i = 0 ; i < LENGTH ; ++i ) {
	+ const size_t entry_begin = hx.row_map[i];
	+ const size_t entry_end = hx.row_map[i+1];
	+ for ( size_t j = entry_begin ; j < entry_end ; ++j ) {
	+ hx.entries(j,0) = j + 1 ;
	+ hx.entries(j,1) = j + 2 ;
	+ hx.entries(j,2) = j + 3 ;
	+ }
	+ }
	+
	+ Kokkos::deep_copy( dx.entries , hx.entries );
	+ Kokkos::deep_copy( mx.entries , dx.entries );
	+
	+ ASSERT_EQ( mx.row_map.dimension_0() , (size_t) LENGTH + 1 );
	+
	+ for ( size_t i = 0 ; i < LENGTH ; ++i ) {
	+ const size_t entry_begin = mx.row_map[i];
	+ const size_t entry_end = mx.row_map[i+1];
	+ ASSERT_EQ( ( entry_end - entry_begin ) , sizes[i] );
	+ for ( size_t j = entry_begin ; j < entry_end ; ++j ) {
	+ ASSERT_EQ( (size_t) mx.entries( j , 0 ) , ( j + 1 ) );
	+ ASSERT_EQ( (size_t) mx.entries( j , 1 ) , ( j + 2 ) );
	+ ASSERT_EQ( (size_t) mx.entries( j , 2 ) , ( j + 3 ) );
	+ }
	+ }
	+}
	+
	+} /* namespace TestStaticCrsGraph */
	+
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestThreads.cpp b/lib/kokkos/containers/unit_tests/TestThreads.cpp
	new file mode 100755
	index 000000000..9320a114f
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestThreads.cpp
	@@ -0,0 +1,168 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#if defined( KOKKOS_HAVE_PTHREAD )
	+
	+#include <Kokkos_Bitset.hpp>
	+#include <Kokkos_UnorderedMap.hpp>
	+
	+#include <Kokkos_Vector.hpp>
	+#include <iomanip>
	+
	+
	+//----------------------------------------------------------------------------
	+#include <TestBitset.hpp>
	+#include <TestUnorderedMap.hpp>
	+#include <TestStaticCrsGraph.hpp>
	+
	+#include <TestVector.hpp>
	+#include <TestDualView.hpp>
	+#include <TestSegmentedView.hpp>
	+
	+namespace Test {
	+
	+class threads : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ std::cout << std::setprecision(5) << std::scientific;
	+
	+ unsigned num_threads = 4;
	+
	+ if (Kokkos::hwloc::available()) {
	+ num_threads = Kokkos::hwloc::get_available_numa_count()
	+ * Kokkos::hwloc::get_available_cores_per_numa()
	+ // * Kokkos::hwloc::get_available_threads_per_core()
	+ ;
	+
	+ }
	+
	+ std::cout << "Threads: " << num_threads << std::endl;
	+
	+ Kokkos::Threads::initialize( num_threads );
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Threads::finalize();
	+ }
	+};
	+
	+TEST_F( threads , staticcrsgraph )
	+{
	+ TestStaticCrsGraph::run_test_graph< Kokkos::Threads >();
	+ TestStaticCrsGraph::run_test_graph2< Kokkos::Threads >();
	+}
	+
	+/*TEST_F( threads, bitset )
	+{
	+ test_bitset<Kokkos::Threads>();
	+}*/
	+
	+#define THREADS_INSERT_TEST( name, num_nodes, num_inserts, num_duplicates, repeat, near ) \
	+ TEST_F( threads, UnorderedMap_insert_##name##_##num_nodes##_##num_inserts##_##num_duplicates##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_insert<Kokkos::Threads>(num_nodes,num_inserts,num_duplicates, near); \
	+ }
	+
	+#define THREADS_FAILED_INSERT_TEST( num_nodes, repeat ) \
	+ TEST_F( threads, UnorderedMap_failed_insert_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_failed_insert<Kokkos::Threads>(num_nodes); \
	+ }
	+
	+#define THREADS_ASSIGNEMENT_TEST( num_nodes, repeat ) \
	+ TEST_F( threads, UnorderedMap_assignment_operators_##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_assignement_operators<Kokkos::Threads>(num_nodes); \
	+ }
	+
	+#define THREADS_DEEP_COPY( num_nodes, repeat ) \
	+ TEST_F( threads, UnorderedMap_deep_copy##num_nodes##_##repeat##x) { \
	+ for (int i=0; i<repeat; ++i) \
	+ test_deep_copy<Kokkos::Threads>(num_nodes); \
	+ }
	+
	+#define THREADS_VECTOR_COMBINE_TEST( size ) \
	+ TEST_F( threads, vector_combination##size##x) { \
	+ test_vector_combinations<int,Kokkos::Threads>(size); \
	+ }
	+
	+#define THREADS_DUALVIEW_COMBINE_TEST( size ) \
	+ TEST_F( threads, dualview_combination##size##x) { \
	+ test_dualview_combinations<int,Kokkos::Threads>(size); \
	+ }
	+
	+#define THREADS_SEGMENTEDVIEW_TEST( size ) \
	+ TEST_F( threads, segmentedview_##size##x) { \
	+ test_segmented_view<double,Kokkos::Threads>(size); \
	+ }
	+
	+
	+THREADS_INSERT_TEST(far, 100000, 90000, 100, 500, false)
	+THREADS_FAILED_INSERT_TEST( 10000, 1000 )
	+THREADS_DEEP_COPY( 10000, 1 )
	+
	+THREADS_VECTOR_COMBINE_TEST( 10 )
	+THREADS_VECTOR_COMBINE_TEST( 3057 )
	+THREADS_DUALVIEW_COMBINE_TEST( 10 )
	+THREADS_SEGMENTEDVIEW_TEST( 10000 )
	+
	+
	+#undef THREADS_INSERT_TEST
	+#undef THREADS_FAILED_INSERT_TEST
	+#undef THREADS_ASSIGNEMENT_TEST
	+#undef THREADS_DEEP_COPY
	+#undef THREADS_VECTOR_COMBINE_TEST
	+#undef THREADS_DUALVIEW_COMBINE_TEST
	+#undef THREADS_SEGMENTEDVIEW_TEST
	+
	+} // namespace Test
	+
	+
	+#endif /* #if defined( KOKKOS_HAVE_PTHREAD ) */
	+
	diff --git a/lib/kokkos/containers/unit_tests/TestUnorderedMap.hpp b/lib/kokkos/containers/unit_tests/TestUnorderedMap.hpp
	new file mode 100755
	index 000000000..ff0328548
	--- /dev/null
	+++ b/lib/kokkos/containers/unit_tests/TestUnorderedMap.hpp
	@@ -0,0 +1,313 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef KOKKOS_TEST_UNORDERED_MAP_HPP
	+#define KOKKOS_TEST_UNORDERED_MAP_HPP
	+
	+#include <gtest/gtest.h>
	+#include <iostream>
	+
	+
	+namespace Test {
	+
	+namespace Impl {
	+
	+template <typename MapType, bool Near = false>
	+struct TestInsert
	+{
	+ typedef MapType map_type;
	+ typedef typename map_type::execution_space execution_space;
	+ typedef uint32_t value_type;
	+
	+ map_type map;
	+ uint32_t inserts;
	+ uint32_t collisions;
	+
	+ TestInsert( map_type arg_map, uint32_t arg_inserts, uint32_t arg_collisions)
	+ : map(arg_map)
	+ , inserts(arg_inserts)
	+ , collisions(arg_collisions)
	+ {}
	+
	+ void testit( bool rehash_on_fail = true )
	+ {
	+ execution_space::fence();
	+
	+ uint32_t failed_count = 0;
	+ do {
	+ failed_count = 0;
	+ Kokkos::parallel_reduce(inserts, *this, failed_count);
	+
	+ if (rehash_on_fail && failed_count > 0u) {
	+ const uint32_t new_capacity = map.capacity() + ((map.capacity()*3ull)/20u) + failed_count/collisions ;
	+ map.rehash( new_capacity );
	+ }
	+ } while (rehash_on_fail && failed_count > 0u);
	+
	+ execution_space::fence();
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & failed_count ) const { failed_count = 0; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & failed_count, const volatile value_type & count ) const
	+ { failed_count += count; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(uint32_t i, value_type & failed_count) const
	+ {
	+ const uint32_t key = Near ? i/collisions : i%(inserts/collisions);
	+ if (map.insert(key,i).failed()) ++failed_count;
	+ }
	+
	+};
	+
	+ template <typename MapType, bool Near>
	+ struct TestErase
	+ {
	+ typedef TestErase<MapType, Near> self_type;
	+
	+ typedef MapType map_type;
	+ typedef typename MapType::execution_space execution_space;
	+
	+ map_type m_map;
	+ uint32_t m_num_erase;
	+ uint32_t m_num_duplicates;
	+
	+ TestErase(map_type map, uint32_t num_erases, uint32_t num_duplicates)
	+ : m_map(map)
	+ , m_num_erase(num_erases)
	+ , m_num_duplicates(num_duplicates)
	+ {}
	+
	+ void testit()
	+ {
	+ execution_space::fence();
	+ Kokkos::parallel_for(m_num_erase, *this);
	+ execution_space::fence();
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(typename execution_space::size_type i) const
	+ {
	+ if (Near) {
	+ m_map.erase(i/m_num_duplicates);
	+ }
	+ else {
	+ m_map.erase(i%(m_num_erase/m_num_duplicates));
	+ }
	+
	+ }
	+ };
	+
	+ template <typename MapType>
	+ struct TestFind
	+ {
	+ typedef MapType map_type;
	+ typedef typename MapType::execution_space::execution_space execution_space;
	+ typedef uint32_t value_type;
	+
	+ map_type m_map;
	+ uint32_t m_num_insert;
	+ uint32_t m_num_duplicates;
	+ uint32_t m_max_key;
	+
	+ TestFind(map_type map, uint32_t num_inserts, uint32_t num_duplicates)
	+ : m_map(map)
	+ , m_num_insert(num_inserts)
	+ , m_num_duplicates(num_duplicates)
	+ , m_max_key( ((num_inserts + num_duplicates) - 1)/num_duplicates )
	+ {}
	+
	+ void testit(value_type &errors)
	+ {
	+ execution_space::execution_space::fence();
	+ Kokkos::parallel_reduce(m_map.capacity(), *this, errors);
	+ execution_space::execution_space::fence();
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & dst)
	+ {
	+ dst = 0;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & dst, const volatile value_type & src)
	+ { dst += src; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(typename execution_space::size_type i, value_type & errors) const
	+ {
	+ const bool expect_to_find_i = (i < m_max_key);
	+
	+ const bool exists = m_map.exists(i);
	+
	+ if (expect_to_find_i && !exists) ++errors;
	+ if (!expect_to_find_i && exists) ++errors;
	+ }
	+ };
	+
	+} // namespace Impl
	+
	+
	+
	+template <typename Device>
	+void test_insert( uint32_t num_nodes , uint32_t num_inserts , uint32_t num_duplicates , bool near )
	+{
	+ typedef Kokkos::UnorderedMap<uint32_t,uint32_t, Device> map_type;
	+ typedef Kokkos::UnorderedMap<const uint32_t,const uint32_t, Device> const_map_type;
	+
	+ const uint32_t expected_inserts = (num_inserts + num_duplicates -1u) / num_duplicates;
	+
	+ map_type map;
	+ map.rehash(num_nodes,false);
	+
	+ if (near) {
	+ Impl::TestInsert<map_type,true> test_insert(map, num_inserts, num_duplicates);
	+ test_insert.testit();
	+ } else
	+ {
	+ Impl::TestInsert<map_type,false> test_insert(map, num_inserts, num_duplicates);
	+ test_insert.testit();
	+ }
	+
	+ const bool print_list = false;
	+ if (print_list) {
	+ Kokkos::Impl::UnorderedMapPrint<map_type> f(map);
	+ f.apply();
	+ }
	+
	+ const uint32_t map_size = map.size();
	+
	+ ASSERT_FALSE( map.failed_insert());
	+ {
	+ EXPECT_EQ(expected_inserts, map_size);
	+
	+ {
	+ uint32_t find_errors = 0;
	+ Impl::TestFind<const_map_type> test_find(map, num_inserts, num_duplicates);
	+ test_find.testit(find_errors);
	+ EXPECT_EQ( 0u, find_errors);
	+ }
	+
	+ map.begin_erase();
	+ Impl::TestErase<map_type,false> test_erase(map, num_inserts, num_duplicates);
	+ test_erase.testit();
	+ map.end_erase();
	+ EXPECT_EQ(0u, map.size());
	+ }
	+}
	+
	+template <typename Device>
	+void test_failed_insert( uint32_t num_nodes)
	+{
	+ typedef Kokkos::UnorderedMap<uint32_t,uint32_t, Device> map_type;
	+
	+ map_type map(num_nodes);
	+ Impl::TestInsert<map_type> test_insert(map, 2u*num_nodes, 1u);
	+ test_insert.testit(false /don't rehash on fail/);
	+ Device::execution_space::fence();
	+
	+ EXPECT_TRUE( map.failed_insert() );
	+}
	+
	+
	+
	+template <typename Device>
	+void test_deep_copy( uint32_t num_nodes )
	+{
	+ typedef Kokkos::UnorderedMap<uint32_t,uint32_t, Device> map_type;
	+ typedef Kokkos::UnorderedMap<const uint32_t, const uint32_t, Device> const_map_type;
	+
	+ typedef typename map_type::HostMirror host_map_type ;
	+ // typedef Kokkos::UnorderedMap<uint32_t, uint32_t, typename Device::host_mirror_execution_space > host_map_type;
	+
	+ map_type map;
	+ map.rehash(num_nodes,false);
	+
	+ {
	+ Impl::TestInsert<map_type> test_insert(map, num_nodes, 1);
	+ test_insert.testit();
	+ ASSERT_EQ( map.size(), num_nodes);
	+ ASSERT_FALSE( map.failed_insert() );
	+ {
	+ uint32_t find_errors = 0;
	+ Impl::TestFind<map_type> test_find(map, num_nodes, 1);
	+ test_find.testit(find_errors);
	+ EXPECT_EQ( find_errors, 0u);
	+ }
	+
	+ }
	+
	+ host_map_type hmap;
	+ Kokkos::deep_copy(hmap, map);
	+
	+ ASSERT_EQ( map.size(), hmap.size());
	+ ASSERT_EQ( map.capacity(), hmap.capacity());
	+ {
	+ uint32_t find_errors = 0;
	+ Impl::TestFind<host_map_type> test_find(hmap, num_nodes, 1);
	+ test_find.testit(find_errors);
	+ EXPECT_EQ( find_errors, 0u);
	+ }
	+
	+ map_type mmap;
	+ Kokkos::deep_copy(mmap, hmap);
	+
	+ const_map_type cmap = mmap;
	+
	+ EXPECT_EQ( cmap.size(), num_nodes);
	+
	+ {
	+ uint32_t find_errors = 0;
	+ Impl::TestFind<const_map_type> test_find(cmap, num_nodes, 1);
	+ test_find.testit(find_errors);
	+ EXPECT_EQ( find_errors, 0u);
	+ }
	+
	+}
	+
	+} // namespace Test
	+
	+#endif //KOKKOS_TEST_UNORDERED_MAP_HPP
	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/containers/unit_tests/TestVector.hpp
	similarity index 52%
	copy from lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	copy to lib/kokkos/containers/unit_tests/TestVector.hpp
	index 0dcb3977a..f9f456489 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/containers/unit_tests/TestVector.hpp
	@@ -1,84 +1,131 @@
	-/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	-*/

	-#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	-#define KOKKOS_PHYSICAL_LAYOUT_HPP
	+#ifndef KOKKOS_TEST_VECTOR_HPP
	+#define KOKKOS_TEST_VECTOR_HPP

	+#include <gtest/gtest.h>
	+#include <iostream>
	+#include <cstdlib>
	+#include <cstdio>
	+#include <impl/Kokkos_Timer.hpp>
	+
	+namespace Test {

	-#include <Kokkos_View.hpp>
	-namespace Kokkos {
	namespace Impl {

	+ template <typename Scalar, class Device>
	+ struct test_vector_combinations
	+ {
	+ typedef test_vector_combinations<Scalar,Device> self_type;

	+ typedef Scalar scalar_type;
	+ typedef Device execution_space;

	-struct PhysicalLayout {
	- enum LayoutType {Left,Right,Scalar,Error};
	- LayoutType layout_type;
	- int rank;
	- long long int stride[8]; //distance between two neighboring elements in a given dimension
	+ Scalar reference;
	+ Scalar result;

	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	+ template <typename Vector>
	+ Scalar run_me(unsigned int n){
	+ Vector a(n,1);
	+
	+
	+ a.push_back(2);
	+ a.resize(n+4);
	+ a[n+1] = 3;
	+ a[n+2] = 4;
	+ a[n+3] = 5;
	+
	+
	+ Scalar temp1 = a[2];
	+ Scalar temp2 = a[n];
	+ Scalar temp3 = a[n+1];
	+
	+ a.assign(n+2,-1);
	+
	+ a[2] = temp1;
	+ a[n] = temp2;
	+ a[n+1] = temp3;
	+
	+ Scalar test1 = 0;
	+ for(unsigned int i=0; i<a.size(); i++)
	+ test1+=a[i];
	+
	+ a.assign(n+1,-2);
	+ Scalar test2 = 0;
	+ for(unsigned int i=0; i<a.size(); i++)
	+ test2+=a[i];
	+
	+ a.reserve(n+10);
	+
	+ Scalar test3 = 0;
	+ for(unsigned int i=0; i<a.size(); i++)
	+ test3+=a[i];
	+
	+
	+ return (test1test2+test3)test2+test1*test3;
	}
	- #ifdef KOKKOS_HAVE_CUDA
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	+
	+
	+ test_vector_combinations(unsigned int size)
	{
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	+ reference = run_me<std::vector<Scalar> >(size);
	+ result = run_me<Kokkos::vector<Scalar,Device> >(size);
	}
	- #endif
	-};

	+ };
	+
	+} // namespace Impl
	+
	+
	+
	+
	+template <typename Scalar, typename Device>
	+void test_vector_combinations(unsigned int size)
	+{
	+ Impl::test_vector_combinations<Scalar,Device> test(size);
	+ ASSERT_EQ( test.reference, test.result);
	}
	-}
	-#endif
	+
	+
	+} // namespace Test
	+
	+#endif //KOKKOS_TEST_UNORDERED_MAP_HPP
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/containers/unit_tests/UnitTestMain.cpp
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/containers/unit_tests/UnitTestMain.cpp
	index 966291abd..f952ab3db 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/containers/unit_tests/UnitTestMain.cpp
	@@ -1,64 +1,50 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	+#include <gtest/gtest.h>

	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+int main(int argc, char *argv[]) {
	+ ::testing::InitGoogleTest(&argc,argv);
	+ return RUN_ALL_TESTS();
	+}

	diff --git a/lib/kokkos/core/perf_test/Makefile b/lib/kokkos/core/perf_test/Makefile
	new file mode 100755
	index 000000000..2bf189a22
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/Makefile
	@@ -0,0 +1,66 @@
	+KOKKOS_PATH = ../..
	+
	+GTEST_PATH = ../../TPL/gtest
	+
	+vpath %.cpp ${KOKKOS_PATH}/core/perf_test
	+
	+default: build_all
	+ echo "End Build"
	+
	+
	+include $(KOKKOS_PATH)/Makefile.kokkos
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ CXX = nvcc_wrapper
	+ CXXFLAGS ?= -O3
	+ LINK = $(CXX)
	+ LDFLAGS ?= -lpthread
	+else
	+ CXX ?= g++
	+ CXXFLAGS ?= -O3
	+ LINK ?= $(CXX)
	+ LDFLAGS ?= -lpthread
	+endif
	+
	+KOKKOS_CXXFLAGS += -I$(GTEST_PATH) -I${KOKKOS_PATH}/core/perf_test
	+
	+TEST_TARGETS =
	+TARGETS =
	+
	+OBJ_PERF = PerfTestHost.o PerfTestCuda.o PerfTestMain.o gtest-all.o
	+TARGETS += KokkosCore_PerformanceTest
	+TEST_TARGETS += test-performance
	+
	+OBJ_ATOMICS = test_atomic.o
	+TARGETS += KokkosCore_PerformanceTest_Atomics
	+TEST_TARGETS += test-atomic
	+
	+
	+KokkosCore_PerformanceTest: $(OBJ_PERF) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_PERF) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_PerformanceTest
	+
	+KokkosCore_PerformanceTest_Atomics: $(OBJ_ATOMICS) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_ATOMICS) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_PerformanceTest_Atomics
	+
	+test-performance: KokkosCore_PerformanceTest
	+ ./KokkosCore_PerformanceTest
	+
	+test-atomic: KokkosCore_PerformanceTest_Atomics
	+ ./KokkosCore_PerformanceTest_Atomics
	+
	+
	+build_all: $(TARGETS)
	+
	+test: $(TEST_TARGETS)
	+
	+clean: kokkos-clean
	+ rm -f *.o $(TARGETS)
	+
	+# Compilation rules
	+
	+%.o:%.cpp $(KOKKOS_CPP_DEPENDS)
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
	+
	+gtest-all.o:$(GTEST_PATH)/gtest/gtest-all.cc
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $(GTEST_PATH)/gtest/gtest-all.cc
	+
	diff --git a/lib/kokkos/core/perf_test/PerfTestBlasKernels.hpp b/lib/kokkos/core/perf_test/PerfTestBlasKernels.hpp
	new file mode 100755
	index 000000000..aa4046cbf
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/PerfTestBlasKernels.hpp
	@@ -0,0 +1,309 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_BLAS_KERNELS_HPP
	+#define KOKKOS_BLAS_KERNELS_HPP
	+
	+namespace Kokkos {
	+
	+template< class ConstVectorType ,
	+ class Device = typename ConstVectorType::execution_space >
	+struct Dot ;
	+
	+template< class ConstVectorType ,
	+ class Device = typename ConstVectorType::execution_space >
	+struct DotSingle ;
	+
	+template< class ConstScalarType ,
	+ class VectorType ,
	+ class Device = typename VectorType::execution_space >
	+struct Scale ;
	+
	+template< class ConstScalarType ,
	+ class ConstVectorType ,
	+ class VectorType ,
	+ class Device = typename VectorType::execution_space >
	+struct AXPBY ;
	+
	+/** \brief Y = alpha * X + beta * Y */
	+template< class ConstScalarType ,
	+ class ConstVectorType ,
	+ class VectorType >
	+void axpby( const ConstScalarType & alpha ,
	+ const ConstVectorType & X ,
	+ const ConstScalarType & beta ,
	+ const VectorType & Y )
	+{
	+ typedef AXPBY< ConstScalarType , ConstVectorType , VectorType > functor ;
	+
	+ parallel_for( Y.dimension_0() , functor( alpha , X , beta , Y ) );
	+}
	+
	+/** \brief Y = alpha /
	+template< class ConstScalarType ,
	+ class VectorType >
	+void scale( const ConstScalarType & alpha , const VectorType & Y )
	+{
	+ typedef Scale< ConstScalarType , VectorType > functor ;
	+
	+ parallel_for( Y.dimension_0() , functor( alpha , Y ) );
	+}
	+
	+template< class ConstVectorType ,
	+ class Finalize >
	+void dot( const ConstVectorType & X ,
	+ const ConstVectorType & Y ,
	+ const Finalize & finalize )
	+{
	+ typedef Dot< ConstVectorType > functor ;
	+
	+ parallel_reduce( X.dimension_0() , functor( X , Y ) , finalize );
	+}
	+
	+template< class ConstVectorType ,
	+ class Finalize >
	+void dot( const ConstVectorType & X ,
	+ const Finalize & finalize )
	+{
	+ typedef DotSingle< ConstVectorType > functor ;
	+
	+ parallel_reduce( X.dimension_0() , functor( X ) , finalize );
	+}
	+
	+} /* namespace Kokkos */
	+
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+
	+template< class Type , class Device >
	+struct Dot
	+{
	+ typedef typename Device::execution_space execution_space ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 1 > ,
	+ Impl::unsigned_< Type::Rank > >::type ok_rank ;
	+
	+
	+/* typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename Type::execution_space >::type ok_device ;*/
	+
	+ typedef double value_type ;
	+
	+#if 1
	+ typename Type::const_type X ;
	+ typename Type::const_type Y ;
	+#else
	+ Type X ;
	+ Type Y ;
	+#endif
	+
	+ Dot( const Type & arg_x , const Type & arg_y )
	+ : X(arg_x) , Y(arg_y) { }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( int i , value_type & update ) const
	+ { update += X[i] * Y[i]; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & source )
	+ { update += source; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+};
	+
	+template< class Type , class Device >
	+struct DotSingle
	+{
	+ typedef typename Device::execution_space execution_space ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 1 > ,
	+ Impl::unsigned_< Type::Rank > >::type ok_rank ;
	+
	+/* typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename Type::execution_space >::type ok_device ;*/
	+
	+ typedef double value_type ;
	+
	+#if 1
	+ typename Type::const_type X ;
	+#else
	+ Type X ;
	+#endif
	+
	+ DotSingle( const Type & arg_x ) : X(arg_x) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( int i , value_type & update ) const
	+ {
	+ const typename Type::value_type & x = X[i]; update += x * x ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & source )
	+ { update += source; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+};
	+
	+
	+template< class ScalarType , class VectorType , class Device>
	+struct Scale
	+{
	+ typedef typename Device::execution_space execution_space ;
	+
	+/* typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename ScalarType::execution_space >::type
	+ ok_scalar_device ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename VectorType::execution_space >::type
	+ ok_vector_device ;*/
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 0 > ,
	+ Impl::unsigned_< ScalarType::Rank > >::type
	+ ok_scalar_rank ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 1 > ,
	+ Impl::unsigned_< VectorType::Rank > >::type
	+ ok_vector_rank ;
	+
	+#if 1
	+ typename ScalarType::const_type alpha ;
	+#else
	+ ScalarType alpha ;
	+#endif
	+
	+ VectorType Y ;
	+
	+ Scale( const ScalarType & arg_alpha , const VectorType & arg_Y )
	+ : alpha( arg_alpha ), Y( arg_Y ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( int i ) const
	+ {
	+ Y[i] *= alpha() ;
	+ }
	+};
	+
	+
	+template< class ScalarType ,
	+ class ConstVectorType ,
	+ class VectorType,
	+ class Device>
	+struct AXPBY
	+{
	+ typedef typename Device::execution_space execution_space ;
	+
	+/* typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename ScalarType::execution_space >::type
	+ ok_scalar_device ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename ConstVectorType::execution_space >::type
	+ ok_const_vector_device ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< execution_space ,
	+ typename VectorType::execution_space >::type
	+ ok_vector_device ;*/
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 0 > ,
	+ Impl::unsigned_< ScalarType::Rank > >::type
	+ ok_scalar_rank ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 1 > ,
	+ Impl::unsigned_< ConstVectorType::Rank > >::type
	+ ok_const_vector_rank ;
	+
	+ typedef typename
	+ Impl::StaticAssertSame< Impl::unsigned_< 1 > ,
	+ Impl::unsigned_< VectorType::Rank > >::type
	+ ok_vector_rank ;
	+
	+#if 1
	+ typename ScalarType::const_type alpha , beta ;
	+ typename ConstVectorType::const_type X ;
	+#else
	+ ScalarType alpha , beta ;
	+ ConstVectorType X ;
	+#endif
	+
	+ VectorType Y ;
	+
	+ AXPBY( const ScalarType & arg_alpha ,
	+ const ConstVectorType & arg_X ,
	+ const ScalarType & arg_beta ,
	+ const VectorType & arg_Y )
	+ : alpha( arg_alpha ), beta( arg_beta ), X( arg_X ), Y( arg_Y ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( int i ) const
	+ {
	+ Y[i] = alpha() * X[i] + beta() * Y[i] ;
	+ }
	+};
	+
	+} /* namespace Kokkos */
	+
	+#endif /* #ifndef KOKKOS_BLAS_KERNELS_HPP */
	diff --git a/lib/kokkos/core/perf_test/PerfTestCuda.cpp b/lib/kokkos/core/perf_test/PerfTestCuda.cpp
	new file mode 100755
	index 000000000..28e654bb7
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/PerfTestCuda.cpp
	@@ -0,0 +1,189 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <iostream>
	+#include <iomanip>
	+#include <algorithm>
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#if defined( KOKKOS_HAVE_CUDA )
	+
	+#include <impl/Kokkos_Timer.hpp>
	+
	+#include <PerfTestHexGrad.hpp>
	+#include <PerfTestBlasKernels.hpp>
	+#include <PerfTestGramSchmidt.hpp>
	+#include <PerfTestDriver.hpp>
	+
	+
	+namespace Test {
	+
	+class cuda : public ::testing::Test {
	+ protected:
	+ static void SetUpTestCase() {
	+ Kokkos::HostSpace::execution_space::initialize();
	+ Kokkos::Cuda::initialize( Kokkos::Cuda::SelectDevice(0) );
	+ }
	+ static void TearDownTestCase() {
	+ Kokkos::Cuda::finalize();
	+ Kokkos::HostSpace::execution_space::finalize();
	+ }
	+};
	+
	+TEST_F( cuda, hexgrad )
	+{
	+ EXPECT_NO_THROW( run_test_hexgrad< Kokkos::Cuda >( 10 , 20, "Kokkos::Cuda" ) );
	+}
	+
	+TEST_F( cuda, gramschmidt )
	+{
	+ EXPECT_NO_THROW( run_test_gramschmidt< Kokkos::Cuda >( 10 , 20, "Kokkos::Cuda" ) );
	+}
	+
	+namespace {
	+
	+template <typename T>
	+struct TextureFetch
	+{
	+ typedef Kokkos::View< T *, Kokkos::CudaSpace> array_type;
	+ typedef Kokkos::View< const T *, Kokkos::CudaSpace, Kokkos::MemoryRandomAccess> const_array_type;
	+ typedef Kokkos::View< int *, Kokkos::CudaSpace> index_array_type;
	+ typedef Kokkos::View< const int *, Kokkos::CudaSpace> const_index_array_type;
	+
	+ struct FillArray
	+ {
	+ array_type m_array;
	+ FillArray( const array_type & array )
	+ : m_array(array)
	+ {}
	+
	+ void apply() const
	+ {
	+ Kokkos::parallel_for( Kokkos::RangePolicy<Kokkos::Cuda,int>(0,m_array.size()), *this);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const { m_array(i) = i; }
	+ };
	+
	+ struct RandomIndexes
	+ {
	+ index_array_type m_indexes;
	+ typename index_array_type::HostMirror m_host_indexes;
	+ RandomIndexes( const index_array_type & indexes)
	+ : m_indexes(indexes)
	+ , m_host_indexes(Kokkos::create_mirror(m_indexes))
	+ {}
	+
	+ void apply() const
	+ {
	+ Kokkos::parallel_for( Kokkos::RangePolicy<Kokkos::HostSpace::execution_space,int>(0,m_host_indexes.size()), *this);
	+ //random shuffle
	+ Kokkos::HostSpace::execution_space::fence();
	+ std::random_shuffle(m_host_indexes.ptr_on_device(), m_host_indexes.ptr_on_device() + m_host_indexes.size());
	+ Kokkos::deep_copy(m_indexes,m_host_indexes);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const { m_host_indexes(i) = i; }
	+ };
	+
	+ struct RandomReduce
	+ {
	+ const_array_type m_array;
	+ const_index_array_type m_indexes;
	+ RandomReduce( const const_array_type & array, const const_index_array_type & indexes)
	+ : m_array(array)
	+ , m_indexes(indexes)
	+ {}
	+
	+ void apply(T & reduce) const
	+ {
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy<Kokkos::Cuda,int>(0,m_array.size()), *this, reduce);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i, T & reduce) const
	+ { reduce += m_array(m_indexes(i)); }
	+ };
	+
	+ static void run(int size, double & reduce_time, T &reduce)
	+ {
	+ array_type array("array",size);
	+ index_array_type indexes("indexes",size);
	+
	+ { FillArray f(array); f.apply(); }
	+ { RandomIndexes f(indexes); f.apply(); }
	+
	+ Kokkos::Cuda::fence();
	+
	+ Kokkos::Impl::Timer timer;
	+ for (int j=0; j<10; ++j) {
	+ RandomReduce f(array,indexes);
	+ f.apply(reduce);
	+ }
	+ Kokkos::Cuda::fence();
	+ reduce_time = timer.seconds();
	+ }
	+};
	+
	+} // unnamed namespace
	+
	+TEST_F( cuda, texture_double )
	+{
	+ printf("Random reduce of double through texture fetch\n");
	+ for (int i=1; i<=27; ++i) {
	+ int size = 1<<i;
	+ double time = 0;
	+ double reduce = 0;
	+ TextureFetch<double>::run(size,time,reduce);
	+ printf(" time = %1.3e size = 2^%d\n", time, i);
	+ }
	+}
	+
	+} // namespace Test
	+
	+#endif /* #if defined( KOKKOS_HAVE_CUDA ) */
	+
	diff --git a/lib/kokkos/core/perf_test/PerfTestDriver.hpp b/lib/kokkos/core/perf_test/PerfTestDriver.hpp
	new file mode 100755
	index 000000000..e3dd3b412
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/PerfTestDriver.hpp
	@@ -0,0 +1,152 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <iostream>
	+#include <string>
	+
	+// mfh 06 Jun 2013: This macro doesn't work like one might thing it
	+// should. It doesn't take the template parameter DeviceType and
	+// print its actual type name; it just literally prints out
	+// "DeviceType". I've worked around this below without using the
	+// macro, so I'm commenting out the macro to avoid compiler complaints
	+// about an unused macro.
	+
	+// #define KOKKOS_MACRO_IMPL_TO_STRING( X ) #X
	+// #define KOKKOS_MACRO_TO_STRING( X ) KOKKOS_MACRO_IMPL_TO_STRING( X )
	+
	+//------------------------------------------------------------------------
	+
	+namespace Test {
	+
	+enum { NUMBER_OF_TRIALS = 5 };
	+
	+
	+
	+template< class DeviceType >
	+void run_test_hexgrad( int exp_beg , int exp_end, const char deviceTypeName[] )
	+{
	+ std::string label_hexgrad ;
	+ label_hexgrad.append( "\"HexGrad< double , " );
	+ // mfh 06 Jun 2013: This only appends "DeviceType" (literally) to
	+ // the string, not the actual name of the device type. Thus, I've
	+ // modified the function to take the name of the device type.
	+ //
	+ //label_hexgrad.append( KOKKOS_MACRO_TO_STRING( DeviceType ) );
	+ label_hexgrad.append( deviceTypeName );
	+ label_hexgrad.append( " >\"" );
	+
	+ for (int i = exp_beg ; i < exp_end ; ++i) {
	+ double min_seconds = 0.0 ;
	+ double max_seconds = 0.0 ;
	+ double avg_seconds = 0.0 ;
	+
	+ const int parallel_work_length = 1<<i;
	+
	+ for ( int j = 0 ; j < NUMBER_OF_TRIALS ; ++j ) {
	+ const double seconds = HexGrad< DeviceType >::test(parallel_work_length) ;
	+
	+ if ( 0 == j ) {
	+ min_seconds = seconds ;
	+ max_seconds = seconds ;
	+ }
	+ else {
	+ if ( seconds < min_seconds ) min_seconds = seconds ;
	+ if ( seconds > max_seconds ) max_seconds = seconds ;
	+ }
	+ avg_seconds += seconds ;
	+ }
	+ avg_seconds /= NUMBER_OF_TRIALS ;
	+
	+ std::cout << label_hexgrad
	+ << " , " << parallel_work_length
	+ << " , " << min_seconds
	+ << " , " << ( min_seconds / parallel_work_length )
	+ << std::endl ;
	+ }
	+}
	+
	+template< class DeviceType >
	+void run_test_gramschmidt( int exp_beg , int exp_end, const char deviceTypeName[] )
	+{
	+ std::string label_gramschmidt ;
	+ label_gramschmidt.append( "\"GramSchmidt< double , " );
	+ // mfh 06 Jun 2013: This only appends "DeviceType" (literally) to
	+ // the string, not the actual name of the device type. Thus, I've
	+ // modified the function to take the name of the device type.
	+ //
	+ //label_gramschmidt.append( KOKKOS_MACRO_TO_STRING( DeviceType ) );
	+ label_gramschmidt.append( deviceTypeName );
	+ label_gramschmidt.append( " >\"" );
	+
	+ for (int i = exp_beg ; i < exp_end ; ++i) {
	+ double min_seconds = 0.0 ;
	+ double max_seconds = 0.0 ;
	+ double avg_seconds = 0.0 ;
	+
	+ const int parallel_work_length = 1<<i;
	+
	+ for ( int j = 0 ; j < NUMBER_OF_TRIALS ; ++j ) {
	+ const double seconds = ModifiedGramSchmidt< double , DeviceType >::test(parallel_work_length, 32 ) ;
	+
	+ if ( 0 == j ) {
	+ min_seconds = seconds ;
	+ max_seconds = seconds ;
	+ }
	+ else {
	+ if ( seconds < min_seconds ) min_seconds = seconds ;
	+ if ( seconds > max_seconds ) max_seconds = seconds ;
	+ }
	+ avg_seconds += seconds ;
	+ }
	+ avg_seconds /= NUMBER_OF_TRIALS ;
	+
	+ std::cout << label_gramschmidt
	+ << " , " << parallel_work_length
	+ << " , " << min_seconds
	+ << " , " << ( min_seconds / parallel_work_length )
	+ << std::endl ;
	+ }
	+}
	+
	+}
	+
	diff --git a/lib/kokkos/core/perf_test/PerfTestGramSchmidt.hpp b/lib/kokkos/core/perf_test/PerfTestGramSchmidt.hpp
	new file mode 100755
	index 000000000..292e09cc4
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/PerfTestGramSchmidt.hpp
	@@ -0,0 +1,231 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <cmath>
	+#include <PerfTestBlasKernels.hpp>
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Test {
	+
	+// Reduction : result = dot( Q(:,j) , Q(:,j) );
	+// PostProcess : R(j,j) = result ; inv = 1 / result ;
	+template< class VectorView , class ValueView >
	+struct InvNorm2 : public Kokkos::DotSingle< VectorView > {
	+
	+ typedef typename Kokkos::DotSingle< VectorView >::value_type value_type ;
	+
	+ ValueView Rjj ;
	+ ValueView inv ;
	+
	+ InvNorm2( const VectorView & argX ,
	+ const ValueView & argR ,
	+ const ValueView & argInv )
	+ : Kokkos::DotSingle< VectorView >( argX )
	+ , Rjj( argR )
	+ , inv( argInv )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void final( value_type & result ) const
	+ {
	+ result = sqrt( result );
	+ Rjj() = result ;
	+ inv() = ( 0 < result ) ? 1.0 / result : 0 ;
	+ }
	+};
	+
	+template< class VectorView , class ValueView >
	+inline
	+void invnorm2( const VectorView & x ,
	+ const ValueView & r ,
	+ const ValueView & r_inv )
	+{
	+ Kokkos::parallel_reduce( x.dimension_0() , InvNorm2< VectorView , ValueView >( x , r , r_inv ) );
	+}
	+
	+// PostProcess : tmp = - ( R(j,k) = result );
	+template< class VectorView , class ValueView >
	+struct DotM : public Kokkos::Dot< VectorView > {
	+
	+ typedef typename Kokkos::Dot< VectorView >::value_type value_type ;
	+
	+ ValueView Rjk ;
	+ ValueView tmp ;
	+
	+ DotM( const VectorView & argX ,
	+ const VectorView & argY ,
	+ const ValueView & argR ,
	+ const ValueView & argTmp )
	+ : Kokkos::Dot< VectorView >( argX , argY )
	+ , Rjk( argR )
	+ , tmp( argTmp )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void final( value_type & result ) const
	+ {
	+ Rjk() = result ;
	+ tmp() = - result ;
	+ }
	+};
	+
	+template< class VectorView , class ValueView >
	+inline
	+void dot_neg( const VectorView & x ,
	+ const VectorView & y ,
	+ const ValueView & r ,
	+ const ValueView & r_neg )
	+{
	+ Kokkos::parallel_reduce( x.dimension_0() , DotM< VectorView , ValueView >( x , y , r , r_neg ) );
	+}
	+
	+
	+template< typename Scalar , class DeviceType >
	+struct ModifiedGramSchmidt
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef Kokkos::View< Scalar** ,
	+ Kokkos::LayoutLeft ,
	+ execution_space > multivector_type ;
	+
	+ typedef Kokkos::View< Scalar* ,
	+ Kokkos::LayoutLeft ,
	+ execution_space > vector_type ;
	+
	+ typedef Kokkos::View< Scalar ,
	+ Kokkos::LayoutLeft ,
	+ execution_space > value_view ;
	+
	+
	+ multivector_type Q ;
	+ multivector_type R ;
	+
	+ static double factorization( const multivector_type Q_ ,
	+ const multivector_type R_ )
	+ {
	+#if defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+ using Kokkos::Experimental::ALL ;
	+#else
	+ const Kokkos::ALL ALL ;
	+#endif
	+ const size_type count = Q_.dimension_1();
	+ value_view tmp("tmp");
	+ value_view one("one");
	+
	+ Kokkos::deep_copy( one , (Scalar) 1 );
	+
	+ Kokkos::Impl::Timer timer ;
	+
	+ for ( size_type j = 0 ; j < count ; ++j ) {
	+ // Reduction : tmp = dot( Q(:,j) , Q(:,j) );
	+ // PostProcess : tmp = sqrt( tmp ); R(j,j) = tmp ; tmp = 1 / tmp ;
	+ const vector_type Qj = Kokkos::subview( Q_ , ALL , j );
	+ const value_view Rjj = Kokkos::subview( R_ , j , j );
	+
	+ invnorm2( Qj , Rjj , tmp );
	+
	+ // Q(:,j) = ( 1 / R(j,j) ); => Q(:,j) = tmp ;
	+ Kokkos::scale( tmp , Qj );
	+
	+ for ( size_t k = j + 1 ; k < count ; ++k ) {
	+ const vector_type Qk = Kokkos::subview( Q_ , ALL , k );
	+ const value_view Rjk = Kokkos::subview( R_ , j , k );
	+
	+ // Reduction : R(j,k) = dot( Q(:,j) , Q(:,k) );
	+ // PostProcess : tmp = - R(j,k);
	+ dot_neg( Qj , Qk , Rjk , tmp );
	+
	+ // Q(:,k) -= R(j,k) * Q(:,j); => Q(:,k) += tmp * Q(:,j)
	+ Kokkos::axpby( tmp , Qj , one , Qk );
	+ }
	+ }
	+
	+ execution_space::fence();
	+
	+ return timer.seconds();
	+ }
	+
	+ //--------------------------------------------------------------------------
	+
	+ static double test( const size_t length ,
	+ const size_t count ,
	+ const size_t iter = 1 )
	+ {
	+ multivector_type Q_( "Q" , length , count );
	+ multivector_type R_( "R" , count , count );
	+
	+ typename multivector_type::HostMirror A =
	+ Kokkos::create_mirror( Q_ );
	+
	+ // Create and fill A on the host
	+
	+ for ( size_type j = 0 ; j < count ; ++j ) {
	+ for ( size_type i = 0 ; i < length ; ++i ) {
	+ A(i,j) = ( i + 1 ) * ( j + 1 );
	+ }
	+ }
	+
	+ double dt_min = 0 ;
	+
	+ for ( size_t i = 0 ; i < iter ; ++i ) {
	+
	+ Kokkos::deep_copy( Q_ , A );
	+
	+ // A = Q * R
	+
	+ const double dt = factorization( Q_ , R_ );
	+
	+ if ( 0 == i ) dt_min = dt ;
	+ else dt_min = dt < dt_min ? dt : dt_min ;
	+ }
	+
	+ return dt_min ;
	+ }
	+};
	+
	+}
	+
	diff --git a/lib/kokkos/core/perf_test/PerfTestHexGrad.hpp b/lib/kokkos/core/perf_test/PerfTestHexGrad.hpp
	new file mode 100755
	index 000000000..d13d9a49e
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/PerfTestHexGrad.hpp
	@@ -0,0 +1,268 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+namespace Test {
	+
	+template< class DeviceType ,
	+ typename CoordScalarType = double ,
	+ typename GradScalarType = float >
	+struct HexGrad
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef HexGrad<DeviceType,CoordScalarType,GradScalarType> self_type;
	+
	+ // 3D array : ( ParallelWork , Space , Node )
	+
	+ enum { NSpace = 3 , NNode = 8 };
	+
	+ typedef Kokkos::View< CoordScalarType*[NSpace][NNode] , execution_space >
	+ elem_coord_type ;
	+
	+ typedef Kokkos::View< GradScalarType*[NSpace][NNode] , execution_space >
	+ elem_grad_type ;
	+
	+ elem_coord_type coords ;
	+ elem_grad_type grad_op ;
	+
	+ enum { FLOPS = 318 }; // = 3 * ( 18 + 8 * 11 ) };
	+ enum { READS = 18 };
	+ enum { WRITES = 18 };
	+
	+ HexGrad( const elem_coord_type & arg_coords ,
	+ const elem_grad_type & arg_grad_op )
	+ : coords( arg_coords )
	+ , grad_op( arg_grad_op )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ void grad( const CoordScalarType x[] ,
	+ const CoordScalarType z[] ,
	+ GradScalarType grad_y[] )
	+ {
	+ const GradScalarType R42=(x[3] - x[1]);
	+ const GradScalarType R52=(x[4] - x[1]);
	+ const GradScalarType R54=(x[4] - x[3]);
	+
	+ const GradScalarType R63=(x[5] - x[2]);
	+ const GradScalarType R83=(x[7] - x[2]);
	+ const GradScalarType R86=(x[7] - x[5]);
	+
	+ const GradScalarType R31=(x[2] - x[0]);
	+ const GradScalarType R61=(x[5] - x[0]);
	+ const GradScalarType R74=(x[6] - x[3]);
	+
	+ const GradScalarType R72=(x[6] - x[1]);
	+ const GradScalarType R75=(x[6] - x[4]);
	+ const GradScalarType R81=(x[7] - x[0]);
	+
	+ const GradScalarType t1=(R63 + R54);
	+ const GradScalarType t2=(R61 + R74);
	+ const GradScalarType t3=(R72 + R81);
	+
	+ const GradScalarType t4 =(R86 + R42);
	+ const GradScalarType t5 =(R83 + R52);
	+ const GradScalarType t6 =(R75 + R31);
	+
	+ // Calculate Y gradient from X and Z data
	+
	+ grad_y[0] = (z[1] * t1) - (z[2] * R42) - (z[3] * t5) + (z[4] * t4) + (z[5] * R52) - (z[7] * R54);
	+ grad_y[1] = (z[2] * t2) + (z[3] * R31) - (z[0] * t1) - (z[5] * t6) + (z[6] * R63) - (z[4] * R61);
	+ grad_y[2] = (z[3] * t3) + (z[0] * R42) - (z[1] * t2) - (z[6] * t4) + (z[7] * R74) - (z[5] * R72);
	+ grad_y[3] = (z[0] * t5) - (z[1] * R31) - (z[2] * t3) + (z[7] * t6) + (z[4] * R81) - (z[6] * R83);
	+ grad_y[4] = (z[5] * t3) + (z[6] * R86) - (z[7] * t2) - (z[0] * t4) - (z[3] * R81) + (z[1] * R61);
	+ grad_y[5] = (z[6] * t5) - (z[4] * t3) - (z[7] * R75) + (z[1] * t6) - (z[0] * R52) + (z[2] * R72);
	+ grad_y[6] = (z[7] * t1) - (z[5] * t5) - (z[4] * R86) + (z[2] * t4) - (z[1] * R63) + (z[3] * R83);
	+ grad_y[7] = (z[4] * t2) - (z[6] * t1) + (z[5] * R75) - (z[3] * t6) - (z[2] * R74) + (z[0] * R54);
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_type ielem ) const
	+ {
	+ GradScalarType g[NNode] ;
	+
	+ const CoordScalarType x[NNode] = {
	+ coords(ielem,0,0),
	+ coords(ielem,0,1),
	+ coords(ielem,0,2),
	+ coords(ielem,0,3),
	+ coords(ielem,0,4),
	+ coords(ielem,0,5),
	+ coords(ielem,0,6),
	+ coords(ielem,0,7)
	+ };
	+
	+ const CoordScalarType y[NNode] = {
	+ coords(ielem,1,0),
	+ coords(ielem,1,1),
	+ coords(ielem,1,2),
	+ coords(ielem,1,3),
	+ coords(ielem,1,4),
	+ coords(ielem,1,5),
	+ coords(ielem,1,6),
	+ coords(ielem,1,7)
	+ };
	+
	+ const CoordScalarType z[NNode] = {
	+ coords(ielem,2,0),
	+ coords(ielem,2,1),
	+ coords(ielem,2,2),
	+ coords(ielem,2,3),
	+ coords(ielem,2,4),
	+ coords(ielem,2,5),
	+ coords(ielem,2,6),
	+ coords(ielem,2,7)
	+ };
	+
	+ grad( z , y , g );
	+
	+ grad_op(ielem,0,0) = g[0];
	+ grad_op(ielem,0,1) = g[1];
	+ grad_op(ielem,0,2) = g[2];
	+ grad_op(ielem,0,3) = g[3];
	+ grad_op(ielem,0,4) = g[4];
	+ grad_op(ielem,0,5) = g[5];
	+ grad_op(ielem,0,6) = g[6];
	+ grad_op(ielem,0,7) = g[7];
	+
	+ grad( x , z , g );
	+
	+ grad_op(ielem,1,0) = g[0];
	+ grad_op(ielem,1,1) = g[1];
	+ grad_op(ielem,1,2) = g[2];
	+ grad_op(ielem,1,3) = g[3];
	+ grad_op(ielem,1,4) = g[4];
	+ grad_op(ielem,1,5) = g[5];
	+ grad_op(ielem,1,6) = g[6];
	+ grad_op(ielem,1,7) = g[7];
	+
	+ grad( y , x , g );
	+
	+ grad_op(ielem,2,0) = g[0];
	+ grad_op(ielem,2,1) = g[1];
	+ grad_op(ielem,2,2) = g[2];
	+ grad_op(ielem,2,3) = g[3];
	+ grad_op(ielem,2,4) = g[4];
	+ grad_op(ielem,2,5) = g[5];
	+ grad_op(ielem,2,6) = g[6];
	+ grad_op(ielem,2,7) = g[7];
	+ }
	+
	+ //--------------------------------------------------------------------------
	+
	+ struct Init {
	+ typedef typename self_type::execution_space execution_space ;
	+
	+ elem_coord_type coords ;
	+
	+ Init( const elem_coord_type & arg_coords )
	+ : coords( arg_coords ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_type ielem ) const
	+ {
	+ coords(ielem,0,0) = 0.;
	+ coords(ielem,1,0) = 0.;
	+ coords(ielem,2,0) = 0.;
	+
	+ coords(ielem,0,1) = 1.;
	+ coords(ielem,1,1) = 0.;
	+ coords(ielem,2,1) = 0.;
	+
	+ coords(ielem,0,2) = 1.;
	+ coords(ielem,1,2) = 1.;
	+ coords(ielem,2,2) = 0.;
	+
	+ coords(ielem,0,3) = 0.;
	+ coords(ielem,1,3) = 1.;
	+ coords(ielem,2,3) = 0.;
	+
	+
	+ coords(ielem,0,4) = 0.;
	+ coords(ielem,1,4) = 0.;
	+ coords(ielem,2,4) = 1.;
	+
	+ coords(ielem,0,5) = 1.;
	+ coords(ielem,1,5) = 0.;
	+ coords(ielem,2,5) = 1.;
	+
	+ coords(ielem,0,6) = 1.;
	+ coords(ielem,1,6) = 1.;
	+ coords(ielem,2,6) = 1.;
	+
	+ coords(ielem,0,7) = 0.;
	+ coords(ielem,1,7) = 1.;
	+ coords(ielem,2,7) = 1.;
	+ }
	+ };
	+
	+ //--------------------------------------------------------------------------
	+
	+ static double test( const int count , const int iter = 1 )
	+ {
	+ elem_coord_type coord( "coord" , count );
	+ elem_grad_type grad ( "grad" , count );
	+
	+ // Execute the parallel kernels on the arrays:
	+
	+ double dt_min = 0 ;
	+
	+ Kokkos::parallel_for( count , Init( coord ) );
	+ execution_space::fence();
	+
	+ for ( int i = 0 ; i < iter ; ++i ) {
	+ Kokkos::Impl::Timer timer ;
	+ Kokkos::parallel_for( count , HexGrad<execution_space>( coord , grad ) );
	+ execution_space::fence();
	+ const double dt = timer.seconds();
	+ if ( 0 == i ) dt_min = dt ;
	+ else dt_min = dt < dt_min ? dt : dt_min ;
	+ }
	+
	+ return dt_min ;
	+ }
	+};
	+
	+}
	+
	diff --git a/lib/kokkos/core/src/Kokkos_Core.hpp b/lib/kokkos/core/perf_test/PerfTestHost.cpp
	similarity index 56%
	copy from lib/kokkos/core/src/Kokkos_Core.hpp
	copy to lib/kokkos/core/perf_test/PerfTestHost.cpp
	index 8f5f34bfd..6a0f2efad 100755
	--- a/lib/kokkos/core/src/Kokkos_Core.hpp
	+++ b/lib/kokkos/core/perf_test/PerfTestHost.cpp
	@@ -1,106 +1,104 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_CORE_HPP
	-#define KOKKOS_CORE_HPP
	+#include <gtest/gtest.h>

	-//----------------------------------------------------------------------------
	-// Include the execution space header files for the enabled execution spaces.
	+#include <Kokkos_Core.hpp>

	-#include <Kokkos_Core_fwd.hpp>
	+#if defined( KOKKOS_HAVE_OPENMP )

	-#if defined( KOKKOS_HAVE_CUDA )
	-#include <Kokkos_Cuda.hpp>
	-#endif
	+typedef Kokkos::OpenMP TestHostDevice ;
	+const char TestHostDeviceName[] = "Kokkos::OpenMP" ;

	-#if defined( KOKKOS_HAVE_OPENMP )
	-#include <Kokkos_OpenMP.hpp>
	-#endif
	+#elif defined( KOKKOS_HAVE_PTHREAD )

	-#if defined( KOKKOS_HAVE_SERIAL )
	-#include <Kokkos_Serial.hpp>
	-#endif
	+typedef Kokkos::Threads TestHostDevice ;
	+const char TestHostDeviceName[] = "Kokkos::Threads" ;

	-#if defined( KOKKOS_HAVE_PTHREAD )
	-#include <Kokkos_Threads.hpp>
	-#endif
	+#elif defined( KOKKOS_HAVE_SERIAL )

	-#include <Kokkos_Pair.hpp>
	-#include <Kokkos_View.hpp>
	-#include <Kokkos_Vectorization.hpp>
	-#include <Kokkos_Atomic.hpp>
	-#include <Kokkos_hwloc.hpp>
	+typedef Kokkos::Serial TestHostDevice ;
	+const char TestHostDeviceName[] = "Kokkos::Serial" ;

	-//----------------------------------------------------------------------------
	+#else
	+# error "You must enable at least one of the following execution spaces in order to build this test: Kokkos::Threads, Kokkos::OpenMP, or Kokkos::Serial."
	+#endif

	-namespace Kokkos {
	+#include <impl/Kokkos_Timer.hpp>

	-struct InitArguments {
	- int num_threads;
	- int num_numa;
	- int device_id;
	+#include <PerfTestHexGrad.hpp>
	+#include <PerfTestBlasKernels.hpp>
	+#include <PerfTestGramSchmidt.hpp>
	+#include <PerfTestDriver.hpp>

	- InitArguments() {
	- num_threads = -1;
	- num_numa = -1;
	- device_id = -1;
	- }
	-};
	+//------------------------------------------------------------------------

	-void initialize(int& narg, char* arg[]);
	+namespace Test {

	-void initialize(const InitArguments& args = InitArguments());
	+class host : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ const unsigned team_count = Kokkos::hwloc::get_available_numa_count();
	+ const unsigned threads_per_team = 4 ;

	-/** \brief Finalize the spaces that were initialized via Kokkos::initialize */
	-void finalize();
	+ TestHostDevice::initialize( team_count * threads_per_team );
	+ }

	-/** \brief Finalize all known execution spaces */
	-void finalize_all();
	+ static void TearDownTestCase()
	+ {
	+ TestHostDevice::finalize();
	+ }
	+};

	-void fence();
	+TEST_F( host, hexgrad ) {
	+ EXPECT_NO_THROW(run_test_hexgrad< TestHostDevice>( 10, 20, TestHostDeviceName ));
	+}

	+TEST_F( host, gramschmidt ) {
	+ EXPECT_NO_THROW(run_test_gramschmidt< TestHostDevice>( 10, 20, TestHostDeviceName ));
	}

	-#endif
	+} // namespace Test
	+
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/core/perf_test/PerfTestMain.cpp
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/core/perf_test/PerfTestMain.cpp
	index 966291abd..ac9163082 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/core/perf_test/PerfTestMain.cpp
	@@ -1,64 +1,49 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	+#include <gtest/gtest.h>

	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	-
	+int main(int argc, char *argv[]) {
	+ ::testing::InitGoogleTest(&argc,argv);
	+ return RUN_ALL_TESTS();
	+}
	diff --git a/lib/kokkos/core/perf_test/test_atomic.cpp b/lib/kokkos/core/perf_test/test_atomic.cpp
	new file mode 100755
	index 000000000..f1e5c1b62
	--- /dev/null
	+++ b/lib/kokkos/core/perf_test/test_atomic.cpp
	@@ -0,0 +1,504 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <cstdio>
	+#include <cstring>
	+#include <cstdlib>
	+
	+#include <Kokkos_Core.hpp>
	+#include <impl/Kokkos_Timer.hpp>
	+
	+typedef Kokkos::DefaultExecutionSpace exec_space;
	+
	+#define RESET 0
	+#define BRIGHT 1
	+#define DIM 2
	+#define UNDERLINE 3
	+#define BLINK 4
	+#define REVERSE 7
	+#define HIDDEN 8
	+
	+#define BLACK 0
	+#define RED 1
	+#define GREEN 2
	+#define YELLOW 3
	+#define BLUE 4
	+#define MAGENTA 5
	+#define CYAN 6
	+#define GREY 7
	+#define WHITE 8
	+
	+void textcolor(int attr, int fg, int bg)
	+{ char command[13];
	+
	+ /* Command is the control command to the terminal */
	+ sprintf(command, "%c[%d;%d;%dm", 0x1B, attr, fg + 30, bg + 40);
	+ printf("%s", command);
	+}
	+void textcolor_standard() {textcolor(RESET, BLACK, WHITE);}
	+
	+
	+template<class T,class DEVICE_TYPE>
	+struct ZeroFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef typename Kokkos::View<T,execution_space> type;
	+ typedef typename Kokkos::View<T,execution_space>::HostMirror h_type;
	+ type data;
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ data() = 0;
	+ }
	+};
	+
	+//---------------------------------------------------
	+//--------------atomic_fetch_add---------------------
	+//---------------------------------------------------
	+
	+template<class T,class DEVICE_TYPE>
	+struct AddFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ Kokkos::atomic_fetch_add(&data(),(T)1);
	+ }
	+};
	+
	+template<class T>
	+T AddLoop(int loop) {
	+ struct ZeroFunctor<T,exec_space> f_zero;
	+ typename ZeroFunctor<T,exec_space>::type data("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ struct AddFunctor<T,exec_space> f_add;
	+ f_add.data = data;
	+ Kokkos::parallel_for(loop,f_add);
	+ exec_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ T val = h_data();
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct AddNonAtomicFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ data()+=(T)1;
	+ }
	+};
	+
	+template<class T>
	+T AddLoopNonAtomic(int loop) {
	+ struct ZeroFunctor<T,exec_space> f_zero;
	+ typename ZeroFunctor<T,exec_space>::type data("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data("HData");
	+
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ struct AddNonAtomicFunctor<T,exec_space> f_add;
	+ f_add.data = data;
	+ Kokkos::parallel_for(loop,f_add);
	+ exec_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ T val = h_data();
	+
	+ return val;
	+}
	+
	+template<class T>
	+T AddLoopSerial(int loop) {
	+ T* data = new T[1];
	+ data[0] = 0;
	+
	+ for(int i=0;i<loop;i++)
	+ *data+=(T)1;
	+
	+ T val = *data;
	+ delete data;
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct CASFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ T old = data();
	+ T newval, assumed;
	+ do {
	+ assumed = old;
	+ newval = assumed + (T)1;
	+ old = Kokkos::atomic_compare_exchange(&data(), assumed, newval);
	+ }
	+ while( old != assumed );
	+ }
	+};
	+
	+template<class T>
	+T CASLoop(int loop) {
	+ struct ZeroFunctor<T,exec_space> f_zero;
	+ typename ZeroFunctor<T,exec_space>::type data("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ struct CASFunctor<T,exec_space> f_cas;
	+ f_cas.data = data;
	+ Kokkos::parallel_for(loop,f_cas);
	+ exec_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ T val = h_data();
	+
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct CASNonAtomicFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ volatile T assumed;
	+ volatile T newval;
	+ bool fail=1;
	+ do {
	+ assumed = data();
	+ newval = assumed + (T)1;
	+ if(data()==assumed) {
	+ data() = newval;
	+ fail = 0;
	+ }
	+ }
	+ while(fail);
	+ }
	+};
	+
	+template<class T>
	+T CASLoopNonAtomic(int loop) {
	+ struct ZeroFunctor<T,exec_space> f_zero;
	+ typename ZeroFunctor<T,exec_space>::type data("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ struct CASNonAtomicFunctor<T,exec_space> f_cas;
	+ f_cas.data = data;
	+ Kokkos::parallel_for(loop,f_cas);
	+ exec_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ T val = h_data();
	+
	+ return val;
	+}
	+
	+template<class T>
	+T CASLoopSerial(int loop) {
	+ T* data = new T[1];
	+ data[0] = 0;
	+
	+ for(int i=0;i<loop;i++) {
	+ T assumed;
	+ T newval;
	+ T old;
	+ do {
	+ assumed = *data;
	+ newval = assumed + (T)1;
	+ old = *data;
	+ *data = newval;
	+ }
	+ while(!(assumed==old));
	+ }
	+
	+ T val = *data;
	+ delete data;
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct ExchFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data, data2;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ T old = Kokkos::atomic_exchange(&data(),(T)i);
	+ Kokkos::atomic_fetch_add(&data2(),old);
	+ }
	+};
	+
	+template<class T>
	+T ExchLoop(int loop) {
	+ struct ZeroFunctor<T,exec_space> f_zero;
	+ typename ZeroFunctor<T,exec_space>::type data("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ typename ZeroFunctor<T,exec_space>::type data2("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data2("HData");
	+ f_zero.data = data2;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ struct ExchFunctor<T,exec_space> f_exch;
	+ f_exch.data = data;
	+ f_exch.data2 = data2;
	+ Kokkos::parallel_for(loop,f_exch);
	+ exec_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ Kokkos::deep_copy(h_data2,data2);
	+ T val = h_data() + h_data2();
	+
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct ExchNonAtomicFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data, data2;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ T old = data();
	+ data()=(T) i;
	+ data2()+=old;
	+ }
	+};
	+
	+
	+template<class T>
	+T ExchLoopNonAtomic(int loop) {
	+ struct ZeroFunctor<T,exec_space> f_zero;
	+ typename ZeroFunctor<T,exec_space>::type data("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ typename ZeroFunctor<T,exec_space>::type data2("Data");
	+ typename ZeroFunctor<T,exec_space>::h_type h_data2("HData");
	+ f_zero.data = data2;
	+ Kokkos::parallel_for(1,f_zero);
	+ exec_space::fence();
	+
	+ struct ExchNonAtomicFunctor<T,exec_space> f_exch;
	+ f_exch.data = data;
	+ f_exch.data2 = data2;
	+ Kokkos::parallel_for(loop,f_exch);
	+ exec_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ Kokkos::deep_copy(h_data2,data2);
	+ T val = h_data() + h_data2();
	+
	+ return val;
	+}
	+
	+template<class T>
	+T ExchLoopSerial(int loop) {
	+ T* data = new T[1];
	+ T* data2 = new T[1];
	+ data[0] = 0;
	+ data2[0] = 0;
	+ for(int i=0;i<loop;i++) {
	+ T old = *data;
	+ *data=(T) i;
	+ *data2+=old;
	+ }
	+
	+ T val = data2 + data;
	+ delete data;
	+ delete data2;
	+ return val;
	+}
	+
	+template<class T>
	+T LoopVariant(int loop, int test) {
	+ switch (test) {
	+ case 1: return AddLoop<T>(loop);
	+ case 2: return CASLoop<T>(loop);
	+ case 3: return ExchLoop<T>(loop);
	+ }
	+ return 0;
	+}
	+
	+template<class T>
	+T LoopVariantSerial(int loop, int test) {
	+ switch (test) {
	+ case 1: return AddLoopSerial<T>(loop);
	+ case 2: return CASLoopSerial<T>(loop);
	+ case 3: return ExchLoopSerial<T>(loop);
	+ }
	+ return 0;
	+}
	+
	+template<class T>
	+T LoopVariantNonAtomic(int loop, int test) {
	+ switch (test) {
	+ case 1: return AddLoopNonAtomic<T>(loop);
	+ case 2: return CASLoopNonAtomic<T>(loop);
	+ case 3: return ExchLoopNonAtomic<T>(loop);
	+ }
	+ return 0;
	+}
	+
	+template<class T>
	+void Loop(int loop, int test, const char* type_name) {
	+ LoopVariant<T>(loop,test);
	+
	+ Kokkos::Impl::Timer timer;
	+ T res = LoopVariant<T>(loop,test);
	+ double time1 = timer.seconds();
	+
	+ timer.reset();
	+ T resNonAtomic = LoopVariantNonAtomic<T>(loop,test);
	+ double time2 = timer.seconds();
	+
	+ timer.reset();
	+ T resSerial = LoopVariantSerial<T>(loop,test);
	+ double time3 = timer.seconds();
	+
	+ time1*=1e6/loop;
	+ time2*=1e6/loop;
	+ time3*=1e6/loop;
	+ //textcolor_standard();
	+ bool passed = true;
	+ if(resSerial!=res) passed = false;
	+ //if(!passed) textcolor(RESET,BLACK,YELLOW);
	+ printf("%s Test %i %s --- Loop: %i Value (S,A,NA): %e %e %e Time: %7.4e %7.4e %7.4e Size of Type %i)",type_name,test,passed?"PASSED":"FAILED",loop,1.0resSerial,1.0res,1.0*resNonAtomic,time1,time2,time3,(int)sizeof(T));
	+ //if(!passed) textcolor_standard();
	+ printf("\n");
	+}
	+
	+
	+template<class T>
	+void Test(int loop, int test, const char* type_name) {
	+ if(test==-1) {
	+ Loop<T>(loop,1,type_name);
	+ Loop<T>(loop,2,type_name);
	+ Loop<T>(loop,3,type_name);
	+
	+ }
	+ else
	+ Loop<T>(loop,test,type_name);
	+}
	+
	+int main(int argc, char* argv[])
	+{
	+ int type = -1;
	+ int loop = 1000000;
	+ int test = -1;
	+
	+ for(int i=0;i<argc;i++)
	+ {
	+ if((strcmp(argv[i],"--test")==0)) {test=atoi(argv[++i]); continue;}
	+ if((strcmp(argv[i],"--type")==0)) {type=atoi(argv[++i]); continue;}
	+ if((strcmp(argv[i],"-l")==0)\|\|(strcmp(argv[i],"--loop")==0)) {loop=atoi(argv[++i]); continue;}
	+ }
	+
	+
	+ Kokkos::initialize(argc,argv);
	+
	+
	+ printf("Using %s\n",Kokkos::atomic_query_version());
	+ bool all_tests = false;
	+ if(type==-1) all_tests = true;
	+ while(type<100) {
	+ if(type==1) {
	+ Test<int>(loop,test,"int ");
	+ }
	+ if(type==2) {
	+ Test<long int>(loop,test,"long int ");
	+ }
	+ if(type==3) {
	+ Test<long long int>(loop,test,"long long int ");
	+ }
	+ if(type==4) {
	+ Test<unsigned int>(loop,test,"unsigned int ");
	+ }
	+ if(type==5) {
	+ Test<unsigned long int>(loop,test,"unsigned long int ");
	+ }
	+ if(type==6) {
	+ Test<unsigned long long int>(loop,test,"unsigned long long int ");
	+ }
	+ if(type==10) {
	+ //Test<float>(loop,test,"float ");
	+ }
	+ if(type==11) {
	+ Test<double>(loop,test,"double ");
	+ }
	+ if(!all_tests) type=100;
	+ else type++;
	+ }
	+
	+ Kokkos::finalize();
	+
	+}
	+
	diff --git a/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_View.hpp b/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_View.hpp
	new file mode 100755
	index 000000000..37c5e53e5
	--- /dev/null
	+++ b/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_View.hpp
	@@ -0,0 +1,283 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_EXPERIMENTAL_CUDA_VIEW_HPP
	+#define KOKKOS_EXPERIMENTAL_CUDA_VIEW_HPP
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#if defined( KOKKOS_HAVE_CUDA )
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+//----------------------------------------------------------------------------
	+// Cuda Texture fetches can be performed for 4, 8 and 16 byte objects (int,int2,int4)
	+// Via reinterpret_case this can be used to support all scalar types of those sizes.
	+// Any other scalar type falls back to either normal reads out of global memory,
	+// or using the __ldg intrinsic on Kepler GPUs or newer (Compute Capability >= 3.0)
	+
	+template< typename ValueType , typename AliasType >
	+struct CudaTextureFetch {
	+
	+ ::cudaTextureObject_t m_obj ;
	+ const ValueType * m_ptr ;
	+ int m_offset ;
	+
	+ // Deference operator pulls through texture object and returns by value
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ ValueType operator[]( const iType & i ) const
	+ {
	+#if defined( __CUDA_ARCH__ ) && ( 300 <= __CUDA_ARCH__ )
	+ AliasType v = tex1Dfetch<AliasType>( m_obj , i + m_offset );
	+ return (reinterpret_cast<ValueType> (&v));
	+#else
	+ return m_ptr[ i ];
	+#endif
	+ }
	+
	+ // Pointer to referenced memory
	+ KOKKOS_INLINE_FUNCTION
	+ operator const ValueType * () const { return m_ptr ; }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaTextureFetch() : m_obj() , m_ptr() , m_offset() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ ~CudaTextureFetch() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaTextureFetch( const CudaTextureFetch & rhs )
	+ : m_obj( rhs.m_obj )
	+ , m_ptr( rhs.m_ptr )
	+ , m_offset( rhs.m_offset )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaTextureFetch( CudaTextureFetch && rhs )
	+ : m_obj( rhs.m_obj )
	+ , m_ptr( rhs.m_ptr )
	+ , m_offset( rhs.m_offset )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaTextureFetch & operator = ( const CudaTextureFetch & rhs )
	+ {
	+ m_obj = rhs.m_obj ;
	+ m_ptr = rhs.m_ptr ;
	+ m_offset = rhs.m_offset ;
	+ return *this ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaTextureFetch & operator = ( CudaTextureFetch && rhs )
	+ {
	+ m_obj = rhs.m_obj ;
	+ m_ptr = rhs.m_ptr ;
	+ m_offset = rhs.m_offset ;
	+ return *this ;
	+ }
	+
	+ // Texture object spans the entire allocation.
	+ // This handle may view a subset of the allocation, so an offset is required.
	+ template< class CudaMemorySpace >
	+ inline explicit
	+ CudaTextureFetch( const ValueType * const arg_ptr
	+ , Kokkos::Experimental::Impl::SharedAllocationRecord< CudaMemorySpace , void > & record
	+ )
	+ // 'attach_texture_object' returns 0 when __CUDA_ARCH__ < 300
	+ : m_obj( record.template attach_texture_object< AliasType >() )
	+ , m_ptr( arg_ptr )
	+ , m_offset( record.attach_texture_object_offset( reinterpret_cast<const AliasType*>( arg_ptr ) ) )
	+ {}
	+};
	+
	+#if defined( KOKKOS_CUDA_USE_LDG_INTRINSIC )
	+
	+template< typename ValueType , typename AliasType >
	+struct CudaLDGFetch {
	+
	+ const ValueType * m_ptr ;
	+
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ ValueType operator[]( const iType & i ) const
	+ {
	+ AliasType v = __ldg(reinterpret_cast<AliasType*>(&m_ptr[i]));
	+ return (reinterpret_cast<ValueType> (&v));
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ operator const ValueType * () const { return m_ptr ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaLDGFetch() : m_ptr() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ ~CudaLDGFetch() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaLDGFetch( const CudaLDGFetch & rhs )
	+ : m_ptr( rhs.m_ptr )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaLDGFetch( CudaLDGFetch && rhs )
	+ : m_ptr( rhs.m_ptr )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaLDGFetch & operator = ( const CudaLDGFetch & rhs )
	+ {
	+ m_ptr = rhs.m_ptr ;
	+ return *this ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ CudaLDGFetch & operator = ( CudaLDGFetch && rhs )
	+ {
	+ m_ptr = rhs.m_ptr ;
	+ return *this ;
	+ }
	+
	+ template< class CudaMemorySpace >
	+ inline explicit
	+ CudaTextureFetch( const ValueType * const arg_ptr
	+ , Kokkos::Experimental::Impl::SharedAllocationRecord< CudaMemorySpace , void > const &
	+ )
	+ : m_ptr( arg_data_ptr )
	+ {}
	+};
	+
	+#endif
	+
	+} // namespace Impl
	+} // namespace Experimental
	+} // namespace Kokkos
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+/** \brief Replace Default ViewDataHandle with Cuda texture fetch specialization
	+ * if 'const' value type, CudaSpace and random access.
	+ */
	+template< class Traits >
	+class ViewDataHandle< Traits ,
	+ typename std::enable_if<(
	+ // Is Cuda memory space
	+ ( std::is_same< typename Traits::memory_space,Kokkos::CudaSpace>::value \|\|
	+ std::is_same< typename Traits::memory_space,Kokkos::CudaUVMSpace>::value )
	+ &&
	+ // Is a trivial const value of 4, 8, or 16 bytes
	+ std::is_trivial<typename Traits::const_value_type>::value
	+ &&
	+ std::is_same<typename Traits::const_value_type,typename Traits::value_type>::value
	+ &&
	+ ( sizeof(typename Traits::const_value_type) == 4 \|\|
	+ sizeof(typename Traits::const_value_type) == 8 \|\|
	+ sizeof(typename Traits::const_value_type) == 16 )
	+ &&
	+ // Random access trait
	+ ( Traits::memory_traits::RandomAccess != 0 )
	+ )>::type >
	+{
	+public:
	+
	+ using track_type = Kokkos::Experimental::Impl::SharedAllocationTracker ;
	+
	+ using value_type = typename Traits::const_value_type ;
	+ using return_type = typename Traits::const_value_type ; // NOT a reference
	+
	+ using alias_type = typename std::conditional< ( sizeof(value_type) == 4 ) , int ,
	+ typename std::conditional< ( sizeof(value_type) == 8 ) , ::int2 ,
	+ typename std::conditional< ( sizeof(value_type) == 16 ) , ::int4 , void
	+ >::type
	+ >::type
	+ >::type ;
	+
	+#if defined( KOKKOS_CUDA_USE_LDG_INTRINSIC )
	+ using handle_type = Kokkos::Experimental::Impl::CudaLDGFetch< value_type , alias_type > ;
	+#else
	+ using handle_type = Kokkos::Experimental::Impl::CudaTextureFetch< value_type , alias_type > ;
	+#endif
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type const & assign( handle_type const & arg_handle , track_type const & /* arg_tracker */ )
	+ {
	+ return arg_handle ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type assign( value_type * arg_data_ptr, track_type const & arg_tracker )
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ // Assignment of texture = non-texture requires creation of a texture object
	+ // which can only occur on the host. In addition, 'get_record' is only valid
	+ // if called in a host execution space
	+ return handle_type( arg_data_ptr , arg_tracker.template get_record< typename Traits::memory_space >() );
	+#else
	+ Kokkos::Impl::cuda_abort("Cannot create Cuda texture object from within a Cuda kernel");
	+ return handle_type();
	+#endif
	+ }
	+};
	+
	+}
	+}
	+}
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+#endif /* #if defined( KOKKOS_HAVE_CUDA ) */
	+#endif /* #ifndef KOKKOS_CUDA_VIEW_HPP */
	+
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
	index 027155bc5..c1b2d51c4 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
	@@ -1,237 +1,277 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDAEXEC_HPP
	#define KOKKOS_CUDAEXEC_HPP

	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	#include <string>
	#include <Kokkos_Parallel.hpp>
	#include <impl/Kokkos_Error.hpp>
	#include <Cuda/Kokkos_Cuda_abort.hpp>
	+#include <Cuda/Kokkos_Cuda_Error.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	struct CudaTraits {
	enum { WarpSize = 32 /* 0x0020 */ };
	enum { WarpIndexMask = 0x001f /* Mask for warpindex */ };
	enum { WarpIndexShift = 5 /* WarpSize == 1 << WarpShift */ };

	enum { SharedMemoryBanks = 32 /* Compute device 2.0 */ };
	enum { SharedMemoryCapacity = 0x0C000 /* 48k shared / 16k L1 Cache */ };
	enum { SharedMemoryUsage = 0x04000 /* 16k shared / 48k L1 Cache */ };

	enum { UpperBoundGridCount = 65535 /* Hard upper bound */ };
	enum { ConstantMemoryCapacity = 0x010000 /* 64k bytes */ };
	enum { ConstantMemoryUsage = 0x008000 /* 32k bytes */ };
	enum { ConstantMemoryCache = 0x002000 /* 8k bytes */ };

	typedef unsigned long
	ConstantGlobalBufferType[ ConstantMemoryUsage / sizeof(unsigned long) ];

	enum { ConstantMemoryUseThreshold = 0x000200 /* 512 bytes */ };

	KOKKOS_INLINE_FUNCTION static
	CudaSpace::size_type warp_count( CudaSpace::size_type i )
	{ return ( i + WarpIndexMask ) >> WarpIndexShift ; }

	KOKKOS_INLINE_FUNCTION static
	CudaSpace::size_type warp_align( CudaSpace::size_type i )
	{
	enum { Mask = ~CudaSpace::size_type( WarpIndexMask ) };
	return ( i + WarpIndexMask ) & Mask ;
	}
	};

	//----------------------------------------------------------------------------

	CudaSpace::size_type cuda_internal_maximum_warp_count();
	CudaSpace::size_type cuda_internal_maximum_grid_count();
	CudaSpace::size_type cuda_internal_maximum_shared_words();

	CudaSpace::size_type * cuda_internal_scratch_flags( const CudaSpace::size_type size );
	CudaSpace::size_type * cuda_internal_scratch_space( const CudaSpace::size_type size );
	CudaSpace::size_type * cuda_internal_scratch_unified( const CudaSpace::size_type size );

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#if defined( __CUDACC__ )

	/** \brief Access to constant memory on the device */
	#ifdef KOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE
	extern
	#endif
	__device__ __constant__
	Kokkos::Impl::CudaTraits::ConstantGlobalBufferType
	kokkos_impl_cuda_constant_memory_buffer ;

	+__device__ __constant__
	+int* kokkos_impl_cuda_atomic_lock_array ;
	+#define CUDA_SPACE_ATOMIC_MASK 0x1FFFF
	+#define CUDA_SPACE_ATOMIC_XOR_MASK 0x15A39
	+
	+namespace Kokkos {
	+namespace Impl {
	+__device__ inline
	+bool lock_address_cuda_space(void* ptr) {
	+ size_t offset = size_t(ptr);
	+ offset = offset >> 2;
	+ offset = offset & CUDA_SPACE_ATOMIC_MASK;
	+ //offset = offset xor CUDA_SPACE_ATOMIC_XOR_MASK;
	+ return (0 == atomicCAS(&kokkos_impl_cuda_atomic_lock_array[offset],0,1));
	+}
	+
	+__device__ inline
	+void unlock_address_cuda_space(void* ptr) {
	+ size_t offset = size_t(ptr);
	+ offset = offset >> 2;
	+ offset = offset & CUDA_SPACE_ATOMIC_MASK;
	+ //offset = offset xor CUDA_SPACE_ATOMIC_XOR_MASK;
	+ atomicExch( &kokkos_impl_cuda_atomic_lock_array[ offset ], 0);
	+}
	+
	+}
	+}
	+
	template< typename T >
	inline
	__device__
	T * kokkos_impl_cuda_shared_memory()
	{ extern __shared__ Kokkos::CudaSpace::size_type sh[]; return (T*) sh ; }

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------
	// See section B.17 of Cuda C Programming Guide Version 3.2
	// for discussion of
	// __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)
	// function qualifier which could be used to improve performance.
	//----------------------------------------------------------------------------
	// Maximize L1 cache and minimize shared memory:
	// cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1 );
	// For 2.0 capability: 48 KB L1 and 16 KB shared
	//----------------------------------------------------------------------------

	template< class DriverType >
	__global__
	static void cuda_parallel_launch_constant_memory()
	{
	const DriverType & driver =
	((const DriverType ) kokkos_impl_cuda_constant_memory_buffer );

	driver();
	}

	template< class DriverType >
	__global__
	static void cuda_parallel_launch_local_memory( const DriverType driver )
	{
	driver();
	}

	template < class DriverType ,
	bool Large = ( CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType) ) >
	struct CudaParallelLaunch ;

	template < class DriverType >
	struct CudaParallelLaunch< DriverType , true > {

	inline
	CudaParallelLaunch( const DriverType & driver
	, const dim3 & grid
	, const dim3 & block
	, const int shmem
	, const cudaStream_t stream = 0 )
	{
	if ( grid.x && ( block.x * block.y * block.z ) ) {

	if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
	sizeof( DriverType ) ) {
	Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: Functor is too large") );
	}

	if ( CudaTraits::SharedMemoryCapacity < shmem ) {
	Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: shared memory request is too large") );
	}
	else if ( shmem ) {
	cudaFuncSetCacheConfig( cuda_parallel_launch_constant_memory< DriverType > , cudaFuncCachePreferShared );
	} else {
	cudaFuncSetCacheConfig( cuda_parallel_launch_constant_memory< DriverType > , cudaFuncCachePreferL1 );
	}

	// Copy functor to constant memory on the device
	cudaMemcpyToSymbol( kokkos_impl_cuda_constant_memory_buffer , & driver , sizeof(DriverType) );

	+ int* lock_array_ptr = lock_array_cuda_space_ptr();
	+ cudaMemcpyToSymbol( kokkos_impl_cuda_atomic_lock_array , & lock_array_ptr , sizeof(int*) );
	+
	// Invoke the driver function on the device
	cuda_parallel_launch_constant_memory< DriverType ><<< grid , block , shmem , stream >>>();

	-#if defined( KOKKOS_EXPRESSION_CHECK )
	+#if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
	Kokkos::Cuda::fence();
	+ CUDA_SAFE_CALL( cudaGetLastError() );
	#endif
	}
	}
	};

	template < class DriverType >
	struct CudaParallelLaunch< DriverType , false > {

	inline
	CudaParallelLaunch( const DriverType & driver
	, const dim3 & grid
	, const dim3 & block
	, const int shmem
	, const cudaStream_t stream = 0 )
	{
	if ( grid.x && ( block.x * block.y * block.z ) ) {

	if ( CudaTraits::SharedMemoryCapacity < shmem ) {
	Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: shared memory request is too large") );
	}
	else if ( shmem ) {
	- cudaFuncSetCacheConfig( cuda_parallel_launch_constant_memory< DriverType > , cudaFuncCachePreferShared );
	+ cudaFuncSetCacheConfig( cuda_parallel_launch_local_memory< DriverType > , cudaFuncCachePreferShared );
	} else {
	- cudaFuncSetCacheConfig( cuda_parallel_launch_constant_memory< DriverType > , cudaFuncCachePreferL1 );
	+ cudaFuncSetCacheConfig( cuda_parallel_launch_local_memory< DriverType > , cudaFuncCachePreferL1 );
	}

	+ int* lock_array_ptr = lock_array_cuda_space_ptr();
	+ cudaMemcpyToSymbol( kokkos_impl_cuda_atomic_lock_array , & lock_array_ptr , sizeof(int*) );
	+
	cuda_parallel_launch_local_memory< DriverType ><<< grid , block , shmem , stream >>>( driver );

	-#if defined( KOKKOS_EXPRESSION_CHECK )
	+#if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
	Kokkos::Cuda::fence();
	+ CUDA_SAFE_CALL( cudaGetLastError() );
	#endif
	}
	}
	};

	//----------------------------------------------------------------------------

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* defined( __CUDACC__ ) */
	-
	+#endif /* defined( KOKKOS_HAVE_CUDA ) */
	#endif /* #ifndef KOKKOS_CUDAEXEC_HPP */
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp b/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
	index 46fbf1083..5b397845c 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
	@@ -1,591 +1,670 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#include <stdlib.h>
	#include <iostream>
	#include <sstream>
	#include <stdexcept>
	#include <Kokkos_Macros.hpp>

	/* only compile this file if CUDA is enabled for Kokkos */
	#ifdef KOKKOS_HAVE_CUDA

	#include <Kokkos_Cuda.hpp>
	#include <Kokkos_CudaSpace.hpp>

	+#include <Cuda/Kokkos_Cuda_BasicAllocators.hpp>
	#include <Cuda/Kokkos_Cuda_Internal.hpp>
	-#include <impl/Kokkos_MemoryTracking.hpp>
	#include <impl/Kokkos_Error.hpp>

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	DeepCopy<CudaSpace,CudaSpace>::DeepCopy( void * dst , const void * src , size_t n )
	{ CUDA_SAFE_CALL( cudaMemcpy( dst , src , n , cudaMemcpyDefault ) ); }

	DeepCopy<CudaSpace,CudaSpace>::DeepCopy( const Cuda & instance , void * dst , const void * src , size_t n )
	-{ CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , instance.m_stream ) ); }
	+{ CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , instance.cuda_stream() ) ); }

	DeepCopy<HostSpace,CudaSpace>::DeepCopy( void * dst , const void * src , size_t n )
	{ CUDA_SAFE_CALL( cudaMemcpy( dst , src , n , cudaMemcpyDefault ) ); }

	DeepCopy<HostSpace,CudaSpace>::DeepCopy( const Cuda & instance , void * dst , const void * src , size_t n )
	-{ CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , instance.m_stream ) ); }
	+{ CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , instance.cuda_stream() ) ); }

	DeepCopy<CudaSpace,HostSpace>::DeepCopy( void * dst , const void * src , size_t n )
	{ CUDA_SAFE_CALL( cudaMemcpy( dst , src , n , cudaMemcpyDefault ) ); }

	DeepCopy<CudaSpace,HostSpace>::DeepCopy( const Cuda & instance , void * dst , const void * src , size_t n )
	-{ CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , instance.m_stream ) ); }
	+{ CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , instance.cuda_stream() ) ); }

	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	-namespace Kokkos {
	-namespace Impl {
	-namespace {

	-class CudaMemoryTracking {
	-public:
	+namespace Kokkos {

	- enum SpaceTag { CudaSpaceTag , CudaUVMSpaceTag , CudaHostPinnedSpaceTag };
	+namespace {

	- struct Attribute {
	+void texture_object_attach_impl( Impl::AllocationTracker const & tracker
	+ , unsigned type_size
	+ , ::cudaChannelFormatDesc const & desc
	+ )
	+{
	+ enum { TEXTURE_BOUND_1D = 2u << 27 };

	- Kokkos::Impl::cuda_texture_object_type m_tex_obj ;
	- int m_tex_flag ;
	+ if ( tracker.attribute() == NULL ) {
	+ // check for correct allocator
	+ const bool ok_alloc = tracker.allocator()->support_texture_binding();

	- Attribute() : m_tex_obj(0), m_tex_flag(0) {}
	+ const bool ok_count = (tracker.alloc_size() / type_size) < TEXTURE_BOUND_1D;

	- ~Attribute()
	- {
	- if ( m_tex_flag ) {
	- cudaDestroyTextureObject( m_tex_obj );
	- m_tex_obj = 0 ;
	- m_tex_flag = 0 ;
	- }
	+ if (ok_alloc && ok_count) {
	+ Impl::TextureAttribute * attr = new Impl::TextureAttribute( tracker.alloc_ptr(), tracker.alloc_size(), desc );
	+ tracker.set_attribute( attr );
	+ }
	+ else {
	+ std::ostringstream oss;
	+ oss << "Error: Cannot attach texture object";
	+ if (!ok_alloc) {
	+ oss << ", incompatabile allocator " << tracker.allocator()->name();
	+ }
	+ if (!ok_count) {
	+ oss << ", array " << tracker.label() << " too large";
	}
	+ oss << ".";
	+ Kokkos::Impl::throw_runtime_exception( oss.str() );
	+ }
	+ }

	- cudaError create( void * const arg_alloc_ptr
	- , size_t const arg_byte_size
	- , cudaChannelFormatDesc const & arg_desc
	- )
	- {
	- cudaError cuda_status = cudaSuccess ;
	+ if ( NULL == dynamic_cast<Impl::TextureAttribute *>(tracker.attribute()) ) {
	+ std::ostringstream oss;
	+ oss << "Error: Allocation " << tracker.label() << " already has an attribute attached.";
	+ Kokkos::Impl::throw_runtime_exception( oss.str() );
	+ }

	- if ( 0 == m_tex_flag ) {
	-
	- cuda_status = cudaDeviceSynchronize();
	+}

	- if ( cudaSuccess == cuda_status ) {
	- struct cudaResourceDesc resDesc ;
	- struct cudaTextureDesc texDesc ;
	+} // unnamed namespace

	- memset( & resDesc , 0 , sizeof(resDesc) );
	- memset( & texDesc , 0 , sizeof(texDesc) );
	+/--------------------------------------------------------------------------/

	- resDesc.resType = cudaResourceTypeLinear ;
	- resDesc.res.linear.desc = arg_desc ;
	- resDesc.res.linear.sizeInBytes = arg_byte_size ;
	- resDesc.res.linear.devPtr = arg_alloc_ptr ;
	+Impl::AllocationTracker CudaSpace::allocate_and_track( const std::string & label, const size_t size )
	+{
	+ return Impl::AllocationTracker( allocator(), size, label);
	+}

	- cuda_status = cudaCreateTextureObject( & m_tex_obj , & resDesc, & texDesc, NULL);
	- }
	+void CudaSpace::texture_object_attach( Impl::AllocationTracker const & tracker
	+ , unsigned type_size
	+ , ::cudaChannelFormatDesc const & desc
	+ )
	+{
	+ texture_object_attach_impl( tracker, type_size, desc );
	+}

	- if ( cudaSuccess == cuda_status ) { cuda_status = cudaDeviceSynchronize(); }
	+void CudaSpace::access_error()
	+{
	+ const std::string msg("Kokkos::CudaSpace::access_error attempt to execute Cuda function from non-Cuda space" );
	+ Kokkos::Impl::throw_runtime_exception( msg );
	+}

	- if ( cudaSuccess == cuda_status ) { m_tex_flag = 1 ; }
	- }
	+void CudaSpace::access_error( const void * const )
	+{
	+ const std::string msg("Kokkos::CudaSpace::access_error attempt to execute Cuda function from non-Cuda space" );
	+ Kokkos::Impl::throw_runtime_exception( msg );
	+}

	- return cuda_status ;
	- }
	- };
	+/--------------------------------------------------------------------------/

	- typedef Kokkos::Impl::MemoryTracking< Attribute > tracking_type ;
	- typedef typename Kokkos::Impl::MemoryTracking< Attribute >::Entry entry_type ;
	+Impl::AllocationTracker CudaUVMSpace::allocate_and_track( const std::string & label, const size_t size )
	+{
	+ return Impl::AllocationTracker( allocator(), size, label);
	+}
	+
	+void CudaUVMSpace::texture_object_attach( Impl::AllocationTracker const & tracker
	+ , unsigned type_size
	+ , ::cudaChannelFormatDesc const & desc
	+ )
	+{
	+ texture_object_attach_impl( tracker, type_size, desc );
	+}

	- bool available() const
	- {
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION )
	- enum { UVM_available = true };
	+bool CudaUVMSpace::available()
	+{
	+#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION ) && !defined(__APPLE__)
	+ enum { UVM_available = true };
	#else
	- enum { UVM_available = false };
	+ enum { UVM_available = false };
	#endif
	+ return UVM_available;
	+}

	- return ( m_space_tag != CudaUVMSpaceTag ) \|\| UVM_available ;
	- }
	-
	-private:
	+/--------------------------------------------------------------------------/

	- tracking_type m_tracking ;
	- SpaceTag const m_space_tag ;
	+Impl::AllocationTracker CudaHostPinnedSpace::allocate_and_track( const std::string & label, const size_t size )
	+{
	+ return Impl::AllocationTracker( allocator(), size, label);
	+}

	+} // namespace Kokkos

	- cudaError cuda_malloc( void ** ptr , size_t byte_size ) const
	- {
	- cudaError result = cudaSuccess ;
	+/--------------------------------------------------------------------------/
	+/--------------------------------------------------------------------------/

	- switch( m_space_tag ) {
	- case CudaSpaceTag :
	- result = cudaMalloc( ptr , byte_size );
	- break ;
	- case CudaUVMSpaceTag :
	-#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION )
	- result = cudaMallocManaged( ptr, byte_size, cudaMemAttachGlobal );
	-#else
	- Kokkos::Impl::throw_runtime_exception( std::string("CUDA VERSION does not support UVM") );
	-#endif
	- break ;
	- case CudaHostPinnedSpaceTag :
	- result = cudaHostAlloc( ptr , byte_size , cudaHostAllocDefault );
	- break ;
	- }
	+namespace Kokkos {

	- return result ;
	- }
	+CudaSpace::CudaSpace()
	+ : m_device( Kokkos::Cuda().cuda_device() )
	+{
	+}

	- cudaError cuda_free( void * ptr ) const
	- {
	- cudaError result = cudaSuccess ;
	-
	- switch( m_space_tag ) {
	- case CudaSpaceTag :
	- case CudaUVMSpaceTag :
	- result = cudaFree( ptr );
	- break ;
	- case CudaHostPinnedSpaceTag :
	- result = cudaFreeHost( ptr );
	- break ;
	- }
	- return result ;
	- }
	+CudaUVMSpace::CudaUVMSpace()
	+ : m_device( Kokkos::Cuda().cuda_device() )
	+{
	+}

	-public :
	+CudaHostPinnedSpace::CudaHostPinnedSpace()
	+{
	+}

	- CudaMemoryTracking( const SpaceTag arg_tag , const char * const arg_label )
	- : m_tracking( arg_label )
	- , m_space_tag( arg_tag )
	- {}
	+void * CudaSpace::allocate( const size_t arg_alloc_size ) const
	+{
	+ void * ptr = NULL;

	- void print( std::ostream & oss , const std::string & lead ) const
	- { m_tracking.print( oss , lead ); }
	+ CUDA_SAFE_CALL( cudaMalloc( &ptr, arg_alloc_size ) );

	- const char * query_label( const void * ptr ) const
	- {
	- static const char error[] = "<NOT FOUND>" ;
	- entry_type * const entry = m_tracking.query( ptr );
	- return entry ? entry->label() : error ;
	- }
	+ return ptr ;
	+}

	- int count(const void * ptr) const {
	- entry_type * const entry = m_tracking.query( ptr );
	- return entry ? entry->count() : 0 ;
	- }
	+void * CudaUVMSpace::allocate( const size_t arg_alloc_size ) const
	+{
	+ void * ptr = NULL;

	- void * allocate( const std::string & label , const size_t byte_size )
	- {
	- void * ptr = 0 ;
	+ CUDA_SAFE_CALL( cudaMallocManaged( &ptr, arg_alloc_size , cudaMemAttachGlobal ) );

	- if ( byte_size ) {
	+ return ptr ;
	+}

	- const bool ok_parallel = ! HostSpace::in_parallel();
	+void * CudaHostPinnedSpace::allocate( const size_t arg_alloc_size ) const
	+{
	+ void * ptr = NULL;

	- cudaError cuda_status = cudaSuccess ;
	+ CUDA_SAFE_CALL( cudaHostAlloc( &ptr, arg_alloc_size , cudaHostAllocDefault ) );

	- if ( ok_parallel ) {
	+ return ptr ;
	+}

	- cuda_status = cudaDeviceSynchronize();
	+void CudaSpace::deallocate( void * const arg_alloc_ptr , const size_t /* arg_alloc_size */ ) const
	+{
	+ try {
	+ CUDA_SAFE_CALL( cudaFree( arg_alloc_ptr ) );
	+ } catch(...) {}
	+}

	- if ( cudaSuccess == cuda_status ) { cuda_status = CudaMemoryTracking::cuda_malloc( & ptr , byte_size ); }
	- if ( cudaSuccess == cuda_status ) { cuda_status = cudaDeviceSynchronize(); }
	- }
	+void CudaUVMSpace::deallocate( void * const arg_alloc_ptr , const size_t /* arg_alloc_size */ ) const
	+{
	+ try {
	+ CUDA_SAFE_CALL( cudaFree( arg_alloc_ptr ) );
	+ } catch(...) {}
	+}

	- if ( ok_parallel && ( cudaSuccess == cuda_status ) ) {
	- m_tracking.insert( label , ptr , byte_size );
	- }
	- else {
	- std::ostringstream msg ;
	- msg << m_tracking.label()
	- << "::allocate( "
	- << label
	- << " , " << byte_size
	- << " ) FAILURE : " ;
	- if ( ! ok_parallel ) {
	- msg << "called within a parallel functor" ;
	- }
	- else {
	- msg << " CUDA ERROR \"" << cudaGetErrorString(cuda_status) << "\"" ;
	- }
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	- }
	- }
	+void CudaHostPinnedSpace::deallocate( void * const arg_alloc_ptr , const size_t /* arg_alloc_size */ ) const
	+{
	+ try {
	+ CUDA_SAFE_CALL( cudaFreeHost( arg_alloc_ptr ) );
	+ } catch(...) {}
	+}

	- return ptr ;
	- }
	+} // namespace Kokkos

	- void decrement( const void * ptr )
	- {
	- const bool ok_parallel = ! HostSpace::in_parallel();
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------

	- cudaError cuda_status = cudaSuccess ;
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {

	- if ( ok_parallel ) {
	+SharedAllocationRecord< void , void >
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::s_root_record ;

	- cuda_status = cudaDeviceSynchronize();
	+SharedAllocationRecord< void , void >
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::s_root_record ;

	- void * const alloc_ptr = ( cudaSuccess == cuda_status ) ? m_tracking.decrement( ptr ) : (void *) 0 ;
	+SharedAllocationRecord< void , void >
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::s_root_record ;

	- if ( alloc_ptr ) {
	- if ( cudaSuccess == cuda_status ) { cuda_status = CudaMemoryTracking::cuda_free( alloc_ptr ); }
	- if ( cudaSuccess == cuda_status ) { cuda_status = cudaDeviceSynchronize(); }
	- }
	- }
	+::cudaTextureObject_t
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+attach_texture_object( const unsigned sizeof_alias
	+ , void * const alloc_ptr
	+ , size_t const alloc_size )
	+{
	+ // Only valid for 300 <= __CUDA_ARCH__
	+ // otherwise return zero.

	- if ( ( ! ok_parallel ) \|\| ( cudaSuccess != cuda_status ) ) {
	- std::ostringstream msg ;
	- msg << m_tracking.label() << "::decrement( " << ptr << " ) FAILURE : " ;
	- if ( ! ok_parallel ) {
	- msg << "called within a parallel functor" ;
	- }
	- else {
	- msg << " CUDA ERROR \"" << cudaGetErrorString(cuda_status) << "\"" ;
	- }
	- std::cerr << msg.str() << std::endl ;
	- }
	- }
	+ ::cudaTextureObject_t tex_obj ;

	- void increment( const void * ptr )
	- {
	- const bool ok_parallel = ! HostSpace::in_parallel();
	+ struct cudaResourceDesc resDesc ;
	+ struct cudaTextureDesc texDesc ;

	- if ( ok_parallel ) {
	- m_tracking.increment( ptr );
	- }
	- else {
	- std::ostringstream msg ;
	- msg << m_tracking.label() << "::increment(" << ptr
	- << ") FAILURE :called within a parallel functor" ;
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	- }
	- }
	+ memset( & resDesc , 0 , sizeof(resDesc) );
	+ memset( & texDesc , 0 , sizeof(texDesc) );

	+ resDesc.resType = cudaResourceTypeLinear ;
	+ resDesc.res.linear.desc = ( sizeof_alias == 4 ? cudaCreateChannelDesc< int >() :
	+ ( sizeof_alias == 8 ? cudaCreateChannelDesc< ::int2 >() :
	+ /* sizeof_alias == 16 */ cudaCreateChannelDesc< ::int4 >() ) );
	+ resDesc.res.linear.sizeInBytes = alloc_size ;
	+ resDesc.res.linear.devPtr = alloc_ptr ;

	- inline
	- void texture_object_attach( const void * const arg_ptr
	- , const unsigned arg_type_size
	- , const cudaChannelFormatDesc & arg_desc
	- , ::cudaTextureObject_t * const arg_tex_obj
	- , void const ** const arg_alloc_ptr
	- , int * const arg_offset
	- )
	- {
	- static const size_t max_array_len = 1 << 28 ;
	-
	- *arg_tex_obj = 0 ;
	- *arg_alloc_ptr = 0 ;
	- *arg_offset = 0 ;
	-
	- if ( arg_ptr ) {
	-
	- // Can only create texture object on device architure 3.0 or better
	- const bool ok_dev_arch = 300 <= Cuda::device_arch();
	- const bool ok_parallel = ok_dev_arch && ! HostSpace::in_parallel();
	-
	- entry_type * const entry = ok_parallel ? m_tracking.query( arg_ptr ) : (entry_type *) 0 ;
	-
	- const size_t offset = entry ? ( reinterpret_cast<const char*>(arg_ptr) -
	- reinterpret_cast<const char*>(entry->m_alloc_ptr) ) : 0 ;
	-
	- const bool ok_offset = entry && ( 0 == ( offset % arg_type_size ) );
	- const bool ok_count = ok_offset && ( entry->m_alloc_size / arg_type_size < max_array_len );
	-
	- cudaError cuda_status = cudaSuccess ;
	-
	- if ( ok_count ) {
	- cuda_status = entry->m_attribute.create( entry->m_alloc_ptr , entry->m_alloc_size , arg_desc );
	- }
	-
	- if ( cudaSuccess == cuda_status ) {
	- *arg_tex_obj = entry->m_attribute.m_tex_obj ;
	- *arg_alloc_ptr = entry->m_alloc_ptr ;
	- *arg_offset = offset / arg_type_size ;
	- }
	- else {
	- std::ostringstream msg ;
	- msg << m_tracking.label()
	- << "::texture_object_attach(" << arg_ptr << ") FAILED :" ;
	- if ( ! ok_dev_arch ) {
	- msg << " cuda architecture " << Cuda::device_arch()
	- << " does not support texture objects" ;
	- }
	- else if ( ! ok_parallel ) {
	- msg << " called within a parallel functor" ;
	- }
	- else if ( 0 == entry ) {
	- msg << " pointer not tracked" ;
	- }
	- else if ( ! ok_offset ) {
	- msg << " pointer not properly aligned" ;
	- }
	- else if ( ! ok_count ) {
	- msg << " array too large for texture object" ;
	- }
	- else {
	- msg << " CUDA ERROR \"" << cudaGetErrorString(cuda_status) << "\"" ;
	- }
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	- }
	- }
	- }
	-};
	+ CUDA_SAFE_CALL( cudaCreateTextureObject( & tex_obj , & resDesc, & texDesc, NULL ) );

	-//----------------------------------------------------------------------------
	+ return tex_obj ;
	+}

	-CudaMemoryTracking &
	-cuda_space_singleton()
	+std::string
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::get_label() const
	{
	- static CudaMemoryTracking s( CudaMemoryTracking::CudaSpaceTag , "Kokkos::CudaSpace");
	- return s ;
	+ SharedAllocationHeader header ;
	+
	+ Kokkos::Impl::DeepCopy< Kokkos::HostSpace , Kokkos::CudaSpace >( & header , RecordBase::head() , sizeof(SharedAllocationHeader) );
	+
	+ return std::string( header.m_label );
	}

	-CudaMemoryTracking &
	-cuda_uvm_space_singleton()
	+std::string
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::get_label() const
	{
	- static CudaMemoryTracking s( CudaMemoryTracking::CudaUVMSpaceTag , "Kokkos::CudaUVMSpace");
	- return s ;
	+ return std::string( RecordBase::head()->m_label );
	}

	-CudaMemoryTracking &
	-cuda_host_pinned_space_singleton()
	+std::string
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::get_label() const
	{
	- static CudaMemoryTracking s( CudaMemoryTracking::CudaHostPinnedSpaceTag , "Kokkos::CudaHostPinnedSpace");
	- return s ;
	+ return std::string( RecordBase::head()->m_label );
	}

	+SharedAllocationRecord< Kokkos::CudaSpace , void > *
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+allocate( const Kokkos::CudaSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ )
	+{
	+ return new SharedAllocationRecord( arg_space , arg_label , arg_alloc_size );
	}
	-} // namespace Impl
	-} // namespace Kokkos
	-
	-/--------------------------------------------------------------------------/
	-/--------------------------------------------------------------------------/
	-
	-namespace Kokkos {

	-void * CudaSpace::allocate( const std::string & label , const size_t size )
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void > *
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
	+allocate( const Kokkos::CudaUVMSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ )
	{
	- return Impl::cuda_space_singleton().allocate( label , size );
	+ return new SharedAllocationRecord( arg_space , arg_label , arg_alloc_size );
	}

	-void CudaSpace::decrement( const void * ptr )
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void > *
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
	+allocate( const Kokkos::CudaHostPinnedSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ )
	{
	- Impl::cuda_space_singleton().decrement( ptr );
	+ return new SharedAllocationRecord( arg_space , arg_label , arg_alloc_size );
	}

	-
	-void CudaSpace::increment( const void * ptr )
	+void
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+deallocate( SharedAllocationRecord< void , void > * arg_rec )
	{
	- Impl::cuda_space_singleton().increment( ptr );
	+ delete static_cast<SharedAllocationRecord*>(arg_rec);
	}

	-void CudaSpace::print_memory_view( std::ostream & oss )
	+void
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
	+deallocate( SharedAllocationRecord< void , void > * arg_rec )
	{
	- Impl::cuda_space_singleton().print( oss , std::string(" ") );
	+ delete static_cast<SharedAllocationRecord*>(arg_rec);
	}

	-int CudaSpace::count( const void * ptr ) {
	- if ( ! HostSpace::in_parallel() ) {
	- return Impl::cuda_space_singleton().count(ptr);
	- }
	- else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::CudaSpace::count called within a parallel functor") );
	- return -1;
	- }
	+void
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
	+deallocate( SharedAllocationRecord< void , void > * arg_rec )
	+{
	+ delete static_cast<SharedAllocationRecord*>(arg_rec);
	}

	-std::string CudaSpace::query_label( const void * p )
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+~SharedAllocationRecord()
	{
	- return std::string( Impl::cuda_space_singleton().query_label(p) );
	+ m_space.deallocate( SharedAllocationRecord< void , void >::m_alloc_ptr
	+ , SharedAllocationRecord< void , void >::m_alloc_size
	+ );
	}

	-void CudaSpace::texture_object_attach( const void * const arg_ptr
	- , const unsigned arg_type_size
	- , ::cudaChannelFormatDesc const & arg_desc
	- , ::cudaTextureObject_t * const arg_tex_obj
	- , void const ** const arg_alloc_ptr
	- , int * const arg_offset
	- )
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
	+~SharedAllocationRecord()
	{
	- Impl::cuda_space_singleton().texture_object_attach( arg_ptr , arg_type_size , arg_desc , arg_tex_obj , arg_alloc_ptr , arg_offset );
	+ m_space.deallocate( SharedAllocationRecord< void , void >::m_alloc_ptr
	+ , SharedAllocationRecord< void , void >::m_alloc_size
	+ );
	}

	-void CudaSpace::access_error()
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
	+~SharedAllocationRecord()
	{
	- const std::string msg("Kokkos::CudaSpace::access_error attempt to execute Cuda function from non-Cuda space" );
	-
	- Kokkos::Impl::throw_runtime_exception( msg );
	+ m_space.deallocate( SharedAllocationRecord< void , void >::m_alloc_ptr
	+ , SharedAllocationRecord< void , void >::m_alloc_size
	+ );
	}

	-void CudaSpace::access_error( const void * const ptr )
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+SharedAllocationRecord( const Kokkos::CudaSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const SharedAllocationRecord< void , void >::function_type arg_dealloc
	+ )
	+ // Pass through allocated [ SharedAllocationHeader , user_memory ]
	+ // Pass through deallocation function
	+ : SharedAllocationRecord< void , void >
	+ ( & SharedAllocationRecord< Kokkos::CudaSpace , void >::s_root_record
	+ , reinterpret_cast<SharedAllocationHeader*>( arg_space.allocate( sizeof(SharedAllocationHeader) + arg_alloc_size ) )
	+ , sizeof(SharedAllocationHeader) + arg_alloc_size
	+ , arg_dealloc
	+ )
	+ , m_tex_obj( 0 )
	+ , m_space( arg_space )
	{
	- std::ostringstream msg ;
	- msg << "Kokkos::CudaSpace::access_error:" ;
	- msg << " attempt to access Cuda-data labeled(" ;
	- msg << query_label( ptr ) ;
	- msg << ") from non-Cuda execution" ;
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	-}
	+ SharedAllocationHeader header ;

	-} // namespace Kokkos
	+ // Fill in the Header information
	+ header.m_record = static_cast< SharedAllocationRecord< void , void > * >( this );

	-/--------------------------------------------------------------------------/
	-/--------------------------------------------------------------------------/
	+ strncpy( header.m_label
	+ , arg_label.c_str()
	+ , SharedAllocationHeader::maximum_label_length
	+ );

	-namespace Kokkos {
	+ // Copy to device memory
	+ Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>::DeepCopy( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
	+}

	-bool CudaUVMSpace::available()
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
	+SharedAllocationRecord( const Kokkos::CudaUVMSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const SharedAllocationRecord< void , void >::function_type arg_dealloc
	+ )
	+ // Pass through allocated [ SharedAllocationHeader , user_memory ]
	+ // Pass through deallocation function
	+ : SharedAllocationRecord< void , void >
	+ ( & SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::s_root_record
	+ , reinterpret_cast<SharedAllocationHeader*>( arg_space.allocate( sizeof(SharedAllocationHeader) + arg_alloc_size ) )
	+ , sizeof(SharedAllocationHeader) + arg_alloc_size
	+ , arg_dealloc
	+ )
	+ , m_tex_obj( 0 )
	+ , m_space( arg_space )
	{
	- return Impl::cuda_uvm_space_singleton().available();
	+ // Fill in the Header information, directly accessible via UVM
	+
	+ RecordBase::m_alloc_ptr->m_record = this ;
	+
	+ strncpy( RecordBase::m_alloc_ptr->m_label
	+ , arg_label.c_str()
	+ , SharedAllocationHeader::maximum_label_length
	+ );
	}

	-void * CudaUVMSpace::allocate( const std::string & label , const size_t size )
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
	+SharedAllocationRecord( const Kokkos::CudaHostPinnedSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const SharedAllocationRecord< void , void >::function_type arg_dealloc
	+ )
	+ // Pass through allocated [ SharedAllocationHeader , user_memory ]
	+ // Pass through deallocation function
	+ : SharedAllocationRecord< void , void >
	+ ( & SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::s_root_record
	+ , reinterpret_cast<SharedAllocationHeader*>( arg_space.allocate( sizeof(SharedAllocationHeader) + arg_alloc_size ) )
	+ , sizeof(SharedAllocationHeader) + arg_alloc_size
	+ , arg_dealloc
	+ )
	+ , m_space( arg_space )
	{
	- return Impl::cuda_uvm_space_singleton().allocate( label , size );
	+ // Fill in the Header information, directly accessible via UVM
	+
	+ RecordBase::m_alloc_ptr->m_record = this ;
	+
	+ strncpy( RecordBase::m_alloc_ptr->m_label
	+ , arg_label.c_str()
	+ , SharedAllocationHeader::maximum_label_length
	+ );
	}

	-void CudaUVMSpace::decrement( const void * ptr )
	+SharedAllocationRecord< Kokkos::CudaSpace , void > *
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::get_record( void * alloc_ptr )
	{
	- Impl::cuda_uvm_space_singleton().decrement( ptr );
	-}
	+ using Header = SharedAllocationHeader ;
	+ using RecordBase = SharedAllocationRecord< void , void > ;
	+ using RecordCuda = SharedAllocationRecord< Kokkos::CudaSpace , void > ;

	+#if 0
	+ // Copy the header from the allocation
	+ SharedAllocationHeader head ;

	-void CudaUVMSpace::increment( const void * ptr )
	-{
	- Impl::cuda_uvm_space_singleton().increment( ptr );
	-}
	+ SharedAllocationHeader const * const head_cuda = Header::get_header( alloc_ptr );

	-int CudaUVMSpace::count( const void * ptr ) {
	- if ( ! HostSpace::in_parallel() ) {
	- return Impl::cuda_uvm_space_singleton().count(ptr);
	+ Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>::DeepCopy( & head , head_cuda , sizeof(SharedAllocationHeader) );
	+
	+ RecordCuda * const record = static_cast< RecordCuda * >( head.m_record );
	+
	+ if ( record->m_alloc_ptr != head_cuda ) {
	+ Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::CudaSpace , void >::get_record ERROR" ) );
	}
	- else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::CudaUVMSpace::count called within a parallel functor") );
	- return -1;
	+
	+#else
	+
	+ // Iterate the list to search for the record among all allocations
	+ // requires obtaining the root of the list and then locking the list.
	+
	+ RecordCuda * const record = static_cast< RecordCuda * >( RecordBase::find( & s_root_record , alloc_ptr ) );
	+
	+ if ( record == 0 ) {
	+ Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::CudaSpace , void >::get_record ERROR" ) );
	}
	-}

	-void CudaUVMSpace::print_memory_view( std::ostream & oss )
	-{
	- Impl::cuda_uvm_space_singleton().print( oss , std::string(" ") );
	-}
	+#endif

	-std::string CudaUVMSpace::query_label( const void * p )
	-{
	- return std::string( Impl::cuda_uvm_space_singleton().query_label(p) );
	+ return record ;
	}

	-void CudaUVMSpace::texture_object_attach( const void * const arg_ptr
	- , const unsigned arg_type_size
	- , ::cudaChannelFormatDesc const & arg_desc
	- , ::cudaTextureObject_t * const arg_tex_obj
	- , void const ** const arg_alloc_ptr
	- , int * const arg_offset
	- )
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void > *
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::get_record( void * alloc_ptr )
	{
	- Impl::cuda_uvm_space_singleton().texture_object_attach( arg_ptr , arg_type_size , arg_desc , arg_tex_obj , arg_alloc_ptr , arg_offset );
	-}
	+ using Header = SharedAllocationHeader ;
	+ using RecordCuda = SharedAllocationRecord< Kokkos::CudaUVMSpace , void > ;

	-} // namespace Kokkos
	+ Header * const h = reinterpret_cast< Header * >( alloc_ptr ) - 1 ;

	-/--------------------------------------------------------------------------/
	-/--------------------------------------------------------------------------/
	+ if ( h->m_record->m_alloc_ptr != h ) {
	+ Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::get_record ERROR" ) );
	+ }

	-namespace Kokkos {
	+ return static_cast< RecordCuda * >( h->m_record );
	+}

	-void * CudaHostPinnedSpace::allocate( const std::string & label , const size_t size )
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void > *
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::get_record( void * alloc_ptr )
	{
	- return Impl::cuda_host_pinned_space_singleton().allocate( label , size );
	+ using Header = SharedAllocationHeader ;
	+ using RecordCuda = SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void > ;
	+
	+ Header * const h = reinterpret_cast< Header * >( alloc_ptr ) - 1 ;
	+
	+ if ( h->m_record->m_alloc_ptr != h ) {
	+ Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::get_record ERROR" ) );
	+ }
	+
	+ return static_cast< RecordCuda * >( h->m_record );
	}

	-void CudaHostPinnedSpace::decrement( const void * ptr )
	+// Iterate records to print orphaned memory ...
	+void
	+SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail )
	{
	- Impl::cuda_host_pinned_space_singleton().decrement( ptr );
	-}
	+ SharedAllocationRecord< void , void > * r = & s_root_record ;

	+ char buffer[256] ;

	-void CudaHostPinnedSpace::increment( const void * ptr )
	-{
	- Impl::cuda_host_pinned_space_singleton().increment( ptr );
	-}
	+ SharedAllocationHeader head ;
	+
	+ if ( detail ) {
	+ do {
	+ if ( r->m_alloc_ptr ) {
	+ Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>::DeepCopy( & head , r->m_alloc_ptr , sizeof(SharedAllocationHeader) );
	+ }
	+ else {
	+ head.m_label[0] = 0 ;
	+ }

	-int CudaHostPinnedSpace::count( const void * ptr ) {
	- if ( ! HostSpace::in_parallel() ) {
	- return Impl::cuda_uvm_space_singleton().count(ptr);
	+ snprintf( buffer , 256 , "Cuda addr( 0x%.12lx ) list( 0x%.12lx 0x%.12lx ) extent[ 0x%.12lx + %.8ld ] count(%d) dealloc(0x%.12lx) %s\n"
	+ , reinterpret_cast<unsigned long>( r )
	+ , reinterpret_cast<unsigned long>( r->m_prev )
	+ , reinterpret_cast<unsigned long>( r->m_next )
	+ , reinterpret_cast<unsigned long>( r->m_alloc_ptr )
	+ , r->m_alloc_size
	+ , r->m_count
	+ , reinterpret_cast<unsigned long>( r->m_dealloc )
	+ , head.m_label
	+ );
	+ std::cout << buffer ;
	+ r = r->m_next ;
	+ } while ( r != & s_root_record );
	}
	else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::CudaHostPinnedSpace::count called within a parallel functor") );
	- return -1;
	+ do {
	+ if ( r->m_alloc_ptr ) {
	+
	+ Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>::DeepCopy( & head , r->m_alloc_ptr , sizeof(SharedAllocationHeader) );
	+
	+ snprintf( buffer , 256 , "Cuda [ 0x%.12lx + %ld ] %s\n"
	+ , reinterpret_cast< unsigned long >( r->data() )
	+ , r->size()
	+ , head.m_label
	+ );
	+ }
	+ else {
	+ snprintf( buffer , 256 , "Cuda [ 0 + 0 ]\n" );
	+ }
	+ std::cout << buffer ;
	+ r = r->m_next ;
	+ } while ( r != & s_root_record );
	}
	}

	-void CudaHostPinnedSpace::print_memory_view( std::ostream & oss )
	+void
	+SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
	+print_records( std::ostream & s , const Kokkos::CudaUVMSpace & space , bool detail )
	{
	- Impl::cuda_host_pinned_space_singleton().print( oss , std::string(" ") );
	+ SharedAllocationRecord< void , void >::print_host_accessible_records( s , "CudaUVM" , & s_root_record , detail );
	}

	-std::string CudaHostPinnedSpace::query_label( const void * p )
	+void
	+SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
	+print_records( std::ostream & s , const Kokkos::CudaHostPinnedSpace & space , bool detail )
	{
	- return std::string( Impl::cuda_host_pinned_space_singleton().query_label(p) );
	+ SharedAllocationRecord< void , void >::print_host_accessible_records( s , "CudaHostPinned" , & s_root_record , detail );
	}

	+} // namespace Impl
	+} // namespace Experimental
	} // namespace Kokkos

	-#endif // KOKKOS_HAVE_CUDA
	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	+namespace Kokkos {
	+namespace {
	+ __global__ void init_lock_array_kernel() {
	+ unsigned i = blockIdx.x*blockDim.x + threadIdx.x;
	+
	+ if(i<CUDA_SPACE_ATOMIC_MASK+1)
	+ kokkos_impl_cuda_atomic_lock_array[i] = 0;
	+ }
	+}
	+
	+namespace Impl {
	+int* lock_array_cuda_space_ptr(bool deallocate) {
	+ static int* ptr = NULL;
	+ if(deallocate) {
	+ cudaFree(ptr);
	+ ptr = NULL;
	+ }
	+
	+ if(ptr==NULL && !deallocate)
	+ cudaMalloc(&ptr,sizeof(int)*(CUDA_SPACE_ATOMIC_MASK+1));
	+ return ptr;
	+}
	+
	+void init_lock_array_cuda_space() {
	+ int is_initialized = 0;
	+ if(! is_initialized) {
	+ int* lock_array_ptr = lock_array_cuda_space_ptr();
	+ cudaMemcpyToSymbol( kokkos_impl_cuda_atomic_lock_array , & lock_array_ptr , sizeof(int*) );
	+ init_lock_array_kernel<<<(CUDA_SPACE_ATOMIC_MASK+255)/256,256>>>();
	+ }
	+}
	+
	+}
	+}
	+#endif // KOKKOS_HAVE_CUDA
	+
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Alloc.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Alloc.hpp
	new file mode 100755
	index 000000000..e1314c0e5
	--- /dev/null
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Alloc.hpp
	@@ -0,0 +1,183 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_CUDA_ALLOCATION_TRACKING_HPP
	+#define KOKKOS_CUDA_ALLOCATION_TRACKING_HPP
	+
	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	+#include <impl/Kokkos_Traits.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp> // AllocatorAttributeBase
	+
	+namespace Kokkos {
	+namespace Impl {
	+
	+template< class DestructFunctor >
	+SharedAllocationRecord *
	+shared_allocation_record( Kokkos::CudaSpace const & arg_space
	+ , void * const arg_alloc_ptr
	+ , DestructFunctor const & arg_destruct )
	+{
	+ SharedAllocationRecord * const record = SharedAllocationRecord::get_record( arg_alloc_ptr );
	+
	+ // assert: record != 0
	+
	+ // assert: sizeof(DestructFunctor) <= record->m_destruct_size
	+
	+ // assert: record->m_destruct_function == 0
	+
	+ DestructFunctor * const functor =
	+ reinterpret_cast< DestructFunctor * >(
	+ reinterpret_cast< unsigned long >( record ) + sizeof(SharedAllocationRecord) );
	+
	+ new( functor ) DestructFunctor( arg_destruct );
	+
	+ record->m_destruct_functor = & shared_allocation_destroy< DestructFunctor > ;
	+
	+ return record ;
	+}
	+
	+
	+/// class CudaUnmanagedAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+struct CudaUnmanagedAllocator
	+{
	+ static const char * name()
	+ {
	+ return "Cuda Unmanaged Allocator";
	+ }
	+
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaUnmanagedAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+struct CudaUnmanagedUVMAllocator
	+{
	+ static const char * name()
	+ {
	+ return "Cuda Unmanaged UVM Allocator";
	+ }
	+
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaUnmanagedHostAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+class CudaUnmanagedHostAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda Unmanaged Host Allocator";
	+ }
	+ // Unmanaged deallocate does nothing
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+};
	+
	+/// class CudaMallocAllocator
	+class CudaMallocAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda Malloc Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaUVMAllocator
	+class CudaUVMAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda UVM Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaHostAllocator
	+class CudaHostAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda Host Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+};
	+
	+
	+}} // namespace Kokkos::Impl
	+
	+#endif //KOKKOS_HAVE_CUDA
	+
	+#endif // #ifndef KOKKOS_CUDA_ALLOCATION_TRACKING_HPP
	+
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp
	new file mode 100755
	index 000000000..8c8c5e47a
	--- /dev/null
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp
	@@ -0,0 +1,192 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	+#include <impl/Kokkos_Error.hpp>
	+#include <Cuda/Kokkos_Cuda_BasicAllocators.hpp>
	+#include <Cuda/Kokkos_Cuda_Error.hpp>
	+
	+#include <sstream>
	+
	+namespace Kokkos { namespace Impl {
	+
	+
	+/--------------------------------------------------------------------------/
	+TextureAttribute::TextureAttribute( void * const alloc_ptr
	+ , size_t alloc_size
	+ , cudaChannelFormatDesc const & desc
	+ )
	+ : m_tex_obj(0)
	+{
	+ cuda_device_synchronize();
	+
	+ struct cudaResourceDesc resDesc ;
	+ struct cudaTextureDesc texDesc ;
	+
	+ memset( & resDesc , 0 , sizeof(resDesc) );
	+ memset( & texDesc , 0 , sizeof(texDesc) );
	+
	+ resDesc.resType = cudaResourceTypeLinear ;
	+ resDesc.res.linear.desc = desc ;
	+ resDesc.res.linear.sizeInBytes = alloc_size ;
	+ resDesc.res.linear.devPtr = alloc_ptr ;
	+
	+ CUDA_SAFE_CALL( cudaCreateTextureObject( & m_tex_obj , & resDesc, & texDesc, NULL) );
	+
	+ cuda_device_synchronize();
	+}
	+
	+
	+TextureAttribute::~TextureAttribute()
	+{
	+ if (m_tex_obj) {
	+ cudaDestroyTextureObject( m_tex_obj );
	+ }
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+void * CudaMallocAllocator::allocate( size_t size )
	+{
	+ void * ptr = NULL;
	+
	+ CUDA_SAFE_CALL( cudaMalloc( &ptr, size ) );
	+
	+ return ptr;
	+}
	+
	+void CudaMallocAllocator::deallocate( void * ptr, size_t /size/ )
	+{
	+ try {
	+ CUDA_SAFE_CALL( cudaFree( ptr ) );
	+ } catch(...) {}
	+}
	+
	+void * CudaMallocAllocator::reallocate(void * old_ptr, size_t old_size, size_t new_size)
	+{
	+ void * ptr = old_ptr;
	+ if (old_size != new_size) {
	+ ptr = allocate( new_size );
	+ size_t copy_size = old_size < new_size ? old_size : new_size;
	+
	+ CUDA_SAFE_CALL( cudaMemcpy( ptr , old_ptr , copy_size , cudaMemcpyDefault ) );
	+
	+ deallocate( old_ptr, old_size );
	+ }
	+ return ptr;
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+void * CudaUVMAllocator::allocate( size_t size )
	+{
	+#if defined( CUDA_VERSION ) && ( 6000 <= CUDA_VERSION )
	+ void * ptr = NULL;
	+ CUDA_SAFE_CALL( cudaMallocManaged( &ptr, size, cudaMemAttachGlobal ) );
	+ return ptr;
	+#else
	+ throw_runtime_exception( "CUDA VERSION does not support UVM" );
	+ return NULL;
	+#endif
	+}
	+
	+void CudaUVMAllocator::deallocate( void * ptr, size_t /size/ )
	+{
	+ try {
	+ CUDA_SAFE_CALL( cudaFree( ptr ) );
	+ } catch(...) {}
	+}
	+
	+void * CudaUVMAllocator::reallocate(void * old_ptr, size_t old_size, size_t new_size)
	+{
	+ void * ptr = old_ptr;
	+ if (old_size != new_size) {
	+ ptr = allocate( new_size );
	+ size_t copy_size = old_size < new_size ? old_size : new_size;
	+
	+ CUDA_SAFE_CALL( cudaMemcpy( ptr , old_ptr , copy_size , cudaMemcpyDefault ) );
	+
	+ deallocate( old_ptr, old_size );
	+ }
	+ return ptr;
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+void * CudaHostAllocator::allocate( size_t size )
	+{
	+ void * ptr = NULL;
	+ CUDA_SAFE_CALL( cudaHostAlloc( &ptr , size , cudaHostAllocDefault ) );
	+ return ptr;
	+}
	+
	+void CudaHostAllocator::deallocate( void * ptr, size_t /size/ )
	+{
	+ try {
	+ CUDA_SAFE_CALL( cudaFreeHost( ptr ) );
	+ } catch(...) {}
	+}
	+
	+void * CudaHostAllocator::reallocate(void * old_ptr, size_t old_size, size_t new_size)
	+{
	+ void * ptr = old_ptr;
	+ if (old_size != new_size) {
	+ ptr = allocate( new_size );
	+ size_t copy_size = old_size < new_size ? old_size : new_size;
	+
	+ CUDA_SAFE_CALL( cudaMemcpy( ptr , old_ptr , copy_size , cudaMemcpyHostToHost ) );
	+
	+ deallocate( old_ptr, old_size );
	+ }
	+ return ptr;
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+}} // namespace Kokkos::Impl
	+
	+#endif //KOKKOS_HAVE_CUDA
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_BasicAllocators.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_BasicAllocators.hpp
	new file mode 100755
	index 000000000..86fe1c901
	--- /dev/null
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_BasicAllocators.hpp
	@@ -0,0 +1,187 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_CUDA_BASIC_ALLOCATORS_HPP
	+#define KOKKOS_CUDA_BASIC_ALLOCATORS_HPP
	+
	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	+#include <impl/Kokkos_Traits.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp> // AllocatorAttributeBase
	+
	+namespace Kokkos { namespace Impl {
	+
	+
	+// Cuda 5.0 <texture_types.h> defines 'cudaTextureObject_t'
	+// to be an 'unsigned long long'. This chould change with
	+// future version of Cuda and this typedef would have to
	+// change accordingly.
	+
	+#if defined( CUDA_VERSION ) && ( 5000 <= CUDA_VERSION )
	+
	+typedef enable_if<
	+ sizeof(::cudaTextureObject_t) == sizeof(const void *) ,
	+ ::cudaTextureObject_t >::type cuda_texture_object_type ;
	+
	+#else
	+
	+typedef const void * cuda_texture_object_type ;
	+
	+#endif
	+
	+
	+struct TextureAttribute : public AllocatorAttributeBase
	+{
	+ cuda_texture_object_type m_tex_obj ;
	+
	+ TextureAttribute( void * const alloc_ptr
	+ , size_t alloc_size
	+ , cudaChannelFormatDesc const & desc
	+ );
	+
	+ ~TextureAttribute();
	+};
	+
	+
	+/// class CudaUnmanagedAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+struct CudaUnmanagedAllocator
	+{
	+ static const char * name()
	+ {
	+ return "Cuda Unmanaged Allocator";
	+ }
	+
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaUnmanagedAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+struct CudaUnmanagedUVMAllocator
	+{
	+ static const char * name()
	+ {
	+ return "Cuda Unmanaged UVM Allocator";
	+ }
	+
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaUnmanagedHostAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+class CudaUnmanagedHostAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda Unmanaged Host Allocator";
	+ }
	+ // Unmanaged deallocate does nothing
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+};
	+
	+/// class CudaMallocAllocator
	+class CudaMallocAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda Malloc Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaUVMAllocator
	+class CudaUVMAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda UVM Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+
	+ static bool support_texture_binding() { return true; }
	+};
	+
	+/// class CudaHostAllocator
	+class CudaHostAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Cuda Host Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+};
	+
	+
	+}} // namespace Kokkos::Impl
	+
	+#endif //KOKKOS_HAVE_CUDA
	+
	+#endif //KOKKOS_CUDA_BASIC_ALLOCATORS_HPP
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp
	similarity index 67%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp
	index 966291abd..a0b29ddc2 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp
	@@ -1,64 +1,69 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-
	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	+#ifndef KOKKOS_CUDA_ERROR_HPP
	+#define KOKKOS_CUDA_ERROR_HPP

	#include <Kokkos_Macros.hpp>

	-namespace Kokkos {
	-namespace Impl {
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	+namespace Kokkos { namespace Impl {
	+
	+void cuda_device_synchronize();
	+
	+void cuda_internal_error_throw( cudaError e , const char * name, const char * file = NULL, const int line = 0 );

	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	+inline void cuda_internal_safe_call( cudaError e , const char * name, const char * file = NULL, const int line = 0)
	+{
	+ if ( cudaSuccess != e ) { cuda_internal_error_throw( e , name, file, line ); }
	+}

	-} /* namespace Impl */
	-} /* namespace Kokkos */
	+#define CUDA_SAFE_CALL( call ) \
	+ Kokkos::Impl::cuda_internal_safe_call( call , #call, __FILE__, __LINE__ )

	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+}} // namespace Kokkos::Impl

	+#endif //KOKKOS_HAVE_CUDA
	+#endif //KOKKOS_CUDA_ERROR_HPP
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
	index 87a2e95ed..b7c3a62d3 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
	@@ -1,670 +1,678 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	/--------------------------------------------------------------------------/
	/* Kokkos interfaces */

	#include <Kokkos_Core.hpp>

	/* only compile this file if CUDA is enabled for Kokkos */
	#ifdef KOKKOS_HAVE_CUDA

	+#include <Cuda/Kokkos_Cuda_Error.hpp>
	#include <Cuda/Kokkos_Cuda_Internal.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp>
	#include <impl/Kokkos_Error.hpp>

	/--------------------------------------------------------------------------/
	/* Standard 'C' libraries */
	#include <stdlib.h>

	/* Standard 'C++' libraries */
	#include <vector>
	#include <iostream>
	#include <sstream>
	#include <string>

	#ifdef KOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE
	__device__ __constant__
	Kokkos::Impl::CudaTraits::ConstantGlobalBufferType
	kokkos_impl_cuda_constant_memory_buffer ;
	#endif

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	namespace {

	__global__
	void query_cuda_kernel_arch( int * d_arch )
	{
	#if defined( __CUDA_ARCH__ )
	*d_arch = __CUDA_ARCH__ ;
	#else
	*d_arch = 0 ;
	#endif
	}

	/** Query what compute capability is actually launched to the device: */
	int cuda_kernel_arch()
	{
	int * d_arch = 0 ;
	cudaMalloc( (void **) & d_arch , sizeof(int) );
	query_cuda_kernel_arch<<<1,1>>>( d_arch );
	int arch = 0 ;
	cudaMemcpy( & arch , d_arch , sizeof(int) , cudaMemcpyDefault );
	cudaFree( d_arch );
	return arch ;
	}

	bool cuda_launch_blocking()
	{
	const char * env = getenv("CUDA_LAUNCH_BLOCKING");

	if (env == 0) return false;

	return atoi(env);
	}

	}

	void cuda_device_synchronize()
	{
	- static const bool launch_blocking = cuda_launch_blocking();
	+// static const bool launch_blocking = cuda_launch_blocking();

	- if (!launch_blocking) {
	+// if (!launch_blocking) {
	CUDA_SAFE_CALL( cudaDeviceSynchronize() );
	- }
	+// }
	}

	void cuda_internal_error_throw( cudaError e , const char * name, const char * file, const int line )
	{
	std::ostringstream out ;
	- out << name << " error: " << cudaGetErrorString(e);
	+ out << name << " error( " << cudaGetErrorName(e) << "): " << cudaGetErrorString(e);
	if (file) {
	out << " " << file << ":" << line;
	}
	throw_runtime_exception( out.str() );
	}

	//----------------------------------------------------------------------------
	// Some significant cuda device properties:
	//
	// cudaDeviceProp::name : Text label for device
	// cudaDeviceProp::major : Device major number
	// cudaDeviceProp::minor : Device minor number
	// cudaDeviceProp::warpSize : number of threads per warp
	// cudaDeviceProp::multiProcessorCount : number of multiprocessors
	// cudaDeviceProp::sharedMemPerBlock : capacity of shared memory per block
	// cudaDeviceProp::totalConstMem : capacity of constant memory
	// cudaDeviceProp::totalGlobalMem : capacity of global memory
	// cudaDeviceProp::maxGridSize[3] : maximum grid size

	//
	// Section 4.4.2.4 of the CUDA Toolkit Reference Manual
	//
	// struct cudaDeviceProp {
	// char name[256];
	// size_t totalGlobalMem;
	// size_t sharedMemPerBlock;
	// int regsPerBlock;
	// int warpSize;
	// size_t memPitch;
	// int maxThreadsPerBlock;
	// int maxThreadsDim[3];
	// int maxGridSize[3];
	// size_t totalConstMem;
	// int major;
	// int minor;
	// int clockRate;
	// size_t textureAlignment;
	// int deviceOverlap;
	// int multiProcessorCount;
	// int kernelExecTimeoutEnabled;
	// int integrated;
	// int canMapHostMemory;
	// int computeMode;
	// int concurrentKernels;
	// int ECCEnabled;
	// int pciBusID;
	// int pciDeviceID;
	// int tccDriver;
	// int asyncEngineCount;
	// int unifiedAddressing;
	// int memoryClockRate;
	// int memoryBusWidth;
	// int l2CacheSize;
	// int maxThreadsPerMultiProcessor;
	// };


	namespace {



	class CudaInternalDevices {
	public:
	enum { MAXIMUM_DEVICE_COUNT = 8 };
	struct cudaDeviceProp m_cudaProp[ MAXIMUM_DEVICE_COUNT ] ;
	int m_cudaDevCount ;

	CudaInternalDevices();

	static const CudaInternalDevices & singleton();
	};

	CudaInternalDevices::CudaInternalDevices()
	{
	// See 'cudaSetDeviceFlags' for host-device thread interaction
	// Section 4.4.2.6 of the CUDA Toolkit Reference Manual

	CUDA_SAFE_CALL (cudaGetDeviceCount( & m_cudaDevCount ) );

	for ( int i = 0 ; i < m_cudaDevCount ; ++i ) {
	CUDA_SAFE_CALL( cudaGetDeviceProperties( m_cudaProp + i , i ) );
	}
	}

	const CudaInternalDevices & CudaInternalDevices::singleton()
	{
	static CudaInternalDevices self ; return self ;
	}

	}

	//----------------------------------------------------------------------------

	class CudaInternal {
	private:

	CudaInternal( const CudaInternal & );
	CudaInternal & operator = ( const CudaInternal & );

	+ AllocationTracker m_scratchFlagsTracker;
	+ AllocationTracker m_scratchSpaceTracker;
	+ AllocationTracker m_scratchUnifiedTracker;
	+
	+
	public:

	typedef Cuda::size_type size_type ;

	int m_cudaDev ;
	int m_cudaArch ;
	unsigned m_maxWarpCount ;
	unsigned m_maxBlock ;
	unsigned m_maxSharedWords ;
	size_type m_scratchSpaceCount ;
	size_type m_scratchFlagsCount ;
	size_type m_scratchUnifiedCount ;
	size_type m_scratchUnifiedSupported ;
	size_type m_streamCount ;
	size_type * m_scratchSpace ;
	size_type * m_scratchFlags ;
	size_type * m_scratchUnified ;
	cudaStream_t * m_stream ;


	static CudaInternal & singleton();

	int verify_is_initialized( const char * const label ) const ;

	int is_initialized() const
	{ return 0 != m_scratchSpace && 0 != m_scratchFlags ; }

	void initialize( int cuda_device_id , int stream_count );
	void finalize();

	void print_configuration( std::ostream & ) const ;

	~CudaInternal();

	CudaInternal()
	: m_cudaDev( -1 )
	, m_cudaArch( -1 )
	, m_maxWarpCount( 0 )
	- , m_maxBlock( 0 )
	+ , m_maxBlock( 0 )
	, m_maxSharedWords( 0 )
	, m_scratchSpaceCount( 0 )
	, m_scratchFlagsCount( 0 )
	, m_scratchUnifiedCount( 0 )
	, m_scratchUnifiedSupported( 0 )
	, m_streamCount( 0 )
	, m_scratchSpace( 0 )
	, m_scratchFlags( 0 )
	, m_scratchUnified( 0 )
	, m_stream( 0 )
	{}

	size_type * scratch_space( const size_type size );
	size_type * scratch_flags( const size_type size );
	size_type * scratch_unified( const size_type size );
	};

	//----------------------------------------------------------------------------


	void CudaInternal::print_configuration( std::ostream & s ) const
	{
	const CudaInternalDevices & dev_info = CudaInternalDevices::singleton();

	#if defined( KOKKOS_HAVE_CUDA )
	s << "macro KOKKOS_HAVE_CUDA : defined" << std::endl ;
	#endif
	#if defined( CUDA_VERSION )
	s << "macro CUDA_VERSION = " << CUDA_VERSION
	<< " = version " << CUDA_VERSION / 1000
	<< "." << ( CUDA_VERSION % 1000 ) / 10
	<< std::endl ;
	#endif

	for ( int i = 0 ; i < dev_info.m_cudaDevCount ; ++i ) {
	s << "Kokkos::Cuda[ " << i << " ] "
	<< dev_info.m_cudaProp[i].name
	<< " capability " << dev_info.m_cudaProp[i].major << "." << dev_info.m_cudaProp[i].minor
	- << ", Total Global Memory: " << human_memory_size(dev_info.m_cudaProp[i].totalGlobalMem)
	+ << ", Total Global Memory: " << human_memory_size(dev_info.m_cudaProp[i].totalGlobalMem)
	<< ", Shared Memory per Block: " << human_memory_size(dev_info.m_cudaProp[i].sharedMemPerBlock);
	if ( m_cudaDev == i ) s << " : Selected" ;
	s << std::endl ;
	}
	}

	//----------------------------------------------------------------------------

	CudaInternal::~CudaInternal()
	{
	if ( m_stream \|\|
	m_scratchSpace \|\|
	m_scratchFlags \|\|
	m_scratchUnified ) {
	std::cerr << "Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()"
	<< std::endl ;
	std::cerr.flush();
	}

	m_cudaDev = -1 ;
	m_cudaArch = -1 ;
	m_maxWarpCount = 0 ;
	m_maxBlock = 0 ;
	m_maxSharedWords = 0 ;
	m_scratchSpaceCount = 0 ;
	m_scratchFlagsCount = 0 ;
	m_scratchUnifiedCount = 0 ;
	m_scratchUnifiedSupported = 0 ;
	m_streamCount = 0 ;
	m_scratchSpace = 0 ;
	m_scratchFlags = 0 ;
	m_scratchUnified = 0 ;
	m_stream = 0 ;
	}

	int CudaInternal::verify_is_initialized( const char * const label ) const
	{
	if ( m_cudaDev < 0 ) {
	std::cerr << "Kokkos::Cuda::" << label << " : ERROR device not initialized" << std::endl ;
	}
	return 0 <= m_cudaDev ;
	}

	CudaInternal & CudaInternal::singleton()
	{
	static CudaInternal self ;
	return self ;
	}

	void CudaInternal::initialize( int cuda_device_id , int stream_count )
	{
	enum { WordSize = sizeof(size_type) };

	if ( ! HostSpace::execution_space::is_initialized() ) {
	const std::string msg("Cuda::initialize ERROR : HostSpace::execution_space is not initialized");
	throw_runtime_exception( msg );
	}

	const CudaInternalDevices & dev_info = CudaInternalDevices::singleton();

	const bool ok_init = 0 == m_scratchSpace \|\| 0 == m_scratchFlags ;

	const bool ok_id = 0 <= cuda_device_id &&
	cuda_device_id < dev_info.m_cudaDevCount ;

	// Need device capability 2.0 or better

	const bool ok_dev = ok_id &&
	( 2 <= dev_info.m_cudaProp[ cuda_device_id ].major &&
	0 <= dev_info.m_cudaProp[ cuda_device_id ].minor );

	if ( ok_init && ok_dev ) {

	const struct cudaDeviceProp & cudaProp =
	dev_info.m_cudaProp[ cuda_device_id ];

	m_cudaDev = cuda_device_id ;

	CUDA_SAFE_CALL( cudaSetDevice( m_cudaDev ) );
	CUDA_SAFE_CALL( cudaDeviceReset() );
	Kokkos::Impl::cuda_device_synchronize();

	// Query what compute capability architecture a kernel executes:
	m_cudaArch = cuda_kernel_arch();

	if ( m_cudaArch != cudaProp.major * 100 + cudaProp.minor * 10 ) {
	std::cerr << "Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability "
	<< ( m_cudaArch / 100 ) << "." << ( ( m_cudaArch % 100 ) / 10 )
	<< " on device with compute capability "
	<< cudaProp.major << "." << cudaProp.minor
	<< " , this will likely reduce potential performance."
	<< std::endl ;
	}

	//----------------------------------
	// Maximum number of warps,
	// at most one warp per thread in a warp for reduction.

	// HCE 2012-February :
	// Found bug in CUDA 4.1 that sometimes a kernel launch would fail
	// if the thread count == 1024 and a functor is passed to the kernel.
	// Copying the kernel to constant memory and then launching with
	// thread count == 1024 would work fine.
	//
	// HCE 2012-October :
	// All compute capabilities support at least 16 warps (512 threads).
	// However, we have found that 8 warps typically gives better performance.

	m_maxWarpCount = 8 ;

	// m_maxWarpCount = cudaProp.maxThreadsPerBlock / Impl::CudaTraits::WarpSize ;

	if ( Impl::CudaTraits::WarpSize < m_maxWarpCount ) {
	m_maxWarpCount = Impl::CudaTraits::WarpSize ;
	}

	m_maxSharedWords = cudaProp.sharedMemPerBlock / WordSize ;

	//----------------------------------
	// Maximum number of blocks:

	m_maxBlock = m_cudaArch < 300 ? 65535 : cudaProp.maxGridSize[0] ;

	//----------------------------------

	m_scratchUnifiedSupported = cudaProp.unifiedAddressing ;

	if ( ! m_scratchUnifiedSupported ) {
	std::cout << "Kokkos::Cuda device "
	<< cudaProp.name << " capability "
	<< cudaProp.major << "." << cudaProp.minor
	<< " does not support unified virtual address space"
	<< std::endl ;
	}

	//----------------------------------
	// Multiblock reduction uses scratch flags for counters
	// and scratch space for partial reduction values.
	// Allocate some initial space. This will grow as needed.

	{
	const unsigned reduce_block_count = m_maxWarpCount * Impl::CudaTraits::WarpSize ;

	(void) scratch_unified( 16 * sizeof(size_type) );
	(void) scratch_flags( reduce_block_count * 2 * sizeof(size_type) );
	(void) scratch_space( reduce_block_count * 16 * sizeof(size_type) );
	}
	//----------------------------------

	if ( stream_count ) {
	- m_stream = (cudaStream_t) malloc( stream_count sizeof(cudaStream_t) );
	+ m_stream = (cudaStream_t) ::malloc( stream_count sizeof(cudaStream_t) );
	m_streamCount = stream_count ;
	for ( size_type i = 0 ; i < m_streamCount ; ++i ) m_stream[i] = 0 ;
	}
	}
	else {

	std::ostringstream msg ;
	msg << "Kokkos::Cuda::initialize(" << cuda_device_id << ") FAILED" ;

	if ( ! ok_init ) {
	msg << " : Already initialized" ;
	}
	if ( ! ok_id ) {
	msg << " : Device identifier out of range "
	<< "[0.." << dev_info.m_cudaDevCount << "]" ;
	}
	else if ( ! ok_dev ) {
	msg << " : Device " ;
	msg << dev_info.m_cudaProp[ cuda_device_id ].major ;
	msg << "." ;
	msg << dev_info.m_cudaProp[ cuda_device_id ].minor ;
	msg << " has insufficient capability, required 2.0 or better" ;
	}
	Kokkos::Impl::throw_runtime_exception( msg.str() );
	}
	+
	+ // Init the array for used for arbitrarily sized atomics
	+ Impl::init_lock_array_cuda_space();
	+
	}

	//----------------------------------------------------------------------------

	typedef Cuda::size_type ScratchGrain[ Impl::CudaTraits::WarpSize ] ;
	enum { sizeScratchGrain = sizeof(ScratchGrain) };


	Cuda::size_type *
	CudaInternal::scratch_flags( const Cuda::size_type size )
	{
	if ( verify_is_initialized("scratch_flags") && m_scratchFlagsCount * sizeScratchGrain < size ) {

	- CudaSpace::decrement( m_scratchFlags );
	-
	+
	m_scratchFlagsCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;

	- m_scratchFlags = (size_type *)
	- CudaSpace::allocate( std::string("InternalScratchFlags") , sizeof( ScratchGrain ) * m_scratchFlagsCount );
	+ m_scratchFlagsTracker = CudaSpace::allocate_and_track( std::string("InternalScratchFlags") , sizeof( ScratchGrain ) * m_scratchFlagsCount );
	+ m_scratchFlags = reinterpret_cast<size_type *>(m_scratchFlagsTracker.alloc_ptr());

	CUDA_SAFE_CALL( cudaMemset( m_scratchFlags , 0 , m_scratchFlagsCount * sizeScratchGrain ) );
	}

	return m_scratchFlags ;
	}

	Cuda::size_type *
	CudaInternal::scratch_space( const Cuda::size_type size )
	{
	if ( verify_is_initialized("scratch_space") && m_scratchSpaceCount * sizeScratchGrain < size ) {

	- CudaSpace::decrement( m_scratchSpace );
	-
	m_scratchSpaceCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;

	- m_scratchSpace = (size_type *)
	- CudaSpace::allocate( std::string("InternalScratchSpace") , sizeof( ScratchGrain ) * m_scratchSpaceCount );
	+ m_scratchSpaceTracker = CudaSpace::allocate_and_track( std::string("InternalScratchSpace") , sizeof( ScratchGrain ) * m_scratchSpaceCount );
	+ m_scratchSpace = reinterpret_cast<size_type *>(m_scratchSpaceTracker.alloc_ptr());
	+
	}

	return m_scratchSpace ;
	}

	Cuda::size_type *
	CudaInternal::scratch_unified( const Cuda::size_type size )
	{
	if ( verify_is_initialized("scratch_unified") &&
	m_scratchUnifiedSupported && m_scratchUnifiedCount * sizeScratchGrain < size ) {

	- CudaHostPinnedSpace::decrement( m_scratchUnified );
	-
	m_scratchUnifiedCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;

	- m_scratchUnified = (size_type *)
	- CudaHostPinnedSpace::allocate( std::string("InternalScratchUnified") , sizeof( ScratchGrain ) * m_scratchUnifiedCount );
	+ m_scratchUnifiedTracker = CudaHostPinnedSpace::allocate_and_track( std::string("InternalScratchUnified") , sizeof( ScratchGrain ) * m_scratchUnifiedCount );
	+ m_scratchUnified = reinterpret_cast<size_type *>( m_scratchUnifiedTracker.alloc_ptr() );
	}

	return m_scratchUnified ;
	}

	//----------------------------------------------------------------------------

	void CudaInternal::finalize()
	{
	if ( 0 != m_scratchSpace \|\| 0 != m_scratchFlags ) {

	+ lock_array_cuda_space_ptr(true);
	if ( m_stream ) {
	for ( size_type i = 1 ; i < m_streamCount ; ++i ) {
	cudaStreamDestroy( m_stream[i] );
	m_stream[i] = 0 ;
	}
	- free( m_stream );
	+ ::free( m_stream );
	}

	- CudaSpace::decrement( m_scratchSpace );
	- CudaSpace::decrement( m_scratchFlags );
	- CudaHostPinnedSpace::decrement( m_scratchUnified );
	-
	+ m_scratchSpaceTracker.clear();
	+ m_scratchFlagsTracker.clear();
	+ m_scratchUnifiedTracker.clear();
	+
	m_cudaDev = -1 ;
	m_maxWarpCount = 0 ;
	- m_maxBlock = 0 ;
	+ m_maxBlock = 0 ;
	m_maxSharedWords = 0 ;
	m_scratchSpaceCount = 0 ;
	m_scratchFlagsCount = 0 ;
	m_scratchUnifiedCount = 0 ;
	m_streamCount = 0 ;
	m_scratchSpace = 0 ;
	m_scratchFlags = 0 ;
	m_scratchUnified = 0 ;
	m_stream = 0 ;
	}
	}

	//----------------------------------------------------------------------------

	Cuda::size_type cuda_internal_maximum_warp_count()
	{ return CudaInternal::singleton().m_maxWarpCount ; }

	Cuda::size_type cuda_internal_maximum_grid_count()
	{ return CudaInternal::singleton().m_maxBlock ; }

	Cuda::size_type cuda_internal_maximum_shared_words()
	{ return CudaInternal::singleton().m_maxSharedWords ; }

	Cuda::size_type * cuda_internal_scratch_space( const Cuda::size_type size )
	{ return CudaInternal::singleton().scratch_space( size ); }

	Cuda::size_type * cuda_internal_scratch_flags( const Cuda::size_type size )
	{ return CudaInternal::singleton().scratch_flags( size ); }

	Cuda::size_type * cuda_internal_scratch_unified( const Cuda::size_type size )
	{ return CudaInternal::singleton().scratch_unified( size ); }


	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------

	namespace Kokkos {

	Cuda::size_type Cuda::detect_device_count()
	{ return Impl::CudaInternalDevices::singleton().m_cudaDevCount ; }

	int Cuda::is_initialized()
	{ return Impl::CudaInternal::singleton().is_initialized(); }

	void Cuda::initialize( const Cuda::SelectDevice config , size_t num_instances )
	{ Impl::CudaInternal::singleton().initialize( config.cuda_device_id , num_instances ); }

	std::vector<unsigned>
	Cuda::detect_device_arch()
	{
	const Impl::CudaInternalDevices & s = Impl::CudaInternalDevices::singleton();

	std::vector<unsigned> output( s.m_cudaDevCount );

	for ( int i = 0 ; i < s.m_cudaDevCount ; ++i ) {
	output[i] = s.m_cudaProp[i].major * 100 + s.m_cudaProp[i].minor ;
	}

	return output ;
	}

	Cuda::size_type Cuda::device_arch()
	{
	const int dev_id = Impl::CudaInternal::singleton().m_cudaDev ;

	int dev_arch = 0 ;

	if ( 0 <= dev_id ) {
	const struct cudaDeviceProp & cudaProp =
	Impl::CudaInternalDevices::singleton().m_cudaProp[ dev_id ] ;

	dev_arch = cudaProp.major * 100 + cudaProp.minor ;
	}

	return dev_arch ;
	}

	void Cuda::finalize()
	{ Impl::CudaInternal::singleton().finalize(); }

	Cuda::Cuda()
	: m_device( Impl::CudaInternal::singleton().m_cudaDev )
	, m_stream( 0 )
	{
	Impl::CudaInternal::singleton().verify_is_initialized( "Cuda instance constructor" );
	}

	Cuda::Cuda( const int instance_id )
	: m_device( Impl::CudaInternal::singleton().m_cudaDev )
	, m_stream(
	Impl::CudaInternal::singleton().verify_is_initialized( "Cuda instance constructor" )
	? Impl::CudaInternal::singleton().m_stream[ instance_id % Impl::CudaInternal::singleton().m_streamCount ]
	: 0 )
	{}

	void Cuda::print_configuration( std::ostream & s , const bool )
	{ Impl::CudaInternal::singleton().print_configuration( s ); }

	bool Cuda::sleep() { return false ; }

	bool Cuda::wake() { return true ; }

	void Cuda::fence()
	-{
	+{
	Kokkos::Impl::cuda_device_synchronize();
	}

	} // namespace Kokkos

	#endif // KOKKOS_HAVE_CUDA
	//----------------------------------------------------------------------------

	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Internal.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Internal.hpp
	index 284e71dec..dd8a08729 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Internal.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Internal.hpp
	@@ -1,171 +1,165 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDA_INTERNAL_HPP
	#define KOKKOS_CUDA_INTERNAL_HPP

	-namespace Kokkos {
	-namespace Impl {
	+#include <Kokkos_Macros.hpp>

	-void cuda_internal_error_throw( cudaError e , const char * name, const char * file = NULL, const int line = 0 );
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA

	-void cuda_device_synchronize();
	+#include <Cuda/Kokkos_Cuda_Error.hpp>
	+
	+namespace Kokkos { namespace Impl {

	-inline
	-void cuda_internal_safe_call( cudaError e , const char * name, const char * file = NULL, const int line = 0)
	-{
	- if ( cudaSuccess != e ) { cuda_internal_error_throw( e , name, file, line ); }
	-}

	template<class DriverType>
	int cuda_get_max_block_size(const typename DriverType::functor_type & f) {
	#if ( CUDA_VERSION < 6050 )
	return 256;
	#else
	bool Large = ( CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType) );

	int numBlocks;
	if(Large) {
	int blockSize=32;
	int sharedmem = FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize );
	cudaOccupancyMaxActiveBlocksPerMultiprocessor(
	&numBlocks,
	cuda_parallel_launch_constant_memory<DriverType>,
	blockSize,
	sharedmem);

	while (blockSize<1024 && numBlocks>0) {
	blockSize*=2;
	sharedmem = FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize );

	cudaOccupancyMaxActiveBlocksPerMultiprocessor(
	&numBlocks,
	cuda_parallel_launch_constant_memory<DriverType>,
	blockSize,
	sharedmem);
	}
	if(numBlocks>0) return blockSize;
	else return blockSize/2;
	} else {
	int blockSize=32;
	int sharedmem = FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize );
	cudaOccupancyMaxActiveBlocksPerMultiprocessor(
	&numBlocks,
	cuda_parallel_launch_local_memory<DriverType>,
	blockSize,
	sharedmem);

	while (blockSize<1024 && numBlocks>0) {
	blockSize*=2;
	sharedmem = FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize );

	cudaOccupancyMaxActiveBlocksPerMultiprocessor(
	&numBlocks,
	cuda_parallel_launch_local_memory<DriverType>,
	blockSize,
	sharedmem);
	}
	if(numBlocks>0) return blockSize;
	else return blockSize/2;
	}
	#endif
	}

	template<class DriverType>
	int cuda_get_opt_block_size(const typename DriverType::functor_type & f) {
	#if ( CUDA_VERSION < 6050 )
	return 256;
	#else
	bool Large = ( CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType) );

	int blockSize=16;
	int numBlocks;
	int sharedmem;
	int maxOccupancy=0;
	int bestBlockSize=0;

	if(Large) {
	while(blockSize<1024) {
	blockSize*=2;

	//calculate the occupancy with that optBlockSize and check whether its larger than the largest one found so far
	sharedmem = FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize );
	cudaOccupancyMaxActiveBlocksPerMultiprocessor(
	&numBlocks,
	cuda_parallel_launch_constant_memory<DriverType>,
	blockSize,
	sharedmem);
	if(maxOccupancy < numBlocks*blockSize) {
	maxOccupancy = numBlocks*blockSize;
	bestBlockSize = blockSize;
	}
	}
	} else {
	while(blockSize<1024) {
	blockSize*=2;
	sharedmem = FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize );

	cudaOccupancyMaxActiveBlocksPerMultiprocessor(
	&numBlocks,
	cuda_parallel_launch_local_memory<DriverType>,
	blockSize,
	sharedmem);

	if(maxOccupancy < numBlocks*blockSize) {
	maxOccupancy = numBlocks*blockSize;
	bestBlockSize = blockSize;
	}
	}
	}
	return bestBlockSize;
	#endif
	}

	-}
	-}
	-
	-#define CUDA_SAFE_CALL( call ) \
	- Kokkos::Impl::cuda_internal_safe_call( call , #call, __FILE__, __LINE__ )
	+}} // namespace Kokkos::Impl

	+#endif // KOKKOS_HAVE_CUDA
	#endif /* #ifndef KOKKOS_CUDA_INTERNAL_HPP */

	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
	index 1faf52ba6..ce33c978c 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
	@@ -1,1591 +1,1799 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDA_PARALLEL_HPP
	#define KOKKOS_CUDA_PARALLEL_HPP

	#include <iostream>
	#include <stdio.h>

	-#if defined( __CUDACC__ )
	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#if defined( __CUDACC__ ) && defined( KOKKOS_HAVE_CUDA )

	#include <utility>
	#include <Kokkos_Parallel.hpp>

	#include <Cuda/Kokkos_CudaExec.hpp>
	#include <Cuda/Kokkos_Cuda_ReduceScan.hpp>
	#include <Cuda/Kokkos_Cuda_Internal.hpp>
	#include <Kokkos_Vectorization.hpp>
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+#include <impl/Kokkos_Profiling_Interface.hpp>
	+#include <typeinfo>
	+#endif
	+
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< typename Type >
	struct CudaJoinFunctor {
	typedef Type value_type ;

	KOKKOS_INLINE_FUNCTION
	static void join( volatile value_type & update ,
	volatile const value_type & input )
	{ update += input ; }
	};

	class CudaTeamMember {
	private:

	typedef Kokkos::Cuda execution_space ;
	typedef execution_space::scratch_memory_space scratch_memory_space ;

	void * m_team_reduce ;
	scratch_memory_space m_team_shared ;
	int m_league_rank ;
	int m_league_size ;

	public:

	#if defined( __CUDA_ARCH__ )

	__device__ inline
	const execution_space::scratch_memory_space & team_shmem() const
	{ return m_team_shared ; }

	__device__ inline int league_rank() const { return m_league_rank ; }
	__device__ inline int league_size() const { return m_league_size ; }
	__device__ inline int team_rank() const { return threadIdx.y ; }
	__device__ inline int team_size() const { return blockDim.y ; }

	__device__ inline void team_barrier() const { __syncthreads(); }

	template<class ValueType>
	__device__ inline void team_broadcast(ValueType& value, const int& thread_id) const {
	__shared__ ValueType sh_val;
	if(threadIdx.x == 0 && threadIdx.y == thread_id) {
	sh_val = val;
	}
	team_barrier();
	val = sh_val;
	}

	#ifdef KOKKOS_HAVE_CXX11
	template< class ValueType, class JoinOp >
	__device__ inline
	typename JoinOp::value_type team_reduce( const ValueType & value
	, const JoinOp & op_in ) const
	{
	typedef JoinLambdaAdapter<ValueType,JoinOp> JoinOpFunctor ;
	const JoinOpFunctor op(op_in);
	ValueType * const base_data = (ValueType *) m_team_reduce ;
	#else
	template< class JoinOp >
	__device__ inline
	typename JoinOp::value_type team_reduce( const typename JoinOp::value_type & value
	, const JoinOp & op ) const
	{
	typedef JoinOp JoinOpFunctor ;
	typename JoinOp::value_type * const base_data = (typename JoinOp::value_type *) m_team_reduce ;
	#endif

	__syncthreads(); // Don't write in to shared data until all threads have entered this function

	if ( 0 == threadIdx.y ) { base_data[0] = 0 ; }

	base_data[ threadIdx.y ] = value ;

	Impl::cuda_intra_block_reduce_scan<false,JoinOpFunctor,void>( op , base_data );

	return base_data[ blockDim.y - 1 ];
	}

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering
	* with intra-team non-deterministic ordering accumulation.
	*
	* The global inter-team accumulation value will, at the end of the
	* league's parallel execution, be the scan's total.
	* Parallel execution ordering of the league's teams is non-deterministic.
	* As such the base value for each team's scan operation is similarly
	* non-deterministic.
	*/
	template< typename Type >
	__device__ inline Type team_scan( const Type & value , Type * const global_accum ) const
	{
	Type * const base_data = (Type *) m_team_reduce ;

	__syncthreads(); // Don't write in to shared data until all threads have entered this function

	if ( 0 == threadIdx.y ) { base_data[0] = 0 ; }

	base_data[ threadIdx.y + 1 ] = value ;

	Impl::cuda_intra_block_reduce_scan<true,Impl::CudaJoinFunctor<Type>,void>( Impl::CudaJoinFunctor<Type>() , base_data + 1 );

	if ( global_accum ) {
	if ( blockDim.y == threadIdx.y + 1 ) {
	base_data[ blockDim.y ] = atomic_fetch_add( global_accum , base_data[ blockDim.y ] );
	}
	__syncthreads(); // Wait for atomic
	base_data[ threadIdx.y ] += base_data[ blockDim.y ] ;
	}

	return base_data[ threadIdx.y ];
	}

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	*
	* The highest rank thread can compute the reduction total as
	* reduction_total = dev.team_scan( value ) + value ;
	*/
	template< typename Type >
	__device__ inline Type team_scan( const Type & value ) const
	{ return this->template team_scan<Type>( value , 0 ); }

	-
	-#ifdef KOKKOS_HAVE_CXX11
	- template< class Operation >
	- __device__ inline void vector_single(const Operation & op) const {
	- if(threadIdx.x == 0)
	- op();
	- }
	-
	- template< class Operation, typename ValueType>
	- __device__ inline void vector_single(const Operation & op, ValueType& bcast) const {
	- if(threadIdx.x == 0)
	- op();
	- bcast = shfl(bcast,0,blockDim.x);
	- }
	-
	-#endif
	-
	//----------------------------------------
	// Private for the driver

	__device__ inline
	CudaTeamMember( void * shared
	, const int shared_begin
	, const int shared_size
	, const int arg_league_rank
	, const int arg_league_size )
	: m_team_reduce( shared )
	, m_team_shared( ((char *)shared) + shared_begin , shared_size )
	, m_league_rank( arg_league_rank )
	, m_league_size( arg_league_size )
	{}

	#else

	const execution_space::scratch_memory_space & team_shmem() const {return m_team_shared;}

	int league_rank() const {return 0;}
	int league_size() const {return 1;}
	int team_rank() const {return 0;}
	int team_size() const {return 1;}

	void team_barrier() const {}
	template<class ValueType>
	void team_broadcast(ValueType& value, const int& thread_id) const {}

	template< class JoinOp >
	typename JoinOp::value_type team_reduce( const typename JoinOp::value_type & value
	, const JoinOp & op ) const {return typename JoinOp::value_type();}

	template< typename Type >
	Type team_scan( const Type & value , Type * const global_accum ) const {return Type();}

	template< typename Type >
	Type team_scan( const Type & value ) const {return Type();}

	-#ifdef KOKKOS_HAVE_CXX11
	- template< class Operation >
	- void vector_single(const Operation & op) const {}
	-
	- template< class Operation , typename ValueType>
	- void vector_single(const Operation & op, ValueType& val) const {}
	-#endif
	//----------------------------------------
	// Private for the driver

	CudaTeamMember( void * shared
	, const int shared_begin
	, const int shared_end
	, const int arg_league_rank
	, const int arg_league_size );

	#endif /* #if ! defined( __CUDA_ARCH__ ) */

	};

	} // namespace Impl

	template< class Arg0 , class Arg1 >
	class TeamPolicy< Arg0 , Arg1 , Kokkos::Cuda >
	{
	private:

	enum { MAX_WARP = 8 };

	const int m_league_size ;
	const int m_team_size ;
	const int m_vector_length ;

	public:

	//! Tag this class as a kokkos execution policy
	typedef TeamPolicy execution_policy ;

	//! Execution space of this execution policy
	typedef Kokkos::Cuda execution_space ;

	typedef typename
	Impl::if_c< ! Impl::is_same< Kokkos::Cuda , Arg0 >::value , Arg0 , Arg1 >::type
	work_tag ;

	//----------------------------------------

	template< class FunctorType >
	inline static
	int team_size_max( const FunctorType & functor )
	{
	int n = MAX_WARP * Impl::CudaTraits::WarpSize ;

	for ( ; n ; n >>= 1 ) {
	const int shmem_size =
	/* for global reduce */ Impl::cuda_single_inter_block_reduce_scan_shmem<false,FunctorType,work_tag>( functor , n )
	/* for team reduce / + ( n + 2 ) sizeof(double)
	/* for team shared */ + Impl::FunctorTeamShmemSize< FunctorType >::value( functor , n );

	if ( shmem_size < Impl::CudaTraits::SharedMemoryCapacity ) break ;
	}

	return n ;
	}

	template< class FunctorType >
	static int team_size_recommended( const FunctorType & functor )
	{ return team_size_max( functor ); }

	+ template< class FunctorType >
	+ static int team_size_recommended( const FunctorType & functor , const int vector_length)
	+ {
	+ int max = team_size_max( functor )/vector_length;
	+ if(max<1) max = 1;
	+ return max;
	+ }
	+
	inline static
	int vector_length_max()
	{ return Impl::CudaTraits::WarpSize; }

	//----------------------------------------

	inline int vector_length() const { return m_vector_length ; }
	inline int team_size() const { return m_team_size ; }
	inline int league_size() const { return m_league_size ; }

	/** \brief Specify league size, request team size */
	- TeamPolicy( execution_space & , int league_size , int team_size_request , int vector_length_request = 1 )
	- : m_league_size( league_size )
	+ TeamPolicy( execution_space & , int league_size_ , int team_size_request , int vector_length_request = 1 )
	+ : m_league_size( league_size_ )
	, m_team_size( team_size_request )
	, m_vector_length ( vector_length_request )
	{
	// Allow only power-of-two vector_length
	int check = 0;
	- for(int k = 1; k < vector_length_max(); k*=2)
	+ for(int k = 1; k <= vector_length_max(); k*=2)
	if(k == vector_length_request)
	check = 1;
	if(!check)
	Impl::throw_runtime_exception( "Requested non-power-of-two vector length for TeamPolicy.");

	// Make sure league size is permissable
	- if(league_size >= int(Impl::cuda_internal_maximum_grid_count()))
	+ if(league_size_ >= int(Impl::cuda_internal_maximum_grid_count()))
	Impl::throw_runtime_exception( "Requested too large league_size for TeamPolicy on Cuda execution space.");
	}

	- TeamPolicy( int league_size , int team_size_request , int vector_length_request = 1 )
	- : m_league_size( league_size )
	+ TeamPolicy( int league_size_ , int team_size_request , int vector_length_request = 1 )
	+ : m_league_size( league_size_ )
	, m_team_size( team_size_request )
	, m_vector_length ( vector_length_request )
	{
	// Allow only power-of-two vector_length
	int check = 0;
	- for(int k = 1; k < vector_length_max(); k*=2)
	+ for(int k = 1; k <= vector_length_max(); k*=2)
	if(k == vector_length_request)
	check = 1;
	if(!check)
	Impl::throw_runtime_exception( "Requested non-power-of-two vector length for TeamPolicy.");

	// Make sure league size is permissable
	- if(league_size >= int(Impl::cuda_internal_maximum_grid_count()))
	+ if(league_size_ >= int(Impl::cuda_internal_maximum_grid_count()))
	Impl::throw_runtime_exception( "Requested too large league_size for TeamPolicy on Cuda execution space.");

	}

	typedef Kokkos::Impl::CudaTeamMember member_type ;
	};

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelFor< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Cuda > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Cuda > Policy ;

	const FunctorType m_functor ;
	const Policy m_policy ;

	ParallelFor();
	ParallelFor & operator = ( const ParallelFor & );

	template< class Tag >
	inline static
	__device__
	void driver( const FunctorType & functor
	, typename Impl::enable_if< Impl::is_same< Tag , void >::value
	, typename Policy::member_type const & >::type iwork
	)
	{ functor( iwork ); }

	template< class Tag >
	inline static
	__device__
	void driver( const FunctorType & functor
	, typename Impl::enable_if< ! Impl::is_same< Tag , void >::value
	, typename Policy::member_type const & >::type iwork
	)
	{ functor( Tag() , iwork ); }

	public:

	typedef FunctorType functor_type ;

	inline
	__device__
	void operator()(void) const
	{
	const typename Policy::member_type work_stride = blockDim.y * gridDim.x ;
	const typename Policy::member_type work_end = m_policy.end();

	for ( typename Policy::member_type
	iwork = m_policy.begin() + threadIdx.y + blockDim.y * blockIdx.x ;
	iwork < work_end ;
	iwork += work_stride ) {
	ParallelFor::template driver< typename Policy::work_tag >( m_functor, iwork );
	}
	}

	ParallelFor( const FunctorType & functor ,
	const Policy & policy )
	: m_functor( functor )
	, m_policy( policy )
	{
	const dim3 block( 1 , CudaTraits::WarpSize * cuda_internal_maximum_warp_count(), 1);
	const dim3 grid( std::min( ( int( policy.end() - policy.begin() ) + block.y - 1 ) / block.y
	, cuda_internal_maximum_grid_count() )
	, 1 , 1);

	CudaParallelLaunch< ParallelFor >( *this , grid , block , 0 );
	}
	};

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelFor< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Cuda > >
	{
	private:

	typedef Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Cuda > Policy ;

	public:

	typedef FunctorType functor_type ;
	typedef Cuda::size_type size_type ;

	private:

	// Algorithmic constraints: blockDim.y is a power of two AND blockDim.y == blockDim.z == 1
	// shared memory utilization:
	//
	// [ team reduce space ]
	// [ team shared space ]
	//

	const FunctorType m_functor ;
	size_type m_shmem_begin ;
	size_type m_shmem_size ;
	size_type m_league_size ;

	template< class TagType >
	- KOKKOS_FORCEINLINE_FUNCTION
	+ __device__ inline
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member ) const
	{ m_functor( member ); }

	template< class TagType >
	- KOKKOS_FORCEINLINE_FUNCTION
	+ __device__ inline
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member ) const
	{ m_functor( TagType() , member ); }

	public:

	__device__ inline
	void operator()(void) const
	{
	// Iterate this block through the league
	for ( int league_rank = blockIdx.x ; league_rank < m_league_size ; league_rank += gridDim.x ) {

	ParallelFor::template driver< typename Policy::work_tag >(
	typename Policy::member_type( kokkos_impl_cuda_shared_memory<void>()
	, m_shmem_begin
	, m_shmem_size
	, league_rank
	, m_league_size ) );
	}
	}


	ParallelFor( const FunctorType & functor
	, const Policy & policy
	)
	: m_functor( functor )
	, m_shmem_begin( sizeof(double) * ( policy.team_size() + 2 ) )
	, m_shmem_size( FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) )
	, m_league_size( policy.league_size() )
	{
	// Functor's reduce memory, team scan memory, and team shared memory depend upon team size.

	const int shmem_size_total = m_shmem_begin + m_shmem_size ;

	if ( CudaTraits::SharedMemoryCapacity < shmem_size_total ) {
	Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelFor< Cuda > insufficient shared memory"));
	}

	const dim3 grid( int(policy.league_size()) , 1 , 1 );
	const dim3 block( policy.vector_length() , policy.team_size() , 1 );

	CudaParallelLaunch< ParallelFor >( *this, grid, block, shmem_size_total ); // copy to device and execute
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelReduce< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Cuda > >
	{
	private:

	typedef Kokkos::RangePolicy<Arg0,Arg1,Arg2, Kokkos::Cuda > Policy ;
	typedef typename Policy::WorkRange work_range ;
	typedef typename Policy::work_tag work_tag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , work_tag > ValueInit ;

	public:

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::value_type value_type ;
	typedef typename ValueTraits::reference_type reference_type ;
	typedef FunctorType functor_type ;
	typedef Cuda::size_type size_type ;

	// Algorithmic constraints: blockSize is a power of two AND blockDim.y == blockDim.z == 1

	const FunctorType m_functor ;
	const Policy m_policy ;
	size_type * m_scratch_space ;
	size_type * m_scratch_flags ;
	size_type * m_unified_space ;

	// Determine block size constrained by shared memory:
	static inline
	unsigned local_block_size( const FunctorType & f )
	{
	unsigned n = CudaTraits::WarpSize * 8 ;
	while ( n && CudaTraits::SharedMemoryCapacity < cuda_single_inter_block_reduce_scan_shmem<false,FunctorType,work_tag>( f , n ) ) { n >>= 1 ; }
	return n ;
	}

	template< class Tag >
	inline static
	__device__
	void driver( const FunctorType & functor
	, typename Impl::enable_if< Impl::is_same< Tag , void >::value
	, typename Policy::member_type const & >::type iwork
	, reference_type value )
	{ functor( iwork , value ); }

	template< class Tag >
	inline static
	__device__
	void driver( const FunctorType & functor
	, typename Impl::enable_if< ! Impl::is_same< Tag , void >::value
	, typename Policy::member_type const & >::type iwork
	, reference_type value )
	{ functor( Tag() , iwork , value ); }

	#ifndef KOKKOS_EXPERIMENTAL_CUDA_SHFL_REDUCTION
	__device__ inline
	void operator()(void) const
	{
	const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
	word_count( ValueTraits::value_size( m_functor ) / sizeof(size_type) );

	{
	reference_type value =
	ValueInit::init( m_functor , kokkos_impl_cuda_shared_memory<size_type>() + threadIdx.y * word_count.value );

	// Number of blocks is bounded so that the reduction can be limited to two passes.
	// Each thread block is given an approximately equal amount of work to perform.
	// Accumulate the values for this block.
	// The accumulation ordering does not match the final pass, but is arithmatically equivalent.

	const work_range range( m_policy , blockIdx.x , gridDim.x );

	for ( typename work_range::member_type iwork = range.begin() + threadIdx.y , iwork_end = range.end() ;
	iwork < iwork_end ; iwork += blockDim.y ) {
	ParallelReduce::template driver< work_tag >( m_functor , iwork , value );
	}
	}

	// Reduce with final value at blockDim.y - 1 location.
	if ( cuda_single_inter_block_reduce_scan<false,FunctorType,work_tag>(
	m_functor , blockIdx.x , gridDim.x ,
	kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags ) ) {

	// This is the final block with the final result at the final threads' location

	size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
	size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;

	if ( threadIdx.y == 0 ) {
	Kokkos::Impl::FunctorFinal< FunctorType , work_tag >::final( m_functor , shared );
	}

	if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); }

	for ( unsigned i = threadIdx.y ; i < word_count.value ; i += blockDim.y ) { global[i] = shared[i]; }
	}
	}
	#else
	__device__ inline
	void operator()(void) const
	{

	value_type value = 0;

	// Number of blocks is bounded so that the reduction can be limited to two passes.
	// Each thread block is given an approximately equal amount of work to perform.
	// Accumulate the values for this block.
	// The accumulation ordering does not match the final pass, but is arithmatically equivalent.

	const Policy range( m_policy , blockIdx.x , gridDim.x );

	for ( typename Policy::member_type iwork = range.begin() + threadIdx.y , iwork_end = range.end() ;
	iwork < iwork_end ; iwork += blockDim.y ) {
	ParallelReduce::template driver< work_tag >( m_functor , iwork , value );
	}

	pointer_type const result = (pointer_type) (m_unified_space ? m_unified_space : m_scratch_space) ;
	int max_active_thread = range.end()-range.begin() < blockDim.y ? range.end() - range.begin():blockDim.y;
	max_active_thread = max_active_thread == 0?blockDim.y:max_active_thread;
	if(Impl::cuda_inter_block_reduction<FunctorType,Impl::JoinAdd<value_type> >
	(value,Impl::JoinAdd<value_type>(),m_scratch_space,result,m_scratch_flags,max_active_thread)) {
	const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
	if(id==0) {
	Kokkos::Impl::FunctorFinal< FunctorType , work_tag >::final( m_functor , (void*) &value );
	*result = value;
	}
	}
	}
	#endif
	template< class HostViewType >
	ParallelReduce( const FunctorType & functor
	, const Policy & policy
	, const HostViewType & result
	)
	: m_functor( functor )
	, m_policy( policy )
	, m_scratch_space( 0 )
	, m_scratch_flags( 0 )
	, m_unified_space( 0 )
	{
	const int block_size = local_block_size( functor );
	const int block_count = std::min( int(block_size)
	, ( int(policy.end() - policy.begin()) + block_size - 1 ) / block_size
	);

	m_scratch_space = cuda_internal_scratch_space( ValueTraits::value_size( functor ) * block_count );
	m_scratch_flags = cuda_internal_scratch_flags( sizeof(size_type) );
	m_unified_space = cuda_internal_scratch_unified( ValueTraits::value_size( functor ) );

	const dim3 grid( block_count , 1 , 1 );
	const dim3 block( 1 , block_size , 1 ); // REQUIRED DIMENSIONS ( 1 , N , 1 )
	#ifdef KOKKOS_EXPERIMENTAL_CUDA_SHFL_REDUCTION
	const int shmem = 0;
	#else
	const int shmem = cuda_single_inter_block_reduce_scan_shmem<false,FunctorType,work_tag>( m_functor , block.y );
	#endif

	CudaParallelLaunch< ParallelReduce >( *this, grid, block, shmem ); // copy to device and execute

	Cuda::fence();

	if ( result.ptr_on_device() ) {
	if ( m_unified_space ) {
	const int count = ValueTraits::value_count( m_functor );
	for ( int i = 0 ; i < count ; ++i ) { result.ptr_on_device()[i] = pointer_type(m_unified_space)[i] ; }
	}
	else {
	const int size = ValueTraits::value_size( m_functor );
	DeepCopy<HostSpace,CudaSpace>( result.ptr_on_device() , m_scratch_space , size );
	}
	}
	}
	};

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelReduce< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Cuda > >
	{
	private:

	typedef Kokkos::TeamPolicy<Arg0,Arg1,Kokkos::Cuda> Policy ;
	typedef typename Policy::work_tag work_tag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , work_tag > ValueInit ;
	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	public:

	typedef FunctorType functor_type ;
	typedef Cuda::size_type size_type ;

	private:

	// Algorithmic constraints: blockDim.y is a power of two AND blockDim.y == blockDim.z == 1
	// shared memory utilization:
	//
	// [ global reduce space ]
	// [ team reduce space ]
	// [ team shared space ]
	//

	const FunctorType m_functor ;
	size_type * m_scratch_space ;
	size_type * m_scratch_flags ;
	size_type * m_unified_space ;
	size_type m_team_begin ;
	size_type m_shmem_begin ;
	size_type m_shmem_size ;
	size_type m_league_size ;

	template< class TagType >
	- KOKKOS_FORCEINLINE_FUNCTION
	+ __device__ inline
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member
	, reference_type update ) const
	{ m_functor( member , update ); }

	template< class TagType >
	- KOKKOS_FORCEINLINE_FUNCTION
	+ __device__ inline
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member
	, reference_type update ) const
	{ m_functor( TagType() , member , update ); }

	public:

	__device__ inline
	void operator()(void) const
	{
	const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
	word_count( ValueTraits::value_size( m_functor ) / sizeof(size_type) );

	reference_type value =
	ValueInit::init( m_functor , kokkos_impl_cuda_shared_memory<size_type>() + threadIdx.y * word_count.value );

	// Iterate this block through the league
	for ( int league_rank = blockIdx.x ; league_rank < m_league_size ; league_rank += gridDim.x ) {

	ParallelReduce::template driver< work_tag >
	( typename Policy::member_type( kokkos_impl_cuda_shared_memory<char>() + m_team_begin
	, m_shmem_begin
	, m_shmem_size
	, league_rank
	, m_league_size )
	, value );
	}

	// Reduce with final value at blockDim.y - 1 location.
	if ( cuda_single_inter_block_reduce_scan<false,FunctorType,work_tag>(
	m_functor , blockIdx.x , gridDim.x ,
	kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags ) ) {

	// This is the final block with the final result at the final threads' location

	size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
	size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;

	if ( threadIdx.y == 0 ) {
	Kokkos::Impl::FunctorFinal< FunctorType , work_tag >::final( m_functor , shared );
	}

	if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); }

	for ( unsigned i = threadIdx.y ; i < word_count.value ; i += blockDim.y ) { global[i] = shared[i]; }
	}
	}


	template< class HostViewType >
	ParallelReduce( const FunctorType & functor
	, const Policy & policy
	, const HostViewType & result
	)
	: m_functor( functor )
	, m_scratch_space( 0 )
	, m_scratch_flags( 0 )
	, m_unified_space( 0 )
	, m_team_begin( cuda_single_inter_block_reduce_scan_shmem<false,FunctorType,work_tag>( functor , policy.team_size() ) )
	, m_shmem_begin( sizeof(double) * ( policy.team_size() + 2 ) )
	, m_shmem_size( FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) )
	, m_league_size( policy.league_size() )
	{

	// The global parallel_reduce does not support vector_length other than 1 at the moment
	if(policy.vector_length() > 1)
	Impl::throw_runtime_exception( "Kokkos::parallel_reduce with a TeamPolicy using a vector length of greater than 1 is not currently supported for CUDA.");

	// Functor's reduce memory, team scan memory, and team shared memory depend upon team size.

	const int shmem_size_total = m_team_begin + m_shmem_begin + m_shmem_size ;
	const int not_power_of_two = 0 != ( policy.team_size() & ( policy.team_size() - 1 ) );

	if ( not_power_of_two \|\| CudaTraits::SharedMemoryCapacity < shmem_size_total ) {
	Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > bad team size"));
	}

	const int block_count = std::min( policy.league_size() , policy.team_size() );

	m_scratch_space = cuda_internal_scratch_space( ValueTraits::value_size( functor ) * block_count );
	m_scratch_flags = cuda_internal_scratch_flags( sizeof(size_type) );
	m_unified_space = cuda_internal_scratch_unified( ValueTraits::value_size( functor ) );

	const dim3 grid( block_count , 1 , 1 );
	const dim3 block( 1 , policy.team_size() , 1 ); // REQUIRED DIMENSIONS ( 1 , N , 1 )

	CudaParallelLaunch< ParallelReduce >( *this, grid, block, shmem_size_total ); // copy to device and execute

	Cuda::fence();

	if ( result.ptr_on_device() ) {
	if ( m_unified_space ) {
	const int count = ValueTraits::value_count( m_functor );
	for ( int i = 0 ; i < count ; ++i ) { result.ptr_on_device()[i] = pointer_type(m_unified_space)[i] ; }
	}
	else {
	const int size = ValueTraits::value_size( m_functor );
	DeepCopy<HostSpace,CudaSpace>( result.ptr_on_device() , m_scratch_space , size );
	}
	}
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelScan< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Cuda > >
	{
	private:

	typedef Kokkos::RangePolicy<Arg0,Arg1,Arg2, Kokkos::Cuda > Policy ;
	typedef typename Policy::WorkRange work_range ;
	typedef typename Policy::work_tag work_tag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , work_tag > ValueInit ;
	typedef Kokkos::Impl::FunctorValueOps< FunctorType , work_tag > ValueOps ;

	public:
	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;
	typedef FunctorType functor_type ;
	typedef Cuda::size_type size_type ;

	// Algorithmic constraints:
	// (a) blockDim.y is a power of two
	// (b) blockDim.y == blockDim.z == 1
	// (c) gridDim.x <= blockDim.y * blockDim.y
	// (d) gridDim.y == gridDim.z == 1

	// Determine block size constrained by shared memory:
	static inline
	unsigned local_block_size( const FunctorType & f )
	{
	// blockDim.y must be power of two = 128 (4 warps) or 256 (8 warps) or 512 (16 warps)
	// gridDim.x <= blockDim.y * blockDim.y
	//
	// 4 warps was 10% faster than 8 warps and 20% faster than 16 warps in unit testing

	unsigned n = CudaTraits::WarpSize * 4 ;
	while ( n && CudaTraits::SharedMemoryCapacity < cuda_single_inter_block_reduce_scan_shmem<false,FunctorType,work_tag>( f , n ) ) { n >>= 1 ; }
	return n ;
	}

	const FunctorType m_functor ;
	const Policy m_policy ;
	size_type * m_scratch_space ;
	size_type * m_scratch_flags ;
	size_type m_final ;

	template< class Tag >
	inline static
	__device__
	void driver( const FunctorType & functor
	, typename Impl::enable_if< Impl::is_same< Tag , void >::value
	, typename Policy::member_type const & >::type iwork
	, reference_type value
	, const bool final )
	{ functor( iwork , value , final ); }

	template< class Tag >
	inline static
	__device__
	void driver( const FunctorType & functor
	, typename Impl::enable_if< ! Impl::is_same< Tag , void >::value
	, typename Policy::member_type const & >::type iwork
	, reference_type value
	, const bool final )
	{ functor( Tag() , iwork , value , final ); }

	//----------------------------------------

	__device__ inline
	void initial(void) const
	{
	const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
	word_count( ValueTraits::value_size( m_functor ) / sizeof(size_type) );

	size_type * const shared_value = kokkos_impl_cuda_shared_memory<size_type>() + word_count.value * threadIdx.y ;

	ValueInit::init( m_functor , shared_value );

	// Number of blocks is bounded so that the reduction can be limited to two passes.
	// Each thread block is given an approximately equal amount of work to perform.
	// Accumulate the values for this block.
	// The accumulation ordering does not match the final pass, but is arithmatically equivalent.

	const work_range range( m_policy , blockIdx.x , gridDim.x );

	for ( typename Policy::member_type iwork = range.begin() + threadIdx.y , iwork_end = range.end() ;
	iwork < iwork_end ; iwork += blockDim.y ) {
	ParallelScan::template driver< work_tag >
	( m_functor , iwork , ValueOps::reference( shared_value ) , false );
	}

	// Reduce and scan, writing out scan of blocks' totals and block-groups' totals.
	// Blocks' scan values are written to 'blockIdx.x' location.
	// Block-groups' scan values are at: i = ( j * blockDim.y - 1 ) for i < gridDim.x
	cuda_single_inter_block_reduce_scan<true,FunctorType,work_tag>( m_functor , blockIdx.x , gridDim.x , kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags );
	}

	//----------------------------------------

	__device__ inline
	void final(void) const
	{
	const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
	word_count( ValueTraits::value_size( m_functor ) / sizeof(size_type) );

	// Use shared memory as an exclusive scan: { 0 , value[0] , value[1] , value[2] , ... }
	size_type * const shared_data = kokkos_impl_cuda_shared_memory<size_type>();
	size_type * const shared_prefix = shared_data + word_count.value * threadIdx.y ;
	size_type * const shared_accum = shared_data + word_count.value * ( blockDim.y + 1 );

	// Starting value for this thread block is the previous block's total.
	if ( blockIdx.x ) {
	size_type * const block_total = m_scratch_space + word_count.value * ( blockIdx.x - 1 );
	for ( unsigned i = threadIdx.y ; i < word_count.value ; ++i ) { shared_accum[i] = block_total[i] ; }
	}
	else if ( 0 == threadIdx.y ) {
	ValueInit::init( m_functor , shared_accum );
	}

	const work_range range( m_policy , blockIdx.x , gridDim.x );

	for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {

	const typename Policy::member_type iwork = iwork_base + threadIdx.y ;

	__syncthreads(); // Don't overwrite previous iteration values until they are used

	ValueInit::init( m_functor , shared_prefix + word_count.value );

	// Copy previous block's accumulation total into thread[0] prefix and inclusive scan value of this block
	for ( unsigned i = threadIdx.y ; i < word_count.value ; ++i ) {
	shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
	}

	if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.

	// Call functor to accumulate inclusive scan value for this work item
	if ( iwork < range.end() ) {
	ParallelScan::template driver< work_tag >
	( m_functor , iwork , ValueOps::reference( shared_prefix + word_count.value ) , false );
	}

	// Scan block values into locations shared_data[1..blockDim.y]
	cuda_intra_block_reduce_scan<true,FunctorType,work_tag>( m_functor , ValueTraits::pointer_type(shared_data+word_count.value) );

	{
	size_type * const block_total = shared_data + word_count.value * blockDim.y ;
	for ( unsigned i = threadIdx.y ; i < word_count.value ; ++i ) { shared_accum[i] = block_total[i]; }
	}

	// Call functor with exclusive scan value
	if ( iwork < range.end() ) {
	ParallelScan::template driver< work_tag >
	( m_functor , iwork , ValueOps::reference( shared_prefix ) , true );
	}
	}
	}

	//----------------------------------------

	__device__ inline
	void operator()(void) const
	{
	if ( ! m_final ) {
	initial();
	}
	else {
	final();
	}
	}

	ParallelScan( const FunctorType & functor ,
	const Policy & policy )
	: m_functor( functor )
	, m_policy( policy )
	, m_scratch_space( 0 )
	, m_scratch_flags( 0 )
	, m_final( false )
	{
	enum { GridMaxComputeCapability_2x = 0x0ffff };

	const int block_size = local_block_size( functor );

	const int grid_max = ( block_size * block_size ) < GridMaxComputeCapability_2x ?
	( block_size * block_size ) : GridMaxComputeCapability_2x ;

	// At most 'max_grid' blocks:
	const int nwork = policy.end() - policy.begin();
	const int max_grid = std::min( int(grid_max) , int(( nwork + block_size - 1 ) / block_size ));

	// How much work per block:
	const int work_per_block = ( nwork + max_grid - 1 ) / max_grid ;

	// How many block are really needed for this much work:
	const dim3 grid( ( nwork + work_per_block - 1 ) / work_per_block , 1 , 1 );
	const dim3 block( 1 , block_size , 1 ); // REQUIRED DIMENSIONS ( 1 , N , 1 )
	const int shmem = ValueTraits::value_size( functor ) * ( block_size + 2 );

	m_scratch_space = cuda_internal_scratch_space( ValueTraits::value_size( functor ) * grid.x );
	m_scratch_flags = cuda_internal_scratch_flags( sizeof(size_type) * 1 );

	m_final = false ;
	CudaParallelLaunch< ParallelScan >( *this, grid, block, shmem ); // copy to device and execute

	m_final = true ;
	CudaParallelLaunch< ParallelScan >( *this, grid, block, shmem ); // copy to device and execute
	}

	void wait() const { Cuda::fence(); }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	-#ifdef KOKKOS_HAVE_CXX11
	-
	namespace Kokkos {
	namespace Impl {
	template<typename iType>
	- struct TeamThreadLoopBoundariesStruct<iType,CudaTeamMember> {
	+ struct TeamThreadRangeBoundariesStruct<iType,CudaTeamMember> {
	typedef iType index_type;
	const iType start;
	const iType end;
	const iType increment;
	const CudaTeamMember& thread;

	#ifdef __CUDA_ARCH__
	__device__ inline
	- TeamThreadLoopBoundariesStruct (const CudaTeamMember& thread_, const iType& count):
	+ TeamThreadRangeBoundariesStruct (const CudaTeamMember& thread_, const iType& count):
	start( threadIdx.y ),
	end( count ),
	increment( blockDim.y ),
	thread(thread_)
	{}
	+ __device__ inline
	+ TeamThreadRangeBoundariesStruct (const CudaTeamMember& thread_, const iType& begin_, const iType& end_):
	+ start( begin_+threadIdx.y ),
	+ end( end_ ),
	+ increment( blockDim.y ),
	+ thread(thread_)
	+ {}
	#else
	KOKKOS_INLINE_FUNCTION
	- TeamThreadLoopBoundariesStruct (const CudaTeamMember& thread_, const iType& count):
	+ TeamThreadRangeBoundariesStruct (const CudaTeamMember& thread_, const iType& count):
	start( 0 ),
	end( count ),
	increment( 1 ),
	thread(thread_)
	{}
	+ KOKKOS_INLINE_FUNCTION
	+ TeamThreadRangeBoundariesStruct (const CudaTeamMember& thread_, const iType& begin_, const iType& end_):
	+ start( begin_ ),
	+ end( end_ ),
	+ increment( 1 ),
	+ thread(thread_)
	+ {}
	#endif
	};

	template<typename iType>
	- struct ThreadVectorLoopBoundariesStruct<iType,CudaTeamMember> {
	+ struct ThreadVectorRangeBoundariesStruct<iType,CudaTeamMember> {
	typedef iType index_type;
	const iType start;
	const iType end;
	const iType increment;

	#ifdef __CUDA_ARCH__
	__device__ inline
	- ThreadVectorLoopBoundariesStruct (const CudaTeamMember& thread, const iType& count):
	+ ThreadVectorRangeBoundariesStruct (const CudaTeamMember& thread, const iType& count):
	start( threadIdx.x ),
	end( count ),
	increment( blockDim.x )
	{}
	#else
	KOKKOS_INLINE_FUNCTION
	- ThreadVectorLoopBoundariesStruct (const CudaTeamMember& thread_, const iType& count):
	+ ThreadVectorRangeBoundariesStruct (const CudaTeamMember& thread_, const iType& count):
	start( 0 ),
	end( count ),
	increment( 1 )
	{}
	#endif
	};

	} // namespace Impl

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::TeamThreadLoopBoundariesStruct<iType,Impl::CudaTeamMember>
	- TeamThreadLoop(const Impl::CudaTeamMember& thread, const iType& count) {
	- return Impl::TeamThreadLoopBoundariesStruct<iType,Impl::CudaTeamMember>(thread,count);
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>
	+ TeamThreadRange(const Impl::CudaTeamMember& thread, const iType& count) {
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>(thread,count);
	}

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::CudaTeamMember >
	- ThreadVectorLoop(Impl::CudaTeamMember thread, const iType count) {
	- return Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::CudaTeamMember >(thread,count);
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>
	+ TeamThreadRange(const Impl::CudaTeamMember& thread, const iType& begin, const iType& end) {
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>(thread,begin,end);
	+}
	+
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::CudaTeamMember >
	+ ThreadVectorRange(const Impl::CudaTeamMember& thread, const iType& count) {
	+ return Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::CudaTeamMember >(thread,count);
	}

	KOKKOS_INLINE_FUNCTION
	Impl::ThreadSingleStruct<Impl::CudaTeamMember> PerTeam(const Impl::CudaTeamMember& thread) {
	return Impl::ThreadSingleStruct<Impl::CudaTeamMember>(thread);
	}

	KOKKOS_INLINE_FUNCTION
	Impl::VectorSingleStruct<Impl::CudaTeamMember> PerThread(const Impl::CudaTeamMember& thread) {
	return Impl::VectorSingleStruct<Impl::CudaTeamMember>(thread);
	}

	} // namespace Kokkos

	namespace Kokkos {

	/** \brief Inter-thread parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::CudaTeamMember>& loop_boundaries, const Lambda& lambda) {
	+void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>& loop_boundaries, const Lambda& lambda) {
	+ #ifdef __CUDA_ARCH__
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	+ #endif
	}

	/** \brief Inter-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::CudaTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>& loop_boundaries,
	const Lambda & lambda, ValueType& result) {

	#ifdef __CUDA_ARCH__
	result = ValueType();

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,result);
	}

	Impl::cuda_intra_warp_reduction(result,[&] (ValueType& dst, const ValueType& src) { dst+=src; });
	Impl::cuda_inter_warp_reduction(result,[&] (ValueType& dst, const ValueType& src) { dst+=src; });

	#endif
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::CudaTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::CudaTeamMember>& loop_boundaries,
	const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	#ifdef __CUDA_ARCH__
	ValueType result = init_result;

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,result);
	}

	Impl::cuda_intra_warp_reduction(result, join );
	Impl::cuda_inter_warp_reduction(result, join );

	init_result = result;
	#endif
	}

	} //namespace Kokkos

	namespace Kokkos {
	/** \brief Intra-thread vector parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::CudaTeamMember >&
	+void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::CudaTeamMember >&
	loop_boundaries, const Lambda& lambda) {
	-
	+#ifdef __CUDA_ARCH__
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	+#endif
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::CudaTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::CudaTeamMember >&
	loop_boundaries, const Lambda & lambda, ValueType& result) {
	#ifdef __CUDA_ARCH__
	ValueType val = ValueType();

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,val);
	}

	result = val;

	if (loop_boundaries.increment > 1)
	result += shfl_down(result, 1,loop_boundaries.increment);
	if (loop_boundaries.increment > 2)
	result += shfl_down(result, 2,loop_boundaries.increment);
	if (loop_boundaries.increment > 4)
	result += shfl_down(result, 4,loop_boundaries.increment);
	if (loop_boundaries.increment > 8)
	result += shfl_down(result, 8,loop_boundaries.increment);
	if (loop_boundaries.increment > 16)
	result += shfl_down(result, 16,loop_boundaries.increment);

	result = shfl(result,0,loop_boundaries.increment);
	#endif
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::CudaTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::CudaTeamMember >&
	loop_boundaries, const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	#ifdef __CUDA_ARCH__
	ValueType result = init_result;

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,result);
	}

	if (loop_boundaries.increment > 1)
	join( result, shfl_down(result, 1,loop_boundaries.increment));
	if (loop_boundaries.increment > 2)
	join( result, shfl_down(result, 2,loop_boundaries.increment));
	if (loop_boundaries.increment > 4)
	join( result, shfl_down(result, 4,loop_boundaries.increment));
	if (loop_boundaries.increment > 8)
	join( result, shfl_down(result, 8,loop_boundaries.increment));
	if (loop_boundaries.increment > 16)
	join( result, shfl_down(result, 16,loop_boundaries.increment));

	init_result = shfl(result,0,loop_boundaries.increment);
	#endif
	}

	/** \brief Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
	* for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes in the thread and a scan operation is performed.
	* Depending on the target execution space the operator might be called twice: once with final=false
	* and once with final=true. When final==true val contains the prefix sum value. The contribution of this
	* "i" needs to be added to val no matter whether final==true or not. In a serial execution
	* (i.e. team_size==1) the operator is only called once with final==true. Scan_val will be set
	* to the final sum value over all vector lanes.
	* This functionality requires C++11 support.*/
	template< typename iType, class FunctorType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_scan(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::CudaTeamMember >&
	+void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::CudaTeamMember >&
	loop_boundaries, const FunctorType & lambda) {

	#ifdef __CUDA_ARCH__
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	typedef typename ValueTraits::value_type value_type ;

	value_type scan_val = value_type();
	const int VectorLength = blockDim.x;

	iType loop_bound = ((loop_boundaries.end+VectorLength-1)/VectorLength) * VectorLength;
	for(int _i = threadIdx.x; _i < loop_bound; _i += VectorLength) {
	value_type val = value_type();
	if(_i<loop_boundaries.end)
	lambda(_i , val , false);

	value_type tmp = val;
	value_type result_i;

	if(threadIdx.x%VectorLength == 0)
	result_i = tmp;
	if (VectorLength > 1) {
	const value_type tmp2 = shfl_up(tmp, 1,VectorLength);
	if(threadIdx.x > 0)
	tmp+=tmp2;
	}
	if(threadIdx.x%VectorLength == 1)
	result_i = tmp;
	if (VectorLength > 3) {
	const value_type tmp2 = shfl_up(tmp, 2,VectorLength);
	if(threadIdx.x > 1)
	tmp+=tmp2;
	}
	if ((threadIdx.x%VectorLength >= 2) &&
	(threadIdx.x%VectorLength < 4))
	result_i = tmp;
	if (VectorLength > 7) {
	const value_type tmp2 = shfl_up(tmp, 4,VectorLength);
	if(threadIdx.x > 3)
	tmp+=tmp2;
	}
	if ((threadIdx.x%VectorLength >= 4) &&
	(threadIdx.x%VectorLength < 8))
	result_i = tmp;
	if (VectorLength > 15) {
	const value_type tmp2 = shfl_up(tmp, 8,VectorLength);
	if(threadIdx.x > 7)
	tmp+=tmp2;
	}
	if ((threadIdx.x%VectorLength >= 8) &&
	(threadIdx.x%VectorLength < 16))
	result_i = tmp;
	if (VectorLength > 31) {
	const value_type tmp2 = shfl_up(tmp, 16,VectorLength);
	if(threadIdx.x > 15)
	tmp+=tmp2;
	}
	if (threadIdx.x%VectorLength >= 16)
	result_i = tmp;

	val = scan_val + result_i - val;
	scan_val += shfl(tmp,VectorLength-1,VectorLength);
	if(_i<loop_boundaries.end)
	lambda(_i , val , true);
	}
	#endif
	}

	}

	namespace Kokkos {

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::VectorSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
	#ifdef __CUDA_ARCH__
	if(threadIdx.x == 0) lambda();
	#endif
	}

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::ThreadSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
	#ifdef __CUDA_ARCH__
	if(threadIdx.x == 0 && threadIdx.y == 0) lambda();
	#endif
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::VectorSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda, ValueType& val) {
	#ifdef __CUDA_ARCH__
	if(threadIdx.x == 0) lambda(val);
	val = shfl(val,0,blockDim.x);
	#endif
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::ThreadSingleStruct<Impl::CudaTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	#ifdef __CUDA_ARCH__
	if(threadIdx.x == 0 && threadIdx.y == 0) {
	lambda(val);
	}
	single_struct.team_member.team_broadcast(val,0);
	#endif
	}

	}

	-#endif // KOKKOS_HAVE_CXX11
	-
	namespace Kokkos {
	-template<int N>
	-struct Vectorization<Cuda,N> {
	- typedef Kokkos::TeamPolicy< Cuda > team_policy ;
	- typedef typename team_policy::member_type team_member ;
	- enum {increment = N};

	-#ifdef __CUDA_ARCH__
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int begin() { return threadIdx.y%N;}
	-#else
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int begin() { return 0;}
	+namespace Impl {
	+ template< class FunctorType, class ExecPolicy, class ValueType , class Tag = typename ExecPolicy::work_tag>
	+ struct CudaFunctorAdapter {
	+ const FunctorType f;
	+ typedef ValueType value_type;
	+ CudaFunctorAdapter(const FunctorType& f_):f(f_) {}
	+
	+ __device__ inline
	+ void operator() (typename ExecPolicy::work_tag, const typename ExecPolicy::member_type& i, ValueType& val) const {
	+ //Insert Static Assert with decltype on ValueType equals third argument type of FunctorType::operator()
	+ f(typename ExecPolicy::work_tag(), i,val);
	+ }
	+ };
	+
	+ template< class FunctorType, class ExecPolicy, class ValueType >
	+ struct CudaFunctorAdapter<FunctorType,ExecPolicy,ValueType,void> {
	+ const FunctorType f;
	+ typedef ValueType value_type;
	+ CudaFunctorAdapter(const FunctorType& f_):f(f_) {}
	+
	+ __device__ inline
	+ void operator() (const typename ExecPolicy::member_type& i, ValueType& val) const {
	+ //Insert Static Assert with decltype on ValueType equals second argument type of FunctorType::operator()
	+ f(i,val);
	+ }
	+
	+ };
	+
	+ template< class FunctorType, class Enable = void>
	+ struct ReduceFunctorHasInit {
	+ enum {value = false};
	+ };
	+
	+ template< class FunctorType>
	+ struct ReduceFunctorHasInit<FunctorType, typename Impl::enable_if< 0 < sizeof( & FunctorType::init ) >::type > {
	+ enum {value = true};
	+ };
	+
	+ template< class FunctorType, class Enable = void>
	+ struct ReduceFunctorHasJoin {
	+ enum {value = false};
	+ };
	+
	+ template< class FunctorType>
	+ struct ReduceFunctorHasJoin<FunctorType, typename Impl::enable_if< 0 < sizeof( & FunctorType::join ) >::type > {
	+ enum {value = true};
	+ };
	+
	+ template< class FunctorType, class Enable = void>
	+ struct ReduceFunctorHasFinal {
	+ enum {value = false};
	+ };
	+
	+ template< class FunctorType>
	+ struct ReduceFunctorHasFinal<FunctorType, typename Impl::enable_if< 0 < sizeof( & FunctorType::final ) >::type > {
	+ enum {value = true};
	+ };
	+
	+ template< class FunctorType, bool Enable =
	+ ( FunctorDeclaresValueType<FunctorType,void>::value) \|\|
	+ ( ReduceFunctorHasInit<FunctorType>::value ) \|\|
	+ ( ReduceFunctorHasJoin<FunctorType>::value ) \|\|
	+ ( ReduceFunctorHasFinal<FunctorType>::value )
	+ >
	+ struct IsNonTrivialReduceFunctor {
	+ enum {value = false};
	+ };
	+
	+ template< class FunctorType>
	+ struct IsNonTrivialReduceFunctor<FunctorType, true> {
	+ enum {value = true};
	+ };
	+
	+ template<class FunctorType, class ResultType, class Tag, bool Enable = IsNonTrivialReduceFunctor<FunctorType>::value >
	+ struct FunctorReferenceType {
	+ typedef ResultType& reference_type;
	+ };
	+
	+ template<class FunctorType, class ResultType, class Tag>
	+ struct FunctorReferenceType<FunctorType, ResultType, Tag, true> {
	+ typedef typename Kokkos::Impl::FunctorValueTraits< FunctorType ,Tag >::reference_type reference_type;
	+ };
	+
	+}
	+
	+// general policy and view ouput
	+template< class ExecPolicy , class FunctorTypeIn , class ViewType >
	+inline
	+void parallel_reduce( const ExecPolicy & policy
	+ , const FunctorTypeIn & functor_in
	+ , const ViewType & result_view
	+ , const std::string& str = ""
	+ , typename Impl::enable_if<
	+ ( Impl::is_view<ViewType>::value && ! Impl::is_integral< ExecPolicy >::value &&
	+ Impl::is_same<typename ExecPolicy::execution_space,Kokkos::Cuda>::value
	+ )>::type * = 0 )
	+{
	+ enum {FunctorHasValueType = Impl::IsNonTrivialReduceFunctor<FunctorTypeIn>::value };
	+ typedef typename Kokkos::Impl::if_c<FunctorHasValueType, FunctorTypeIn, Impl::CudaFunctorAdapter<FunctorTypeIn,ExecPolicy,typename ViewType::value_type> >::type FunctorType;
	+ FunctorType functor = Impl::if_c<FunctorHasValueType,FunctorTypeIn,FunctorType>::select(functor_in,FunctorType(functor_in));
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( functor , policy , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	+}

	- KOKKOS_FORCEINLINE_FUNCTION
	- static int thread_rank(const team_member &dev) {
	- return dev.team_rank()/increment;
	- }
	+// general policy and pod or array of pod output
	+template< class ExecPolicy , class FunctorTypeIn , class ResultType>
	+inline
	+void parallel_reduce( const ExecPolicy & policy
	+ , const FunctorTypeIn & functor_in
	+ , ResultType& result_ref
	+ , const std::string& str = ""
	+ , typename Impl::enable_if<
	+ ( ! Impl::is_view<ResultType>::value &&
	+ ! Impl::IsNonTrivialReduceFunctor<FunctorTypeIn>::value &&
	+ ! Impl::is_integral< ExecPolicy >::value &&
	+ Impl::is_same<typename ExecPolicy::execution_space,Kokkos::Cuda>::value )>::type * = 0 )
	+{
	+ typedef typename Impl::CudaFunctorAdapter<FunctorTypeIn,ExecPolicy,ResultType> FunctorType;
	+
	+ typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag > ValueTraits ;
	+ typedef Kokkos::Impl::FunctorValueOps< FunctorType , typename ExecPolicy::work_tag > ValueOps ;
	+
	+ // Wrap the result output request in a view to inform the implementation
	+ // of the type and memory space.
	+
	+ typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	+ , typename ValueTraits::value_type
	+ , typename ValueTraits::pointer_type
	+ >::type value_type ;
	+ Kokkos::View< value_type
	+ , HostSpace
	+ , Kokkos::MemoryUnmanaged
	+ >
	+ result_view( ValueOps::pointer( result_ref )
	+ , 1
	+ );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( FunctorType(functor_in) , policy , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	+}

	- KOKKOS_FORCEINLINE_FUNCTION
	- static int team_rank(const team_member &dev) {
	- return dev.team_rank()/increment;
	- }
	+// general policy and pod or array of pod output
	+template< class ExecPolicy , class FunctorType>
	+inline
	+void parallel_reduce( const ExecPolicy & policy
	+ , const FunctorType & functor
	+ , typename Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag >::reference_type result_ref
	+ , const std::string& str = ""
	+ , typename Impl::enable_if<
	+ ( Impl::IsNonTrivialReduceFunctor<FunctorType>::value &&
	+ ! Impl::is_integral< ExecPolicy >::value &&
	+ Impl::is_same<typename ExecPolicy::execution_space,Kokkos::Cuda>::value )>::type * = 0 )
	+{
	+ typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag > ValueTraits ;
	+ typedef Kokkos::Impl::FunctorValueOps< FunctorType , typename ExecPolicy::work_tag > ValueOps ;
	+
	+ // Wrap the result output request in a view to inform the implementation
	+ // of the type and memory space.
	+
	+ typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	+ , typename ValueTraits::value_type
	+ , typename ValueTraits::pointer_type
	+ >::type value_type ;
	+
	+ Kokkos::View< value_type
	+ , HostSpace
	+ , Kokkos::MemoryUnmanaged
	+ >
	+ result_view( ValueOps::pointer( result_ref )
	+ , ValueTraits::value_count( functor )
	+ );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( functor , policy , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	+}

	- KOKKOS_FORCEINLINE_FUNCTION
	- static int team_size(const team_member &dev) {
	- return dev.team_size()/increment;
	- }
	+// integral range policy and view ouput
	+template< class FunctorTypeIn , class ViewType >
	+inline
	+void parallel_reduce( const size_t work_count
	+ , const FunctorTypeIn & functor_in
	+ , const ViewType & result_view
	+ , const std::string& str = ""
	+ , typename Impl::enable_if<( Impl::is_view<ViewType>::value &&
	+ Impl::is_same<
	+ typename Impl::FunctorPolicyExecutionSpace< FunctorTypeIn , void >::execution_space,
	+ Kokkos::Cuda>::value
	+ )>::type * = 0 )
	+{
	+ enum {FunctorHasValueType = Impl::IsNonTrivialReduceFunctor<FunctorTypeIn>::value };
	+ typedef typename
	+ Impl::FunctorPolicyExecutionSpace< FunctorTypeIn , void >::execution_space
	+ execution_space ;

	- KOKKOS_FORCEINLINE_FUNCTION
	- static int global_thread_rank(const team_member &dev) {
	- return (dev.league_rank()*dev.team_size()+dev.team_rank())/increment;
	- }
	+ typedef RangePolicy< execution_space > ExecPolicy ;

	- KOKKOS_FORCEINLINE_FUNCTION
	- static bool is_lane_0(const team_member &dev) {
	- return (dev.team_rank()%increment)==0;
	- }
	+ typedef typename Kokkos::Impl::if_c<FunctorHasValueType, FunctorTypeIn, Impl::CudaFunctorAdapter<FunctorTypeIn,ExecPolicy,typename ViewType::value_type> >::type FunctorType;

	- template<class Scalar>
	- KOKKOS_INLINE_FUNCTION
	- static Scalar reduce(const Scalar& val) {
	- #ifdef __CUDA_ARCH__
	- __shared__ Scalar result[256];
	- Scalar myresult;
	- for(int k=0;k<blockDim.y;k+=256) {
	- const int tid = threadIdx.y - k;
	- if(tid > 0 && tid<256) {
	- result[tid] = val;
	- if ( (N > 1) && (tid%2==0) )
	- result[tid] += result[tid+1];
	- if ( (N > 2) && (tid%4==0) )
	- result[tid] += result[tid+2];
	- if ( (N > 4) && (tid%8==0) )
	- result[tid] += result[tid+4];
	- if ( (N > 8) && (tid%16==0) )
	- result[tid] += result[tid+8];
	- if ( (N > 16) && (tid%32==0) )
	- result[tid] += result[tid+16];
	- myresult = result[tid];
	- }
	- if(blockDim.y>256)
	- __syncthreads();
	- }
	- return myresult;
	- #else
	- return val;
	- #endif
	- }
	+ FunctorType functor = Impl::if_c<FunctorHasValueType,FunctorTypeIn,FunctorType>::select(functor_in,FunctorType(functor_in));

	-#ifdef __CUDA_ARCH__
	- #if (__CUDA_ARCH__ >= 300)
	- KOKKOS_INLINE_FUNCTION
	- static int reduce(const int& val) {
	- int result = val;
	- if (N > 1)
	- result += shfl_down(result, 1,N);
	- if (N > 2)
	- result += shfl_down(result, 2,N);
	- if (N > 4)
	- result += shfl_down(result, 4,N);
	- if (N > 8)
	- result += shfl_down(result, 8,N);
	- if (N > 16)
	- result += shfl_down(result, 16,N);
	- return result;
	- }
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( functor , ExecPolicy(0,work_count) , result_view );

	- KOKKOS_INLINE_FUNCTION
	- static unsigned int reduce(const unsigned int& val) {
	- unsigned int result = val;
	- if (N > 1)
	- result += shfl_down(result, 1,N);
	- if (N > 2)
	- result += shfl_down(result, 2,N);
	- if (N > 4)
	- result += shfl_down(result, 4,N);
	- if (N > 8)
	- result += shfl_down(result, 8,N);
	- if (N > 16)
	- result += shfl_down(result, 16,N);
	- return result;
	- }
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif

	- KOKKOS_INLINE_FUNCTION
	- static long int reduce(const long int& val) {
	- long int result = val;
	- if (N > 1)
	- result += shfl_down(result, 1,N);
	- if (N > 2)
	- result += shfl_down(result, 2,N);
	- if (N > 4)
	- result += shfl_down(result, 4,N);
	- if (N > 8)
	- result += shfl_down(result, 8,N);
	- if (N > 16)
	- result += shfl_down(result, 16,N);
	- return result;
	- }
	+}

	- KOKKOS_INLINE_FUNCTION
	- static unsigned long int reduce(const unsigned long int& val) {
	- unsigned long int result = val;
	- if (N > 1)
	- result += shfl_down(result, 1,N);
	- if (N > 2)
	- result += shfl_down(result, 2,N);
	- if (N > 4)
	- result += shfl_down(result, 4,N);
	- if (N > 8)
	- result += shfl_down(result, 8,N);
	- if (N > 16)
	- result += shfl_down(result, 16,N);
	- return result;
	- }
	+// integral range policy and pod or array of pod output
	+template< class FunctorTypeIn , class ResultType>
	+inline
	+void parallel_reduce( const size_t work_count
	+ , const FunctorTypeIn & functor_in
	+ , ResultType& result
	+ , const std::string& str = ""
	+ , typename Impl::enable_if< ! Impl::is_view<ResultType>::value &&
	+ ! Impl::IsNonTrivialReduceFunctor<FunctorTypeIn>::value &&
	+ Impl::is_same<
	+ typename Impl::FunctorPolicyExecutionSpace< FunctorTypeIn , void >::execution_space,
	+ Kokkos::Cuda>::value >::type * = 0 )
	+{
	+ typedef typename
	+ Kokkos::Impl::FunctorPolicyExecutionSpace< FunctorTypeIn , void >::execution_space
	+ execution_space ;
	+ typedef Kokkos::RangePolicy< execution_space > ExecPolicy ;

	- KOKKOS_INLINE_FUNCTION
	- static float reduce(const float& val) {
	- float result = val;
	- if (N > 1)
	- result += shfl_down(result, 1,N);
	- if (N > 2)
	- result += shfl_down(result, 2,N);
	- if (N > 4)
	- result += shfl_down(result, 4,N);
	- if (N > 8)
	- result += shfl_down(result, 8,N);
	- if (N > 16)
	- result += shfl_down(result, 16,N);
	- return result;
	- }
	+ typedef Impl::CudaFunctorAdapter<FunctorTypeIn,ExecPolicy,ResultType> FunctorType;

	- KOKKOS_INLINE_FUNCTION
	- static double reduce(const double& val) {
	- double result = val;
	- if (N > 1)
	- result += shfl_down(result, 1,N);
	- if (N > 2)
	- result += shfl_down(result, 2,N);
	- if (N > 4)
	- result += shfl_down(result, 4,N);
	- if (N > 8)
	- result += shfl_down(result, 8,N);
	- if (N > 16)
	- result += shfl_down(result, 16,N);
	- return result;
	- }
	- #endif
	+
	+ typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	+ typedef Kokkos::Impl::FunctorValueOps< FunctorType , void > ValueOps ;
	+
	+
	+ // Wrap the result output request in a view to inform the implementation
	+ // of the type and memory space.
	+
	+ typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	+ , typename ValueTraits::value_type
	+ , typename ValueTraits::pointer_type
	+ >::type value_type ;
	+
	+ Kokkos::View< value_type
	+ , HostSpace
	+ , Kokkos::MemoryUnmanaged
	+ >
	+ result_view( ValueOps::pointer( result )
	+ , 1
	+ );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType , ExecPolicy >( FunctorType(functor_in) , ExecPolicy(0,work_count) , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	+}

	-};
	+template< class FunctorType>
	+inline
	+void parallel_reduce( const size_t work_count
	+ , const FunctorType & functor
	+ , typename Kokkos::Impl::FunctorValueTraits< FunctorType , void >::reference_type result
	+ , const std::string& str = ""
	+ , typename Impl::enable_if< Impl::IsNonTrivialReduceFunctor<FunctorType>::value &&
	+ Impl::is_same<
	+ typename Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space,
	+ Kokkos::Cuda>::value >::type * = 0 )
	+{
	+
	+ typedef typename
	+ Kokkos::Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space
	+ execution_space ;
	+ typedef Kokkos::RangePolicy< execution_space > ExecPolicy ;
	+
	+
	+
	+ typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	+ typedef Kokkos::Impl::FunctorValueOps< FunctorType , void > ValueOps ;
	+
	+
	+ // Wrap the result output request in a view to inform the implementation
	+ // of the type and memory space.
	+
	+ typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	+ , typename ValueTraits::value_type
	+ , typename ValueTraits::pointer_type
	+ >::type value_type ;
	+
	+ Kokkos::View< value_type
	+ , HostSpace
	+ , Kokkos::MemoryUnmanaged
	+ >
	+ result_view( ValueOps::pointer( result )
	+ , ValueTraits::value_count( functor )
	+ );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType , ExecPolicy >( functor , ExecPolicy(0,work_count) , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	}

	+} // namespace Kokkos
	#endif /* defined( __CUDACC__ ) */

	#endif /* #ifndef KOKKOS_CUDA_PARALLEL_HPP */

	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
	index a723f629a..5ef16711e 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
	@@ -1,421 +1,424 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDA_REDUCESCAN_HPP
	#define KOKKOS_CUDA_REDUCESCAN_HPP

	-#if defined( __CUDACC__ )
	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#if defined( __CUDACC__ ) && defined( KOKKOS_HAVE_CUDA )

	#include <utility>

	#include <Kokkos_Parallel.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>
	#include <impl/Kokkos_Error.hpp>
	#include <Cuda/Kokkos_Cuda_Vectorization.hpp>
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {



	//Shfl based reductions
	/*
	* Algorithmic constraints:
	* (a) threads with same threadIdx.y have same value
	* (b) blockDim.x == power of two
	* (c) blockDim.z == 1
	*/

	template< class ValueType , class JoinOp>
	__device__
	inline void cuda_intra_warp_reduction( ValueType& result,
	const JoinOp& join,
	const int max_active_thread = blockDim.y) {

	unsigned int shift = 1;

	//Reduce over values from threads with different threadIdx.y
	while(blockDim.x * shift < 32 ) {
	const ValueType tmp = shfl_down(result, blockDim.x*shift,32u);
	//Only join if upper thread is active (this allows non power of two for blockDim.y
	if(threadIdx.y + shift < max_active_thread)
	join(result , tmp);
	shift*=2;
	}

	result = shfl(result,0,32);
	}

	template< class ValueType , class JoinOp>
	__device__
	inline void cuda_inter_warp_reduction( ValueType& value,
	const JoinOp& join,
	const int max_active_thread = blockDim.y) {

	#define STEP_WIDTH 4
	__shared__ char sh_result[sizeof(ValueType)*STEP_WIDTH];
	ValueType* result = (ValueType*) & sh_result;
	const unsigned step = 32 / blockDim.x;
	unsigned shift = STEP_WIDTH;
	const int id = threadIdx.y%step==0?threadIdx.y/step:65000;
	if(id < STEP_WIDTH ) {
	result[id] = value;
	}
	__syncthreads();
	while (shift<=max_active_thread/step) {
	if(shift<=id && shift+STEP_WIDTH>id && threadIdx.x==0) {
	join(result[id%STEP_WIDTH],value);
	}
	__syncthreads();
	shift+=STEP_WIDTH;
	}


	value = result[0];
	for(int i = 1; (i*step<=max_active_thread) && i<STEP_WIDTH; i++)
	join(value,result[i]);
	}

	template< class ValueType , class JoinOp>
	__device__
	inline void cuda_intra_block_reduction( ValueType& value,
	const JoinOp& join,
	const int max_active_thread = blockDim.y) {
	cuda_intra_warp_reduction(value,join,max_active_thread);
	cuda_inter_warp_reduction(value,join,max_active_thread);
	}

	template< class FunctorType , class JoinOp>
	__device__
	bool cuda_inter_block_reduction( typename FunctorValueTraits< FunctorType , void >::reference_type value,
	const JoinOp& join,
	Cuda::size_type * const m_scratch_space,
	typename FunctorValueTraits< FunctorType , void >::pointer_type const result,
	Cuda::size_type * const m_scratch_flags,
	const int max_active_thread = blockDim.y) {
	typedef typename FunctorValueTraits< FunctorType , void >::pointer_type pointer_type;
	typedef typename FunctorValueTraits< FunctorType , void >::value_type value_type;

	//Do the intra-block reduction with shfl operations and static shared memory
	cuda_intra_block_reduction(value,join,max_active_thread);

	const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;

	//One thread in the block writes block result to global scratch_memory
	if(id == 0 ) {
	pointer_type global = ((pointer_type) m_scratch_space) + blockIdx.x;
	*global = value;
	}

	//One warp of last block performs inter block reduction through loading the block values from global scratch_memory
	bool last_block = false;

	__syncthreads();
	if ( id < 32 ) {
	Cuda::size_type count;

	//Figure out whether this is the last block
	if(id == 0)
	count = Kokkos::atomic_fetch_add(m_scratch_flags,1);
	count = Kokkos::shfl(count,0,32);

	//Last block does the inter block reduction
	if( count == gridDim.x - 1) {
	//set flag back to zero
	if(id == 0)
	*m_scratch_flags = 0;
	last_block = true;
	value = 0;

	pointer_type const volatile global = (pointer_type) m_scratch_space ;

	//Reduce all global values with splitting work over threads in one warp
	const int step_size = blockDim.xblockDim.y < 32 ? blockDim.xblockDim.y : 32;
	for(int i=id; i<gridDim.x; i+=step_size) {
	value_type tmp = global[i];
	join(value, tmp);
	}

	//Perform shfl reductions within the warp only join if contribution is valid (allows gridDim.x non power of two and <32)
	if (blockDim.x*blockDim.y > 1) {
	value_type tmp = Kokkos::shfl_down(value, 1,32);
	if( id + 1 < gridDim.x )
	join(value, tmp);
	}
	if (blockDim.x*blockDim.y > 2) {
	value_type tmp = Kokkos::shfl_down(value, 2,32);
	if( id + 2 < gridDim.x )
	join(value, tmp);
	}
	if (blockDim.x*blockDim.y > 4) {
	value_type tmp = Kokkos::shfl_down(value, 4,32);
	if( id + 4 < gridDim.x )
	join(value, tmp);
	}
	if (blockDim.x*blockDim.y > 8) {
	value_type tmp = Kokkos::shfl_down(value, 8,32);
	if( id + 8 < gridDim.x )
	join(value, tmp);
	}
	if (blockDim.x*blockDim.y > 16) {
	value_type tmp = Kokkos::shfl_down(value, 16,32);
	if( id + 16 < gridDim.x )
	join(value, tmp);
	}
	}
	}

	//The last block has in its thread=0 the global reduction value through "value"
	return last_block;
	}

	//----------------------------------------------------------------------------
	// See section B.17 of Cuda C Programming Guide Version 3.2
	// for discussion of
	// __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)
	// function qualifier which could be used to improve performance.
	//----------------------------------------------------------------------------
	// Maximize shared memory and minimize L1 cache:
	// cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared );
	// For 2.0 capability: 48 KB shared and 16 KB L1
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	/*
	* Algorithmic constraints:
	* (a) blockDim.y is a power of two
	* (b) blockDim.y <= 512
	* (c) blockDim.x == blockDim.z == 1
	*/

	template< bool DoScan , class FunctorType , class ArgTag >
	__device__
	void cuda_intra_block_reduce_scan( const FunctorType & functor ,
	const typename FunctorValueTraits< FunctorType , ArgTag >::pointer_type base_data )
	{
	typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
	typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;

	typedef typename ValueTraits::pointer_type pointer_type ;

	const unsigned value_count = ValueTraits::value_count( functor );
	const unsigned BlockSizeMask = blockDim.y - 1 ;

	// Must have power of two thread count

	if ( BlockSizeMask & blockDim.y ) { Kokkos::abort("Cuda::cuda_intra_block_scan requires power-of-two blockDim"); }

	#define BLOCK_REDUCE_STEP( R , TD , S ) \
	if ( ! ( R & ((1<<(S+1))-1) ) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }

	#define BLOCK_SCAN_STEP( TD , N , S ) \
	if ( N == (1<<S) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S))); }

	const unsigned rtid_intra = threadIdx.y ^ BlockSizeMask ;
	const pointer_type tdata_intra = base_data + value_count * threadIdx.y ;

	{ // Intra-warp reduction:
	BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,0)
	BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,1)
	BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,2)
	BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,3)
	BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,4)
	}

	__syncthreads(); // Wait for all warps to reduce

	{ // Inter-warp reduce-scan by a single warp to avoid extra synchronizations
	const unsigned rtid_inter = ( threadIdx.y ^ BlockSizeMask ) << CudaTraits::WarpIndexShift ;

	if ( rtid_inter < blockDim.y ) {

	const pointer_type tdata_inter = base_data + value_count * ( rtid_inter ^ BlockSizeMask );

	if ( (1<<5) < BlockSizeMask ) { BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,5) }
	if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
	if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
	if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }

	if ( DoScan ) {

	int n = ( rtid_inter & 32 ) ? 32 : (
	( rtid_inter & 64 ) ? 64 : (
	( rtid_inter & 128 ) ? 128 : (
	( rtid_inter & 256 ) ? 256 : 0 )));

	if ( ! ( rtid_inter + n < blockDim.y ) ) n = 0 ;

	BLOCK_SCAN_STEP(tdata_inter,n,8)
	BLOCK_SCAN_STEP(tdata_inter,n,7)
	BLOCK_SCAN_STEP(tdata_inter,n,6)
	BLOCK_SCAN_STEP(tdata_inter,n,5)
	}
	}
	}

	__syncthreads(); // Wait for inter-warp reduce-scan to complete

	if ( DoScan ) {
	int n = ( rtid_intra & 1 ) ? 1 : (
	( rtid_intra & 2 ) ? 2 : (
	( rtid_intra & 4 ) ? 4 : (
	( rtid_intra & 8 ) ? 8 : (
	( rtid_intra & 16 ) ? 16 : 0 ))));

	if ( ! ( rtid_intra + n < blockDim.y ) ) n = 0 ;

	BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
	BLOCK_SCAN_STEP(tdata_intra,n,3) __threadfence_block();
	BLOCK_SCAN_STEP(tdata_intra,n,2) __threadfence_block();
	BLOCK_SCAN_STEP(tdata_intra,n,1) __threadfence_block();
	BLOCK_SCAN_STEP(tdata_intra,n,0)
	}

	#undef BLOCK_SCAN_STEP
	#undef BLOCK_REDUCE_STEP
	}

	//----------------------------------------------------------------------------
	/**\brief Input value-per-thread starting at 'shared_data'.
	* Reduction value at last thread's location.
	*
	* If 'DoScan' then write blocks' scan values and block-groups' scan values.
	*
	* Global reduce result is in the last threads' 'shared_data' location.
	*/
	template< bool DoScan , class FunctorType , class ArgTag >
	__device__
	bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
	const Cuda::size_type block_id ,
	const Cuda::size_type block_count ,
	Cuda::size_type * const shared_data ,
	Cuda::size_type * const global_data ,
	Cuda::size_type * const global_flags )
	{
	typedef Cuda::size_type size_type ;
	typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
	typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
	typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
	typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const unsigned BlockSizeMask = blockDim.y - 1 ;
	const unsigned BlockSizeShift = power_of_two_if_valid( blockDim.y );

	// Must have power of two thread count
	if ( BlockSizeMask & blockDim.y ) { Kokkos::abort("Cuda::cuda_single_inter_block_reduce_scan requires power-of-two blockDim"); }

	const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
	word_count( ValueTraits::value_size( functor ) / sizeof(size_type) );

	// Reduce the accumulation for the entire block.
	cuda_intra_block_reduce_scan<false,FunctorType,ArgTag>( functor , pointer_type(shared_data) );

	{
	// Write accumulation total to global scratch space.
	// Accumulation total is the last thread's data.
	size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
	size_type * const global = global_data + word_count.value * block_id ;

	for ( size_type i = threadIdx.y ; i < word_count.value ; i += blockDim.y ) { global[i] = shared[i] ; }
	}

	// Contributing blocks note that their contribution has been completed via an atomic-increment flag
	// If this block is not the last block to contribute to this group then the block is done.
	const bool is_last_block =
	! __syncthreads_or( threadIdx.y ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) );

	if ( is_last_block ) {

	const size_type b = ( long(block_count) * long(threadIdx.y) ) >> BlockSizeShift ;
	const size_type e = ( long(block_count) * long( threadIdx.y + 1 ) ) >> BlockSizeShift ;

	{
	void * const shared_ptr = shared_data + word_count.value * threadIdx.y ;
	reference_type shared_value = ValueInit::init( functor , shared_ptr );

	for ( size_type i = b ; i < e ; ++i ) {
	ValueJoin::join( functor , shared_ptr , global_data + word_count.value * i );
	}
	}

	cuda_intra_block_reduce_scan<DoScan,FunctorType,ArgTag>( functor , pointer_type(shared_data) );

	if ( DoScan ) {

	size_type * const shared_value = shared_data + word_count.value * ( threadIdx.y ? threadIdx.y - 1 : blockDim.y );

	if ( ! threadIdx.y ) { ValueInit::init( functor , shared_value ); }

	// Join previous inclusive scan value to each member
	for ( size_type i = b ; i < e ; ++i ) {
	size_type * const global_value = global_data + word_count.value * i ;
	ValueJoin::join( functor , shared_value , global_value );
	ValueOps ::copy( functor , global_value , shared_value );
	}
	}
	}

	return is_last_block ;
	}

	// Size in bytes required for inter block reduce or scan
	template< bool DoScan , class FunctorType , class ArgTag >
	inline
	unsigned cuda_single_inter_block_reduce_scan_shmem( const FunctorType & functor , const unsigned BlockSize )
	{
	return ( BlockSize + 2 ) * Impl::FunctorValueTraits< FunctorType , ArgTag >::value_size( functor );
	}

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #if defined( __CUDACC__ ) */
	#endif /* KOKKOS_CUDA_REDUCESCAN_HPP */

	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Vectorization.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Vectorization.hpp
	index d6d1baa4e..0b8427cbe 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Vectorization.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Vectorization.hpp
	@@ -1,291 +1,298 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/
	#ifndef KOKKOS_CUDA_VECTORIZATION_HPP
	#define KOKKOS_CUDA_VECTORIZATION_HPP
	+
	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	#include <Kokkos_Cuda.hpp>

	namespace Kokkos {


	// Shuffle only makes sense on >= Kepler GPUs; it doesn't work on CPUs
	// or other GPUs. We provide a generic definition (which is trivial
	// and doesn't do what it claims to do) because we don't actually use
	// this function unless we are on a suitable GPU, with a suitable
	// Scalar type. (For example, in the mat-vec, the "ThreadsPerRow"
	// internal parameter depends both on the ExecutionSpace and the Scalar type,
	// and it controls whether shfl_down() gets called.)
	namespace Impl {

	template< typename Scalar >
	struct shfl_union {
	enum {n = sizeof(Scalar)/4};
	float fval[n];
	KOKKOS_INLINE_FUNCTION
	Scalar value() {
	return (Scalar) fval;
	}
	KOKKOS_INLINE_FUNCTION
	- void operator= (Scalar& value) {
	- float* const val_ptr = (float*) &value;
	+ void operator= (Scalar& value_) {
	+ float* const val_ptr = (float*) &value_;
	for(int i=0; i<n ; i++) {
	fval[i] = val_ptr[i];
	}
	}
	KOKKOS_INLINE_FUNCTION
	- void operator= (const Scalar& value) {
	- float* const val_ptr = (float*) &value;
	+ void operator= (const Scalar& value_) {
	+ float* const val_ptr = (float*) &value_;
	for(int i=0; i<n ; i++) {
	fval[i] = val_ptr[i];
	}
	}

	};
	}

	#ifdef __CUDA_ARCH__
	#if (__CUDA_ARCH__ >= 300)

	KOKKOS_INLINE_FUNCTION
	int shfl(const int &val, const int& srcLane, const int& width ) {
	return __shfl(val,srcLane,width);
	}

	KOKKOS_INLINE_FUNCTION
	float shfl(const float &val, const int& srcLane, const int& width ) {
	return __shfl(val,srcLane,width);
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl(const Scalar &val, const int& srcLane, const typename Impl::enable_if< (sizeof(Scalar) == 4) , int >::type& width
	) {
	Scalar tmp1 = val;
	float tmp = reinterpret_cast<float>(&tmp1);
	tmp = __shfl(tmp,srcLane,width);
	return reinterpret_cast<Scalar>(&tmp);
	}

	KOKKOS_INLINE_FUNCTION
	double shfl(const double &val, const int& srcLane, const int& width) {
	int lo = __double2loint(val);
	int hi = __double2hiint(val);
	lo = __shfl(lo,srcLane,width);
	hi = __shfl(hi,srcLane,width);
	return __hiloint2double(hi,lo);
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl(const Scalar &val, const int& srcLane, const typename Impl::enable_if< (sizeof(Scalar) == 8) ,int>::type& width) {
	int lo = __double2loint(reinterpret_cast<const double>(&val));
	int hi = __double2hiint(reinterpret_cast<const double>(&val));
	lo = __shfl(lo,srcLane,width);
	hi = __shfl(hi,srcLane,width);
	const double tmp = __hiloint2double(hi,lo);
	return (reinterpret_cast<const Scalar>(&tmp));
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl(const Scalar &val, const int& srcLane, const typename Impl::enable_if< (sizeof(Scalar) > 8) ,int>::type& width) {
	Impl::shfl_union<Scalar> s_val;
	Impl::shfl_union<Scalar> r_val;
	s_val = val;

	for(int i = 0; i<s_val.n; i++)
	r_val.fval[i] = __shfl(s_val.fval[i],srcLane,width);
	return r_val.value();
	}

	KOKKOS_INLINE_FUNCTION
	int shfl_down(const int &val, const int& delta, const int& width) {
	return __shfl_down(val,delta,width);
	}

	KOKKOS_INLINE_FUNCTION
	float shfl_down(const float &val, const int& delta, const int& width) {
	return __shfl_down(val,delta,width);
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_down(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 4) , int >::type & width) {
	Scalar tmp1 = val;
	float tmp = reinterpret_cast<float>(&tmp1);
	tmp = __shfl_down(tmp,delta,width);
	return reinterpret_cast<Scalar>(&tmp);
	}

	KOKKOS_INLINE_FUNCTION
	double shfl_down(const double &val, const int& delta, const int& width) {
	int lo = __double2loint(val);
	int hi = __double2hiint(val);
	lo = __shfl_down(lo,delta,width);
	hi = __shfl_down(hi,delta,width);
	return __hiloint2double(hi,lo);
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_down(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 8) , int >::type & width) {
	int lo = __double2loint(reinterpret_cast<const double>(&val));
	int hi = __double2hiint(reinterpret_cast<const double>(&val));
	lo = __shfl_down(lo,delta,width);
	hi = __shfl_down(hi,delta,width);
	const double tmp = __hiloint2double(hi,lo);
	return (reinterpret_cast<const Scalar>(&tmp));
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_down(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) > 8) , int >::type & width) {
	Impl::shfl_union<Scalar> s_val;
	Impl::shfl_union<Scalar> r_val;
	s_val = val;

	for(int i = 0; i<s_val.n; i++)
	r_val.fval[i] = __shfl_down(s_val.fval[i],delta,width);
	return r_val.value();
	}

	KOKKOS_INLINE_FUNCTION
	int shfl_up(const int &val, const int& delta, const int& width ) {
	return __shfl_up(val,delta,width);
	}

	KOKKOS_INLINE_FUNCTION
	float shfl_up(const float &val, const int& delta, const int& width ) {
	return __shfl_up(val,delta,width);
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_up(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 4) , int >::type & width) {
	Scalar tmp1 = val;
	float tmp = reinterpret_cast<float>(&tmp1);
	tmp = __shfl_up(tmp,delta,width);
	return reinterpret_cast<Scalar>(&tmp);
	}

	KOKKOS_INLINE_FUNCTION
	double shfl_up(const double &val, const int& delta, const int& width ) {
	int lo = __double2loint(val);
	int hi = __double2hiint(val);
	lo = __shfl_up(lo,delta,width);
	hi = __shfl_up(hi,delta,width);
	return __hiloint2double(hi,lo);
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_up(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 8) , int >::type & width) {
	int lo = __double2loint(reinterpret_cast<const double>(&val));
	int hi = __double2hiint(reinterpret_cast<const double>(&val));
	lo = __shfl_up(lo,delta,width);
	hi = __shfl_up(hi,delta,width);
	const double tmp = __hiloint2double(hi,lo);
	return (reinterpret_cast<const Scalar>(&tmp));
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_up(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) > 8) , int >::type & width) {
	Impl::shfl_union<Scalar> s_val;
	Impl::shfl_union<Scalar> r_val;
	s_val = val;

	for(int i = 0; i<s_val.n; i++)
	r_val.fval[i] = __shfl_up(s_val.fval[i],delta,width);
	return r_val.value();
	}

	#else
	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl(const Scalar &val, const int& srcLane, const int& width) {
	if(width > 1) Kokkos::abort("Error: calling shfl from a device with CC<3.0.");
	return val;
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_down(const Scalar &val, const int& delta, const int& width) {
	if(width > 1) Kokkos::abort("Error: calling shfl_down from a device with CC<3.0.");
	return val;
	}

	template<typename Scalar>
	KOKKOS_INLINE_FUNCTION
	Scalar shfl_up(const Scalar &val, const int& delta, const int& width) {
	if(width > 1) Kokkos::abort("Error: calling shfl_down from a device with CC<3.0.");
	return val;
	}
	#endif
	#else
	template<typename Scalar>
	inline
	Scalar shfl(const Scalar &val, const int& srcLane, const int& width) {
	if(width > 1) Kokkos::abort("Error: calling shfl from a device with CC<3.0.");
	return val;
	}

	template<typename Scalar>
	inline
	Scalar shfl_down(const Scalar &val, const int& delta, const int& width) {
	if(width > 1) Kokkos::abort("Error: calling shfl_down from a device with CC<3.0.");
	return val;
	}

	template<typename Scalar>
	inline
	Scalar shfl_up(const Scalar &val, const int& delta, const int& width) {
	if(width > 1) Kokkos::abort("Error: calling shfl_down from a device with CC<3.0.");
	return val;
	}
	#endif



	}

	+#endif // KOKKOS_HAVE_CUDA
	#endif
	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
	index 67c7214f8..a78ead0cb 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
	@@ -1,299 +1,312 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDA_VIEW_HPP
	#define KOKKOS_CUDA_VIEW_HPP

	+#include <Kokkos_Macros.hpp>
	+
	+/* only compile this file if CUDA is enabled for Kokkos */
	+#ifdef KOKKOS_HAVE_CUDA
	+
	#include <cstring>

	#include <Kokkos_HostSpace.hpp>
	#include <Kokkos_CudaSpace.hpp>
	-#include <Kokkos_CudaTypes.hpp>
	#include <Kokkos_View.hpp>

	+#include <Cuda/Kokkos_Cuda_BasicAllocators.hpp>
	+
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template<>
	struct AssertShapeBoundsAbort< CudaSpace >
	{
	KOKKOS_INLINE_FUNCTION
	static void apply( const size_t /* rank */ ,
	const size_t /* n0 / , const size_t / n1 */ ,
	const size_t /* n2 / , const size_t / n3 */ ,
	const size_t /* n4 / , const size_t / n5 */ ,
	const size_t /* n6 / , const size_t / n7 */ ,

	const size_t /* arg_rank */ ,
	const size_t /* i0 / , const size_t / i1 */ ,
	const size_t /* i2 / , const size_t / i3 */ ,
	const size_t /* i4 / , const size_t / i5 */ ,
	const size_t /* i6 / , const size_t / i7 */ )
	{
	Kokkos::abort("Kokkos::View array bounds violation");
	}
	};

	}
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	-// Cuda 5.0 <texture_types.h> defines 'cudaTextureObject_t'
	-// to be an 'unsigned long long'. This chould change with
	-// future version of Cuda and this typedef would have to
	-// change accordingly.
	-
	-#if defined( CUDA_VERSION ) && ( 5000 <= CUDA_VERSION )
	-
	-typedef enable_if<
	- sizeof(::cudaTextureObject_t) == sizeof(const void *) ,
	- ::cudaTextureObject_t >::type cuda_texture_object_type ;
	-
	-#else
	-
	-typedef const void * cuda_texture_object_type ;
	-
	-#endif
	-
	//----------------------------------------------------------------------------
	// Cuda Texture fetches can be performed for 4, 8 and 16 byte objects (int,int2,int4)
	// Via reinterpret_case this can be used to support all scalar types of those sizes.
	// Any other scalar type falls back to either normal reads out of global memory,
	// or using the __ldg intrinsic on Kepler GPUs or newer (Compute Capability >= 3.0)

	template< typename ValueType
	, class MemorySpace
	, class AliasType =
	typename Kokkos::Impl::if_c< ( sizeof(ValueType) == 4 ) , int ,
	- typename Kokkos::Impl::if_c< ( sizeof(ValueType) == 8 ) , int2 ,
	- typename Kokkos::Impl::if_c< ( sizeof(ValueType) == 16 ) , int4 , void
	+ typename Kokkos::Impl::if_c< ( sizeof(ValueType) == 8 ) , ::int2 ,
	+ typename Kokkos::Impl::if_c< ( sizeof(ValueType) == 16 ) , ::int4 , void
	>::type
	>::type
	>::type
	>
	class CudaTextureFetch {
	private:

	cuda_texture_object_type m_obj ;
	const ValueType * m_alloc_ptr ;
	int m_offset ;

	+ void attach( const ValueType * const arg_ptr, AllocationTracker const & tracker )
	+ {
	+ typedef char const * const byte;
	+
	+ m_alloc_ptr = reinterpret_cast<ValueType *>(tracker.alloc_ptr());
	+
	+ size_t byte_offset = reinterpret_cast<byte>(arg_ptr) - reinterpret_cast<byte>(m_alloc_ptr);
	+ const bool ok_aligned = 0 == byte_offset % sizeof(ValueType);
	+
	+ const size_t count = tracker.alloc_size() / sizeof(ValueType);
	+ const bool ok_contains = (m_alloc_ptr <= arg_ptr) && (arg_ptr < (m_alloc_ptr + count));
	+
	+ if (ok_aligned && ok_contains) {
	+ if (tracker.attribute() == NULL ) {
	+ MemorySpace::texture_object_attach(
	+ tracker
	+ , sizeof(ValueType)
	+ , cudaCreateChannelDesc< AliasType >()
	+ );
	+ }
	+ m_obj = dynamic_cast<TextureAttribute*>(tracker.attribute())->m_tex_obj;
	+ m_offset = arg_ptr - m_alloc_ptr;
	+ }
	+ else if( !ok_contains ) {
	+ throw_runtime_exception("Error: cannot attach a texture object to a tracker which does not bound the pointer.");
	+ }
	+ else {
	+ throw_runtime_exception("Error: cannot attach a texture object to an incorrectly aligned pointer.");
	+ }
	+ }
	+
	public:

	KOKKOS_INLINE_FUNCTION
	- CudaTextureFetch() : m_obj( 0 ) , m_alloc_ptr(0) , m_offset(0) {}
	+ CudaTextureFetch() : m_obj() , m_alloc_ptr() , m_offset() {}

	KOKKOS_INLINE_FUNCTION
	~CudaTextureFetch() {}

	KOKKOS_INLINE_FUNCTION
	CudaTextureFetch( const CudaTextureFetch & rhs )
	: m_obj( rhs.m_obj )
	, m_alloc_ptr( rhs.m_alloc_ptr )
	, m_offset( rhs.m_offset )
	{}

	KOKKOS_INLINE_FUNCTION
	CudaTextureFetch & operator = ( const CudaTextureFetch & rhs )
	{
	m_obj = rhs.m_obj ;
	m_alloc_ptr = rhs.m_alloc_ptr ;
	m_offset = rhs.m_offset ;
	return *this ;
	}

	-
	KOKKOS_INLINE_FUNCTION explicit
	- CudaTextureFetch( const ValueType * const arg_ptr )
	+ CudaTextureFetch( const ValueType * const arg_ptr, AllocationTracker const & tracker )
	: m_obj( 0 ) , m_alloc_ptr(0) , m_offset(0)
	{
	-#if defined( __CUDACC__ ) && ! defined( __CUDA_ARCH__ )
	- MemorySpace::texture_object_attach( arg_ptr
	- , sizeof(ValueType)
	- , cudaCreateChannelDesc< AliasType >()
	- , & m_obj
	- , reinterpret_cast<const void **>( & m_alloc_ptr )
	- , & m_offset
	- );
	-#endif
	- }
	-
	- KOKKOS_INLINE_FUNCTION
	- CudaTextureFetch & operator = ( const ValueType * arg_ptr )
	- {
	-#if defined( __CUDACC__ ) && ! defined( __CUDA_ARCH__ )
	- MemorySpace::texture_object_attach( arg_ptr
	- , sizeof(ValueType)
	- , cudaCreateChannelDesc< AliasType >()
	- , & m_obj
	- , reinterpret_cast<const void **>( & m_alloc_ptr )
	- , & m_offset
	- );
	-#endif
	- return *this ;
	+ #if defined( KOKKOS_USE_LDG_INTRINSIC )
	+ m_alloc_ptr(arg_ptr);
	+ #elif defined( __CUDACC__ ) && ! defined( __CUDA_ARCH__ )
	+ if ( arg_ptr != NULL ) {
	+ if ( tracker.is_valid() ) {
	+ attach( arg_ptr, tracker );
	+ }
	+ else {
	+ AllocationTracker found_tracker = AllocationTracker::find<typename MemorySpace::allocator>(arg_ptr);
	+ if ( found_tracker.is_valid() ) {
	+ attach( arg_ptr, found_tracker );
	+ } else {
	+ throw_runtime_exception("Error: cannot attach a texture object to an untracked pointer!");
	+ }
	+ }
	+ }
	+ #endif
	}

	-
	KOKKOS_INLINE_FUNCTION
	operator const ValueType * () const { return m_alloc_ptr + m_offset ; }


	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	ValueType operator[]( const iType & i ) const
	{
	-#if defined( __CUDA_ARCH__ ) && ( 300 <= __CUDA_ARCH__ )
	-#if defined( KOKKOS_USE_LDG_INTRINSIC )
	- // Enable the usage of the _ldg intrinsic even in cases where texture fetches work
	- // Currently texture fetches are faster, but that might change in the future
	- return _ldg( & m_alloc_ptr[i+m_offset] );
	-#else /* ! defined( KOKKOS_USE_LDG_INTRINSIC ) */
	- AliasType v = tex1Dfetch<AliasType>( m_obj , i + m_offset );
	-
	- return (reinterpret_cast<ValueType> (&v));
	-#endif /* ! defined( KOKKOS_USE_LDG_INTRINSIC ) */
	-#else /* ! defined( __CUDA_ARCH__ ) && ( 300 <= __CUDA_ARCH__ ) */
	- return m_alloc_ptr[ i + m_offset ];
	-#endif
	+ #if defined( KOKKOS_USE_LDG_INTRINSIC ) && defined( __CUDA_ARCH__ ) && ( 300 <= __CUDA_ARCH__ )
	+ AliasType v = __ldg(reinterpret_cast<AliasType*>(&m_alloc_ptr[i]));
	+ return (reinterpret_cast<ValueType> (&v));
	+ #elif defined( __CUDA_ARCH__ ) && ( 300 <= __CUDA_ARCH__ )
	+ AliasType v = tex1Dfetch<AliasType>( m_obj , i + m_offset );
	+ return (reinterpret_cast<ValueType> (&v));
	+ #else
	+ return m_alloc_ptr[ i + m_offset ];
	+ #endif
	}
	};

	-template< typename ValueType >
	-class CudaTextureFetch< const ValueType, void >
	+template< typename ValueType, class MemorySpace >
	+class CudaTextureFetch< const ValueType, MemorySpace, void >
	{
	private:
	const ValueType * m_ptr ;
	public:

	KOKKOS_INLINE_FUNCTION
	CudaTextureFetch() : m_ptr(0) {};

	KOKKOS_INLINE_FUNCTION
	~CudaTextureFetch() {
	}

	+ KOKKOS_INLINE_FUNCTION
	+ CudaTextureFetch( const ValueType * ptr, const AllocationTracker & ) : m_ptr(ptr) {}
	+
	KOKKOS_INLINE_FUNCTION
	CudaTextureFetch( const CudaTextureFetch & rhs ) : m_ptr(rhs.m_ptr) {}

	KOKKOS_INLINE_FUNCTION
	CudaTextureFetch & operator = ( const CudaTextureFetch & rhs ) {
	m_ptr = rhs.m_ptr;
	return *this ;
	}

	explicit KOKKOS_INLINE_FUNCTION
	- CudaTextureFetch( ValueType * const base_view_ptr ) {
	+ CudaTextureFetch( ValueType * const base_view_ptr, AllocationTracker const & /tracker/ ) {
	m_ptr = base_view_ptr;
	}

	KOKKOS_INLINE_FUNCTION
	CudaTextureFetch & operator = (const ValueType* base_view_ptr) {
	m_ptr = base_view_ptr;
	return *this;
	}


	KOKKOS_INLINE_FUNCTION
	operator const ValueType * () const { return m_ptr ; }


	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	ValueType operator[]( const iType & i ) const
	{
	- #if defined( __CUDA_ARCH__ ) && ( 300 <= __CUDA_ARCH__ )
	- return _ldg(&m_ptr[i]);
	- #else
	return m_ptr[ i ];
	- #endif
	}
	};

	} // namespace Impl
	} // namespace Kokkos


	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	/** \brief Replace Default ViewDataHandle with Cuda texture fetch specialization
	* if 'const' value type, CudaSpace and random access.
	*/
	template< class ViewTraits >
	class ViewDataHandle< ViewTraits ,
	typename enable_if< ( is_same< typename ViewTraits::memory_space,CudaSpace>::value \|\|
	is_same< typename ViewTraits::memory_space,CudaUVMSpace>::value )
	&&
	is_same<typename ViewTraits::const_value_type,typename ViewTraits::value_type>::value
	&&
	ViewTraits::memory_traits::RandomAccess
	>::type >
	{
	public:
	enum { ReturnTypeIsReference = false };

	typedef Impl::CudaTextureFetch< typename ViewTraits::value_type
	- , typename ViewTraits::memory_space > handle_type;
	+ , typename ViewTraits::memory_space> handle_type;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type create_handle( typename ViewTraits::value_type * arg_data_ptr, AllocationTracker const & arg_tracker )
	+ {
	+ return handle_type(arg_data_ptr, arg_tracker);
	+ }

	typedef typename ViewTraits::value_type return_type;
	};

	}
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	+#endif // KOKKOS_HAVE_CUDA
	#endif /* #ifndef KOKKOS_CUDA_VIEW_HPP */

	diff --git a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_abort.hpp b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_abort.hpp
	index bd85259ce..deb955ccd 100755
	--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_abort.hpp
	+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_abort.hpp
	@@ -1,117 +1,119 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDA_ABORT_HPP
	#define KOKKOS_CUDA_ABORT_HPP

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	-
	-#if defined( __CUDACC__ ) && defined( __CUDA_ARCH__ )
	+#include "Kokkos_Macros.hpp"
	+#if defined( __CUDACC__ ) && defined( __CUDA_ARCH__ ) && defined( KOKKOS_HAVE_CUDA )

	#include <cuda.h>

	#if ! defined( CUDA_VERSION ) \|\| ( CUDA_VERSION < 4010 )
	#error "Cuda version 4.1 or greater required"
	#endif

	#if ( __CUDA_ARCH__ < 200 )
	#error "Cuda device capability 2.0 or greater required"
	#endif

	extern "C" {
	/* Cuda runtime function, declared in <crt/device_runtime.h>
	* Requires capability 2.x or better.
	*/
	extern __device__ void __assertfail(
	const void *message,
	const void *file,
	unsigned int line,
	const void *function,
	size_t charsize);
	}

	namespace Kokkos {
	namespace Impl {

	__device__ inline
	void cuda_abort( const char * const message )
	{
	+#ifndef __APPLE__
	const char empty[] = "" ;

	__assertfail( (const void *) message ,
	(const void *) empty ,
	(unsigned int) 0 ,
	(const void *) empty ,
	sizeof(char) );
	+#endif
	}

	} // namespace Impl
	} // namespace Kokkos

	#else

	namespace Kokkos {
	namespace Impl {
	KOKKOS_INLINE_FUNCTION
	void cuda_abort( const char * const ) {}
	}
	}

	#endif /* #if defined( __CUDACC__ ) && defined( __CUDA_ARCH__ ) */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA )
	namespace Kokkos {
	__device__ inline
	void abort( const char * const message ) { Kokkos::Impl::cuda_abort(message); }
	}
	#endif /* defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA ) */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_CUDA_ABORT_HPP */

	diff --git a/lib/kokkos/core/src/KokkosExp_View.hpp b/lib/kokkos/core/src/KokkosExp_View.hpp
	new file mode 100755
	index 000000000..a2226f3de
	--- /dev/null
	+++ b/lib/kokkos/core/src/KokkosExp_View.hpp
	@@ -0,0 +1,1945 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_EXPERIMENTAL_VIEW_HPP
	+#define KOKKOS_EXPERIMENTAL_VIEW_HPP
	+
	+#include <string>
	+#include <type_traits>
	+#include <initializer_list>
	+
	+#include <Kokkos_Core_fwd.hpp>
	+#include <Kokkos_HostSpace.hpp>
	+#include <Kokkos_MemoryTraits.hpp>
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template< class > struct ViewDataAnalysis ;
	+
	+template< class , class = void , typename Enable = void >
	+class ViewMapping { enum { is_assignable = false }; };
	+
	+template< class DstMemorySpace , class SrcMemorySpace >
	+struct DeepCopy ;
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+/** \class ViewTraits
	+ * \brief Traits class for accessing attributes of a View.
	+ *
	+ * This is an implementation detail of View. It is only of interest
	+ * to developers implementing a new specialization of View.
	+ *
	+ * Template argument permutations:
	+ * - View< DataType , void , void , void >
	+ * - View< DataType , Space , void , void >
	+ * - View< DataType , Space , MemoryTraits , void >
	+ * - View< DataType , Space , void , MemoryTraits >
	+ * - View< DataType , ArrayLayout , void , void >
	+ * - View< DataType , ArrayLayout , Space , void >
	+ * - View< DataType , ArrayLayout , MemoryTraits , void >
	+ * - View< DataType , ArrayLayout , Space , MemoryTraits >
	+ * - View< DataType , MemoryTraits , void , void >
	+ */
	+
	+template< class DataType ,
	+ class Arg1 = void ,
	+ class Arg2 = void ,
	+ class Arg3 = void >
	+class ViewTraits {
	+private:
	+
	+ // Layout, Space, and MemoryTraits are optional
	+ // but need to appear in that order. That means Layout
	+ // can only be Arg1, Space can be Arg1 or Arg2, and
	+ // MemoryTraits can be Arg1, Arg2 or Arg3
	+
	+ enum { Arg1IsLayout = Kokkos::Impl::is_array_layout<Arg1>::value };
	+
	+ enum { Arg1IsSpace = Kokkos::Impl::is_space<Arg1>::value };
	+ enum { Arg2IsSpace = Kokkos::Impl::is_space<Arg2>::value };
	+
	+ enum { Arg1IsMemoryTraits = Kokkos::Impl::is_memory_traits<Arg1>::value };
	+ enum { Arg2IsMemoryTraits = Kokkos::Impl::is_memory_traits<Arg2>::value };
	+ enum { Arg3IsMemoryTraits = Kokkos::Impl::is_memory_traits<Arg3>::value };
	+
	+ enum { Arg1IsVoid = std::is_same< Arg1 , void >::value };
	+ enum { Arg2IsVoid = std::is_same< Arg2 , void >::value };
	+ enum { Arg3IsVoid = std::is_same< Arg3 , void >::value };
	+
	+ static_assert( 1 == Arg1IsLayout + Arg1IsSpace + Arg1IsMemoryTraits + Arg1IsVoid
	+ , "Template argument #1 must be layout, space, traits, or void" );
	+
	+ // If Arg1 is Layout then Arg2 is Space, MemoryTraits, or void
	+ // If Arg1 is Space then Arg2 is MemoryTraits or void
	+ // If Arg1 is MemoryTraits then Arg2 is void
	+ // If Arg1 is Void then Arg2 is void
	+
	+ static_assert( ( Arg1IsLayout && ( 1 == Arg2IsSpace + Arg2IsMemoryTraits + Arg2IsVoid ) ) \|\|
	+ ( Arg1IsSpace && ( 0 == Arg2IsSpace ) && ( 1 == Arg2IsMemoryTraits + Arg2IsVoid ) ) \|\|
	+ ( Arg1IsMemoryTraits && Arg2IsVoid ) \|\|
	+ ( Arg1IsVoid && Arg2IsVoid )
	+ , "Template argument #2 must be space, traits, or void" );
	+
	+ // Arg3 is MemoryTraits or void and at most one argument is MemoryTraits
	+ static_assert( ( 1 == Arg3IsMemoryTraits + Arg3IsVoid ) &&
	+ ( Arg1IsMemoryTraits + Arg2IsMemoryTraits + Arg3IsMemoryTraits <= 1 )
	+ , "Template argument #3 must be traits or void" );
	+
	+ typedef
	+ typename std::conditional< Arg1IsSpace , Arg1 ,
	+ typename std::conditional< Arg2IsSpace , Arg2 , Kokkos::DefaultExecutionSpace
	+ >::type >::type::execution_space
	+ ExecutionSpace ;
	+
	+ typedef
	+ typename std::conditional< Arg1IsSpace , Arg1 ,
	+ typename std::conditional< Arg2IsSpace , Arg2 , Kokkos::DefaultExecutionSpace
	+ >::type >::type::memory_space
	+ MemorySpace ;
	+
	+ typedef
	+ typename Kokkos::Impl::is_space<
	+ typename std::conditional< Arg1IsSpace , Arg1 ,
	+ typename std::conditional< Arg2IsSpace , Arg2 , Kokkos::DefaultExecutionSpace
	+ >::type >::type >::host_mirror_space
	+ HostMirrorSpace ;
	+
	+ typedef
	+ typename std::conditional< Arg1IsLayout , Arg1 , typename ExecutionSpace::array_layout >::type
	+ ArrayLayout ;
	+
	+ // Arg1, Arg2, or Arg3 may be memory traits
	+ typedef
	+ typename std::conditional< Arg1IsMemoryTraits , Arg1 ,
	+ typename std::conditional< Arg2IsMemoryTraits , Arg2 ,
	+ typename std::conditional< Arg3IsMemoryTraits , Arg3 , MemoryManaged
	+ >::type >::type >::type
	+ MemoryTraits ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewDataAnalysis< DataType > analysis ;
	+
	+public:
	+
	+ //------------------------------------
	+ // Data type traits:
	+
	+ typedef typename analysis::type data_type ;
	+ typedef typename analysis::const_type const_data_type ;
	+ typedef typename analysis::non_const_type non_const_data_type ;
	+
	+ //------------------------------------
	+ // Compatible array of trivial type traits:
	+
	+ typedef typename analysis::array_scalar_type array_scalar_type ;
	+ typedef typename analysis::const_array_scalar_type const_array_scalar_type ;
	+ typedef typename analysis::non_const_array_scalar_type non_const_array_scalar_type ;
	+
	+ //------------------------------------
	+ // Value type traits:
	+
	+ typedef typename analysis::value_type value_type ;
	+ typedef typename analysis::const_value_type const_value_type ;
	+ typedef typename analysis::non_const_value_type non_const_value_type ;
	+
	+ //------------------------------------
	+ // Mapping traits:
	+
	+ typedef ArrayLayout array_layout ;
	+ typedef typename analysis::dimension dimension ;
	+ typedef typename analysis::specialize specialize /* mapping specialization tag */ ;
	+
	+ enum { rank = dimension::rank };
	+ enum { rank_dynamic = dimension::rank_dynamic };
	+
	+ //------------------------------------
	+ // Execution space, memory space, memory access traits, and host mirror space.
	+
	+ typedef ExecutionSpace execution_space ;
	+ typedef MemorySpace memory_space ;
	+ typedef Device<ExecutionSpace,MemorySpace> device_type ;
	+ typedef MemoryTraits memory_traits ;
	+ typedef HostMirrorSpace host_mirror_space ;
	+
	+ typedef typename memory_space::size_type size_type ;
	+
	+ enum { is_hostspace = std::is_same< memory_space , HostSpace >::value };
	+ enum { is_managed = memory_traits::Unmanaged == 0 };
	+ enum { is_random_access = memory_traits::RandomAccess == 1 };
	+
	+ //------------------------------------
	+};
	+
	+/** \class View
	+ * \brief View to an array of data.
	+ *
	+ * A View represents an array of one or more dimensions.
	+ * For details, please refer to Kokkos' tutorial materials.
	+ *
	+ * \section Kokkos_View_TemplateParameters Template parameters
	+ *
	+ * This class has both required and optional template parameters. The
	+ * \c DataType parameter must always be provided, and must always be
	+ * first. The parameters \c Arg1Type, \c Arg2Type, and \c Arg3Type are
	+ * placeholders for different template parameters. The default value
	+ * of the fifth template parameter \c Specialize suffices for most use
	+ * cases. When explaining the template parameters, we won't refer to
	+ * \c Arg1Type, \c Arg2Type, and \c Arg3Type; instead, we will refer
	+ * to the valid categories of template parameters, in whatever order
	+ * they may occur.
	+ *
	+ * Valid ways in which template arguments may be specified:
	+ * - View< DataType , Space >
	+ * - View< DataType , Space , MemoryTraits >
	+ * - View< DataType , Space , void , MemoryTraits >
	+ * - View< DataType , Layout , Space >
	+ * - View< DataType , Layout , Space , MemoryTraits >
	+ *
	+ * \tparam DataType (required) This indicates both the type of each
	+ * entry of the array, and the combination of compile-time and
	+ * run-time array dimension(s). For example, <tt>double*</tt>
	+ * indicates a one-dimensional array of \c double with run-time
	+ * dimension, and <tt>int*[3]</tt> a two-dimensional array of \c int
	+ * with run-time first dimension and compile-time second dimension
	+ * (of 3). In general, the run-time dimensions (if any) must go
	+ * first, followed by zero or more compile-time dimensions. For
	+ * more examples, please refer to the tutorial materials.
	+ *
	+ * \tparam Space (required) The memory space.
	+ *
	+ * \tparam Layout (optional) The array's layout in memory. For
	+ * example, LayoutLeft indicates a column-major (Fortran style)
	+ * layout, and LayoutRight a row-major (C style) layout. If not
	+ * specified, this defaults to the preferred layout for the
	+ * <tt>Space</tt>.
	+ *
	+ * \tparam MemoryTraits (optional) Assertion of the user's intended
	+ * access behavior. For example, RandomAccess indicates read-only
	+ * access with limited spatial locality, and Unmanaged lets users
	+ * wrap externally allocated memory in a View without automatic
	+ * deallocation.
	+ *
	+ * \section Kokkos_View_MT MemoryTraits discussion
	+ *
	+ * \subsection Kokkos_View_MT_Interp MemoryTraits interpretation depends on Space
	+ *
	+ * Some \c MemoryTraits options may have different interpretations for
	+ * different \c Space types. For example, with the Cuda device,
	+ * \c RandomAccess tells Kokkos to fetch the data through the texture
	+ * cache, whereas the non-GPU devices have no such hardware construct.
	+ *
	+ * \subsection Kokkos_View_MT_PrefUse Preferred use of MemoryTraits
	+ *
	+ * Users should defer applying the optional \c MemoryTraits parameter
	+ * until the point at which they actually plan to rely on it in a
	+ * computational kernel. This minimizes the number of template
	+ * parameters exposed in their code, which reduces the cost of
	+ * compilation. Users may always assign a View without specified
	+ * \c MemoryTraits to a compatible View with that specification.
	+ * For example:
	+ * \code
	+ * // Pass in the simplest types of View possible.
	+ * void
	+ * doSomething (View<double*, Cuda> out,
	+ * View<const double*, Cuda> in)
	+ * {
	+ * // Assign the "generic" View in to a RandomAccess View in_rr.
	+ * // Note that RandomAccess View objects must have const data.
	+ * View<const double*, Cuda, RandomAccess> in_rr = in;
	+ * // ... do something with in_rr and out ...
	+ * }
	+ * \endcode
	+ */
	+template< class DataType
	+ , class Arg1 = void /* ArrayLayout, SpaceType, or MemoryTraits */
	+ , class Arg2 = void /* SpaceType or MemoryTraits */
	+ , class Arg3 = void /* MemoryTraits */ >
	+class View ;
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+#include <impl/KokkosExp_ViewMapping.hpp>
	+#include <impl/KokkosExp_ViewAllocProp.hpp>
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+namespace {
	+
	+constexpr Kokkos::Experimental::Impl::ALL_t
	+ ALL = Kokkos::Experimental::Impl::ALL_t();
	+
	+constexpr Kokkos::Experimental::Impl::WithoutInitializing_t
	+ WithoutInitializing = Kokkos::Experimental::Impl::WithoutInitializing_t();
	+
	+constexpr Kokkos::Experimental::Impl::AllowPadding_t
	+ AllowPadding = Kokkos::Experimental::Impl::AllowPadding_t();
	+
	+}
	+
	+/** \brief Create View allocation parameter bundle from argument list.
	+ *
	+ * Valid argument list members are:
	+ * 1) label as a "string" or std::string
	+ * 2) memory space instance of the View::memory_space type
	+ * 3) execution space instance compatible with the View::memory_space
	+ * 4) Kokkos::WithoutInitializing to bypass initialization
	+ * 4) Kokkos::AllowPadding to allow allocation to pad dimensions for memory alignment
	+ */
	+template< class ... Args >
	+inline
	+Kokkos::Experimental::Impl::ViewAllocProp< Args ... >
	+view_alloc( Args ... args )
	+{
	+ return Kokkos::Experimental::Impl::ViewAllocProp< Args ... >( args ... );
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+/*\brief Each R? template argument designates whether the subview argument is a range /
	+template< class V
	+ , bool R0 = false , bool R1 = false , bool R2 = false , bool R3 = false
	+ , bool R4 = false , bool R5 = false , bool R6 = false , bool R7 = false >
	+using Subview = typename Kokkos::Experimental::Impl::SubviewType< V, R0 , R1 , R2 , R3 , R4 , R5 , R6 , R7 >::type ;
	+
	+template< class DataType , class Arg1 , class Arg2 , class Arg3 >
	+class View : public ViewTraits< DataType , Arg1 , Arg2 , Arg3 > {
	+private:
	+
	+ template< class , class , class , class > friend class View ;
	+
	+ typedef ViewTraits< DataType , Arg1 , Arg2 , Arg3 > traits ;
	+ typedef Kokkos::Experimental::Impl::ViewMapping< traits > map_type ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationTracker track_type ;
	+
	+ track_type m_track ;
	+ map_type m_map ;
	+
	+public:
	+
	+ //----------------------------------------
	+ /** \brief Compatible view of array of scalar types */
	+ typedef View< typename traits::array_scalar_type ,
	+ typename traits::array_layout ,
	+ typename traits::device_type ,
	+ typename traits::memory_traits >
	+ array_type ;
	+
	+ /** \brief Compatible view of const data type */
	+ typedef View< typename traits::const_data_type ,
	+ typename traits::array_layout ,
	+ typename traits::device_type ,
	+ typename traits::memory_traits >
	+ const_type ;
	+
	+ /** \brief Compatible view of non-const data type */
	+ typedef View< typename traits::non_const_data_type ,
	+ typename traits::array_layout ,
	+ typename traits::device_type ,
	+ typename traits::memory_traits >
	+ non_const_type ;
	+
	+ /** \brief Compatible HostMirror view */
	+ typedef View< typename traits::non_const_data_type ,
	+ typename traits::array_layout ,
	+ typename traits::host_mirror_space ,
	+ void >
	+ HostMirror ;
	+
	+ //----------------------------------------
	+ // Domain dimensions
	+
	+ enum { Rank = map_type::Rank };
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_0() const { return m_map.dimension_0(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_1() const { return m_map.dimension_1(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_2() const { return m_map.dimension_2(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_3() const { return m_map.dimension_3(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_4() const { return m_map.dimension_4(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_5() const { return m_map.dimension_5(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_6() const { return m_map.dimension_6(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_7() const { return m_map.dimension_7(); }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_0() const { return m_map.stride_0(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_1() const { return m_map.stride_1(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_2() const { return m_map.stride_2(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_3() const { return m_map.stride_3(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_4() const { return m_map.stride_4(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_5() const { return m_map.stride_5(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_6() const { return m_map.stride_6(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_7() const { return m_map.stride_7(); }
	+
	+ //----------------------------------------
	+ // Range span
	+
	+ typedef typename map_type::reference_type reference_type ;
	+
	+ enum { reference_type_is_lvalue_reference = std::is_lvalue_reference< reference_type >::value };
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_t span() const { return m_map.span(); }
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const { return m_map.span_is_contiguous(); }
	+ KOKKOS_INLINE_FUNCTION constexpr typename traits::value_type * data() const { return m_map.data(); }
	+
	+ // Deprecated, use 'span_is_contigous()' instead
	+ KOKKOS_INLINE_FUNCTION constexpr bool is_contiguous() const { return m_map.span_is_contiguous(); }
	+ // Deprecated, use 'data()' instead
	+ KOKKOS_INLINE_FUNCTION constexpr typename traits::value_type * ptr_on_device() const { return m_map.data(); }
	+
	+ //----------------------------------------
	+
	+private:
	+
	+ typedef typename
	+ std::conditional< Rank == 0 , reference_type
	+ , Kokkos::Experimental::Impl::Error_view_scalar_reference_to_non_scalar_view >::type
	+ scalar_operator_reference_type ;
	+
	+ typedef typename
	+ std::conditional< Rank == 0 , const int
	+ , Kokkos::Experimental::Impl::Error_view_scalar_reference_to_non_scalar_view >::type
	+ scalar_operator_index_type ;
	+
	+public:
	+
	+ // Rank == 0
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ scalar_operator_reference_type operator()() const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, 0, 0, 0, 0, 0, 0, 0, 0 );
	+ return scalar_operator_reference_type( m_map.reference() );
	+ }
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( scalar_operator_index_type i0
	+ , const int i1 = 0 , const int i2 = 0 , const int i3 = 0
	+ , const int i4 = 0 , const int i5 = 0 , const int i6 = 0 , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference();
	+ }
	+
	+ // Rank == 1
	+
	+ template< typename I0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 1 && std::is_integral<I0>::value
	+ ), reference_type >::type
	+ operator[]( const I0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, 0, 0, 0, 0, 0, 0, 0 );
	+ return m_map.reference(i0);
	+ }
	+
	+ template< typename I0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 1 && std::is_integral<I0>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, 0, 0, 0, 0, 0, 0, 0 );
	+ return m_map.reference(i0);
	+ }
	+
	+ template< typename I0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0
	+ , typename std::enable_if<( Rank == 1 && std::is_integral<I0>::value ), const int >::type i1
	+ , const int i2 = 0 , const int i3 = 0
	+ , const int i4 = 0 , const int i5 = 0 , const int i6 = 0 , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0);
	+ }
	+
	+ // Rank == 2
	+
	+ template< typename I0 , typename I1 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 2 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, 0, 0, 0, 0, 0, 0 );
	+ return m_map.reference(i0,i1);
	+ }
	+
	+ template< typename I0 , typename I1 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0 , const I1 & i1
	+ , typename std::enable_if<( Rank == 2 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value
	+ ), const int >::type i2
	+ , const int i3 = 0
	+ , const int i4 = 0 , const int i5 = 0 , const int i6 = 0 , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1);
	+ }
	+
	+ // Rank == 3
	+
	+ template< typename I0 , typename I1 , typename I2 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 3 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, 0, 0, 0, 0, 0 );
	+ return m_map.reference(i0,i1,i2);
	+ }
	+
	+ template< typename I0 , typename I1 , typename I2 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2
	+ , typename std::enable_if<( Rank == 3 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value
	+ ), const int >::type i3
	+ , const int i4 = 0 , const int i5 = 0 , const int i6 = 0 , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1,i2);
	+ }
	+
	+ // Rank == 4
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 4 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, 0, 0, 0, 0 );
	+ return m_map.reference(i0,i1,i2,i3);
	+ }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , typename std::enable_if<( Rank == 4 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value
	+ ), const int >::type i4
	+ , const int i5 = 0 , const int i6 = 0 , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1,i2,i3);
	+ }
	+
	+ // Rank == 5
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 5 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, 0, 0, 0 );
	+ return m_map.reference(i0,i1,i2,i3,i4);
	+ }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4
	+ , typename std::enable_if<( Rank == 5 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value
	+ ), const int >::type i5
	+ , const int i6 = 0 , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1,i2,i3,i4);
	+ }
	+
	+ // Rank == 6
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 6 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value &&
	+ std::is_integral<I5>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, 0, 0 );
	+ return m_map.reference(i0,i1,i2,i3,i4,i5);
	+ }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5
	+ , typename std::enable_if<( Rank == 6 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value
	+ ), const int >::type i6
	+ , const int i7 = 0 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1,i2,i3,i4,i5);
	+ }
	+
	+ // Rank == 7
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 , typename I6 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 7 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value &&
	+ std::is_integral<I5>::value &&
	+ std::is_integral<I6>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 , const I6 & i6 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, 0 );
	+ return m_map.reference(i0,i1,i2,i3,i4,i5,i6);
	+ }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 , typename I6 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 , const I6 & i6
	+ , typename std::enable_if<( Rank == 7 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value
	+ ), const int >::type i7
	+ ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1,i2,i3,i4,i5,i6);
	+ }
	+
	+ // Rank == 8
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 , typename I6 , typename I7 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename std::enable_if<( Rank == 8 &&
	+ std::is_integral<I0>::value &&
	+ std::is_integral<I1>::value &&
	+ std::is_integral<I2>::value &&
	+ std::is_integral<I3>::value &&
	+ std::is_integral<I4>::value &&
	+ std::is_integral<I5>::value &&
	+ std::is_integral<I6>::value &&
	+ std::is_integral<I7>::value
	+ ), reference_type >::type
	+ operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 , const I6 & i6 , const I7 & i7 ) const
	+ {
	+ KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( typename traits::memory_space, m_map, Rank, i0, i1, i2, i3, i4, i5, i6, i7 );
	+ return m_map.reference(i0,i1,i2,i3,i4,i5,i6,i7);
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION
	+ ~View() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View() : m_track(), m_map() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View( const View & rhs ) : m_track( rhs.m_track ), m_map( rhs.m_map ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View( View && rhs ) : m_track( rhs.m_track ), m_map( rhs.m_map ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View & operator = ( const View & rhs ) { m_track = rhs.m_track ; m_map = rhs.m_map ; return *this ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View & operator = ( View && rhs ) { m_track = rhs.m_track ; m_map = rhs.m_map ; return *this ; }
	+
	+ //----------------------------------------
	+
	+ template< class RT , class R1 , class R2 , class R3 >
	+ KOKKOS_INLINE_FUNCTION
	+ View( const View<RT,R1,R2,R3> & rhs )
	+ : m_track( rhs.m_track )
	+ , m_map()
	+ {
	+ typedef typename View<RT,R1,R2,R3>::traits SrcTraits ;
	+ typedef Kokkos::Experimental::Impl::ViewMapping< traits , SrcTraits > Mapping ;
	+ static_assert( Mapping::is_assignable , "Incompatible View copy construction" );
	+ Mapping::assign( m_map , rhs.m_map , rhs.m_track );
	+ }
	+
	+ template< class RT , class R1 , class R2 , class R3 >
	+ KOKKOS_INLINE_FUNCTION
	+ View( View<RT,R1,R2,R3> && rhs )
	+ : m_track( rhs.m_track )
	+ , m_map()
	+ {
	+ typedef typename View<RT,R1,R2,R3>::traits SrcTraits ;
	+ typedef Kokkos::Experimental::Impl::ViewMapping< traits , SrcTraits > Mapping ;
	+ static_assert( Mapping::is_assignable , "Incompatible View move construction" );
	+ Mapping::assign( m_map , rhs.m_map , rhs.m_track );
	+ }
	+
	+ template< class RT , class R1 , class R2 , class R3 >
	+ KOKKOS_INLINE_FUNCTION
	+ View & operator = ( const View<RT,R1,R2,R3> & rhs )
	+ {
	+ typedef typename View<RT,R1,R2,R3>::traits SrcTraits ;
	+ typedef Kokkos::Experimental::Impl::ViewMapping< traits , SrcTraits > Mapping ;
	+ static_assert( Mapping::is_assignable , "Incompatible View copy assignment" );
	+ Mapping::assign( m_map , rhs.m_map , rhs.m_track );
	+ m_track.operator=( rhs.m_track );
	+ return *this ;
	+ }
	+
	+ template< class RT , class R1 , class R2 , class R3 >
	+ KOKKOS_INLINE_FUNCTION
	+ View & operator = ( View<RT,R1,R2,R3> && rhs )
	+ {
	+ typedef typename View<RT,R1,R2,R3>::traits SrcTraits ;
	+ typedef Kokkos::Experimental::Impl::ViewMapping< traits , SrcTraits > Mapping ;
	+ static_assert( Mapping::is_assignable , "Incompatible View move assignment" );
	+ Mapping::assign( m_map , rhs.m_map , rhs.m_track );
	+ m_track.operator=( rhs.m_track );
	+ return *this ;
	+ }
	+
	+ //----------------------------------------
	+ // Allocation according to allocation properties
	+
	+private:
	+
	+ // Must call destructor for non-trivial types
	+ template< class ExecSpace >
	+ struct DestroyFunctor {
	+ map_type m_map ;
	+ ExecSpace m_space ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void destroy_shared_allocation() { m_map.destroy( m_space ); }
	+ };
	+
	+public:
	+
	+ inline
	+ const std::string label() const { return m_track.template get_label< typename traits::memory_space >(); }
	+
	+ template< class Prop >
	+ explicit inline
	+ View( const Prop & arg_prop
	+ , const size_t arg_N0 = 0
	+ , const size_t arg_N1 = 0
	+ , const size_t arg_N2 = 0
	+ , const size_t arg_N3 = 0
	+ , const size_t arg_N4 = 0
	+ , const size_t arg_N5 = 0
	+ , const size_t arg_N6 = 0
	+ , const size_t arg_N7 = 0
	+ )
	+ : m_track()
	+ , m_map()
	+ {
	+ // Merge the < execution_space , memory_space > into the properties.
	+ typedef Kokkos::Experimental::Impl::ViewAllocProp< typename traits::device_type , Prop > alloc_prop ;
	+
	+ typedef typename alloc_prop::execution_space execution_space ;
	+ typedef typename traits::memory_space memory_space ;
	+ typedef DestroyFunctor< execution_space > destroy_functor ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< memory_space , destroy_functor > record_type ;
	+
	+ static_assert( traits::is_managed , "View allocation constructor requires managed memory" );
	+
	+ const alloc_prop prop( arg_prop );
	+
	+ // Query the mapping for byte-size of allocation.
	+ const size_t alloc_size = map_type::memory_span( prop.allow_padding
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 );
	+
	+ // Allocate memory from the memory space.
	+ record_type * const record = record_type::allocate( prop.memory , prop.label , alloc_size );
	+
	+ // Construct the mapping object prior to start of tracking
	+ // to assign destroy functor and possibly initialize.
	+ m_map = map_type( record->data()
	+ , prop.allow_padding
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 );
	+
	+ // Copy the destroy functor into the allocation record before initiating tracking.
	+ record->m_destroy.m_map = m_map ;
	+ record->m_destroy.m_space = prop.execution ;
	+
	+ if ( prop.initialize.value ) {
	+ m_map.construct( prop.execution );
	+ }
	+
	+ // Destroy functor assigned and initialization complete, start tracking
	+ m_track = track_type( record );
	+ }
	+
	+ template< class Prop >
	+ explicit inline
	+ View( const Prop & arg_prop
	+ , const typename traits::array_layout & arg_layout
	+ )
	+ : m_track()
	+ , m_map()
	+ {
	+ // Merge the < execution_space , memory_space > into the properties.
	+ typedef Kokkos::Experimental::Impl::ViewAllocProp< typename traits::device_type , Prop > alloc_prop ;
	+
	+ typedef typename alloc_prop::execution_space execution_space ;
	+ typedef typename traits::memory_space memory_space ;
	+ typedef DestroyFunctor< execution_space > destroy_functor ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< memory_space , destroy_functor > record_type ;
	+
	+ static_assert( traits::is_managed , "View allocation constructor requires managed memory" );
	+
	+ const alloc_prop prop( arg_prop );
	+
	+ // Query the mapping for byte-size of allocation.
	+ const size_t alloc_size = map_type::memory_span( prop.allow_padding , arg_layout );
	+
	+ // Allocate memory from the memory space.
	+ record_type * const record = record_type::allocate( prop.memory , prop.label , alloc_size );
	+
	+ // Construct the mapping object prior to start of tracking
	+ // to assign destroy functor and possibly initialize.
	+ m_map = map_type( record->data() , prop.allow_padding , arg_layout );
	+
	+ // Copy the destroy functor into the allocation record before initiating tracking.
	+ record->m_destroy.m_map = m_map ;
	+ record->m_destroy.m_space = prop.execution ;
	+
	+ if ( prop.initialize.value ) {
	+ m_map.construct( prop.execution );
	+ }
	+
	+ // Destroy functor assigned and initialization complete, start tracking
	+ m_track = track_type( record );
	+ }
	+
	+ //----------------------------------------
	+ // Memory span required to wrap these dimensions.
	+ static constexpr size_t memory_span( const size_t arg_N0 = 0
	+ , const size_t arg_N1 = 0
	+ , const size_t arg_N2 = 0
	+ , const size_t arg_N3 = 0
	+ , const size_t arg_N4 = 0
	+ , const size_t arg_N5 = 0
	+ , const size_t arg_N6 = 0
	+ , const size_t arg_N7 = 0
	+ )
	+ {
	+ return map_type::memory_span( std::integral_constant<bool,false>()
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 );
	+ }
	+
	+ explicit inline
	+ View( typename traits::value_type * const arg_ptr
	+ , const size_t arg_N0 = 0
	+ , const size_t arg_N1 = 0
	+ , const size_t arg_N2 = 0
	+ , const size_t arg_N3 = 0
	+ , const size_t arg_N4 = 0
	+ , const size_t arg_N5 = 0
	+ , const size_t arg_N6 = 0
	+ , const size_t arg_N7 = 0
	+ )
	+ : m_track() // No memory tracking
	+ , m_map( arg_ptr , std::integral_constant<bool,false>()
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 )
	+ {}
	+
	+ explicit inline
	+ View( typename traits::value_type * const arg_ptr
	+ , typename traits::array_layout & arg_layout
	+ )
	+ : m_track() // No memory tracking
	+ , m_map( arg_ptr , std::integral_constant<bool,false>(), arg_layout )
	+ {}
	+
	+ //----------------------------------------
	+ // Shared scratch memory constructor
	+
	+ static inline
	+ size_t shmem_size( const size_t arg_N0 = 0 ,
	+ const size_t arg_N1 = 0 ,
	+ const size_t arg_N2 = 0 ,
	+ const size_t arg_N3 = 0 ,
	+ const size_t arg_N4 = 0 ,
	+ const size_t arg_N5 = 0 ,
	+ const size_t arg_N6 = 0 ,
	+ const size_t arg_N7 = 0 )
	+ {
	+ return map_type::memory_span( std::integral_constant<bool,false>()
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 );
	+ }
	+
	+ explicit KOKKOS_INLINE_FUNCTION
	+ View( const typename traits::execution_space::scratch_memory_space & arg_space
	+ , const size_t arg_N0 = 0
	+ , const size_t arg_N1 = 0
	+ , const size_t arg_N2 = 0
	+ , const size_t arg_N3 = 0
	+ , const size_t arg_N4 = 0
	+ , const size_t arg_N5 = 0
	+ , const size_t arg_N6 = 0
	+ , const size_t arg_N7 = 0 )
	+ : m_track() // No memory tracking
	+ , m_map( arg_space.get_shmem( map_type::memory_span( std::integral_constant<bool,false>()
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 ) )
	+ , std::integral_constant<bool,false>()
	+ , arg_N0 , arg_N1 , arg_N2 , arg_N3
	+ , arg_N4 , arg_N5 , arg_N6 , arg_N7 )
	+ {}
	+
	+ //----------------------------------------
	+ // Subviews
	+
	+private:
	+
	+ explicit KOKKOS_INLINE_FUNCTION
	+ View( const track_type & rhs )
	+ : m_track( rhs )
	+ , m_map()
	+ {}
	+
	+public:
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 , class T6 , class T7 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T6>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T7>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4 , T5 const & arg5 , T6 const & arg6 , T7 const & arg7
	+ );
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 , class T6 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T6>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4 , T5 const & arg5 , T6 const & arg6
	+ );
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4 , T5 const & arg5
	+ );
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4
	+ );
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 , class T2 , class T3 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ );
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 , class T2 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2
	+ );
	+
	+ template< class D , class A1 , class A2 , class A3
	+ , class T0 , class T1 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D , A1 , A2 , A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ >
	+ subview( const View< D , A1 , A2 , A3 > & src
	+ , T0 const & arg0 , T1 const & arg1
	+ );
	+
	+ template< class D, class A1, class A2, class A3, class T0 >
	+ friend
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ >
	+ subview( const View< D, A1, A2, A3 > & src , T0 const & arg0 );
	+
	+};
	+
	+template< class > struct is_view : public std::false_type {};
	+
	+template< class D, class A1, class A2, class A3 >
	+struct is_view< View<D,A1,A2,A3> > : public std::true_type {};
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 , class T6 , class T7 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T6>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T7>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4 , T5 const & arg5 , T6 const & arg6 , T7 const & arg7
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T6>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T7>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 8 , "Subview of rank 8 View requires 8 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 , class T6 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T6>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4 , T5 const & arg5 , T6 const & arg6
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T6>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 7 , "Subview of rank 7 View requires 7 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, arg2, arg3, arg4, arg5, arg6, 0 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4 , T5 const & arg5
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T5>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 6 , "Subview of rank 6 View requires 6 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, arg2, arg3, arg4, arg5, 0, 0 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 , class T2 , class T3
	+ , class T4 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ , T4 const & arg4
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T4>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 5 , "Subview of rank 5 View requires 5 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, arg2, arg3, arg4, 0, 0, 0 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 , class T2 , class T3 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2 , T3 const & arg3
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T3>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 4 , "Subview of rank 4 View requires 4 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, arg2, arg3, 0, 0, 0, 0 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 , class T2 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1 , T2 const & arg2
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T2>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 3 , "Subview of rank 3 View requires 3 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, arg2, 0, 0, 0, 0, 0 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3
	+ , class T0 , class T1 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src
	+ , T0 const & arg0 , T1 const & arg1
	+ )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T1>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 2 , "Subview of rank 2 View requires 2 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map, src.m_map, arg0, arg1, 0, 0, 0, 0, 0, 0 );
	+
	+ return dst ;
	+}
	+
	+template< class D, class A1, class A2, class A3, class T0 >
	+KOKKOS_INLINE_FUNCTION
	+Kokkos::Experimental::Subview< View< D, A1, A2, A3 >
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ >
	+subview( const View< D, A1, A2, A3 > & src , T0 const & arg0 )
	+{
	+ typedef View< D, A1, A2, A3 > SrcView ;
	+
	+ typedef Kokkos::Experimental::Impl::SubviewMapping
	+ < typename SrcView::traits
	+ , Kokkos::Experimental::Impl::ViewOffsetRange<T0>::is_range
	+ > Mapping ;
	+
	+ typedef typename Mapping::type DstView ;
	+
	+ static_assert( SrcView::Rank == 1 , "Subview of rank 1 View requires 1 arguments" );
	+
	+ DstView dst( src.m_track );
	+
	+ Mapping::assign( dst.m_map , src.m_map , arg0, 0, 0, 0, 0, 0, 0, 0 );
	+
	+ return dst ;
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+template< class LT , class L1 , class L2 , class L3
	+ , class RT , class R1 , class R2 , class R3 >
	+KOKKOS_INLINE_FUNCTION
	+bool operator == ( const View<LT,L1,L2,L3> & lhs ,
	+ const View<RT,R1,R2,R3> & rhs )
	+{
	+ // Same data, layout, dimensions
	+ typedef ViewTraits<LT,L1,L2,L3> lhs_traits ;
	+ typedef ViewTraits<RT,R1,R2,R3> rhs_traits ;
	+
	+ return
	+ std::is_same< typename lhs_traits::const_value_type ,
	+ typename rhs_traits::const_value_type >::value &&
	+ std::is_same< typename lhs_traits::array_layout ,
	+ typename rhs_traits::array_layout >::value &&
	+ std::is_same< typename lhs_traits::memory_space ,
	+ typename rhs_traits::memory_space >::value &&
	+ lhs_traits::Rank == rhs_traits::Rank &&
	+ lhs.data() == rhs.data() &&
	+ lhs.span() == rhs.span() &&
	+ lhs.dimension_0() == rhs.dimension_0() &&
	+ lhs.dimension_1() == rhs.dimension_1() &&
	+ lhs.dimension_2() == rhs.dimension_2() &&
	+ lhs.dimension_3() == rhs.dimension_3() &&
	+ lhs.dimension_4() == rhs.dimension_4() &&
	+ lhs.dimension_5() == rhs.dimension_5() &&
	+ lhs.dimension_6() == rhs.dimension_6() &&
	+ lhs.dimension_7() == rhs.dimension_7();
	+}
	+
	+template< class LT , class L1 , class L2 , class L3
	+ , class RT , class R1 , class R2 , class R3 >
	+KOKKOS_INLINE_FUNCTION
	+bool operator != ( const View<LT,L1,L2,L3> & lhs ,
	+ const View<RT,R1,R2,R3> & rhs )
	+{
	+ return ! ( operator==(lhs,rhs) );
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template< class OutputView , typename Enable = void >
	+struct ViewFill {
	+
	+ typedef typename OutputView::const_value_type const_value_type ;
	+
	+ const OutputView output ;
	+ const_value_type input ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_t i0 ) const
	+ {
	+ const size_t n1 = output.dimension_1();
	+ const size_t n2 = output.dimension_2();
	+ const size_t n3 = output.dimension_3();
	+ const size_t n4 = output.dimension_4();
	+ const size_t n5 = output.dimension_5();
	+ const size_t n6 = output.dimension_6();
	+ const size_t n7 = output.dimension_7();
	+
	+ for ( size_t i1 = 0 ; i1 < n1 ; ++i1 ) {
	+ for ( size_t i2 = 0 ; i2 < n2 ; ++i2 ) {
	+ for ( size_t i3 = 0 ; i3 < n3 ; ++i3 ) {
	+ for ( size_t i4 = 0 ; i4 < n4 ; ++i4 ) {
	+ for ( size_t i5 = 0 ; i5 < n5 ; ++i5 ) {
	+ for ( size_t i6 = 0 ; i6 < n6 ; ++i6 ) {
	+ for ( size_t i7 = 0 ; i7 < n7 ; ++i7 ) {
	+ output(i0,i1,i2,i3,i4,i5,i6,i7) = input ;
	+ }}}}}}}
	+ }
	+
	+ ViewFill( const OutputView & arg_out , const_value_type & arg_in )
	+ : output( arg_out ), input( arg_in )
	+ {
	+ typedef typename OutputView::execution_space execution_space ;
	+ typedef Kokkos::RangePolicy< execution_space > Policy ;
	+
	+ (void) Kokkos::Impl::ParallelFor< ViewFill , Policy >( *this , Policy( 0 , output.dimension_0() ) );
	+
	+ execution_space::fence();
	+ }
	+};
	+
	+template< class OutputView >
	+struct ViewFill< OutputView , typename std::enable_if< OutputView::Rank == 0 >::type > {
	+ ViewFill( const OutputView & dst , const typename OutputView::const_value_type & src )
	+ {
	+ Kokkos::Impl::DeepCopy< typename OutputView::memory_space , Kokkos::HostSpace >
	+ ( dst.data() , & src , sizeof(typename OutputView::const_value_type) );
	+ }
	+};
	+
	+template< class OutputView , class InputView >
	+struct ViewRemap {
	+
	+ const OutputView output ;
	+ const InputView input ;
	+ const size_t n0 ;
	+ const size_t n1 ;
	+ const size_t n2 ;
	+ const size_t n3 ;
	+ const size_t n4 ;
	+ const size_t n5 ;
	+ const size_t n6 ;
	+ const size_t n7 ;
	+
	+ ViewRemap( const OutputView & arg_out , const InputView & arg_in )
	+ : output( arg_out ), input( arg_in )
	+ , n0( std::min( (size_t)arg_out.dimension_0() , (size_t)arg_in.dimension_0() ) )
	+ , n1( std::min( (size_t)arg_out.dimension_1() , (size_t)arg_in.dimension_1() ) )
	+ , n2( std::min( (size_t)arg_out.dimension_2() , (size_t)arg_in.dimension_2() ) )
	+ , n3( std::min( (size_t)arg_out.dimension_3() , (size_t)arg_in.dimension_3() ) )
	+ , n4( std::min( (size_t)arg_out.dimension_4() , (size_t)arg_in.dimension_4() ) )
	+ , n5( std::min( (size_t)arg_out.dimension_5() , (size_t)arg_in.dimension_5() ) )
	+ , n6( std::min( (size_t)arg_out.dimension_6() , (size_t)arg_in.dimension_6() ) )
	+ , n7( std::min( (size_t)arg_out.dimension_7() , (size_t)arg_in.dimension_7() ) )
	+ {
	+ typedef typename OutputView::execution_space execution_space ;
	+ typedef Kokkos::RangePolicy< execution_space > Policy ;
	+ (void) Kokkos::Impl::ParallelFor< ViewRemap , Policy >( *this , Policy( 0 , n0 ) );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_t i0 ) const
	+ {
	+ for ( size_t i1 = 0 ; i1 < n1 ; ++i1 ) {
	+ for ( size_t i2 = 0 ; i2 < n2 ; ++i2 ) {
	+ for ( size_t i3 = 0 ; i3 < n3 ; ++i3 ) {
	+ for ( size_t i4 = 0 ; i4 < n4 ; ++i4 ) {
	+ for ( size_t i5 = 0 ; i5 < n5 ; ++i5 ) {
	+ for ( size_t i6 = 0 ; i6 < n6 ; ++i6 ) {
	+ for ( size_t i7 = 0 ; i7 < n7 ; ++i7 ) {
	+ output(i0,i1,i2,i3,i4,i5,i6,i7) = input(i0,i1,i2,i3,i4,i5,i6,i7);
	+ }}}}}}}
	+ }
	+};
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+/** \brief Deep copy a value from Host memory into a view. */
	+template< class DT , class D1 , class D2 , class D3 >
	+inline
	+void deep_copy( const View<DT,D1,D2,D3> & dst
	+ , typename ViewTraits<DT,D1,D2,D3>::const_value_type & value )
	+{
	+ static_assert( std::is_same< typename ViewTraits<DT,D1,D2,D3>::non_const_value_type ,
	+ typename ViewTraits<DT,D1,D2,D3>::value_type >::value
	+ , "ERROR: Incompatible deep_copy( View , value )" );
	+
	+ Kokkos::Experimental::Impl::ViewFill< View<DT,D1,D2,D3> >( dst , value );
	+}
	+
	+/** \brief Deep copy into a value in Host memory from a view. */
	+template< class ST , class S1 , class S2 , class S3 >
	+inline
	+void deep_copy( ST & dst , const View<ST,S1,S2,S3> & src )
	+{
	+ static_assert( ViewTraits<ST,S1,S2,S3>::rank == 0
	+ , "ERROR: Non-rank-zero view in deep_copy( value , View )" );
	+
	+ typedef ViewTraits<ST,S1,S2,S3> src_traits ;
	+ typedef typename src_traits::memory_space src_memory_space ;
	+ Kokkos::Impl::DeepCopy< HostSpace , src_memory_space >( & dst , src.data() , sizeof(ST) );
	+}
	+
	+//----------------------------------------------------------------------------
	+/** \brief A deep copy between views of compatible type, and rank zero. */
	+template< class DT , class D1 , class D2 , class D3
	+ , class ST , class S1 , class S2 , class S3 >
	+inline
	+void deep_copy( const View<DT,D1,D2,D3> & dst ,
	+ const View<ST,S1,S2,S3> & src ,
	+ typename std::enable_if<(
	+ // Rank zero:
	+ ( unsigned(ViewTraits<DT,D1,D2,D3>::rank) == unsigned(0) ) &&
	+ ( unsigned(ViewTraits<ST,S1,S2,S3>::rank) == unsigned(0) ) &&
	+ // Same type and destination is not constant:
	+ std::is_same< typename ViewTraits<DT,D1,D2,D3>::value_type ,
	+ typename ViewTraits<ST,S1,S2,S3>::non_const_value_type >::value
	+ )>::type * = 0 )
	+{
	+ typedef View<DT,D1,D2,D3> dst_type ;
	+ typedef View<ST,S1,S2,S3> src_type ;
	+
	+ typedef typename dst_type::value_type value_type ;
	+ typedef typename dst_type::memory_space dst_memory_space ;
	+ typedef typename src_type::memory_space src_memory_space ;
	+
	+ if ( dst.data() != src.data() ) {
	+ Kokkos::Impl::DeepCopy< dst_memory_space , src_memory_space >( dst.data() , src.data() , sizeof(value_type) );
	+ }
	+}
	+
	+//----------------------------------------------------------------------------
	+/** \brief A deep copy between views of the default specialization, compatible type,
	+ * same non-zero rank, same contiguous layout.
	+ */
	+template< class DT , class D1 , class D2 , class D3 ,
	+ class ST , class S1 , class S2 , class S3 >
	+inline
	+void deep_copy( const View<DT,D1,D2,D3> & dst ,
	+ const View<ST,S1,S2,S3> & src ,
	+ typename std::enable_if<(
	+ // destination is non-const.
	+ std::is_same< typename ViewTraits<DT,D1,D2,D3>::value_type ,
	+ typename ViewTraits<DT,D1,D2,D3>::non_const_value_type >::value
	+ &&
	+ // Same non-zero rank:
	+ ( unsigned(ViewTraits<DT,D1,D2,D3>::rank) != 0 )
	+ &&
	+ ( unsigned(ViewTraits<DT,D1,D2,D3>::rank) ==
	+ unsigned(ViewTraits<ST,S1,S2,S3>::rank) )
	+ &&
	+ // Not specialized, default ViewMapping
	+ std::is_same< typename ViewTraits<DT,D1,D2,D3>::specialize , void >::value
	+ &&
	+ std::is_same< typename ViewTraits<ST,S1,S2,S3>::specialize , void >::value
	+ )>::type * = 0 )
	+{
	+ typedef View<DT,D1,D2,D3> dst_type ;
	+ typedef View<ST,S1,S2,S3> src_type ;
	+
	+ typedef typename dst_type::execution_space dst_execution_space ;
	+ typedef typename dst_type::memory_space dst_memory_space ;
	+ typedef typename src_type::memory_space src_memory_space ;
	+
	+ enum { DstExecCanAccessSrc =
	+ Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< typename dst_execution_space::memory_space , src_memory_space >::value };
	+
	+ if ( (void ) dst.data() != (void) src.data() ) {
	+
	+ // Concern: If overlapping views then a parallel copy will be erroneous.
	+ // ...
	+
	+ // If same type, equal layout, equal dimensions, equal span, and contiguous memory then can byte-wise copy
	+
	+ if ( std::is_same< typename ViewTraits<DT,D1,D2,D3>::value_type ,
	+ typename ViewTraits<ST,S1,S2,S3>::non_const_value_type >::value &&
	+ std::is_same< typename ViewTraits<DT,D1,D2,D3>::array_layout ,
	+ typename ViewTraits<ST,S1,S2,S3>::array_layout >::value &&
	+ dst.span_is_contiguous() &&
	+ src.span_is_contiguous() &&
	+ dst.span() == src.span() &&
	+ dst.dimension_0() == src.dimension_0() &&
	+ dst.dimension_1() == src.dimension_1() &&
	+ dst.dimension_2() == src.dimension_2() &&
	+ dst.dimension_3() == src.dimension_3() &&
	+ dst.dimension_4() == src.dimension_4() &&
	+ dst.dimension_5() == src.dimension_5() &&
	+ dst.dimension_6() == src.dimension_6() &&
	+ dst.dimension_7() == src.dimension_7() ) {
	+
	+ const size_t nbytes = sizeof(typename dst_type::value_type) * dst.span();
	+
	+ Kokkos::Impl::DeepCopy< dst_memory_space , src_memory_space >( dst.data() , src.data() , nbytes );
	+ }
	+ else if ( DstExecCanAccessSrc ) {
	+ // Copying data between views in accessible memory spaces and either non-contiguous or incompatible shape.
	+ Kokkos::Experimental::Impl::ViewRemap< dst_type , src_type >( dst , src );
	+ }
	+ else {
	+ Kokkos::Impl::throw_runtime_exception("deep_copy given views that would require a temporary allocation");
	+ }
	+ }
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+template< class T , class A1, class A2, class A3 >
	+inline
	+typename Kokkos::Experimental::View<T,A1,A2,A3>::HostMirror
	+create_mirror( const Kokkos::Experimental::View<T,A1,A2,A3> & src
	+ , typename std::enable_if<
	+ ! std::is_same< typename Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::array_layout
	+ , Kokkos::LayoutStride >::value
	+ >::type * = 0
	+ )
	+{
	+ typedef View<T,A1,A2,A3> src_type ;
	+ typedef typename src_type::HostMirror dst_type ;
	+
	+ return dst_type( std::string( src.label() ).append("_mirror")
	+ , src.dimension_0()
	+ , src.dimension_1()
	+ , src.dimension_2()
	+ , src.dimension_3()
	+ , src.dimension_4()
	+ , src.dimension_5()
	+ , src.dimension_6()
	+ , src.dimension_7() );
	+}
	+
	+template< class T , class A1, class A2, class A3 >
	+inline
	+typename Kokkos::Experimental::View<T,A1,A2,A3>::HostMirror
	+create_mirror( const Kokkos::Experimental::View<T,A1,A2,A3> & src
	+ , typename std::enable_if<
	+ std::is_same< typename Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::array_layout
	+ , Kokkos::LayoutStride >::value
	+ >::type * = 0
	+ )
	+{
	+ typedef View<T,A1,A2,A3> src_type ;
	+ typedef typename src_type::HostMirror dst_type ;
	+
	+ Kokkos::LayoutStride layout ;
	+
	+ layout.dimension[0] = src.dimension_0();
	+ layout.dimension[1] = src.dimension_1();
	+ layout.dimension[2] = src.dimension_2();
	+ layout.dimension[3] = src.dimension_3();
	+ layout.dimension[4] = src.dimension_4();
	+ layout.dimension[5] = src.dimension_5();
	+ layout.dimension[6] = src.dimension_6();
	+ layout.dimension[7] = src.dimension_7();
	+
	+ layout.stride[0] = src.stride_0();
	+ layout.stride[1] = src.stride_1();
	+ layout.stride[2] = src.stride_2();
	+ layout.stride[3] = src.stride_3();
	+ layout.stride[4] = src.stride_4();
	+ layout.stride[5] = src.stride_5();
	+ layout.stride[6] = src.stride_6();
	+ layout.stride[7] = src.stride_7();
	+
	+ return dst_type( std::string( src.label() ).append("_mirror") , layout );
	+}
	+
	+template< class T , class A1 , class A2 , class A3 >
	+inline
	+typename Kokkos::Experimental::View<T,A1,A2,A3>::HostMirror
	+create_mirror_view( const Kokkos::Experimental::View<T,A1,A2,A3> & src
	+ , typename std::enable_if<(
	+ std::is_same< typename Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::memory_space
	+ , typename Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::host_mirror_space
	+ >::value
	+ )>::type * = 0
	+ )
	+{
	+ return src ;
	+}
	+
	+template< class T , class A1 , class A2 , class A3 >
	+inline
	+typename Kokkos::Experimental::View<T,A1,A2,A3>::HostMirror
	+create_mirror_view( const Kokkos::Experimental::View<T,A1,A2,A3> & src
	+ , typename std::enable_if<(
	+ ! std::is_same< typename Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::memory_space
	+ , typename Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::host_mirror_space
	+ >::value
	+ )>::type * = 0
	+ )
	+{
	+ return Kokkos::Experimental::create_mirror( src );
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+/** \brief Resize a view with copying old data to new data at the corresponding indices. */
	+template< class T , class A1 , class A2 , class A3 >
	+inline
	+void resize( Kokkos::Experimental::View<T,A1,A2,A3> & v ,
	+ const size_t n0 = 0 ,
	+ const size_t n1 = 0 ,
	+ const size_t n2 = 0 ,
	+ const size_t n3 = 0 ,
	+ const size_t n4 = 0 ,
	+ const size_t n5 = 0 ,
	+ const size_t n6 = 0 ,
	+ const size_t n7 = 0 )
	+{
	+ typedef Kokkos::Experimental::View<T,A1,A2,A3> view_type ;
	+
	+ static_assert( Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::is_managed , "Can only resize managed views" );
	+
	+ view_type v_resized( v.label(), n0, n1, n2, n3, n4, n5, n6, n7 );
	+
	+ Kokkos::Experimental::Impl::ViewRemap< view_type , view_type >( v_resized , v );
	+
	+ v = v_resized ;
	+}
	+
	+/** \brief Resize a view with copying old data to new data at the corresponding indices. */
	+template< class T , class A1 , class A2 , class A3 >
	+inline
	+void realloc( Kokkos::Experimental::View<T,A1,A2,A3> & v ,
	+ const size_t n0 = 0 ,
	+ const size_t n1 = 0 ,
	+ const size_t n2 = 0 ,
	+ const size_t n3 = 0 ,
	+ const size_t n4 = 0 ,
	+ const size_t n5 = 0 ,
	+ const size_t n6 = 0 ,
	+ const size_t n7 = 0 )
	+{
	+ typedef Kokkos::Experimental::View<T,A1,A2,A3> view_type ;
	+
	+ static_assert( Kokkos::Experimental::ViewTraits<T,A1,A2,A3>::is_managed , "Can only realloc managed views" );
	+
	+ const std::string label = v.label();
	+
	+ v = view_type(); // Deallocate first, if the only view to allocation
	+ v = view_type( label, n0, n1, n2, n3, n4, n5, n6, n7 );
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+#if defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+namespace Kokkos {
	+
	+template< class D , class A1 = void , class A2 = void , class A3 = void >
	+using ViewTraits = Kokkos::Experimental::ViewTraits<D,A1,A2,A3> ;
	+
	+template< class D , class A1 = void , class A2 = void , class A3 = void , class S = void >
	+using View = Kokkos::Experimental::View<D,A1,A2,A3> ;
	+
	+using Kokkos::Experimental::deep_copy ;
	+using Kokkos::Experimental::create_mirror ;
	+using Kokkos::Experimental::create_mirror_view ;
	+using Kokkos::Experimental::subview ;
	+using Kokkos::Experimental::resize ;
	+using Kokkos::Experimental::realloc ;
	+
	+namespace Impl {
	+
	+using Kokkos::Experimental::is_view ;
	+
	+class ViewDefault {};
	+
	+template< class SrcViewType
	+ , class Arg0Type
	+ , class Arg1Type
	+ , class Arg2Type
	+ , class Arg3Type
	+ , class Arg4Type
	+ , class Arg5Type
	+ , class Arg6Type
	+ , class Arg7Type
	+ >
	+struct ViewSubview /* { typedef ... type ; } */ ;
	+
	+}
	+
	+} /* namespace Kokkos */
	+
	+#include <impl/Kokkos_Atomic_View.hpp>
	+
	+#endif /* #if defined( KOKKOS_USING_EXPERIMENTAL_VIEW ) */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+#endif
	+
	diff --git a/lib/kokkos/core/src/Kokkos_Atomic.hpp b/lib/kokkos/core/src/Kokkos_Atomic.hpp
	index 856c740ea..60009e6d4 100755
	--- a/lib/kokkos/core/src/Kokkos_Atomic.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Atomic.hpp
	@@ -1,236 +1,285 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_Atomic.hpp
	/// \brief Atomic functions
	///
	/// This header file defines prototypes for the following atomic functions:
	/// - exchange
	/// - compare and exchange
	/// - add
	///
	/// Supported types include:
	/// - signed and unsigned 4 and 8 byte integers
	/// - float
	/// - double
	///
	/// They are implemented through GCC compatible intrinsics, OpenMP
	/// directives and native CUDA intrinsics.
	///
	/// Including this header file requires one of the following
	/// compilers:
	/// - NVCC (for CUDA device code only)
	/// - GCC (for host code only)
	/// - Intel (for host code only)
	/// - A compiler that supports OpenMP 3.1 (for host code only)

	#ifndef KOKKOS_ATOMIC_HPP
	#define KOKKOS_ATOMIC_HPP

	#include <Kokkos_Macros.hpp>
	+#include <Kokkos_HostSpace.hpp>
	#include <impl/Kokkos_Traits.hpp>

	//----------------------------------------------------------------------------
	-
	-#if defined( __CUDA_ARCH__ )
	+#if defined(_WIN32)
	+#define KOKKOS_ATOMICS_USE_WINDOWS
	+#else
	+#if defined( __CUDA_ARCH__ ) && defined( KOKKOS_HAVE_CUDA )

	// Compiling NVIDIA device code, must use Cuda atomics:

	#define KOKKOS_ATOMICS_USE_CUDA

	#elif ! defined( KOKKOS_ATOMICS_USE_GCC ) && \
	! defined( KOKKOS_ATOMICS_USE_INTEL ) && \
	! defined( KOKKOS_ATOMICS_USE_OMP31 )

	// Compiling for non-Cuda atomic implementation has not been pre-selected.
	// Choose the best implementation for the detected compiler.
	// Preference: GCC, INTEL, OMP31

	#if defined( KOKKOS_COMPILER_GNU ) \|\| \
	- defined( KOKKOS_COMPILER_CLANG )
	+ defined( KOKKOS_COMPILER_CLANG ) \|\| \
	+ ( defined ( KOKKOS_COMPILER_NVCC ) && defined ( __GNUC__ ) )

	#define KOKKOS_ATOMICS_USE_GCC

	#elif defined( KOKKOS_COMPILER_INTEL ) \|\| \
	defined( KOKKOS_COMPILER_CRAYC )

	#define KOKKOS_ATOMICS_USE_INTEL

	#elif defined( _OPENMP ) && ( 201107 <= _OPENMP )

	#define KOKKOS_ATOMICS_USE_OMP31

	#else

	#error "KOKKOS_ATOMICS_USE : Unsupported compiler"

	#endif

	#endif /* Not pre-selected atomic implementation */
	+#endif

	//----------------------------------------------------------------------------

	+// Forward decalaration of functions supporting arbitrary sized atomics
	+// This is necessary since Kokkos_Atomic.hpp is internally included very early
	+// through Kokkos_HostSpace.hpp as well as the allocation tracker.
	+#ifdef KOKKOS_HAVE_CUDA
	+namespace Kokkos {
	+namespace Impl {
	+/// \brief Aquire a lock for the address
	+///
	+/// This function tries to aquire the lock for the hash value derived
	+/// from the provided ptr. If the lock is successfully aquired the
	+/// function returns true. Otherwise it returns false.
	+__device__ inline
	+bool lock_address_cuda_space(void* ptr);
	+
	+/// \brief Release lock for the address
	+///
	+/// This function releases the lock for the hash value derived
	+/// from the provided ptr. This function should only be called
	+/// after previously successfully aquiring a lock with
	+/// lock_address.
	+__device__ inline
	+void unlock_address_cuda_space(void* ptr);
	+}
	+}
	+#endif
	+
	+
	namespace Kokkos {
	template <typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_add(volatile T * const dest, const T src);

	// Atomic increment
	template<typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_increment(volatile T* a);

	template<typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_decrement(volatile T* a);
	}

	-
	+#if ! defined(_WIN32)
	#include<impl/Kokkos_Atomic_Assembly_X86.hpp>
	+#endif

	namespace Kokkos {


	inline
	const char * atomic_query_version()
	{
	#if defined( KOKKOS_ATOMICS_USE_CUDA )
	return "KOKKOS_ATOMICS_USE_CUDA" ;
	#elif defined( KOKKOS_ATOMICS_USE_GCC )
	return "KOKKOS_ATOMICS_USE_GCC" ;
	#elif defined( KOKKOS_ATOMICS_USE_INTEL )
	return "KOKKOS_ATOMICS_USE_INTEL" ;
	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )
	return "KOKKOS_ATOMICS_USE_OMP31" ;
	+#elif defined( KOKKOS_ATOMICS_USE_WINDOWS )
	+ return "KOKKOS_ATOMICS_USE_WINDOWS";
	#endif
	}

	} // namespace Kokkos

	+#ifdef _WIN32
	+#include "impl/Kokkos_Atomic_Windows.hpp"
	+#else
	//#include "impl/Kokkos_Atomic_Assembly_X86.hpp"

	//----------------------------------------------------------------------------
	// Atomic exchange
	//
	// template< typename T >
	// T atomic_exchange( volatile T* const dest , const T val )
	// { T tmp = dest ; dest = val ; return tmp ; }

	#include "impl/Kokkos_Atomic_Exchange.hpp"

	//----------------------------------------------------------------------------
	// Atomic compare-and-exchange
	//
	// template<class T>
	// bool atomic_compare_exchange_strong(volatile T* const dest, const T compare, const T val)
	// { bool equal = compare == dest ; if ( equal ) { dest = val ; } return equal ; }

	#include "impl/Kokkos_Atomic_Compare_Exchange_Strong.hpp"

	//----------------------------------------------------------------------------
	// Atomic fetch and add
	//
	// template<class T>
	// T atomic_fetch_add(volatile T* const dest, const T val)
	// { T tmp = dest ; dest += val ; return tmp ; }

	#include "impl/Kokkos_Atomic_Fetch_Add.hpp"

	+//----------------------------------------------------------------------------
	+// Atomic fetch and sub
	+//
	+// template<class T>
	+// T atomic_fetch_sub(volatile T* const dest, const T val)
	+// { T tmp = dest ; dest -= val ; return tmp ; }
	+
	+#include "impl/Kokkos_Atomic_Fetch_Sub.hpp"
	+
	//----------------------------------------------------------------------------
	// Atomic fetch and or
	//
	// template<class T>
	// T atomic_fetch_or(volatile T* const dest, const T val)
	// { T tmp = dest ; dest = tmp \| val ; return tmp ; }

	#include "impl/Kokkos_Atomic_Fetch_Or.hpp"

	//----------------------------------------------------------------------------
	// Atomic fetch and and
	//
	// template<class T>
	// T atomic_fetch_and(volatile T* const dest, const T val)
	// { T tmp = dest ; dest = tmp & val ; return tmp ; }

	#include "impl/Kokkos_Atomic_Fetch_And.hpp"
	+#endif /Not _WIN32/

	//----------------------------------------------------------------------------
	// Memory fence
	//
	// All loads and stores from this thread will be globally consistent before continuing
	//
	// void memory_fence() {...};
	#include "impl/Kokkos_Memory_Fence.hpp"

	//----------------------------------------------------------------------------
	// Provide volatile_load and safe_load
	//
	// T volatile_load(T const volatile * const ptr);
	//
	// T const& safe_load(T const * const ptr);
	// XEON PHI
	// T safe_load(T const * const ptr

	#include "impl/Kokkos_Volatile_Load.hpp"

	+#ifndef _WIN32
	#include "impl/Kokkos_Atomic_Generic.hpp"
	-
	+#endif
	//----------------------------------------------------------------------------
	// This atomic-style macro should be an inlined function, not a macro

	-#if defined( KOKKOS_COMPILER_GNU )
	+#if defined( KOKKOS_COMPILER_GNU ) && !defined(__PGIC__)

	#define KOKKOS_NONTEMPORAL_PREFETCH_LOAD(addr) __builtin_prefetch(addr,0,0)
	#define KOKKOS_NONTEMPORAL_PREFETCH_STORE(addr) __builtin_prefetch(addr,1,0)

	#else

	#define KOKKOS_NONTEMPORAL_PREFETCH_LOAD(addr) ((void)0)
	#define KOKKOS_NONTEMPORAL_PREFETCH_STORE(addr) ((void)0)

	#endif

	//----------------------------------------------------------------------------

	#endif /* KOKKOS_ATOMIC_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_Core.hpp b/lib/kokkos/core/src/Kokkos_Core.hpp
	index 8f5f34bfd..c521e2315 100755
	--- a/lib/kokkos/core/src/Kokkos_Core.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Core.hpp
	@@ -1,106 +1,228 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CORE_HPP
	#define KOKKOS_CORE_HPP

	//----------------------------------------------------------------------------
	// Include the execution space header files for the enabled execution spaces.

	#include <Kokkos_Core_fwd.hpp>

	#if defined( KOKKOS_HAVE_CUDA )
	#include <Kokkos_Cuda.hpp>
	#endif

	#if defined( KOKKOS_HAVE_OPENMP )
	#include <Kokkos_OpenMP.hpp>
	#endif

	#if defined( KOKKOS_HAVE_SERIAL )
	#include <Kokkos_Serial.hpp>
	#endif

	#if defined( KOKKOS_HAVE_PTHREAD )
	#include <Kokkos_Threads.hpp>
	#endif

	#include <Kokkos_Pair.hpp>
	#include <Kokkos_View.hpp>
	#include <Kokkos_Vectorization.hpp>
	#include <Kokkos_Atomic.hpp>
	#include <Kokkos_hwloc.hpp>

	+#include <iostream>
	+
	//----------------------------------------------------------------------------

	namespace Kokkos {

	struct InitArguments {
	int num_threads;
	int num_numa;
	int device_id;

	InitArguments() {
	num_threads = -1;
	num_numa = -1;
	device_id = -1;
	}
	};

	void initialize(int& narg, char* arg[]);

	void initialize(const InitArguments& args = InitArguments());

	/** \brief Finalize the spaces that were initialized via Kokkos::initialize */
	void finalize();

	/** \brief Finalize all known execution spaces */
	void finalize_all();

	void fence();

	}

	+#ifdef KOKKOS_HAVE_CXX11
	+namespace Kokkos {
	+
	+namespace Impl {
	+// should only by used by kokkos_malloc and kokkos_free
	+struct MallocHelper
	+{
	+ static void increment_ref_count( AllocationTracker const & tracker )
	+ {
	+ tracker.increment_ref_count();
	+ }
	+
	+ static void decrement_ref_count( AllocationTracker const & tracker )
	+ {
	+ tracker.decrement_ref_count();
	+ }
	+};
	+} // namespace Impl
	+
	+/* Allocate memory from a memory space.
	+ * The allocation is tracked in Kokkos memory tracking system, so
	+ * leaked memory can be identified.
	+ */
	+template< class Arg = DefaultExecutionSpace>
	+void* kokkos_malloc(const std::string label, size_t count) {
	+ typedef typename Arg::memory_space MemorySpace;
	+ Impl::AllocationTracker tracker = MemorySpace::allocate_and_track(label,count);;
	+ Impl::MallocHelper::increment_ref_count( tracker );
	+ return tracker.alloc_ptr();
	+}
	+
	+template< class Arg = DefaultExecutionSpace>
	+void* kokkos_malloc(const size_t& count) {
	+ return kokkos_malloc<Arg>("DefaultLabel",count);
	+}
	+
	+
	+/* Free memory from a memory space.
	+ */
	+template< class Arg = DefaultExecutionSpace>
	+void kokkos_free(const void* ptr) {
	+ typedef typename Arg::memory_space MemorySpace;
	+ typedef typename MemorySpace::allocator allocator;
	+ Impl::AllocationTracker tracker = Impl::AllocationTracker::find<allocator>(ptr);
	+ if (tracker.is_valid()) {
	+ Impl::MallocHelper::decrement_ref_count( tracker );
	+ }
	+}
	+
	+
	+template< class Arg = DefaultExecutionSpace>
	+const void* kokkos_realloc(const void* old_ptr, size_t size) {
	+ typedef typename Arg::memory_space MemorySpace;
	+ typedef typename MemorySpace::allocator allocator;
	+ Impl::AllocationTracker tracker = Impl::AllocationTracker::find<allocator>(old_ptr);
	+
	+ tracker.reallocate(size);
	+
	+ return tracker.alloc_ptr();
	+}
	+
	+} // namespace Kokkos
	#endif
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+template< class Space = typename Kokkos::DefaultExecutionSpace::memory_space >
	+inline
	+void * kokkos_malloc( const size_t arg_alloc_size )
	+{
	+ typedef typename Space::memory_space MemorySpace ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< void , void > RecordBase ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< MemorySpace , void > RecordHost ;
	+
	+ RecordHost * const r = RecordHost::allocate( MemorySpace() , "kokkos_malloc" , arg_alloc_size );
	+
	+ RecordBase::increment( r );
	+
	+ return r->data();
	+}
	+
	+template< class Space = typename Kokkos::DefaultExecutionSpace::memory_space >
	+inline
	+void kokkos_free( void * arg_alloc )
	+{
	+ typedef typename Space::memory_space MemorySpace ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< void , void > RecordBase ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< MemorySpace , void > RecordHost ;
	+
	+ RecordHost * const r = RecordHost::get_record( arg_alloc );
	+
	+ RecordBase::decrement( r );
	+}
	+
	+template< class Space = typename Kokkos::DefaultExecutionSpace::memory_space >
	+inline
	+void * kokkos_realloc( void * arg_alloc , const size_t arg_alloc_size )
	+{
	+ typedef typename Space::memory_space MemorySpace ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< void , void > RecordBase ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< MemorySpace , void > RecordHost ;
	+
	+ RecordHost * const r_old = RecordHost::get_record( arg_alloc );
	+ RecordHost * const r_new = RecordHost::allocate( MemorySpace() , "kokkos_malloc" , arg_alloc_size );
	+
	+ Kokkos::Impl::DeepCopy<MemorySpace,MemorySpace>( r_new->data() , r_old->data()
	+ , std::min( r_old->size() , r_new->size() ) );
	+
	+ RecordBase::increment( r_new );
	+ RecordBase::decrement( r_old );
	+
	+ return r_new->data();
	+}
	+
	+} // namespace Experimental
	+} // namespace Kokkos
	+
	+#endif
	+
	diff --git a/lib/kokkos/core/src/Kokkos_Core_fwd.hpp b/lib/kokkos/core/src/Kokkos_Core_fwd.hpp
	index 2661d315a..2cde9299a 100755
	--- a/lib/kokkos/core/src/Kokkos_Core_fwd.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Core_fwd.hpp
	@@ -1,150 +1,170 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CORE_FWD_HPP
	#define KOKKOS_CORE_FWD_HPP

	//----------------------------------------------------------------------------
	// Kokkos_Macros.hpp does introspection on configuration options
	// and compiler environment then sets a collection of #define macros.

	#include <Kokkos_Macros.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	// Forward declarations for class inter-relationships

	namespace Kokkos {

	class HostSpace ; ///< Memory space for main process and CPU execution spaces

	#if defined( KOKKOS_HAVE_SERIAL )
	class Serial ; ///< Execution space main process on CPU
	#endif // defined( KOKKOS_HAVE_SERIAL )

	#if defined( KOKKOS_HAVE_PTHREAD )
	class Threads ; ///< Execution space with pthreads back-end
	#endif

	#if defined( KOKKOS_HAVE_OPENMP )
	class OpenMP ; ///< OpenMP execution space
	#endif

	#if defined( KOKKOS_HAVE_CUDA )
	class CudaSpace ; ///< Memory space on Cuda GPU
	class CudaUVMSpace ; ///< Memory space on Cuda GPU with UVM
	class CudaHostPinnedSpace ; ///< Memory space on Host accessible to Cuda GPU
	class Cuda ; ///< Execution space for Cuda GPU
	#endif

	+template<class ExecutionSpace, class MemorySpace>
	+struct Device;
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	// Set the default execution space.

	/// Define Kokkos::DefaultExecutionSpace as per configuration option
	/// or chosen from the enabled execution spaces in the following order:
	/// Kokkos::Cuda, Kokkos::OpenMP, Kokkos::Threads, Kokkos::Serial

	namespace Kokkos {

	#if defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_CUDA )
	- typedef Kokkos::Cuda DefaultExecutionSpace ;
	+ typedef Cuda DefaultExecutionSpace ;
	#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP )
	typedef OpenMP DefaultExecutionSpace ;
	#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS )
	typedef Threads DefaultExecutionSpace ;
	#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_SERIAL )
	typedef Serial DefaultExecutionSpace ;
	#else
	# error "At least one of the following execution spaces must be defined in order to use Kokkos: Kokkos::Cuda, Kokkos::OpenMP, Kokkos::Serial, or Kokkos::Threads."
	#endif

	+#if defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP )
	+ typedef OpenMP DefaultHostExecutionSpace ;
	+#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS )
	+ typedef Threads DefaultHostExecutionSpace ;
	+#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_SERIAL )
	+ typedef Serial DefaultHostExecutionSpace ;
	+#elif defined ( KOKKOS_HAVE_OPENMP )
	+ typedef OpenMP DefaultHostExecutionSpace ;
	+#elif defined ( KOKKOS_HAVE_PTHREAD )
	+ typedef Threads DefaultHostExecutionSpace ;
	+#elif defined ( KOKKOS_HAVE_SERIAL )
	+ typedef Serial DefaultHostExecutionSpace ;
	+#else
	+# error "At least one of the following execution spaces must be defined in order to use Kokkos: Kokkos::OpenMP, Kokkos::Serial, or Kokkos::Threads."
	+#endif
	+
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	// Detect the active execution space and define its memory space.
	// This is used to verify whether a running kernel can access
	// a given memory space.

	namespace Kokkos {
	namespace Impl {

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA ) && defined (KOKKOS_HAVE_CUDA)
	typedef Kokkos::CudaSpace ActiveExecutionMemorySpace ;
	#elif defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	typedef Kokkos::HostSpace ActiveExecutionMemorySpace ;
	#else
	typedef void ActiveExecutionMemorySpace ;
	#endif

	template< class ActiveSpace , class MemorySpace >
	struct VerifyExecutionCanAccessMemorySpace {
	enum {value = 0};
	};

	template< class Space >
	struct VerifyExecutionCanAccessMemorySpace< Space , Space >
	{
	enum {value = 1};
	KOKKOS_INLINE_FUNCTION static void verify(void) {}
	KOKKOS_INLINE_FUNCTION static void verify(const void *) {}
	};

	} // namespace Impl
	} // namespace Kokkos

	#define KOKKOS_RESTRICT_EXECUTION_TO_DATA( DATA_SPACE , DATA_PTR ) \
	Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< \
	Kokkos::Impl::ActiveExecutionMemorySpace , DATA_SPACE >::verify( DATA_PTR )

	#define KOKKOS_RESTRICT_EXECUTION_TO_( DATA_SPACE ) \
	Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< \
	Kokkos::Impl::ActiveExecutionMemorySpace , DATA_SPACE >::verify()

	+namespace Kokkos {
	+ void fence();
	+}
	+
	#endif /* #ifndef KOKKOS_CORE_FWD_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_CrsArray.hpp b/lib/kokkos/core/src/Kokkos_CrsArray.hpp
	deleted file mode 100755
	index 53ab15b21..000000000
	--- a/lib/kokkos/core/src/Kokkos_CrsArray.hpp
	+++ /dev/null
	@@ -1,171 +0,0 @@
	-/*
	-//@HEADER
	-// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	-// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	-// the U.S. Government retains certain rights in this software.
	-//
	-// Redistribution and use in source and binary forms, with or without
	-// modification, are permitted provided that the following conditions are
	-// met:
	-//
	-// 1. Redistributions of source code must retain the above copyright
	-// notice, this list of conditions and the following disclaimer.
	-//
	-// 2. Redistributions in binary form must reproduce the above copyright
	-// notice, this list of conditions and the following disclaimer in the
	-// documentation and/or other materials provided with the distribution.
	-//
	-// 3. Neither the name of the Corporation nor the names of the
	-// contributors may be used to endorse or promote products derived from
	-// this software without specific prior written permission.
	-//
	-// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	-// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	-// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	-// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	-// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	-// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	-// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	-// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	-// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	-// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	-// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	-//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	-// ************************************************************************
	-//@HEADER
	-*/
	-
	-#ifndef KOKKOS_CRSARRAY_HPP
	-#define KOKKOS_CRSARRAY_HPP
	-
	-#include <string>
	-#include <vector>
	-
	-#include <Kokkos_View.hpp>
	-
	-namespace Kokkos {
	-
	-/// \class CrsArray
	-/// \brief Compressed row storage array.
	-///
	-/// \tparam DataType The type of stored entries. If a CrsArray is
	-/// used as the graph of a sparse matrix, then this is usually an
	-/// integer type, the type of the column indices in the sparse
	-/// matrix.
	-///
	-/// \tparam Arg1Type The second template parameter, corresponding
	-/// either to the Space type (if there are no more template
	-/// parameters) or to the Layout type (if there is at least one more
	-/// template parameter).
	-///
	-/// \tparam Arg2Type The third template parameter, which if provided
	-/// corresponds to the Space type.
	-///
	-/// \tparam SizeType The type of row offsets. Usually the default
	-/// parameter suffices. However, setting a nondefault value is
	-/// necessary in some cases, for example, if you want to have a
	-/// sparse matrices with dimensions (and therefore column indices)
	-/// that fit in \c int, but want to store more than <tt>INT_MAX</tt>
	-/// entries in the sparse matrix.
	-///
	-/// A row has a range of entries:
	-/// <ul>
	-/// <li> <tt> row_map[i0] <= entry < row_map[i0+1] </tt> </li>
	-/// <li> <tt> 0 <= i1 < row_map[i0+1] - row_map[i0] </tt> </li>
	-/// <li> <tt> entries( entry , i2 , i3 , ... ); </tt> </li>
	-/// <li> <tt> entries( row_map[i0] + i1 , i2 , i3 , ... ); </tt> </li>
	-/// </ul>
	-template< class DataType,
	- class Arg1Type,
	- class Arg2Type = void,
	- typename SizeType = typename ViewTraits<DataType*, Arg1Type, Arg2Type, void >::size_type>
	-class CrsArray {
	-private:
	- typedef ViewTraits<DataType*, Arg1Type, Arg2Type, void> traits ;
	-
	-public:
	- typedef DataType data_type;
	- typedef typename traits::array_layout array_layout;
	- typedef typename traits::execution_space execution_space ;
	- typedef typename traits::memory_space memory_space ;
	- typedef SizeType size_type;
	-
	- typedef CrsArray< DataType , Arg1Type , Arg2Type , SizeType > crsarray_type;
	- typedef CrsArray< DataType , array_layout , typename traits::host_mirror_space , SizeType > HostMirror;
	- typedef View< const size_type* , array_layout, execution_space > row_map_type;
	- typedef View< DataType* , array_layout, execution_space > entries_type;
	-
	- entries_type entries;
	- row_map_type row_map;
	-
	- //! Construct an empty view.
	- CrsArray () : entries(), row_map() {}
	-
	- //! Copy constructor (shallow copy).
	- CrsArray (const CrsArray& rhs) : entries (rhs.entries), row_map (rhs.row_map)
	- {}
	-
	- /** \brief Assign to a view of the rhs array.
	- * If the old view is the last view
	- * then allocated memory is deallocated.
	- */
	- CrsArray& operator= (const CrsArray& rhs) {
	- entries = rhs.entries;
	- row_map = rhs.row_map;
	- return *this;
	- }
	-
	- /** \brief Destroy this view of the array.
	- * If the last view then allocated memory is deallocated.
	- */
	- ~CrsArray() {}
	-};
	-
	-//----------------------------------------------------------------------------
	-
	-template< class CrsArrayType , class InputSizeType >
	-typename CrsArrayType::crsarray_type
	-create_crsarray( const std::string & label ,
	- const std::vector< InputSizeType > & input );
	-
	-template< class CrsArrayType , class InputSizeType >
	-typename CrsArrayType::crsarray_type
	-create_crsarray( const std::string & label ,
	- const std::vector< std::vector< InputSizeType > > & input );
	-
	-//----------------------------------------------------------------------------
	-
	-template< class DataType ,
	- class Arg1Type ,
	- class Arg2Type ,
	- typename SizeType >
	-typename CrsArray< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	-create_mirror_view( const CrsArray<DataType,Arg1Type,Arg2Type,SizeType > & input );
	-
	-template< class DataType ,
	- class Arg1Type ,
	- class Arg2Type ,
	- typename SizeType >
	-typename CrsArray< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	-create_mirror( const CrsArray<DataType,Arg1Type,Arg2Type,SizeType > & input );
	-
	-} // namespace Kokkos
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#include <impl/Kokkos_CrsArray_factory.hpp>
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#endif /* #ifndef KOKKOS_CRSARRAY_HPP */
	-
	diff --git a/lib/kokkos/core/src/Kokkos_Cuda.hpp b/lib/kokkos/core/src/Kokkos_Cuda.hpp
	index e3325a975..d736459b5 100755
	--- a/lib/kokkos/core/src/Kokkos_Cuda.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Cuda.hpp
	@@ -1,263 +1,268 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDA_HPP
	#define KOKKOS_CUDA_HPP

	#include <Kokkos_Core_fwd.hpp>

	// If CUDA execution space is enabled then use this header file.

	#if defined( KOKKOS_HAVE_CUDA )

	#include <iosfwd>
	#include <vector>

	#include <Kokkos_CudaSpace.hpp>

	#include <Kokkos_Parallel.hpp>
	#include <Kokkos_Layout.hpp>
	#include <Kokkos_ScratchSpace.hpp>
	#include <Kokkos_MemoryTraits.hpp>
	#include <impl/Kokkos_Tags.hpp>

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {
	class CudaExec ;
	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/// \class Cuda
	/// \brief Kokkos Execution Space that uses CUDA to run on GPUs.
	///
	/// An "execution space" represents a parallel execution model. It tells Kokkos
	/// how to parallelize the execution of kernels in a parallel_for or
	/// parallel_reduce. For example, the Threads execution space uses Pthreads or
	/// C++11 threads on a CPU, the OpenMP execution space uses the OpenMP language
	/// extensions, and the Serial execution space executes "parallel" kernels
	/// sequentially. The Cuda execution space uses NVIDIA's CUDA programming
	/// model to execute kernels in parallel on GPUs.
	class Cuda {
	public:
	//! \name Type declarations that all Kokkos execution spaces must provide.
	//@{

	//! Tag this class as a kokkos execution space
	typedef Cuda execution_space ;

	#if defined( KOKKOS_USE_CUDA_UVM )
	//! This execution space's preferred memory space.
	typedef CudaUVMSpace memory_space ;
	#else
	//! This execution space's preferred memory space.
	typedef CudaSpace memory_space ;
	#endif

	+ //! This execution space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	//! The size_type best suited for this execution space.
	typedef memory_space::size_type size_type ;

	//! This execution space's preferred array layout.
	typedef LayoutLeft array_layout ;

	- //! For backward compatibility
	- typedef Cuda device_type ;
	//!
	typedef ScratchMemorySpace< Cuda > scratch_memory_space ;

	//@}
	//--------------------------------------------------
	//! \name Functions that all Kokkos devices must implement.
	//@{

	/// \brief True if and only if this method is being called in a
	/// thread-parallel function.
	KOKKOS_INLINE_FUNCTION static int in_parallel() {
	#if defined( __CUDA_ARCH__ )
	return true;
	#else
	return false;
	#endif
	}

	/** \brief Set the device in a "sleep" state.
	*
	* This function sets the device in a "sleep" state in which it is
	* not ready for work. This may consume less resources than if the
	* device were in an "awake" state, but it may also take time to
	* bring the device from a sleep state to be ready for work.
	*
	* \return True if the device is in the "sleep" state, else false if
	* the device is actively working and could not enter the "sleep"
	* state.
	*/
	static bool sleep();

	/// \brief Wake the device from the 'sleep' state so it is ready for work.
	///
	/// \return True if the device is in the "ready" state, else "false"
	/// if the device is actively working (which also means that it's
	/// awake).
	static bool wake();

	/// \brief Wait until all dispatched functors complete.
	///
	/// The parallel_for or parallel_reduce dispatch of a functor may
	/// return asynchronously, before the functor completes. This
	/// method does not return until all dispatched functors on this
	/// device have completed.
	static void fence();

	//! Free any resources being consumed by the device.
	static void finalize();

	//! Has been initialized
	static int is_initialized();

	//! Print configuration information to the given output stream.
	static void print_configuration( std::ostream & , const bool detail = false );

	//@}
	//--------------------------------------------------
	//! \name Cuda space instances

	~Cuda() {}
	Cuda();
	explicit Cuda( const int instance_id );

	-#if defined( KOKKOS_HAVE_CXX11 )
	- Cuda & operator = ( const Cuda & ) = delete ;
	-#else
	-private:
	- Cuda & operator = ( const Cuda & );
	-public:
	-#endif
	+ Cuda( const Cuda & ) = default ;
	+ Cuda( Cuda && ) = default ;
	+ Cuda & operator = ( const Cuda & ) = default ;
	+ Cuda & operator = ( Cuda && ) = default ;

	//--------------------------------------------------------------------------
	//! \name Device-specific functions
	//@{

	struct SelectDevice {
	int cuda_device_id ;
	SelectDevice() : cuda_device_id(0) {}
	explicit SelectDevice( int id ) : cuda_device_id( id ) {}
	};

	//! Initialize, telling the CUDA run-time library which device to use.
	static void initialize( const SelectDevice = SelectDevice()
	, const size_t num_instances = 1 );

	/// \brief Cuda device architecture of the selected device.
	///
	/// This matches the __CUDA_ARCH__ specification.
	static size_type device_arch();

	//! Query device count.
	static size_type detect_device_count();

	/** \brief Detect the available devices and their architecture
	* as defined by the __CUDA_ARCH__ specification.
	*/
	static std::vector<unsigned> detect_device_arch();

	+ cudaStream_t cuda_stream() const { return m_stream ; }
	+ int cuda_device() const { return m_device ; }
	+
	//@}
	//--------------------------------------------------------------------------

	- const cudaStream_t m_stream ;
	- const int m_device ;
	+private:
	+
	+ cudaStream_t m_stream ;
	+ int m_device ;
	};

	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template<>
	struct VerifyExecutionCanAccessMemorySpace
	< Kokkos::CudaSpace
	, Kokkos::Cuda::scratch_memory_space
	>
	{
	enum { value = true };
	KOKKOS_INLINE_FUNCTION static void verify( void ) { }
	KOKKOS_INLINE_FUNCTION static void verify( const void * ) { }
	};

	template<>
	struct VerifyExecutionCanAccessMemorySpace
	< Kokkos::HostSpace
	, Kokkos::Cuda::scratch_memory_space
	>
	{
	enum { value = false };
	inline static void verify( void ) { CudaSpace::access_error(); }
	inline static void verify( const void * p ) { CudaSpace::access_error(p); }
	};

	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	#include <Cuda/Kokkos_CudaExec.hpp>
	#include <Cuda/Kokkos_Cuda_View.hpp>
	+
	+#include <KokkosExp_View.hpp>
	+#include <Cuda/KokkosExp_Cuda_View.hpp>
	+
	#include <Cuda/Kokkos_Cuda_Parallel.hpp>

	//----------------------------------------------------------------------------

	#endif /* #if defined( KOKKOS_HAVE_CUDA ) */
	#endif /* #ifndef KOKKOS_CUDA_HPP */



	diff --git a/lib/kokkos/core/src/Kokkos_CudaSpace.hpp b/lib/kokkos/core/src/Kokkos_CudaSpace.hpp
	index 2c4686126..34915fd38 100755
	--- a/lib/kokkos/core/src/Kokkos_CudaSpace.hpp
	+++ b/lib/kokkos/core/src/Kokkos_CudaSpace.hpp
	@@ -1,468 +1,656 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_CUDASPACE_HPP
	#define KOKKOS_CUDASPACE_HPP

	+#include <Kokkos_Core_fwd.hpp>
	+
	#if defined( KOKKOS_HAVE_CUDA )

	#include <iosfwd>
	#include <typeinfo>
	#include <string>

	-#include <Kokkos_Core_fwd.hpp>
	#include <Kokkos_HostSpace.hpp>
	+
	+#include <impl/Kokkos_AllocationTracker.hpp>
	+
	#include <Cuda/Kokkos_Cuda_abort.hpp>
	+#include <Cuda/Kokkos_Cuda_BasicAllocators.hpp>

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/** \brief Cuda on-device memory management */

	class CudaSpace {
	public:

	//! Tag this class as a kokkos memory space
	typedef CudaSpace memory_space ;
	typedef Kokkos::Cuda execution_space ;
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	typedef unsigned int size_type ;

	- /** \brief Allocate a contiguous block of memory on the Cuda device.
	+ typedef Impl::CudaMallocAllocator allocator;
	+
	+ /** \brief Allocate a contiguous block of memory.
	*
	* The input label is associated with the block of memory.
	* The block of memory is tracked via reference counting where
	* allocation gives it a reference count of one.
	- *
	- * Allocation may only occur on the master thread of the process.
	- */
	- static void * allocate( const std::string & label , const size_t size );
	-
	- /** \brief Increment the reference count of the block of memory
	- * in which the input pointer resides.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void increment( const void * );
	-
	- /** \brief Decrement the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count falls to zero the memory is deallocated.
	- *
	- * Reference counting only occurs on the master thread.
	*/
	- static void decrement( const void * );
	-
	- /** \brief Get the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count is zero the memory region is not tracked.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static int count( const void * );
	-
	- /** \brief Print all tracked memory to the output stream. */
	- static void print_memory_view( std::ostream & );
	-
	- /** \brief Retrieve label associated with the input pointer */
	- static std::string query_label( const void * );
	+ static Impl::AllocationTracker allocate_and_track( const std::string & label, const size_t size );

	/--------------------------------/
	/** \brief Cuda specific function to attached texture object to an allocation.
	* Output the texture object, base pointer, and offset from the input pointer.
	*/
	#if defined( __CUDACC__ )
	- static void texture_object_attach( const void * const arg_ptr
	- , const unsigned arg_type_size
	- , ::cudaChannelFormatDesc const & arg_desc
	- , ::cudaTextureObject_t * const arg_tex_obj
	- , void const ** const arg_alloc_ptr
	- , int * const arg_offset
	+ static void texture_object_attach( Impl::AllocationTracker const & tracker
	+ , unsigned type_size
	+ , ::cudaChannelFormatDesc const & desc
	);
	#endif

	+ /--------------------------------/
	+
	+ CudaSpace();
	+ CudaSpace( const CudaSpace & rhs ) = default ;
	+ CudaSpace & operator = ( const CudaSpace & rhs ) = default ;
	+ ~CudaSpace() = default ;
	+
	+ /*\brief Allocate memory in the cuda space /
	+ void * allocate( const size_t arg_alloc_size ) const ;
	+
	+ /*\brief Deallocate memory in the cuda space /
	+ void deallocate( void * const arg_alloc_ptr
	+ , const size_t arg_alloc_size ) const ;
	+
	/--------------------------------/
	/** \brief Error reporting for HostSpace attempt to access CudaSpace */
	static void access_error();
	static void access_error( const void * const );
	+
	+private:
	+
	+ int m_device ; ///< Which Cuda device
	+
	+ // friend class Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::CudaSpace , void > ;
	};

	+namespace Impl {
	+/// \brief Initialize lock array for arbitrary size atomics.
	+///
	+/// Arbitrary atomics are implemented using a hash table of locks
	+/// where the hash value is derived from the address of the
	+/// object for which an atomic operation is performed.
	+/// This function initializes the locks to zero (unset).
	+void init_lock_array_cuda_space();
	+
	+/// \brief Retrieve the pointer to the lock array for arbitrary size atomics.
	+///
	+/// Arbitrary atomics are implemented using a hash table of locks
	+/// where the hash value is derived from the address of the
	+/// object for which an atomic operation is performed.
	+/// This function retrieves the lock array pointer.
	+/// If the array is not yet allocated it will do so.
	+int* lock_array_cuda_space_ptr(bool deallocate = false);
	+}
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/** \brief Cuda memory that is accessible to Host execution space
	* through Cuda's unified virtual memory (UVM) runtime.
	*/
	class CudaUVMSpace {
	public:

	//! Tag this class as a kokkos memory space
	typedef CudaUVMSpace memory_space ;
	typedef Cuda execution_space ;
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	typedef unsigned int size_type ;

	/** \brief If UVM capability is available */
	static bool available();

	- /** \brief Allocate a contiguous block of memory on the Cuda device.
	+ typedef Impl::CudaUVMAllocator allocator;
	+
	+ /** \brief Allocate a contiguous block of memory.
	*
	* The input label is associated with the block of memory.
	* The block of memory is tracked via reference counting where
	* allocation gives it a reference count of one.
	- *
	- * Allocation may only occur on the master thread of the process.
	*/
	- static void * allocate( const std::string & label , const size_t size );
	+ static Impl::AllocationTracker allocate_and_track( const std::string & label, const size_t size );

	- /** \brief Increment the reference count of the block of memory
	- * in which the input pointer resides.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void increment( const void * );
	-
	- /** \brief Decrement the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count falls to zero the memory is deallocated.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void decrement( const void * );
	-
	- /** \brief Get the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count is zero the memory region is not tracked.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static int count( const void * );
	-
	- /** \brief Print all tracked memory to the output stream. */
	- static void print_memory_view( std::ostream & );
	-
	- /** \brief Retrieve label associated with the input pointer */
	- static std::string query_label( const void * );

	/** \brief Cuda specific function to attached texture object to an allocation.
	* Output the texture object, base pointer, and offset from the input pointer.
	*/
	#if defined( __CUDACC__ )
	- static void texture_object_attach( const void * const arg_ptr
	- , const unsigned arg_type_size
	- , ::cudaChannelFormatDesc const & arg_desc
	- , ::cudaTextureObject_t * const arg_tex_obj
	- , void const ** const arg_alloc_ptr
	- , int * const arg_offset
	+ static void texture_object_attach( Impl::AllocationTracker const & tracker
	+ , unsigned type_size
	+ , ::cudaChannelFormatDesc const & desc
	);
	#endif
	+ /--------------------------------/
	+
	+ CudaUVMSpace();
	+ CudaUVMSpace( const CudaUVMSpace & rhs ) = default ;
	+ CudaUVMSpace & operator = ( const CudaUVMSpace & rhs ) = default ;
	+ ~CudaUVMSpace() = default ;
	+
	+ /*\brief Allocate memory in the cuda space /
	+ void * allocate( const size_t arg_alloc_size ) const ;
	+
	+ /*\brief Deallocate memory in the cuda space /
	+ void deallocate( void * const arg_alloc_ptr
	+ , const size_t arg_alloc_size ) const ;
	+
	+ /--------------------------------/
	+
	+private:
	+
	+ int m_device ; ///< Which Cuda device
	};

	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/** \brief Host memory that is accessible to Cuda execution space
	* through Cuda's host-pinned memory allocation.
	*/
	class CudaHostPinnedSpace {
	public:

	//! Tag this class as a kokkos memory space
	+ /** \brief Memory is in HostSpace so use the HostSpace::execution_space */
	+ typedef HostSpace::execution_space execution_space ;
	typedef CudaHostPinnedSpace memory_space ;
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	typedef unsigned int size_type ;

	- /** \brief Memory is in HostSpace so use the HostSpace::execution_space */
	- typedef HostSpace::execution_space execution_space ;

	- /** \brief Allocate a contiguous block of memory on the Cuda device.
	+ typedef Impl::CudaHostAllocator allocator ;
	+
	+ /** \brief Allocate a contiguous block of memory.
	*
	* The input label is associated with the block of memory.
	* The block of memory is tracked via reference counting where
	* allocation gives it a reference count of one.
	- *
	- * Allocation may only occur on the master thread of the process.
	*/
	- static void * allocate( const std::string & label , const size_t size );
	+ static Impl::AllocationTracker allocate_and_track( const std::string & label, const size_t size );

	- /** \brief Increment the reference count of the block of memory
	- * in which the input pointer resides.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void increment( const void * );
	+ /--------------------------------/

	- /** \brief Get the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count is zero the memory region is not tracked.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static int count( const void * );
	+ CudaHostPinnedSpace();
	+ CudaHostPinnedSpace( const CudaHostPinnedSpace & rhs ) = default ;
	+ CudaHostPinnedSpace & operator = ( const CudaHostPinnedSpace & rhs ) = default ;
	+ ~CudaHostPinnedSpace() = default ;

	- /** \brief Decrement the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count falls to zero the memory is deallocated.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void decrement( const void * );
	+ /*\brief Allocate memory in the cuda space /
	+ void * allocate( const size_t arg_alloc_size ) const ;

	- /** \brief Print all tracked memory to the output stream. */
	- static void print_memory_view( std::ostream & );
	+ /*\brief Deallocate memory in the cuda space /
	+ void deallocate( void * const arg_alloc_ptr
	+ , const size_t arg_alloc_size ) const ;

	- /** \brief Retrieve label associated with the input pointer */
	- static std::string query_label( const void * );
	+ /--------------------------------/
	};

	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template<> struct DeepCopy< CudaSpace , CudaSpace >
	{
	DeepCopy( void * dst , const void * src , size_t );
	DeepCopy( const Cuda & , void * dst , const void * src , size_t );
	};

	template<> struct DeepCopy< CudaSpace , HostSpace >
	{
	DeepCopy( void * dst , const void * src , size_t );
	DeepCopy( const Cuda & , void * dst , const void * src , size_t );
	};

	template<> struct DeepCopy< HostSpace , CudaSpace >
	{
	DeepCopy( void * dst , const void * src , size_t );
	DeepCopy( const Cuda & , void * dst , const void * src , size_t );
	};

	template<> struct DeepCopy< CudaSpace , CudaUVMSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< CudaSpace , CudaSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaSpace , CudaHostPinnedSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< CudaSpace , HostSpace >( dst , src , n ); }
	};


	template<> struct DeepCopy< CudaUVMSpace , CudaSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< CudaSpace , CudaSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaUVMSpace , CudaUVMSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< CudaSpace , CudaSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaUVMSpace , CudaHostPinnedSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< CudaSpace , HostSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaUVMSpace , HostSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< CudaSpace , HostSpace >( dst , src , n ); }
	};


	template<> struct DeepCopy< CudaHostPinnedSpace , CudaSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< HostSpace , CudaSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaHostPinnedSpace , CudaUVMSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< HostSpace , CudaSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaHostPinnedSpace , CudaHostPinnedSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< HostSpace , HostSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< CudaHostPinnedSpace , HostSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< HostSpace , HostSpace >( dst , src , n ); }
	};


	template<> struct DeepCopy< HostSpace , CudaUVMSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< HostSpace , CudaSpace >( dst , src , n ); }
	};

	template<> struct DeepCopy< HostSpace , CudaHostPinnedSpace >
	{
	inline
	DeepCopy( void * dst , const void * src , size_t n )
	{ (void) DeepCopy< HostSpace , HostSpace >( dst , src , n ); }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	/** Running in CudaSpace attempting to access HostSpace: error */
	template<>
	struct VerifyExecutionCanAccessMemorySpace< Kokkos::CudaSpace , Kokkos::HostSpace >
	{
	enum { value = false };
	KOKKOS_INLINE_FUNCTION static void verify( void )
	{ Kokkos::abort("Cuda code attempted to access HostSpace memory"); }

	KOKKOS_INLINE_FUNCTION static void verify( const void * )
	{ Kokkos::abort("Cuda code attempted to access HostSpace memory"); }
	};

	/** Running in CudaSpace accessing CudaUVMSpace: ok */
	template<>
	struct VerifyExecutionCanAccessMemorySpace< Kokkos::CudaSpace , Kokkos::CudaUVMSpace >
	{
	enum { value = true };
	KOKKOS_INLINE_FUNCTION static void verify( void ) { }
	KOKKOS_INLINE_FUNCTION static void verify( const void * ) { }
	};

	/** Running in CudaSpace accessing CudaHostPinnedSpace: ok */
	template<>
	struct VerifyExecutionCanAccessMemorySpace< Kokkos::CudaSpace , Kokkos::CudaHostPinnedSpace >
	{
	enum { value = true };
	KOKKOS_INLINE_FUNCTION static void verify( void ) { }
	KOKKOS_INLINE_FUNCTION static void verify( const void * ) { }
	};

	/** Running in CudaSpace attempting to access an unknown space: error */
	template< class OtherSpace >
	struct VerifyExecutionCanAccessMemorySpace<
	typename enable_if< ! is_same<Kokkos::CudaSpace,OtherSpace>::value , Kokkos::CudaSpace >::type ,
	OtherSpace >
	{
	enum { value = false };
	KOKKOS_INLINE_FUNCTION static void verify( void )
	{ Kokkos::abort("Cuda code attempted to access unknown Space memory"); }

	KOKKOS_INLINE_FUNCTION static void verify( const void * )
	{ Kokkos::abort("Cuda code attempted to access unknown Space memory"); }
	};

	//----------------------------------------------------------------------------
	/** Running in HostSpace attempting to access CudaSpace */
	template<>
	struct VerifyExecutionCanAccessMemorySpace< Kokkos::HostSpace , Kokkos::CudaSpace >
	{
	enum { value = false };
	inline static void verify( void ) { CudaSpace::access_error(); }
	inline static void verify( const void * p ) { CudaSpace::access_error(p); }
	};

	/** Running in HostSpace accessing CudaUVMSpace is OK */
	template<>
	struct VerifyExecutionCanAccessMemorySpace< Kokkos::HostSpace , Kokkos::CudaUVMSpace >
	{
	enum { value = true };
	inline static void verify( void ) { }
	inline static void verify( const void * ) { }
	};

	/** Running in HostSpace accessing CudaHostPinnedSpace is OK */
	template<>
	struct VerifyExecutionCanAccessMemorySpace< Kokkos::HostSpace , Kokkos::CudaHostPinnedSpace >
	{
	enum { value = true };
	KOKKOS_INLINE_FUNCTION static void verify( void ) {}
	KOKKOS_INLINE_FUNCTION static void verify( const void * ) {}
	};

	+} // namespace Impl
	+} // namespace Kokkos
	+
	+//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template<>
	+class SharedAllocationRecord< Kokkos::CudaSpace , void >
	+ : public SharedAllocationRecord< void , void >
	+{
	+private:
	+
	+ friend class SharedAllocationRecord< Kokkos::CudaUVMSpace , void > ;
	+
	+ typedef SharedAllocationRecord< void , void > RecordBase ;
	+
	+ SharedAllocationRecord( const SharedAllocationRecord & ) = delete ;
	+ SharedAllocationRecord & operator = ( const SharedAllocationRecord & ) = delete ;
	+
	+ static void deallocate( RecordBase * );
	+
	+ static ::cudaTextureObject_t
	+ attach_texture_object( const unsigned sizeof_alias
	+ , void * const alloc_ptr
	+ , const size_t alloc_size );
	+
	+ static RecordBase s_root_record ;
	+
	+ ::cudaTextureObject_t m_tex_obj ;
	+ const Kokkos::CudaSpace m_space ;
	+
	+protected:
	+
	+ ~SharedAllocationRecord();
	+ SharedAllocationRecord() : RecordBase(), m_tex_obj(0), m_space() {}
	+
	+ SharedAllocationRecord( const Kokkos::CudaSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const RecordBase::function_type arg_dealloc = & deallocate
	+ );
	+
	+public:
	+
	+ std::string get_label() const ;
	+
	+ static SharedAllocationRecord * allocate( const Kokkos::CudaSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ );
	+
	+ template< typename AliasType >
	+ inline
	+ ::cudaTextureObject_t attach_texture_object()
	+ {
	+ static_assert( ( std::is_same< AliasType , int >::value \|\|
	+ std::is_same< AliasType , ::int2 >::value \|\|
	+ std::is_same< AliasType , ::int4 >::value )
	+ , "Cuda texture fetch only supported for alias types of int, ::int2, or ::int4" );
	+
	+ if ( m_tex_obj == 0 ) {
	+ m_tex_obj = attach_texture_object( sizeof(AliasType)
	+ , (void*) RecordBase::m_alloc_ptr
	+ , RecordBase::m_alloc_size );
	+ }
	+
	+ return m_tex_obj ;
	+ }
	+
	+ template< typename AliasType >
	+ inline
	+ int attach_texture_object_offset( const AliasType * const ptr )
	+ {
	+ // Texture object is attached to the entire allocation range
	+ return ptr - reinterpret_cast<AliasType*>( RecordBase::m_alloc_ptr );
	+ }
	+
	+ static SharedAllocationRecord * get_record( void * arg_alloc_ptr );
	+
	+ static void print_records( std::ostream & , const Kokkos::CudaSpace & , bool detail = false );
	+};
	+
	+
	+template<>
	+class SharedAllocationRecord< Kokkos::CudaUVMSpace , void >
	+ : public SharedAllocationRecord< void , void >
	+{
	+private:
	+
	+ typedef SharedAllocationRecord< void , void > RecordBase ;
	+
	+ SharedAllocationRecord( const SharedAllocationRecord & ) = delete ;
	+ SharedAllocationRecord & operator = ( const SharedAllocationRecord & ) = delete ;
	+
	+ static void deallocate( RecordBase * );
	+
	+ static RecordBase s_root_record ;
	+
	+ ::cudaTextureObject_t m_tex_obj ;
	+ const Kokkos::CudaUVMSpace m_space ;
	+
	+protected:
	+
	+ ~SharedAllocationRecord();
	+ SharedAllocationRecord() : RecordBase(), m_tex_obj(0), m_space() {}
	+
	+ SharedAllocationRecord( const Kokkos::CudaUVMSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const RecordBase::function_type arg_dealloc = & deallocate
	+ );
	+
	+public:
	+
	+ std::string get_label() const ;
	+
	+ static SharedAllocationRecord * allocate( const Kokkos::CudaUVMSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ );
	+
	+ template< typename AliasType >
	+ inline
	+ ::cudaTextureObject_t attach_texture_object()
	+ {
	+ static_assert( ( std::is_same< AliasType , int >::value \|\|
	+ std::is_same< AliasType , ::int2 >::value \|\|
	+ std::is_same< AliasType , ::int4 >::value )
	+ , "Cuda texture fetch only supported for alias types of int, ::int2, or ::int4" );
	+
	+ if ( m_tex_obj == 0 ) {
	+ m_tex_obj = SharedAllocationRecord< Kokkos::CudaSpace , void >::
	+ attach_texture_object( sizeof(AliasType)
	+ , (void*) RecordBase::m_alloc_ptr
	+ , RecordBase::m_alloc_size );
	+ }
	+
	+ return m_tex_obj ;
	+ }
	+
	+ template< typename AliasType >
	+ inline
	+ int attach_texture_object_offset( const AliasType * const ptr )
	+ {
	+ // Texture object is attached to the entire allocation range
	+ return ptr - reinterpret_cast<AliasType*>( RecordBase::m_alloc_ptr );
	+ }
	+
	+ static SharedAllocationRecord * get_record( void * arg_alloc_ptr );
	+
	+ static void print_records( std::ostream & , const Kokkos::CudaUVMSpace & , bool detail = false );
	+};
	+
	+template<>
	+class SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >
	+ : public SharedAllocationRecord< void , void >
	+{
	+private:
	+
	+ typedef SharedAllocationRecord< void , void > RecordBase ;
	+
	+ SharedAllocationRecord( const SharedAllocationRecord & ) = delete ;
	+ SharedAllocationRecord & operator = ( const SharedAllocationRecord & ) = delete ;
	+
	+ static void deallocate( RecordBase * );
	+
	+ static RecordBase s_root_record ;
	+
	+ const Kokkos::CudaHostPinnedSpace m_space ;
	+
	+protected:
	+
	+ ~SharedAllocationRecord();
	+ SharedAllocationRecord() : RecordBase(), m_space() {}
	+
	+ SharedAllocationRecord( const Kokkos::CudaHostPinnedSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const RecordBase::function_type arg_dealloc = & deallocate
	+ );
	+
	+public:
	+
	+ std::string get_label() const ;
	+
	+ static SharedAllocationRecord * allocate( const Kokkos::CudaHostPinnedSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ );
	+
	+ static SharedAllocationRecord * get_record( void * arg_alloc_ptr );
	+
	+ static void print_records( std::ostream & , const Kokkos::CudaHostPinnedSpace & , bool detail = false );
	+};
	+
	} // namespace Impl
	+} // namespace Experimental
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #if defined( KOKKOS_HAVE_CUDA ) */
	#endif /* #define KOKKOS_CUDASPACE_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp b/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
	index 209ea4a50..807cb5cb4 100755
	--- a/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
	+++ b/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
	@@ -1,439 +1,497 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_EXECPOLICY_HPP
	#define KOKKOS_EXECPOLICY_HPP

	#include <Kokkos_Core_fwd.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_StaticAssert.hpp>
	#include <impl/Kokkos_Tags.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {

	/** \brief Execution policy for work over a range of an integral type.
	*
	* Valid template argument options:
	*
	* With a specified execution space:
	* < ExecSpace , WorkTag , { IntConst \| IntType } >
	* < ExecSpace , WorkTag , void >
	* < ExecSpace , { IntConst \| IntType } , void >
	* < ExecSpace , void , void >
	*
	* With the default execution space:
	* < WorkTag , { IntConst \| IntType } , void >
	* < WorkTag , void , void >
	* < { IntConst \| IntType } , void , void >
	* < void , void , void >
	*
	* IntType is a fundamental integral type
	* IntConst is an Impl::integral_constant< IntType , Blocking >
	*
	* Blocking is the granularity of partitioning the range among threads.
	*/
	template< class Arg0 = void , class Arg1 = void , class Arg2 = void
	, class ExecSpace =
	// The first argument is the execution space,
	// otherwise use the default execution space.
	typename Impl::if_c< Impl::is_execution_space< Arg0 >::value , Arg0
	, Kokkos::DefaultExecutionSpace >::type
	>
	class RangePolicy {
	private:

	// Default integral type and blocking factor:
	typedef int DefaultIntType ;
	enum { DefaultIntValue = 8 };

	enum { Arg0_Void = Impl::is_same< Arg0 , void >::value };
	enum { Arg1_Void = Impl::is_same< Arg1 , void >::value };
	enum { Arg2_Void = Impl::is_same< Arg2 , void >::value };

	enum { Arg0_ExecSpace = Impl::is_execution_space< Arg0 >::value };

	enum { Arg0_IntConst = Impl::is_integral_constant< Arg0 >::value };
	enum { Arg1_IntConst = Impl::is_integral_constant< Arg1 >::value };
	enum { Arg2_IntConst = Impl::is_integral_constant< Arg2 >::value };

	enum { Arg0_IntType = Impl::is_integral< Arg0 >::value };
	enum { Arg1_IntType = Impl::is_integral< Arg1 >::value };
	enum { Arg2_IntType = Impl::is_integral< Arg2 >::value };

	enum { Arg0_WorkTag = ! Arg0_ExecSpace && ! Arg0_IntConst && ! Arg0_IntType && ! Arg0_Void };
	enum { Arg1_WorkTag = Arg0_ExecSpace && ! Arg1_IntConst && ! Arg1_IntType && ! Arg1_Void };

	enum { ArgOption_OK = Impl::StaticAssert< (
	( Arg0_ExecSpace && Arg1_WorkTag && ( Arg2_IntConst \|\| Arg2_IntType ) ) \|\|
	( Arg0_ExecSpace && Arg1_WorkTag && Arg2_Void ) \|\|
	( Arg0_ExecSpace && ( Arg1_IntConst \|\| Arg1_IntType ) && Arg2_Void ) \|\|
	( Arg0_ExecSpace && Arg1_Void && Arg2_Void ) \|\|
	( Arg0_WorkTag && ( Arg1_IntConst \|\| Arg1_IntType ) && Arg2_Void ) \|\|
	( Arg0_WorkTag && Arg1_Void && Arg2_Void ) \|\|
	( ( Arg0_IntConst \|\| Arg0_IntType ) && Arg1_Void && Arg2_Void ) \|\|
	( Arg0_Void && Arg1_Void && Arg2_Void )
	) >::value };

	// The work argument tag is the first or second argument
	typedef typename Impl::if_c< Arg0_WorkTag , Arg0 ,
	typename Impl::if_c< Arg1_WorkTag , Arg1 , void
	>::type >::type
	WorkTag ;

	enum { Granularity = Arg0_IntConst ? unsigned(Impl::is_integral_constant<Arg0>::integral_value) : (
	Arg1_IntConst ? unsigned(Impl::is_integral_constant<Arg1>::integral_value) : (
	Arg2_IntConst ? unsigned(Impl::is_integral_constant<Arg2>::integral_value) : (
	unsigned(DefaultIntValue) ))) };

	// Only accept the integral type if the blocking is a power of two
	typedef typename Impl::enable_if< Impl::is_power_of_two< Granularity >::value ,
	typename Impl::if_c< Arg0_IntType , Arg0 ,
	typename Impl::if_c< Arg1_IntType , Arg1 ,
	typename Impl::if_c< Arg2_IntType , Arg2 ,
	typename Impl::if_c< Arg0_IntConst , typename Impl::is_integral_constant<Arg0>::integral_type ,
	typename Impl::if_c< Arg1_IntConst , typename Impl::is_integral_constant<Arg1>::integral_type ,
	typename Impl::if_c< Arg2_IntConst , typename Impl::is_integral_constant<Arg2>::integral_type ,
	DefaultIntType
	>::type >::type >::type
	>::type >::type >::type
	>::type
	IntType ;

	enum { GranularityMask = IntType(Granularity) - 1 };

	ExecSpace m_space ;
	IntType m_begin ;
	IntType m_end ;

	public:

	//! Tag this class as an execution policy
	typedef ExecSpace execution_space ;
	typedef RangePolicy execution_policy ;
	typedef WorkTag work_tag ;
	typedef IntType member_type ;

	KOKKOS_INLINE_FUNCTION const execution_space & space() const { return m_space ; }
	KOKKOS_INLINE_FUNCTION member_type begin() const { return m_begin ; }
	KOKKOS_INLINE_FUNCTION member_type end() const { return m_end ; }

	inline RangePolicy() : m_space(), m_begin(0), m_end(0) {}

	/** \brief Total range */
	inline
	RangePolicy( const member_type work_begin
	, const member_type work_end
	)
	: m_space()
	, m_begin( work_begin < work_end ? work_begin : 0 )
	, m_end( work_begin < work_end ? work_end : 0 )
	{}

	/** \brief Total range */
	inline
	RangePolicy( const execution_space & work_space
	, const member_type work_begin
	, const member_type work_end
	)
	: m_space( work_space )
	, m_begin( work_begin < work_end ? work_begin : 0 )
	, m_end( work_begin < work_end ? work_end : 0 )
	{}

	/** \brief Subrange for a partition's rank and size.
	*
	* Typically used to partition a range over a group of threads.
	*/
	struct WorkRange {
	typedef RangePolicy::work_tag work_tag ;
	typedef RangePolicy::member_type member_type ;

	KOKKOS_INLINE_FUNCTION member_type begin() const { return m_begin ; }
	KOKKOS_INLINE_FUNCTION member_type end() const { return m_end ; }

	/** \brief Subrange for a partition's rank and size.
	*
	* Typically used to partition a range over a group of threads.
	*/
	KOKKOS_INLINE_FUNCTION
	WorkRange( const RangePolicy & range
	, const int part_rank
	, const int part_size
	)
	: m_begin(0), m_end(0)
	{
	if ( part_size ) {

	// Split evenly among partitions, then round up to the granularity.
	const member_type work_part =
	( ( ( ( range.end() - range.begin() ) + ( part_size - 1 ) ) / part_size )
	+ GranularityMask ) & ~member_type(GranularityMask);

	m_begin = range.begin() + work_part * part_rank ;
	m_end = m_begin + work_part ;

	if ( range.end() < m_begin ) m_begin = range.end() ;
	if ( range.end() < m_end ) m_end = range.end() ;
	}
	}
	private:
	member_type m_begin ;
	member_type m_end ;
	WorkRange();
	WorkRange & operator = ( const WorkRange & );
	};
	};

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	/** \brief Execution policy for parallel work over a league of teams of threads.
	*
	* The work functor is called for each thread of each team such that
	* the team's member threads are guaranteed to be concurrent.
	*
	* The team's threads have access to team shared scratch memory and
	* team collective operations.
	*
	* If the WorkTag is non-void then the first calling argument of the
	* work functor's parentheses operator is 'const WorkTag &'.
	* This allows a functor to have multiple work member functions.
	*
	* template argument option with specified execution space:
	* < ExecSpace , WorkTag >
	* < ExecSpace , void >
	*
	* template argument option with default execution space:
	* < WorkTag , void >
	* < void , void >
	*/
	template< class Arg0 = void
	, class Arg1 = void
	, class ExecSpace =
	// If the first argument is not an execution
	// then use the default execution space.
	typename Impl::if_c< Impl::is_execution_space< Arg0 >::value , Arg0
	, Kokkos::DefaultExecutionSpace >::type
	>
	class TeamPolicy {
	private:

	enum { Arg0_ExecSpace = Impl::is_execution_space< Arg0 >::value };
	enum { Arg1_Void = Impl::is_same< Arg1 , void >::value };
	enum { ArgOption_OK = Impl::StaticAssert< ( Arg0_ExecSpace \|\| Arg1_Void ) >::value };

	typedef typename Impl::if_c< Arg0_ExecSpace , Arg1 , Arg0 >::type WorkTag ;

	public:

	//! Tag this class as an execution policy
	typedef TeamPolicy execution_policy ;
	typedef ExecSpace execution_space ;
	typedef WorkTag work_tag ;

	//----------------------------------------
	/** \brief Query maximum team size for a given functor.
	*
	* This size takes into account execution space concurrency limitations and
	* scratch memory space limitations for reductions, team reduce/scan, and
	* team shared memory.
	*/
	template< class FunctorType >
	static int team_size_max( const FunctorType & );

	/** \brief Query recommended team size for a given functor.
	*
	* This size takes into account execution space concurrency limitations and
	* scratch memory space limitations for reductions, team reduce/scan, and
	* team shared memory.
	*/
	template< class FunctorType >
	static int team_size_recommended( const FunctorType & );

	+ template< class FunctorType >
	+ static int team_size_recommended( const FunctorType & , const int&);
	//----------------------------------------
	/** \brief Construct policy with the given instance of the execution space */
	TeamPolicy( const execution_space & , int league_size_request , int team_size_request );

	/** \brief Construct policy with the default instance of the execution space */
	TeamPolicy( int league_size_request , int team_size_request );

	/** \brief The actual league size (number of teams) of the policy.
	*
	* This may be smaller than the requested league size due to limitations
	* of the execution space.
	*/
	KOKKOS_INLINE_FUNCTION int league_size() const ;

	/** \brief The actual team size (number of threads per team) of the policy.
	*
	* This may be smaller than the requested team size due to limitations
	* of the execution space.
	*/
	KOKKOS_INLINE_FUNCTION int team_size() const ;

	/** \brief Parallel execution of a functor calls the functor once with
	* each member of the execution policy.
	*/
	struct member_type {

	/** \brief Handle to the currently executing team shared scratch memory */
	KOKKOS_INLINE_FUNCTION
	typename execution_space::scratch_memory_space team_shmem() const ;

	/** \brief Rank of this team within the league of teams */
	KOKKOS_INLINE_FUNCTION int league_rank() const ;

	/** \brief Number of teams in the league */
	KOKKOS_INLINE_FUNCTION int league_size() const ;

	/** \brief Rank of this thread within this team */
	KOKKOS_INLINE_FUNCTION int team_rank() const ;

	/** \brief Number of threads in this team */
	KOKKOS_INLINE_FUNCTION int team_size() const ;

	/** \brief Barrier among the threads of this team */
	KOKKOS_INLINE_FUNCTION void team_barrier() const ;

	/** \brief Intra-team reduction. Returns join of all values of the team members. */
	template< class JoinOp >
	KOKKOS_INLINE_FUNCTION
	typename JoinOp::value_type team_reduce( const typename JoinOp::value_type
	, const JoinOp & ) const ;

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	*
	* The highest rank thread can compute the reduction total as
	* reduction_total = dev.team_scan( value ) + value ;
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value ) const ;

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering
	* with intra-team non-deterministic ordering accumulation.
	*
	* The global inter-team accumulation value will, at the end of the
	* league's parallel execution, be the scan's total.
	* Parallel execution ordering of the league's teams is non-deterministic.
	* As such the base value for each team's scan operation is similarly
	* non-deterministic.
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value , Type * const global_accum ) const ;
	};
	};

	} // namespace Kokkos

	namespace Kokkos {

	namespace Impl {
	- template<typename iType, class TeamMemberType>
	- struct TeamThreadLoopBoundariesStruct {
	- typedef iType index_type;
	- const iType start;
	- const iType end;
	- enum {increment = 1};
	- const TeamMemberType& thread;

	- KOKKOS_INLINE_FUNCTION
	- TeamThreadLoopBoundariesStruct (const TeamMemberType& thread_, const iType& count):
	- start( ( (count + thread_.team_size()-1) / thread_.team_size() ) * thread_.team_rank() ),
	- end( ( (count + thread_.team_size()-1) / thread_.team_size() ) * ( thread_.team_rank() + 1 ) <= count?
	- ( (count + thread_.team_size()-1) / thread_.team_size() ) * ( thread_.team_rank() + 1 ):count),
	- thread(thread_)
	+template<typename iType, class TeamMemberType>
	+struct TeamThreadRangeBoundariesStruct {
	+private:
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ iType ibegin( const iType & arg_begin
	+ , const iType & arg_end
	+ , const iType & arg_rank
	+ , const iType & arg_size
	+ )
	+ {
	+ return arg_begin + ( ( arg_end - arg_begin + arg_size - 1 ) / arg_size ) * arg_rank ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ iType iend( const iType & arg_begin
	+ , const iType & arg_end
	+ , const iType & arg_rank
	+ , const iType & arg_size
	+ )
	+ {
	+ const iType end_ = arg_begin + ( ( arg_end - arg_begin + arg_size - 1 ) / arg_size ) * ( arg_rank + 1 );
	+ return end_ < arg_end ? end_ : arg_end ;
	+ }
	+
	+public:
	+
	+ typedef iType index_type;
	+ const iType start;
	+ const iType end;
	+ enum {increment = 1};
	+ const TeamMemberType& thread;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TeamThreadRangeBoundariesStruct( const TeamMemberType& arg_thread
	+ , const iType& arg_end
	+ )
	+ : start( ibegin( 0 , arg_end , arg_thread.team_rank() , arg_thread.team_size() ) )
	+ , end( iend( 0 , arg_end , arg_thread.team_rank() , arg_thread.team_size() ) )
	+ , thread( arg_thread )
	{}
	- };
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TeamThreadRangeBoundariesStruct( const TeamMemberType& arg_thread
	+ , const iType& arg_begin
	+ , const iType& arg_end
	+ )
	+ : start( ibegin( arg_begin , arg_end , arg_thread.team_rank() , arg_thread.team_size() ) )
	+ , end( iend( arg_begin , arg_end , arg_thread.team_rank() , arg_thread.team_size() ) )
	+ , thread( arg_thread )
	+ {}
	+};

	template<typename iType, class TeamMemberType>
	- struct ThreadVectorLoopBoundariesStruct {
	+ struct ThreadVectorRangeBoundariesStruct {
	typedef iType index_type;
	enum {start = 0};
	const iType end;
	enum {increment = 1};

	KOKKOS_INLINE_FUNCTION
	- ThreadVectorLoopBoundariesStruct (const TeamMemberType& thread, const iType& count):
	+ ThreadVectorRangeBoundariesStruct (const TeamMemberType& thread, const iType& count):
	end( count )
	{}
	};

	template<class TeamMemberType>
	struct ThreadSingleStruct {
	const TeamMemberType& team_member;
	KOKKOS_INLINE_FUNCTION
	ThreadSingleStruct(const TeamMemberType& team_member_):team_member(team_member_){}
	};

	template<class TeamMemberType>
	struct VectorSingleStruct {
	const TeamMemberType& team_member;
	KOKKOS_INLINE_FUNCTION
	VectorSingleStruct(const TeamMemberType& team_member_):team_member(team_member_){}
	};
	} // namespace Impl

	-/*template<typename iType, class TeamMemberType>
	+/** \brief Execution policy for parallel work over a threads within a team.
	+ *
	+ * The range is split over all threads in a team. The Mapping scheme depends on the architecture.
	+ * This policy is used together with a parallel pattern as a nested layer within a kernel launched
	+ * with the TeamPolicy. This variant expects a single count. So the range is (0,count].
	+ */
	+template<typename iType, class TeamMemberType>
	KOKKOS_INLINE_FUNCTION
	-Impl::TeamThreadLoopBoundariesStruct<iType,TeamMemberType>
	- TeamThreadLoop(TeamMemberType thread, const iType count);
	+Impl::TeamThreadRangeBoundariesStruct<iType,TeamMemberType> TeamThreadRange(const TeamMemberType&, const iType& count);

	+/** \brief Execution policy for parallel work over a threads within a team.
	+ *
	+ * The range is split over all threads in a team. The Mapping scheme depends on the architecture.
	+ * This policy is used together with a parallel pattern as a nested layer within a kernel launched
	+ * with the TeamPolicy. This variant expects a begin and end. So the range is (begin,end].
	+ */
	template<typename iType, class TeamMemberType>
	KOKKOS_INLINE_FUNCTION
	-Impl::ThreadVectorLoopBoundariesStruct<iType,TeamMemberType>
	- ThreadVectorLoop(TeamMemberType thread, const iType count);*/
	+Impl::TeamThreadRangeBoundariesStruct<iType,TeamMemberType> TeamThreadRange(const TeamMemberType&, const iType& begin, const iType& end);

	+/** \brief Execution policy for a vector parallel loop.
	+ *
	+ * The range is split over all vector lanes in a thread. The Mapping scheme depends on the architecture.
	+ * This policy is used together with a parallel pattern as a nested layer within a kernel launched
	+ * with the TeamPolicy. This variant expects a single count. So the range is (0,count].
	+ */
	+template<typename iType, class TeamMemberType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::ThreadVectorRangeBoundariesStruct<iType,TeamMemberType> ThreadVectorRange(const TeamMemberType&, const iType& count);

	} // namespace Kokkos

	#endif /* #define KOKKOS_EXECPOLICY_HPP */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	diff --git a/lib/kokkos/core/src/Kokkos_HostSpace.hpp b/lib/kokkos/core/src/Kokkos_HostSpace.hpp
	index 80abf0b50..012743d43 100755
	--- a/lib/kokkos/core/src/Kokkos_HostSpace.hpp
	+++ b/lib/kokkos/core/src/Kokkos_HostSpace.hpp
	@@ -1,161 +1,270 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_HOSTSPACE_HPP
	#define KOKKOS_HOSTSPACE_HPP

	+#include <cstring>
	+#include <string>
	#include <iosfwd>
	#include <typeinfo>
	-#include <string>

	#include <Kokkos_Core_fwd.hpp>
	#include <Kokkos_MemoryTraits.hpp>
	+
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Error.hpp>

	+#include <impl/Kokkos_AllocationTracker.hpp>
	+#include <impl/Kokkos_BasicAllocators.hpp>
	+
	+#include <impl/KokkosExp_SharedAlloc.hpp>
	+
	/--------------------------------------------------------------------------/

	+namespace Kokkos {
	+namespace Impl {
	+
	+/// \brief Initialize lock array for arbitrary size atomics.
	+///
	+/// Arbitrary atomics are implemented using a hash table of locks
	+/// where the hash value is derived from the address of the
	+/// object for which an atomic operation is performed.
	+/// This function initializes the locks to zero (unset).
	+void init_lock_array_host_space();
	+
	+/// \brief Aquire a lock for the address
	+///
	+/// This function tries to aquire the lock for the hash value derived
	+/// from the provided ptr. If the lock is successfully aquired the
	+/// function returns true. Otherwise it returns false.
	+bool lock_address_host_space(void* ptr);
	+
	+/// \brief Release lock for the address
	+///
	+/// This function releases the lock for the hash value derived
	+/// from the provided ptr. This function should only be called
	+/// after previously successfully aquiring a lock with
	+/// lock_address.
	+void unlock_address_host_space(void* ptr);
	+
	+} // namespace Impl
	+} // namespace Kokkos
	+
	namespace Kokkos {

	/// \class HostSpace
	/// \brief Memory management for host memory.
	///
	/// HostSpace is a memory space that governs host memory. "Host"
	/// memory means the usual CPU-accessible memory.
	class HostSpace {
	public:

	//! Tag this class as a kokkos memory space
	typedef HostSpace memory_space ;
	typedef size_t size_type ;

	/// \typedef execution_space
	/// \brief Default execution space for this memory space.
	///
	/// Every memory space has a default execution space. This is
	/// useful for things like initializing a View (which happens in
	/// parallel using the View's default execution space).
	#if defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP )
	typedef Kokkos::OpenMP execution_space ;
	#elif defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS )
	typedef Kokkos::Threads execution_space ;
	#elif defined( KOKKOS_HAVE_OPENMP )
	typedef Kokkos::OpenMP execution_space ;
	#elif defined( KOKKOS_HAVE_PTHREAD )
	typedef Kokkos::Threads execution_space ;
	#elif defined( KOKKOS_HAVE_SERIAL )
	typedef Kokkos::Serial execution_space ;
	#else
	# error "At least one of the following host execution spaces must be defined: Kokkos::OpenMP, Kokkos::Serial, or Kokkos::Threads. You might be seeing this message if you disabled the Kokkos::Serial device explicitly using the Kokkos_ENABLE_Serial:BOOL=OFF CMake option, but did not enable any of the other host execution space devices."
	#endif

	+ //! This memory space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	+
	+#if defined( KOKKOS_USE_PAGE_ALIGNED_HOST_MEMORY )
	+ typedef Impl::PageAlignedAllocator allocator ;
	+#else
	+ typedef Impl::AlignedAllocator allocator ;
	+#endif
	+
	/** \brief Allocate a contiguous block of memory.
	*
	* The input label is associated with the block of memory.
	* The block of memory is tracked via reference counting where
	* allocation gives it a reference count of one.
	- *
	- * Allocation may only occur on the master thread of the process.
	*/
	- static void * allocate( const std::string & label , const size_t size );
	+ static Impl::AllocationTracker allocate_and_track( const std::string & label, const size_t size );

	- /** \brief Increment the reference count of the block of memory
	- * in which the input pointer resides.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void increment( const void * );
	-
	- /** \brief Decrement the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count falls to zero the memory is deallocated.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static void decrement( const void * );
	+ /--------------------------------/
	+ /* Functions unique to the HostSpace */
	+ static int in_parallel();

	- /** \brief Get the reference count of the block of memory
	- * in which the input pointer resides. If the reference
	- * count is zero the memory region is not tracked.
	- *
	- * Reference counting only occurs on the master thread.
	- */
	- static int count( const void * );
	+ static void register_in_parallel( int (*)() );

	/--------------------------------/

	- /** \brief Print all tracked memory to the output stream. */
	- static void print_memory_view( std::ostream & );
	+ /*\brief Default memory space instance /
	+ HostSpace();
	+ HostSpace( const HostSpace & rhs ) = default ;
	+ HostSpace & operator = ( const HostSpace & ) = default ;
	+ ~HostSpace() = default ;

	- /** \brief Retrieve label associated with the input pointer */
	- static std::string query_label( const void * );
	+ /*\brief Non-default memory space instance to choose allocation mechansim, if available /

	- /--------------------------------/
	- /* Functions unique to the HostSpace */
	+ enum AllocationMechanism { STD_MALLOC , POSIX_MEMALIGN , POSIX_MMAP , INTEL_MM_ALLOC };

	- static int in_parallel();
	+ explicit
	+ HostSpace( const AllocationMechanism & );

	- static void register_in_parallel( int (*)() );
	+ /*\brief Allocate memory in the host space /
	+ void * allocate( const size_t arg_alloc_size ) const ;
	+
	+ /*\brief Deallocate memory in the host space /
	+ void deallocate( void * const arg_alloc_ptr
	+ , const size_t arg_alloc_size ) const ;
	+
	+private:
	+
	+ AllocationMechanism m_alloc_mech ;
	+
	+ friend class Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > ;
	};

	+} // namespace Kokkos
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template<>
	+class SharedAllocationRecord< Kokkos::HostSpace , void >
	+ : public SharedAllocationRecord< void , void >
	+{
	+private:
	+
	+ friend Kokkos::HostSpace ;
	+
	+ typedef SharedAllocationRecord< void , void > RecordBase ;

	+ SharedAllocationRecord( const SharedAllocationRecord & ) = delete ;
	+ SharedAllocationRecord & operator = ( const SharedAllocationRecord & ) = delete ;
	+
	+ static void deallocate( RecordBase * );
	+
	+ /*\brief Root record for tracked allocations from this HostSpace instance /
	+ static RecordBase s_root_record ;
	+
	+ const Kokkos::HostSpace m_space ;
	+
	+protected:
	+
	+ ~SharedAllocationRecord();
	+ SharedAllocationRecord() = default ;
	+
	+ SharedAllocationRecord( const Kokkos::HostSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const RecordBase::function_type arg_dealloc = & deallocate
	+ );
	+
	+public:
	+
	+ inline
	+ std::string get_label() const
	+ {
	+ return std::string( RecordBase::head()->m_label );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ SharedAllocationRecord * allocate( const Kokkos::HostSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ )
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ return new SharedAllocationRecord( arg_space , arg_label , arg_alloc_size );
	+#else
	+ return (SharedAllocationRecord *) 0 ;
	+#endif
	+ }
	+
	+
	+ static SharedAllocationRecord * get_record( void * arg_alloc_ptr );
	+
	+ static void print_records( std::ostream & , const Kokkos::HostSpace & , bool detail = false );
	+};
	+
	+} // namespace Impl
	+} // namespace Experimental
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class , class > struct DeepCopy ;

	template<>
	struct DeepCopy<HostSpace,HostSpace> {
	DeepCopy( void * dst , const void * src , size_t n );
	};

	} // namespace Impl
	} // namespace Kokkos

	+
	#endif /* #define KOKKOS_HOSTSPACE_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_Layout.hpp b/lib/kokkos/core/src/Kokkos_Layout.hpp
	index 1440ad84a..32822889d 100755
	--- a/lib/kokkos/core/src/Kokkos_Layout.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Layout.hpp
	@@ -1,176 +1,174 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_Layout.hpp
	/// \brief Declaration of various \c MemoryLayout options.

	#ifndef KOKKOS_LAYOUT_HPP
	#define KOKKOS_LAYOUT_HPP

	#include <stddef.h>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Tags.hpp>

	namespace Kokkos {

	//----------------------------------------------------------------------------
	/// \struct LayoutLeft
	/// \brief Memory layout tag indicating left-to-right (Fortran scheme)
	/// striding of multi-indices.
	///
	/// This is an example of a \c MemoryLayout template parameter of
	/// View. The memory layout describes how View maps from a
	/// multi-index (i0, i1, ..., ik) to a memory location.
	///
	/// "Layout left" indicates a mapping where the leftmost index i0
	/// refers to contiguous access, and strides increase for dimensions
	/// going right from there (i1, i2, ...). This layout imitates how
	/// Fortran stores multi-dimensional arrays. For the special case of
	/// a two-dimensional array, "layout left" is also called "column
	/// major."
	struct LayoutLeft {
	//! Tag this class as a kokkos array layout
	typedef LayoutLeft array_layout ;
	};

	//----------------------------------------------------------------------------
	/// \struct LayoutRight
	/// \brief Memory layout tag indicating right-to-left (C or
	/// lexigraphical scheme) striding of multi-indices.
	///
	/// This is an example of a \c MemoryLayout template parameter of
	/// View. The memory layout describes how View maps from a
	/// multi-index (i0, i1, ..., ik) to a memory location.
	///
	/// "Right layout" indicates a mapping where the rightmost index ik
	/// refers to contiguous access, and strides increase for dimensions
	/// going left from there. This layout imitates how C stores
	/// multi-dimensional arrays. For the special case of a
	/// two-dimensional array, "layout right" is also called "row major."
	struct LayoutRight {
	//! Tag this class as a kokkos array layout
	typedef LayoutRight array_layout ;
	};

	//----------------------------------------------------------------------------
	/// \struct LayoutStride
	/// \brief Memory layout tag indicated arbitrarily strided
	/// multi-index mapping into contiguous memory.
	struct LayoutStride {

	//! Tag this class as a kokkos array layout
	typedef LayoutStride array_layout ;

	enum { MAX_RANK = 8 };

	size_t dimension[ MAX_RANK ] ;
	size_t stride[ MAX_RANK ] ;

	/** \brief Compute strides from ordered dimensions.
	*
	* Values of order uniquely form the set [0..rank)
	* and specify ordering of the dimensions.
	* Order = {0,1,2,...} is LayoutLeft
	* Order = {...,2,1,0} is LayoutRight
	*/
	template< typename iTypeOrder , typename iTypeDimen >
	KOKKOS_INLINE_FUNCTION static
	LayoutStride order_dimensions( int const rank
	, iTypeOrder const * const order
	, iTypeDimen const * const dimen )
	{
	LayoutStride tmp ;
	// Verify valid rank order:
	int check_input = MAX_RANK < rank ? 0 : int( 1 << rank ) - 1 ;
	for ( int r = 0 ; r < MAX_RANK ; ++r ) {
	tmp.dimension[r] = 0 ;
	tmp.stride[r] = 0 ;
	check_input &= ~int( 1 << order[r] );
	}
	if ( 0 == check_input ) {
	size_t n = 1 ;
	for ( int r = 0 ; r < rank ; ++r ) {
	tmp.stride[ order[r] ] = n ;
	n *= ( dimen[order[r]] );
	tmp.dimension[r] = dimen[r];
	}
	}
	return tmp ;
	}
	};

	//----------------------------------------------------------------------------
	/// \struct LayoutTileLeft
	/// \brief Memory layout tag indicating left-to-right (Fortran scheme)
	/// striding of multi-indices by tiles.
	///
	/// This is an example of a \c MemoryLayout template parameter of
	/// View. The memory layout describes how View maps from a
	/// multi-index (i0, i1, ..., ik) to a memory location.
	///
	/// "Tiled layout" indicates a mapping to contiguously stored
	/// <tt>ArgN0</tt> by <tt>ArgN1</tt> tiles for the rightmost two
	/// dimensions. Indices are LayoutLeft within each tile, and the
	/// tiles themselves are arranged using LayoutLeft. Note that the
	/// dimensions <tt>ArgN0</tt> and <tt>ArgN1</tt> of the tiles must be
	/// compile-time constants. This speeds up index calculations. If
	/// both tile dimensions are powers of two, Kokkos can optimize
	/// further.
	template < unsigned ArgN0 , unsigned ArgN1 ,
	bool IsPowerOfTwo = ( Impl::is_power_of_two<ArgN0>::value &&
	Impl::is_power_of_two<ArgN1>::value )
	>
	struct LayoutTileLeft {
	//! Tag this class as a kokkos array layout
	typedef LayoutTileLeft<ArgN0,ArgN1,IsPowerOfTwo> array_layout ;

	enum { N0 = ArgN0 };
	enum { N1 = ArgN1 };
	};

	} // namespace Kokkos

	#endif // #ifndef KOKKOS_LAYOUT_HPP

	diff --git a/lib/kokkos/core/src/Kokkos_Macros.hpp b/lib/kokkos/core/src/Kokkos_Macros.hpp
	index a67aa1adc..3978a0622 100755
	--- a/lib/kokkos/core/src/Kokkos_Macros.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Macros.hpp
	@@ -1,433 +1,397 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_MACROS_HPP
	#define KOKKOS_MACROS_HPP

	//----------------------------------------------------------------------------
	/** Pick up configure/build options via #define macros:
	*
	* KOKKOS_HAVE_CUDA Kokkos::Cuda execution and memory spaces
	* KOKKOS_HAVE_PTHREAD Kokkos::Threads execution space
	* KOKKOS_HAVE_QTHREAD Kokkos::Qthread execution space
	* KOKKOS_HAVE_OPENMP Kokkos::OpenMP execution space
	* KOKKOS_HAVE_HWLOC HWLOC library is available
	- * KOKKOS_HAVE_EXPRESSION_CHECK insert array bounds checks, is expensive!
	+ * KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK insert array bounds checks, is expensive!
	* KOKKOS_HAVE_CXX11 enable C++11 features
	*
	* KOKKOS_HAVE_MPI negotiate MPI/execution space interactions
	*
	* KOKKOS_USE_CUDA_UVM Use CUDA UVM for Cuda memory space
	*/

	#ifndef KOKKOS_DONT_INCLUDE_CORE_CONFIG_H
	#include <KokkosCore_config.h>
	#endif

	//----------------------------------------------------------------------------
	/** Pick up compiler specific #define macros:
	*
	* Macros for known compilers evaluate to an integral version value
	*
	* KOKKOS_COMPILER_NVCC
	* KOKKOS_COMPILER_GNU
	* KOKKOS_COMPILER_INTEL
	* KOKKOS_COMPILER_IBM
	* KOKKOS_COMPILER_CRAYC
	* KOKKOS_COMPILER_APPLECC
	* KOKKOS_COMPILER_CLANG
	* KOKKOS_COMPILER_PGI
	*
	* Macros for which compiler extension to use for atomics on intrinsice types
	*
	* KOKKOS_ATOMICS_USE_CUDA
	* KOKKOS_ATOMICS_USE_GNU
	* KOKKOS_ATOMICS_USE_INTEL
	* KOKKOS_ATOMICS_USE_OPENMP31
	*
	* A suite of 'KOKKOS_HAVE_PRAGMA_...' are defined for internal use.
	*
	* Macros for marking functions to run in an execution space:
	*
	* KOKKOS_FUNCTION
	* KOKKOS_INLINE_FUNCTION request compiler to inline
	* KOKKOS_FORCEINLINE_FUNCTION force compiler to inline, use with care!
	*/

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_HAVE_CUDA ) && defined( __CUDACC__ )

	/* Compiling with a CUDA compiler.
	*
	* Include <cuda.h> to pick up the CUDA_VERSION macro defined as:
	* CUDA_VERSION = ( MAJOR_VERSION * 1000 ) + ( MINOR_VERSION * 10 )
	*
	* When generating device code the __CUDA_ARCH__ macro is defined as:
	* __CUDA_ARCH__ = ( MAJOR_CAPABILITY * 100 ) + ( MINOR_CAPABILITY * 10 )
	*/

	#include <cuda_runtime.h>
	#include <cuda.h>

	#if ! defined( CUDA_VERSION )
	#error "#include <cuda.h> did not define CUDA_VERSION"
	#endif

	-#if ( CUDA_VERSION < 4010 )
	-#error "Cuda version 4.1 or greater required"
	+#if ( CUDA_VERSION < 6050 )
	+// CUDA supports (inofficially) C++11 in device code starting with
	+// version 6.5. This includes auto type and device code internal
	+// lambdas.
	+#error "Cuda version 6.5 or greater required"
	#endif

	-#if defined( __CUDA_ARCH__ ) && ( __CUDA_ARCH__ < 200 )
	+#if defined( __CUDA_ARCH__ ) && ( __CUDA_ARCH__ < 300 )
	/* Compiling with CUDA compiler for device code. */
	-#error "Cuda device capability >= 2.0 is required"
	+#error "Cuda device capability >= 3.0 is required"
	#endif

	#endif /* #if defined( KOKKOS_HAVE_CUDA ) && defined( __CUDACC__ ) */

	/--------------------------------------------------------------------------/
	/* Language info: C++, CUDA, OPENMP */

	#if defined( __CUDA_ARCH__ ) && defined( KOKKOS_HAVE_CUDA )
	// Compiling Cuda code to 'ptx'

	#define KOKKOS_FORCEINLINE_FUNCTION __device__ __host__ __forceinline__
	#define KOKKOS_INLINE_FUNCTION __device__ __host__ inline
	#define KOKKOS_FUNCTION __device__ __host__

	#endif /* #if defined( __CUDA_ARCH__ ) */

	#if defined( _OPENMP )

	/* Compiling with OpenMP.
	* The value of _OPENMP is an integer value YYYYMM
	* where YYYY and MM are the year and month designation
	* of the supported OpenMP API version.
	*/

	#endif /* #if defined( _OPENMP ) */

	/--------------------------------------------------------------------------/
	/* Mapping compiler built-ins to KOKKOS_COMPILER_*** macros */

	#if defined( __NVCC__ )
	// NVIDIA compiler is being used.
	// Code is parsed and separated into host and device code.
	// Host code is compiled again with another compiler.
	// Device code is compile to 'ptx'.
	#define KOKKOS_COMPILER_NVCC __NVCC__

	- #if defined( KOKKOS_HAVE_CXX11 ) && defined (KOKKOS_HAVE_CUDA)
	- // CUDA supports (inofficially) C++11 in device code starting with
	- // version 6.5. This includes auto type and device code internal
	- // lambdas.
	- #if ( CUDA_VERSION < 6050 )
	- #error "NVCC does not support C++11"
	- #endif
	- #endif
	#else
	- #if defined( KOKKOS_HAVE_CXX11 )
	+#if defined( KOKKOS_HAVE_CXX11 ) && ! defined( KOKKOS_HAVE_CXX11_DISPATCH_LAMBDA )
	// CUDA (including version 6.5) does not support giving lambdas as
	// arguments to global functions. Thus its not currently possible
	// to dispatch lambdas from the host.
	#define KOKKOS_HAVE_CXX11_DISPATCH_LAMBDA 1
	#endif
	#endif /* #if defined( __NVCC__ ) */

	#if defined( KOKKOS_HAVE_CXX11 ) && !defined (KOKKOS_LAMBDA)
	#define KOKKOS_LAMBDA [=]
	#endif

	#if ! defined( __CUDA_ARCH__ ) /* Not compiling Cuda code to 'ptx'. */

	/* Intel compiler for host code */

	#if defined( __INTEL_COMPILER )
	#define KOKKOS_COMPILER_INTEL __INTEL_COMPILER
	#elif defined( __ICC )
	// Old define
	#define KOKKOS_COMPILER_INTEL __ICC
	-#elif defined( __ECC )
	+#elif defined( __ECC )
	// Very old define
	#define KOKKOS_COMPILER_INTEL __ECC
	#endif

	/* CRAY compiler for host code */
	#if defined( _CRAYC )
	#define KOKKOS_COMPILER_CRAYC _CRAYC
	#endif

	#if defined( __IBMCPP__ )
	// IBM C++
	#define KOKKOS_COMPILER_IBM __IBMCPP__
	#elif defined( __IBMC__ )
	#define KOKKOS_COMPILER_IBM __IBMC__
	#endif

	#if defined( __APPLE_CC__ )
	#define KOKKOS_COMPILER_APPLECC __APPLE_CC__
	#endif

	#if defined (__clang__) && !defined (KOKKOS_COMPILER_INTEL)
	#define KOKKOS_COMPILER_CLANG __clang_major__100+__clang_minor__10+__clang_patchlevel__
	#endif

	#if ! defined( __clang__ ) && ! defined( KOKKOS_COMPILER_INTEL ) &&defined( __GNUC__ )
	#define KOKKOS_COMPILER_GNU __GNUC__100+__GNUC_MINOR__10+__GNUC_PATCHLEVEL__
	#endif

	#if defined( __PGIC__ ) && ! defined( __GNUC__ )
	#define KOKKOS_COMPILER_PGI __PGIC__100+__PGIC_MINOR__10+__PGIC_PATCHLEVEL__
	#endif

	#endif /* #if ! defined( __CUDA_ARCH__ ) */

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/
	/* Intel compiler macros */

	#if defined( KOKKOS_COMPILER_INTEL )

	#define KOKKOS_HAVE_PRAGMA_UNROLL 1
	#define KOKKOS_HAVE_PRAGMA_IVDEP 1
	#define KOKKOS_HAVE_PRAGMA_LOOPCOUNT 1
	#define KOKKOS_HAVE_PRAGMA_VECTOR 1
	#define KOKKOS_HAVE_PRAGMA_SIMD 1

	- #if ( 1200 <= KOKKOS_COMPILER_INTEL ) && ! defined( KOKKOS_ENABLE_ASM )
	+#if ( 1200 <= KOKKOS_COMPILER_INTEL ) && ! defined( KOKKOS_ENABLE_ASM ) && ! defined( _WIN32 )
	#define KOKKOS_ENABLE_ASM 1
	#endif

	#if ( 1200 <= KOKKOS_COMPILER_INTEL ) && ! defined( KOKKOS_FORCEINLINE_FUNCTION )
	- #define KOKKOS_FORCEINLINE_FUNCTION inline __attribute__((always_inline))
	+ #if !defined (_WIN32)
	+ #define KOKKOS_FORCEINLINE_FUNCTION inline __attribute__((always_inline))
	+ #else
	+ #define KOKKOS_FORCEINLINE_FUNCTION inline
	+ #endif
	#endif

	#if defined( __MIC__ )
	// Compiling for Xeon Phi
	#endif

	#endif

	/--------------------------------------------------------------------------/
	/* Cray compiler macros */

	#if defined( KOKKOS_COMPILER_CRAYC )


	#endif

	/--------------------------------------------------------------------------/
	/* IBM Compiler macros */

	#if defined( KOKKOS_COMPILER_IBM )

	#define KOKKOS_HAVE_PRAGMA_UNROLL 1
	//#define KOKKOS_HAVE_PRAGMA_IVDEP 1
	//#define KOKKOS_HAVE_PRAGMA_LOOPCOUNT 1
	//#define KOKKOS_HAVE_PRAGMA_VECTOR 1
	//#define KOKKOS_HAVE_PRAGMA_SIMD 1

	#endif

	/--------------------------------------------------------------------------/
	/* CLANG compiler macros */

	#if defined( KOKKOS_COMPILER_CLANG )

	//#define KOKKOS_HAVE_PRAGMA_UNROLL 1
	//#define KOKKOS_HAVE_PRAGMA_IVDEP 1
	//#define KOKKOS_HAVE_PRAGMA_LOOPCOUNT 1
	//#define KOKKOS_HAVE_PRAGMA_VECTOR 1
	//#define KOKKOS_HAVE_PRAGMA_SIMD 1

	#if ! defined( KOKKOS_FORCEINLINE_FUNCTION )
	#define KOKKOS_FORCEINLINE_FUNCTION inline __attribute__((always_inline))
	#endif

	#endif

	/--------------------------------------------------------------------------/
	/* GNU Compiler macros */

	-#if defined( KOKKOS_COMPILER_GNU )
	+#if defined( KOKKOS_COMPILER_GNU )

	//#define KOKKOS_HAVE_PRAGMA_UNROLL 1
	//#define KOKKOS_HAVE_PRAGMA_IVDEP 1
	//#define KOKKOS_HAVE_PRAGMA_LOOPCOUNT 1
	//#define KOKKOS_HAVE_PRAGMA_VECTOR 1
	//#define KOKKOS_HAVE_PRAGMA_SIMD 1

	#if ! defined( KOKKOS_FORCEINLINE_FUNCTION )
	#define KOKKOS_FORCEINLINE_FUNCTION inline __attribute__((always_inline))
	#endif

	#if ! defined( KOKKOS_ENABLE_ASM ) && \
	! ( defined( __powerpc) \|\| \
	defined(__powerpc__) \|\| \
	defined(__powerpc64__) \|\| \
	defined(__POWERPC__) \|\| \
	defined(__ppc__) \|\| \
	- defined(__ppc64__) )
	+ defined(__ppc64__) \|\| \
	+ defined(__PGIC__) )
	#define KOKKOS_ENABLE_ASM 1
	#endif

	#endif

	/--------------------------------------------------------------------------/

	#if defined( KOKKOS_COMPILER_PGI )

	#define KOKKOS_HAVE_PRAGMA_UNROLL 1
	#define KOKKOS_HAVE_PRAGMA_IVDEP 1
	//#define KOKKOS_HAVE_PRAGMA_LOOPCOUNT 1
	#define KOKKOS_HAVE_PRAGMA_VECTOR 1
	//#define KOKKOS_HAVE_PRAGMA_SIMD 1

	#endif

	/--------------------------------------------------------------------------/

	#if defined( KOKKOS_COMPILER_NVCC )

	#if defined(__CUDA_ARCH__ )
	#define KOKKOS_HAVE_PRAGMA_UNROLL 1
	#endif

	#endif

	-/--------------------------------------------------------------------------/
	-/* Select compiler dependent interface for atomics */
	-
	-#if ! defined( KOKKOS_ATOMICS_USE_CUDA ) \|\| \
	- ! defined( KOKKOS_ATOMICS_USE_GNU ) \|\| \
	- ! defined( KOKKOS_ATOMICS_USE_INTEL ) \|\| \
	- ! defined( KOKKOS_ATOMICS_USE_OPENMP31 )
	-
	-/* Atomic selection is not pre-defined, choose from language and compiler. */
	-
	-#if defined( __CUDA_ARCH__ ) && defined (KOKKOS_HAVE_CUDA)
	-
	- #define KOKKOS_ATOMICS_USE_CUDA
	-
	-#elif defined( KOKKOS_COMPILER_GNU ) \|\| defined( KOKKOS_COMPILER_CLANG )
	-
	- #define KOKKOS_ATOMICS_USE_GNU
	-
	-#elif defined( KOKKOS_COMPILER_INTEL ) \|\| defined( KOKKOS_COMPILER_CRAYC )
	-
	- #define KOKKOS_ATOMICS_USE_INTEL
	-
	-#elif defined( _OPENMP ) && ( 201107 <= _OPENMP )
	-
	- #define KOKKOS_ATOMICS_USE_OMP31
	-
	-#else
	-
	- #error "Compiler does not support atomic operations"
	-
	-#endif
	-
	-#endif
	-
	//----------------------------------------------------------------------------
	/** Define function marking macros if compiler specific macros are undefined: */

	#if ! defined( KOKKOS_FORCEINLINE_FUNCTION )
	#define KOKKOS_FORCEINLINE_FUNCTION inline
	#endif

	#if ! defined( KOKKOS_INLINE_FUNCTION )
	#define KOKKOS_INLINE_FUNCTION inline
	#endif

	#if ! defined( KOKKOS_FUNCTION )
	#define KOKKOS_FUNCTION /**/
	#endif

	//----------------------------------------------------------------------------
	/** Determine the default execution space for parallel dispatch.
	* There is zero or one default execution space specified.
	*/

	#if 1 < ( ( defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_CUDA ) ? 1 : 0 ) + \
	( defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP ) ? 1 : 0 ) + \
	( defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS ) ? 1 : 0 ) + \
	( defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_SERIAL ) ? 1 : 0 ) )

	#error "More than one KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_* specified" ;

	#endif

	/** If default is not specified then chose from enabled execution spaces.
	* Priority: CUDA, OPENMP, THREADS, SERIAL
	*/
	#if defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_CUDA )
	#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP )
	#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS )
	#elif defined ( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_SERIAL )
	#elif defined ( KOKKOS_HAVE_CUDA )
	#define KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_CUDA
	#elif defined ( KOKKOS_HAVE_OPENMP )
	#define KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP
	#elif defined ( KOKKOS_HAVE_PTHREAD )
	#define KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS
	#else
	#define KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_SERIAL
	#endif

	//----------------------------------------------------------------------------
	/** Determine for what space the code is being compiled: */

	#if defined( __CUDACC__ ) && defined( __CUDA_ARCH__ ) && defined (KOKKOS_HAVE_CUDA)
	#define KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA
	#else
	#define KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST
	#endif

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_MACROS_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_MemoryTraits.hpp b/lib/kokkos/core/src/Kokkos_MemoryTraits.hpp
	index 0817e20a8..b581c7da2 100755
	--- a/lib/kokkos/core/src/Kokkos_MemoryTraits.hpp
	+++ b/lib/kokkos/core/src/Kokkos_MemoryTraits.hpp
	@@ -1,118 +1,116 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_MEMORYTRAITS_HPP
	#define KOKKOS_MEMORYTRAITS_HPP

	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Tags.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {

	/** \brief Memory access traits for views, an extension point.
	*
	* These traits should be orthogonal. If there are dependencies then
	* the MemoryTraits template must detect and enforce dependencies.
	*
	* A zero value is the default for a View, indicating that none of
	* these traits are present.
	*/
	enum MemoryTraitsFlags
	{ Unmanaged = 0x01
	, RandomAccess = 0x02
	, Atomic = 0x04
	};

	template < unsigned T >
	struct MemoryTraits {
	//! Tag this class as a kokkos memory traits:
	typedef MemoryTraits memory_traits ;

	enum { Unmanaged = T & unsigned(Kokkos::Unmanaged) };
	enum { RandomAccess = T & unsigned(Kokkos::RandomAccess) };
	enum { Atomic = T & unsigned(Kokkos::Atomic) };

	};

	} // namespace Kokkos

	//----------------------------------------------------------------------------

	namespace Kokkos {

	typedef Kokkos::MemoryTraits<0> MemoryManaged ;
	typedef Kokkos::MemoryTraits< Kokkos::Unmanaged > MemoryUnmanaged ;
	typedef Kokkos::MemoryTraits< Kokkos::Unmanaged \| Kokkos::RandomAccess > MemoryRandomAccess ;

	} // namespace Kokkos

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	/** \brief Memory alignment settings
	*
	* Sets global value for memory alignment. Must be a power of two!
	* Enable compatibility of views from different devices with static stride.
	* Use compiler flag to enable overwrites.
	*/
	enum { MEMORY_ALIGNMENT =
	#if defined( KOKKOS_MEMORY_ALIGNMENT )
	( 1 << Kokkos::Impl::power_of_two< KOKKOS_MEMORY_ALIGNMENT >::value )
	#else
	( 1 << Kokkos::Impl::power_of_two< 128 >::value )
	#endif
	, MEMORY_ALIGNMENT_THRESHOLD = 4
	};


	} //namespace Impl
	} // namespace Kokkos

	#endif /* #ifndef KOKKOS_MEMORYTRAITS_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_OpenMP.hpp b/lib/kokkos/core/src/Kokkos_OpenMP.hpp
	index 7c75357e5..508da04c8 100755
	--- a/lib/kokkos/core/src/Kokkos_OpenMP.hpp
	+++ b/lib/kokkos/core/src/Kokkos_OpenMP.hpp
	@@ -1,176 +1,175 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_OPENMP_HPP
	#define KOKKOS_OPENMP_HPP

	#include <Kokkos_Core_fwd.hpp>

	#if defined( KOKKOS_HAVE_OPENMP ) && defined( _OPENMP )

	#include <omp.h>

	#include <cstddef>
	#include <iosfwd>
	#include <Kokkos_HostSpace.hpp>
	#include <Kokkos_ScratchSpace.hpp>
	#include <Kokkos_Parallel.hpp>
	#include <Kokkos_Layout.hpp>
	#include <impl/Kokkos_Tags.hpp>

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/// \class OpenMP
	/// \brief Kokkos device for multicore processors in the host memory space.
	class OpenMP {
	public:
	//------------------------------------
	//! \name Type declarations that all Kokkos devices must provide.
	//@{

	//! Tag this class as a kokkos execution space
	typedef OpenMP execution_space ;
	typedef HostSpace memory_space ;
	+ //! This execution space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	typedef LayoutRight array_layout ;
	typedef HostSpace::size_type size_type ;

	typedef ScratchMemorySpace< OpenMP > scratch_memory_space ;

	- //! For backward compatibility
	- typedef OpenMP device_type ;
	//@}
	//------------------------------------
	//! \name Functions that all Kokkos devices must implement.
	//@{

	inline static bool in_parallel() { return omp_in_parallel(); }

	/** \brief Set the device in a "sleep" state. A noop for OpenMP. */
	static bool sleep();

	/** \brief Wake the device from the 'sleep' state. A noop for OpenMP. */
	static bool wake();

	/** \brief Wait until all dispatched functors complete. A noop for OpenMP. */
	static void fence() {}

	/// \brief Print configuration information to the given output stream.
	static void print_configuration( std::ostream & , const bool detail = false );

	/// \brief Free any resources being consumed by the device.
	static void finalize();

	/** \brief Initialize the device.
	*
	* 1) If the hardware locality library is enabled and OpenMP has not
	* already bound threads then bind OpenMP threads to maximize
	* core utilization and group for memory hierarchy locality.
	*
	* 2) Allocate a HostThread for each OpenMP thread to hold its
	* topology and fan in/out data.
	*/
	static void initialize( unsigned thread_count = 0 ,
	unsigned use_numa_count = 0 ,
	unsigned use_cores_per_numa = 0 );

	static int is_initialized();
	//@}
	//------------------------------------
	/** \brief This execution space has a topological thread pool which can be queried.
	*
	* All threads within a pool have a common memory space for which they are cache coherent.
	* depth = 0 gives the number of threads in the whole pool.
	* depth = 1 gives the number of threads in a NUMA region, typically sharing L3 cache.
	* depth = 2 gives the number of threads at the finest granularity, typically sharing L1 cache.
	*/
	inline static int thread_pool_size( int depth = 0 );

	/** \brief The rank of the executing thread in this thread pool */
	KOKKOS_INLINE_FUNCTION static int thread_pool_rank();

	//------------------------------------

	inline static unsigned max_hardware_threads() { return thread_pool_size(0); }

	KOKKOS_INLINE_FUNCTION static
	unsigned hardware_thread_id() { return thread_pool_rank(); }
	};

	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template<>
	struct VerifyExecutionCanAccessMemorySpace
	< Kokkos::OpenMP::memory_space
	, Kokkos::OpenMP::scratch_memory_space
	>
	{
	enum { value = true };
	inline static void verify( void ) { }
	inline static void verify( const void * ) { }
	};

	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	#include <OpenMP/Kokkos_OpenMPexec.hpp>
	#include <OpenMP/Kokkos_OpenMP_Parallel.hpp>

	/--------------------------------------------------------------------------/

	#endif /* #if defined( KOKKOS_HAVE_OPENMP ) && defined( _OPENMP ) */
	#endif /* #ifndef KOKKOS_OPENMP_HPP */


	diff --git a/lib/kokkos/core/src/Kokkos_Pair.hpp b/lib/kokkos/core/src/Kokkos_Pair.hpp
	index c69273cb8..52de637a5 100755
	--- a/lib/kokkos/core/src/Kokkos_Pair.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Pair.hpp
	@@ -1,457 +1,498 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	/// \file Kokkos_Pair.hpp
	/// \brief Declaration and definition of Kokkos::pair.
	///
	/// This header file declares and defines Kokkos::pair and its related
	/// nonmember functions.

	#ifndef KOKKOS_PAIR_HPP
	#define KOKKOS_PAIR_HPP

	#include <Kokkos_Macros.hpp>
	#include <utility>

	namespace Kokkos {
	/// \struct pair
	/// \brief Replacement for std::pair that works on CUDA devices.
	///
	/// The instance methods of std::pair, including its constructors, are
	/// not marked as <tt>__device__</tt> functions. Thus, they cannot be
	/// called on a CUDA device, such as an NVIDIA GPU. This struct
	/// implements the same interface as std::pair, but can be used on a
	/// CUDA device as well as on the host.
	template <class T1, class T2>
	struct pair
	{
	//! The first template parameter of this class.
	typedef T1 first_type;
	//! The second template parameter of this class.
	typedef T2 second_type;

	//! The first element of the pair.
	first_type first;
	//! The second element of the pair.
	second_type second;

	/// \brief Default constructor.
	///
	/// This calls the default constructors of T1 and T2. It won't
	/// compile if those default constructors are not defined and
	/// public.
	KOKKOS_FORCEINLINE_FUNCTION
	pair()
	: first(), second()
	{}

	/// \brief Constructor that takes both elements of the pair.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	KOKKOS_FORCEINLINE_FUNCTION
	pair(first_type const& f, second_type const& s)
	: first(f), second(s)
	{}

	/// \brief Copy constructor.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair( const pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	/// \brief Assignment operator.
	///
	/// This calls the assignment operators of T1 and T2. It won't
	/// compile if the assignment operators are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<T1, T2> & operator=(const pair<U,V> &p)
	{
	first = p.first;
	second = p.second;
	return *this;
	}

	// from std::pair<U,V>
	template <class U, class V>
	pair( const std::pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	/// \brief Return the std::pair version of this object.
	///
	/// This is <i>not</i> a device function; you may not call it on a
	/// CUDA device. It is meant to be called on the host, if the user
	/// wants an std::pair instead of a Kokkos::pair.
	///
	/// \note This is not a conversion operator, since defining a
	/// conversion operator made the relational operators have
	/// ambiguous definitions.
	std::pair<T1,T2> to_std_pair() const
	{ return std::make_pair(first,second); }
	};

	template <class T1, class T2>
	struct pair<T1&, T2&>
	{
	//! The first template parameter of this class.
	typedef T1& first_type;
	//! The second template parameter of this class.
	typedef T2& second_type;

	//! The first element of the pair.
	first_type first;
	//! The second element of the pair.
	second_type second;

	/// \brief Constructor that takes both elements of the pair.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	KOKKOS_FORCEINLINE_FUNCTION
	pair(first_type f, second_type s)
	: first(f), second(s)
	{}

	/// \brief Copy constructor.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair( const pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	// from std::pair<U,V>
	template <class U, class V>
	pair( const std::pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	/// \brief Assignment operator.
	///
	/// This calls the assignment operators of T1 and T2. It won't
	/// compile if the assignment operators are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<first_type, second_type> & operator=(const pair<U,V> &p)
	{
	first = p.first;
	second = p.second;
	return *this;
	}

	/// \brief Return the std::pair version of this object.
	///
	/// This is <i>not</i> a device function; you may not call it on a
	/// CUDA device. It is meant to be called on the host, if the user
	/// wants an std::pair instead of a Kokkos::pair.
	///
	/// \note This is not a conversion operator, since defining a
	/// conversion operator made the relational operators have
	/// ambiguous definitions.
	std::pair<T1,T2> to_std_pair() const
	{ return std::make_pair(first,second); }
	};

	template <class T1, class T2>
	struct pair<T1, T2&>
	{
	//! The first template parameter of this class.
	typedef T1 first_type;
	//! The second template parameter of this class.
	typedef T2& second_type;

	//! The first element of the pair.
	first_type first;
	//! The second element of the pair.
	second_type second;

	/// \brief Constructor that takes both elements of the pair.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	KOKKOS_FORCEINLINE_FUNCTION
	pair(first_type const& f, second_type s)
	: first(f), second(s)
	{}

	/// \brief Copy constructor.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair( const pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	// from std::pair<U,V>
	template <class U, class V>
	pair( const std::pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	/// \brief Assignment operator.
	///
	/// This calls the assignment operators of T1 and T2. It won't
	/// compile if the assignment operators are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<first_type, second_type> & operator=(const pair<U,V> &p)
	{
	first = p.first;
	second = p.second;
	return *this;
	}

	/// \brief Return the std::pair version of this object.
	///
	/// This is <i>not</i> a device function; you may not call it on a
	/// CUDA device. It is meant to be called on the host, if the user
	/// wants an std::pair instead of a Kokkos::pair.
	///
	/// \note This is not a conversion operator, since defining a
	/// conversion operator made the relational operators have
	/// ambiguous definitions.
	std::pair<T1,T2> to_std_pair() const
	{ return std::make_pair(first,second); }
	};

	template <class T1, class T2>
	struct pair<T1&, T2>
	{
	//! The first template parameter of this class.
	typedef T1& first_type;
	//! The second template parameter of this class.
	typedef T2 second_type;

	//! The first element of the pair.
	first_type first;
	//! The second element of the pair.
	second_type second;

	/// \brief Constructor that takes both elements of the pair.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	KOKKOS_FORCEINLINE_FUNCTION
	pair(first_type f, second_type const& s)
	: first(f), second(s)
	{}

	/// \brief Copy constructor.
	///
	/// This calls the copy constructors of T1 and T2. It won't compile
	/// if those copy constructors are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair( const pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	// from std::pair<U,V>
	template <class U, class V>
	pair( const std::pair<U,V> &p)
	: first(p.first), second(p.second)
	{}

	/// \brief Assignment operator.
	///
	/// This calls the assignment operators of T1 and T2. It won't
	/// compile if the assignment operators are not defined and public.
	template <class U, class V>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<first_type, second_type> & operator=(const pair<U,V> &p)
	{
	first = p.first;
	second = p.second;
	return *this;
	}

	/// \brief Return the std::pair version of this object.
	///
	/// This is <i>not</i> a device function; you may not call it on a
	/// CUDA device. It is meant to be called on the host, if the user
	/// wants an std::pair instead of a Kokkos::pair.
	///
	/// \note This is not a conversion operator, since defining a
	/// conversion operator made the relational operators have
	/// ambiguous definitions.
	std::pair<T1,T2> to_std_pair() const
	{ return std::make_pair(first,second); }
	};

	//! Equality operator for Kokkos::pair.
	template <class T1, class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator== (const pair<T1,T2>& lhs, const pair<T1,T2>& rhs)
	{ return lhs.first==rhs.first && lhs.second==rhs.second; }

	//! Inequality operator for Kokkos::pair.
	template <class T1, class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator!= (const pair<T1,T2>& lhs, const pair<T1,T2>& rhs)
	{ return !(lhs==rhs); }

	//! Less-than operator for Kokkos::pair.
	template <class T1, class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator< (const pair<T1,T2>& lhs, const pair<T1,T2>& rhs)
	{ return lhs.first<rhs.first \|\| (!(rhs.first<lhs.first) && lhs.second<rhs.second); }

	//! Less-than-or-equal-to operator for Kokkos::pair.
	template <class T1, class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator<= (const pair<T1,T2>& lhs, const pair<T1,T2>& rhs)
	{ return !(rhs<lhs); }

	//! Greater-than operator for Kokkos::pair.
	template <class T1, class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator> (const pair<T1,T2>& lhs, const pair<T1,T2>& rhs)
	{ return rhs<lhs; }

	//! Greater-than-or-equal-to operator for Kokkos::pair.
	template <class T1, class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator>= (const pair<T1,T2>& lhs, const pair<T1,T2>& rhs)
	{ return !(lhs<rhs); }

	/// \brief Return a new pair.
	///
	/// This is a "nonmember constructor" for Kokkos::pair. It works just
	/// like std::make_pair.
	template <class T1,class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<T1,T2> make_pair (T1 x, T2 y)
	{ return ( pair<T1,T2>(x,y) ); }

	/// \brief Return a pair of references to the input arguments.
	///
	/// This compares to std::tie (new in C++11). You can use it to
	/// assign to two variables at once, from the result of a function
	/// that returns a pair. For example (<tt>__device__</tt> and
	/// <tt>__host__</tt> attributes omitted for brevity):
	/// \code
	/// // Declaration of the function to call.
	/// // First return value: operation count.
	/// // Second return value: whether all operations succeeded.
	/// Kokkos::pair<int, bool> someFunction ();
	///
	/// // Code that uses Kokkos::tie.
	/// int myFunction () {
	/// int count = 0;
	/// bool success = false;
	///
	/// // This assigns to both count and success.
	/// Kokkos::tie (count, success) = someFunction ();
	///
	/// if (! success) {
	/// // ... Some operation failed;
	/// // take corrective action ...
	/// }
	/// return count;
	/// }
	/// \endcode
	///
	/// The line that uses tie() could have been written like this:
	/// \code
	/// Kokkos::pair<int, bool> result = someFunction ();
	/// count = result.first;
	/// success = result.second;
	/// \endcode
	///
	/// Using tie() saves two lines of code and avoids a copy of each
	/// element of the pair. The latter could be significant if one or
	/// both elements of the pair are more substantial objects than \c int
	/// or \c bool.
	template <class T1,class T2>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<T1 &,T2 &> tie (T1 & x, T2 & y)
	{ return ( pair<T1 &,T2 &>(x,y) ); }

	//
	// Specialization of Kokkos::pair for a \c void second argument. This
	// is not actually a "pair"; it only contains one element, the first.
	//
	template <class T1>
	struct pair<T1,void>
	{
	typedef T1 first_type;
	typedef void second_type;

	first_type first;
	enum { second = 0 };

	KOKKOS_FORCEINLINE_FUNCTION
	pair()
	: first()
	{}

	KOKKOS_FORCEINLINE_FUNCTION
	pair(const first_type & f)
	: first(f)
	{}

	KOKKOS_FORCEINLINE_FUNCTION
	pair(const first_type & f, int)
	: first(f)
	{}

	template <class U>
	KOKKOS_FORCEINLINE_FUNCTION
	pair( const pair<U,void> &p)
	: first(p.first)
	{}

	template <class U>
	KOKKOS_FORCEINLINE_FUNCTION
	pair<T1, void> & operator=(const pair<U,void> &p)
	{
	first = p.first;
	return *this;
	}
	};

	//
	// Specialization of relational operators for Kokkos::pair<T1,void>.
	//

	template <class T1>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator== (const pair<T1,void>& lhs, const pair<T1,void>& rhs)
	{ return lhs.first==rhs.first; }

	template <class T1>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator!= (const pair<T1,void>& lhs, const pair<T1,void>& rhs)
	{ return !(lhs==rhs); }

	template <class T1>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator< (const pair<T1,void>& lhs, const pair<T1,void>& rhs)
	{ return lhs.first<rhs.first; }

	template <class T1>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator<= (const pair<T1,void>& lhs, const pair<T1,void>& rhs)
	{ return !(rhs<lhs); }

	template <class T1>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator> (const pair<T1,void>& lhs, const pair<T1,void>& rhs)
	{ return rhs<lhs; }

	template <class T1>
	KOKKOS_FORCEINLINE_FUNCTION
	bool operator>= (const pair<T1,void>& lhs, const pair<T1,void>& rhs)
	{ return !(lhs<rhs); }

	} // namespace Kokkos


	#endif //KOKKOS_PAIR_HPP
	diff --git a/lib/kokkos/core/src/Kokkos_Parallel.hpp b/lib/kokkos/core/src/Kokkos_Parallel.hpp
	index 609dfee4b..d714485e7 100755
	--- a/lib/kokkos/core/src/Kokkos_Parallel.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Parallel.hpp
	@@ -1,598 +1,908 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_Parallel.hpp
	/// \brief Declaration of parallel operators

	#ifndef KOKKOS_PARALLEL_HPP
	#define KOKKOS_PARALLEL_HPP

	#include <cstddef>
	#include <Kokkos_Core_fwd.hpp>
	#include <Kokkos_View.hpp>
	#include <Kokkos_ExecPolicy.hpp>

	+#ifdef KOKKOSP_ENABLE_PROFILING
	+#include <impl/Kokkos_Profiling_Interface.hpp>
	+#include <typeinfo>
	+#endif
	+
	+#include <impl/Kokkos_AllocationTracker.hpp>
	#include <impl/Kokkos_Tags.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>

	+#ifdef KOKKOS_HAVE_DEBUG
	+#include<iostream>
	+#endif
	+
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------
	/** \brief Given a Functor and Execution Policy query an execution space.
	*
	* if the Policy has an execution space use that
	* else if the Functor has an execution_space use that
	* else if the Functor has a device_type use that for backward compatibility
	* else use the default
	*/
	template< class Functor
	, class Policy
	, class EnableFunctor = void
	, class EnablePolicy = void
	>
	struct FunctorPolicyExecutionSpace {
	typedef Kokkos::DefaultExecutionSpace execution_space ;
	};

	template< class Functor , class Policy >
	struct FunctorPolicyExecutionSpace
	< Functor , Policy
	, typename enable_if_type< typename Functor::device_type >::type
	, typename enable_if_type< typename Policy ::execution_space >::type
	>
	{
	typedef typename Policy ::execution_space execution_space ;
	};

	template< class Functor , class Policy >
	struct FunctorPolicyExecutionSpace
	< Functor , Policy
	, typename enable_if_type< typename Functor::execution_space >::type
	, typename enable_if_type< typename Policy ::execution_space >::type
	>
	{
	typedef typename Policy ::execution_space execution_space ;
	};

	template< class Functor , class Policy , class EnableFunctor >
	struct FunctorPolicyExecutionSpace
	< Functor , Policy
	, EnableFunctor
	, typename enable_if_type< typename Policy::execution_space >::type
	>
	{
	typedef typename Policy ::execution_space execution_space ;
	};

	template< class Functor , class Policy , class EnablePolicy >
	struct FunctorPolicyExecutionSpace
	< Functor , Policy
	, typename enable_if_type< typename Functor::device_type >::type
	, EnablePolicy
	>
	{
	typedef typename Functor::device_type execution_space ;
	};

	template< class Functor , class Policy , class EnablePolicy >
	struct FunctorPolicyExecutionSpace
	< Functor , Policy
	, typename enable_if_type< typename Functor::execution_space >::type
	, EnablePolicy
	>
	{
	typedef typename Functor::execution_space execution_space ;
	};

	//----------------------------------------------------------------------------
	/// \class ParallelFor
	/// \brief Implementation of the ParallelFor operator that has a
	/// partial specialization for the device.
	///
	/// This is an implementation detail of parallel_for. Users should
	/// skip this and go directly to the nonmember function parallel_for.
	template< class FunctorType , class ExecPolicy > class ParallelFor ;

	/// \class ParallelReduce
	/// \brief Implementation detail of parallel_reduce.
	///
	/// This is an implementation detail of parallel_reduce. Users should
	/// skip this and go directly to the nonmember function parallel_reduce.
	template< class FunctorType , class ExecPolicy > class ParallelReduce ;

	/// \class ParallelScan
	/// \brief Implementation detail of parallel_scan.
	///
	/// This is an implementation detail of parallel_scan. Users should
	/// skip this and go directly to the documentation of the nonmember
	/// template function Kokkos::parallel_scan.
	template< class FunctorType , class ExecPolicy > class ParallelScan ;

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	/** \brief Execute \c functor in parallel according to the execution \c policy.
	*
	* A "functor" is a class containing the function to execute in parallel,
	* data needed for that execution, and an optional \c execution_space
	* typedef. Here is an example functor for parallel_for:
	*
	* \code
	* class FunctorType {
	* public:
	* typedef ... execution_space ;
	* void operator() ( WorkType iwork ) const ;
	* };
	* \endcode
	*
	* In the above example, \c WorkType is any integer type for which a
	* valid conversion from \c size_t to \c IntType exists. Its
	* <tt>operator()</tt> method defines the operation to parallelize,
	* over the range of integer indices <tt>iwork=[0,work_count-1]</tt>.
	* This compares to a single iteration \c iwork of a \c for loop.
	* If \c execution_space is not defined DefaultExecutionSpace will be used.
	*/
	template< class ExecPolicy , class FunctorType >
	inline
	void parallel_for( const ExecPolicy & policy
	, const FunctorType & functor
	+ , const std::string& str = ""
	, typename Impl::enable_if< ! Impl::is_integral< ExecPolicy >::value >::type * = 0
	)
	{
	- (void) Impl::ParallelFor< FunctorType , ExecPolicy >( functor , policy );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelFor("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelFor< FunctorType , ExecPolicy >( Impl::CopyWithoutTracking::apply(functor) , policy );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelFor(kpID);
	+ }
	+#endif
	}

	template< class FunctorType >
	inline
	-void parallel_for( const size_t work_count ,
	- const FunctorType & functor )
	+void parallel_for( const size_t work_count
	+ , const FunctorType & functor
	+ , const std::string& str = ""
	+ )
	{
	typedef typename
	Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space
	execution_space ;
	typedef RangePolicy< execution_space > policy ;
	- (void) Impl::ParallelFor< FunctorType , policy >( functor , policy(0,work_count) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelFor("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelFor< FunctorType , policy >( Impl::CopyWithoutTracking::apply(functor) , policy(0,work_count) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelFor(kpID);
	+ }
	+#endif
	+}
	+
	+template< class ExecPolicy , class FunctorType >
	+inline
	+void parallel_for( const std::string & str
	+ , const ExecPolicy & policy
	+ , const FunctorType & functor )
	+{
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG Start parallel_for kernel: " << str << std::endl;
	+ #endif
	+
	+ parallel_for(policy,functor,str);
	+
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG End parallel_for kernel: " << str << std::endl;
	+ #endif
	+ (void) str;
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	/** \brief Parallel reduction
	*
	* Example of a parallel_reduce functor for a POD (plain old data) value type:
	* \code
	* class FunctorType { // For POD value type
	* public:
	* typedef ... execution_space ;
	* typedef <podType> value_type ;
	* void operator()( <intType> iwork , <podType> & update ) const ;
	* void init( <podType> & update ) const ;
	* void join( volatile <podType> & update ,
	* volatile const <podType> & input ) const ;
	*
	* typedef true_type has_final ;
	* void final( <podType> & update ) const ;
	* };
	* \endcode
	*
	* Example of a parallel_reduce functor for an array of POD (plain old data) values:
	* \code
	* class FunctorType { // For array of POD value
	* public:
	* typedef ... execution_space ;
	* typedef <podType> value_type[] ;
	* void operator()( <intType> , <podType> update[] ) const ;
	* void init( <podType> update[] ) const ;
	* void join( volatile <podType> update[] ,
	* volatile const <podType> input[] ) const ;
	*
	* typedef true_type has_final ;
	* void final( <podType> update[] ) const ;
	* };
	* \endcode
	*/
	template< class ExecPolicy , class FunctorType >
	inline
	void parallel_reduce( const ExecPolicy & policy
	, const FunctorType & functor
	+ , const std::string& str = ""
	, typename Impl::enable_if< ! Impl::is_integral< ExecPolicy >::value >::type * = 0
	)
	{
	- (void) Impl::ParallelReduce< FunctorType , ExecPolicy >( functor , policy );
	+ // typedef typename
	+ // Impl::FunctorPolicyExecutionSpace< FunctorType , ExecPolicy >::execution_space
	+ // execution_space ;
	+
	+ typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag > ValueTraits ;
	+
	+ typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	+ , typename ValueTraits::value_type
	+ , typename ValueTraits::pointer_type
	+ >::type value_type ;
	+
	+ Kokkos::View< value_type
	+ , HostSpace
	+ , Kokkos::MemoryUnmanaged
	+ >
	+ result_view ;
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelReduce("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType , ExecPolicy >( Impl::CopyWithoutTracking::apply(functor) , policy , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelReduce(kpID);
	+ }
	+#endif
	}

	// integral range policy
	template< class FunctorType >
	inline
	void parallel_reduce( const size_t work_count
	, const FunctorType & functor
	+ , const std::string& str = ""
	)
	{
	typedef typename
	Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space
	execution_space ;

	typedef RangePolicy< execution_space > policy ;

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;

	typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	, typename ValueTraits::value_type
	, typename ValueTraits::pointer_type
	>::type value_type ;

	Kokkos::View< value_type
	, HostSpace
	, Kokkos::MemoryUnmanaged
	>
	result_view ;

	- (void) Impl::ParallelReduce< FunctorType , policy >( functor , policy(0,work_count) , result_view );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelReduce("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType , policy >( Impl::CopyWithoutTracking::apply(functor) , policy(0,work_count) , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelReduce(kpID);
	+ }
	+#endif
	+
	}

	// general policy and view ouput
	template< class ExecPolicy , class FunctorType , class ViewType >
	inline
	void parallel_reduce( const ExecPolicy & policy
	, const FunctorType & functor
	, const ViewType & result_view
	+ , const std::string& str = ""
	, typename Impl::enable_if<
	( Impl::is_view<ViewType>::value && ! Impl::is_integral< ExecPolicy >::value
	+#ifdef KOKKOS_HAVE_CUDA
	+ && ! Impl::is_same<typename ExecPolicy::execution_space,Kokkos::Cuda>::value
	+#endif
	)>::type * = 0 )
	{
	- (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( functor , policy , result_view );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelReduce("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( Impl::CopyWithoutTracking::apply(functor) , policy , Impl::CopyWithoutTracking::apply(result_view) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelReduce(kpID);
	+ }
	+#endif
	+
	}

	// general policy and pod or array of pod output
	template< class ExecPolicy , class FunctorType >
	-inline
	void parallel_reduce( const ExecPolicy & policy
	, const FunctorType & functor
	+#ifdef KOKKOS_HAVE_CUDA
	, typename Impl::enable_if<
	- ( ! Impl::is_integral< ExecPolicy >::value )
	- , typename Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag >::reference_type
	- >::type result_ref )
	+ ( ! Impl::is_integral< ExecPolicy >::value &&
	+ ! Impl::is_same<typename ExecPolicy::execution_space,Kokkos::Cuda>::value )
	+ , typename Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag >::reference_type>::type result_ref
	+ , const std::string& str = ""
	+ , typename Impl::enable_if<! Impl::is_same<typename ExecPolicy::execution_space,Kokkos::Cuda>::value >::type* = 0
	+ )
	+#else
	+ , typename Impl::enable_if<
	+ ( ! Impl::is_integral< ExecPolicy >::value)
	+ , typename Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag >::reference_type
	+ >::type result_ref
	+ , const std::string& str = ""
	+ )
	+#endif
	{
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename ExecPolicy::work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueOps< FunctorType , typename ExecPolicy::work_tag > ValueOps ;

	// Wrap the result output request in a view to inform the implementation
	// of the type and memory space.

	typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	, typename ValueTraits::value_type
	, typename ValueTraits::pointer_type
	>::type value_type ;

	Kokkos::View< value_type
	, HostSpace
	, Kokkos::MemoryUnmanaged
	>
	result_view( ValueOps::pointer( result_ref )
	, ValueTraits::value_count( functor )
	);

	- (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( functor , policy , result_view );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelReduce("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( Impl::CopyWithoutTracking::apply(functor) , policy , Impl::CopyWithoutTracking::apply(result_view) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelReduce(kpID);
	+ }
	+#endif
	+
	}

	// integral range policy and view ouput
	template< class FunctorType , class ViewType >
	inline
	void parallel_reduce( const size_t work_count
	, const FunctorType & functor
	, const ViewType & result_view
	- , typename Impl::enable_if<( Impl::is_view<ViewType>::value )>::type * = 0 )
	+ , const std::string& str = ""
	+ , typename Impl::enable_if<( Impl::is_view<ViewType>::value
	+#ifdef KOKKOS_HAVE_CUDA
	+ && ! Impl::is_same<
	+ typename Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space,
	+ Kokkos::Cuda>::value
	+#endif
	+ )>::type * = 0 )
	{
	typedef typename
	Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space
	execution_space ;

	typedef RangePolicy< execution_space > ExecPolicy ;

	- (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( functor , ExecPolicy(0,work_count) , result_view );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelReduce("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType, ExecPolicy >( Impl::CopyWithoutTracking::apply(functor) , ExecPolicy(0,work_count) , Impl::CopyWithoutTracking::apply(result_view) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelReduce(kpID);
	+ }
	+#endif
	+
	}

	// integral range policy and pod or array of pod output
	template< class FunctorType >
	inline
	-void parallel_reduce( const size_t work_count ,
	- const FunctorType & functor ,
	- typename Kokkos::Impl::FunctorValueTraits< FunctorType , void >::reference_type result )
	+void parallel_reduce( const size_t work_count
	+ , const FunctorType & functor
	+ , typename Kokkos::Impl::FunctorValueTraits<
	+ typename Impl::if_c<Impl::is_execution_policy<FunctorType>::value \|\|
	+ Impl::is_integral<FunctorType>::value,
	+ void,FunctorType>::type
	+ , void >::reference_type result
	+ , const std::string& str = ""
	+ , typename Impl::enable_if< true
	+#ifdef KOKKOS_HAVE_CUDA
	+ && ! Impl::is_same<
	+ typename Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space,
	+ Kokkos::Cuda>::value
	+#endif
	+ >::type * = 0 )
	{
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueOps< FunctorType , void > ValueOps ;

	typedef typename
	Kokkos::Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space
	execution_space ;

	typedef Kokkos::RangePolicy< execution_space > policy ;

	// Wrap the result output request in a view to inform the implementation
	// of the type and memory space.

	typedef typename Kokkos::Impl::if_c< (ValueTraits::StaticValueSize != 0)
	, typename ValueTraits::value_type
	, typename ValueTraits::pointer_type
	>::type value_type ;

	Kokkos::View< value_type
	, HostSpace
	, Kokkos::MemoryUnmanaged
	>
	result_view( ValueOps::pointer( result )
	, ValueTraits::value_count( functor )
	);

	- (void) Impl::ParallelReduce< FunctorType , policy >( functor , policy(0,work_count) , result_view );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelReduce("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelReduce< FunctorType , policy >( Impl::CopyWithoutTracking::apply(functor) , policy(0,work_count) , Impl::CopyWithoutTracking::apply(result_view) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelReduce(kpID);
	+ }
	+#endif
	+
	+}
	+
	+template< class ExecPolicy , class FunctorType , class ResultType >
	+inline
	+void parallel_reduce( const std::string & str
	+ , const ExecPolicy & policy
	+ , const FunctorType & functor
	+ , ResultType * result)
	+{
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG Start parallel_reduce kernel: " << str << std::endl;
	+ #endif
	+
	+ parallel_reduce(policy,functor,result,str);
	+
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG End parallel_reduce kernel: " << str << std::endl;
	+ #endif
	+ (void) str;
	+}
	+
	+template< class ExecPolicy , class FunctorType , class ResultType >
	+inline
	+void parallel_reduce( const std::string & str
	+ , const ExecPolicy & policy
	+ , const FunctorType & functor
	+ , ResultType & result)
	+{
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG Start parallel_reduce kernel: " << str << std::endl;
	+ #endif
	+
	+ parallel_reduce(policy,functor,result,str);
	+
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG End parallel_reduce kernel: " << str << std::endl;
	+ #endif
	+ (void) str;
	+}
	+
	+template< class ExecPolicy , class FunctorType >
	+inline
	+void parallel_reduce( const std::string & str
	+ , const ExecPolicy & policy
	+ , const FunctorType & functor)
	+{
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG Start parallel_reduce kernel: " << str << std::endl;
	+ #endif
	+
	+ parallel_reduce(policy,functor,str);
	+
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG End parallel_reduce kernel: " << str << std::endl;
	+ #endif
	+ (void) str;
	}

	+
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	/// \fn parallel_scan
	/// \tparam ExecutionPolicy The execution policy type.
	/// \tparam FunctorType The scan functor type.
	///
	/// \param policy [in] The execution policy.
	/// \param functor [in] The scan functor.
	///
	/// This function implements a parallel scan pattern. The scan can
	/// be either inclusive or exclusive, depending on how you implement
	/// the scan functor.
	///
	/// A scan functor looks almost exactly like a reduce functor, except
	/// that its operator() takes a third \c bool argument, \c final_pass,
	/// which indicates whether this is the last pass of the scan
	/// operation. We will show below how to use the \c final_pass
	/// argument to control whether the scan is inclusive or exclusive.
	///
	/// Here is the minimum required interface of a scan functor for a POD
	/// (plain old data) value type \c PodType. That is, the result is a
	/// View of zero or more PodType. It is also possible for the result
	/// to be an array of (same-sized) arrays of PodType, but we do not
	/// show the required interface for that here.
	/// \code
	/// template< class ExecPolicy , class FunctorType >
	/// class ScanFunctor {
	/// public:
	/// // The Kokkos device type
	/// typedef ... execution_space;
	/// // Type of an entry of the array containing the result;
	/// // also the type of each of the entries combined using
	/// // operator() or join().
	/// typedef PodType value_type;
	///
	/// void operator () (const ExecPolicy::member_type & i, value_type& update, const bool final_pass) const;
	/// void init (value_type& update) const;
	/// void join (volatile value_type& update, volatile const value_type& input) const
	/// };
	/// \endcode
	///
	/// Here is an example of a functor which computes an inclusive plus-scan
	/// of an array of \c int, in place. If given an array [1, 2, 3, 4], this
	/// scan will overwrite that array with [1, 3, 6, 10].
	///
	/// \code
	/// template<class SpaceType>
	/// class InclScanFunctor {
	/// public:
	/// typedef SpaceType execution_space;
	/// typedef int value_type;
	/// typedef typename SpaceType::size_type size_type;
	///
	/// InclScanFunctor( Kokkos::View<value_type*, execution_space> x
	/// , Kokkos::View<value_type*, execution_space> y ) : m_x(x), m_y(y) {}
	///
	/// void operator () (const size_type i, value_type& update, const bool final_pass) const {
	/// update += m_x(i);
	/// if (final_pass) {
	/// m_y(i) = update;
	/// }
	/// }
	/// void init (value_type& update) const {
	/// update = 0;
	/// }
	/// void join (volatile value_type& update, volatile const value_type& input) const {
	/// update += input;
	/// }
	///
	/// private:
	/// Kokkos::View<value_type*, execution_space> m_x;
	/// Kokkos::View<value_type*, execution_space> m_y;
	/// };
	/// \endcode
	///
	/// Here is an example of a functor which computes an <i>exclusive</i>
	/// scan of an array of \c int, in place. In operator(), note both
	/// that the final_pass test and the update have switched places, and
	/// the use of a temporary. If given an array [1, 2, 3, 4], this scan
	/// will overwrite that array with [0, 1, 3, 6].
	///
	/// \code
	/// template<class SpaceType>
	/// class ExclScanFunctor {
	/// public:
	/// typedef SpaceType execution_space;
	/// typedef int value_type;
	/// typedef typename SpaceType::size_type size_type;
	///
	/// ExclScanFunctor (Kokkos::View<value_type*, execution_space> x) : x_ (x) {}
	///
	/// void operator () (const size_type i, value_type& update, const bool final_pass) const {
	/// const value_type x_i = x_(i);
	/// if (final_pass) {
	/// x_(i) = update;
	/// }
	/// update += x_i;
	/// }
	/// void init (value_type& update) const {
	/// update = 0;
	/// }
	/// void join (volatile value_type& update, volatile const value_type& input) const {
	/// update += input;
	/// }
	///
	/// private:
	/// Kokkos::View<value_type*, execution_space> x_;
	/// };
	/// \endcode
	///
	/// Here is an example of a functor which builds on the above
	/// exclusive scan example, to compute an offsets array from a
	/// population count array, in place. We assume that the pop count
	/// array has an extra entry at the end to store the final count. If
	/// given an array [1, 2, 3, 4, 0], this scan will overwrite that
	/// array with [0, 1, 3, 6, 10].
	///
	/// \code
	/// template<class SpaceType>
	/// class OffsetScanFunctor {
	/// public:
	/// typedef SpaceType execution_space;
	/// typedef int value_type;
	/// typedef typename SpaceType::size_type size_type;
	///
	/// // lastIndex_ is the last valid index (zero-based) of x.
	/// // If x has length zero, then lastIndex_ won't be used anyway.
	/// OffsetScanFunctor( Kokkos::View<value_type*, execution_space> x
	/// , Kokkos::View<value_type*, execution_space> y )
	/// : m_x(x), m_y(y), last_index_ (x.dimension_0 () == 0 ? 0 : x.dimension_0 () - 1)
	/// {}
	///
	/// void operator () (const size_type i, int& update, const bool final_pass) const {
	/// if (final_pass) {
	/// m_y(i) = update;
	/// }
	/// update += m_x(i);
	/// // The last entry of m_y gets the final sum.
	/// if (final_pass && i == last_index_) {
	/// m_y(i+1) = update;
	/// }
	/// }
	/// void init (value_type& update) const {
	/// update = 0;
	/// }
	/// void join (volatile value_type& update, volatile const value_type& input) const {
	/// update += input;
	/// }
	///
	/// private:
	/// Kokkos::View<value_type*, execution_space> m_x;
	/// Kokkos::View<value_type*, execution_space> m_y;
	/// const size_type last_index_;
	/// };
	/// \endcode
	///
	template< class ExecutionPolicy , class FunctorType >
	inline
	void parallel_scan( const ExecutionPolicy & policy
	, const FunctorType & functor
	+ , const std::string& str = ""
	, typename Impl::enable_if< ! Impl::is_integral< ExecutionPolicy >::value >::type * = 0
	)
	{
	- Impl::ParallelScan< FunctorType , ExecutionPolicy > scan( functor , policy );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ Impl::ParallelScan< FunctorType , ExecutionPolicy > scan( Impl::CopyWithoutTracking::apply(functor) , policy );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	+
	}

	template< class FunctorType >
	inline
	-void parallel_scan( const size_t work_count ,
	- const FunctorType & functor )
	+void parallel_scan( const size_t work_count
	+ , const FunctorType & functor
	+ , const std::string& str = "" )
	{
	typedef typename
	Kokkos::Impl::FunctorPolicyExecutionSpace< FunctorType , void >::execution_space
	execution_space ;

	typedef Kokkos::RangePolicy< execution_space > policy ;

	- (void) Impl::ParallelScan< FunctorType , policy >( functor , policy(0,work_count) );
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ uint64_t kpID = 0;
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::beginParallelScan("" == str ? typeid(FunctorType).name() : str, 0, &kpID);
	+ }
	+#endif
	+
	+ (void) Impl::ParallelScan< FunctorType , policy >( Impl::CopyWithoutTracking::apply(functor) , policy(0,work_count) );
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ if(Kokkos::Experimental::profileLibraryLoaded()) {
	+ Kokkos::Experimental::endParallelScan(kpID);
	+ }
	+#endif
	+
	+}
	+
	+template< class ExecutionPolicy , class FunctorType >
	+inline
	+void parallel_scan( const std::string& str
	+ , const ExecutionPolicy & policy
	+ , const FunctorType & functor)
	+{
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG Start parallel_scan kernel: " << str << std::endl;
	+ #endif
	+
	+ parallel_scan(policy,functor,str);
	+
	+ #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
	+ Kokkos::fence();
	+ std::cout << "KOKKOS_DEBUG End parallel_scan kernel: " << str << std::endl;
	+ #endif
	+ (void) str;
	}

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Enable = void >
	struct FunctorTeamShmemSize
	{
	static inline size_t value( const FunctorType & , int ) { return 0 ; }
	};

	template< class FunctorType >
	-struct FunctorTeamShmemSize< FunctorType , typename enable_if< sizeof( & FunctorType::team_shmem_size ) >::type >
	+struct FunctorTeamShmemSize< FunctorType , typename Impl::enable_if< 0 < sizeof( & FunctorType::team_shmem_size ) >::type >
	{
	static inline size_t value( const FunctorType & f , int team_size ) { return f.team_shmem_size( team_size ) ; }
	};

	template< class FunctorType >
	-struct FunctorTeamShmemSize< FunctorType , typename enable_if< sizeof( & FunctorType::shmem_size ) >::type >
	+struct FunctorTeamShmemSize< FunctorType , typename Impl::enable_if< 0 < sizeof( & FunctorType::shmem_size ) >::type >
	{
	static inline size_t value( const FunctorType & f , int team_size ) { return f.shmem_size( team_size ) ; }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* KOKKOS_PARALLEL_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_Qthread.hpp b/lib/kokkos/core/src/Kokkos_Qthread.hpp
	index cc6f0f844..4f12c02ba 100755
	--- a/lib/kokkos/core/src/Kokkos_Qthread.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Qthread.hpp
	@@ -1,165 +1,165 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_QTHREAD_HPP
	#define KOKKOS_QTHREAD_HPP

	#include <cstddef>
	#include <iosfwd>
	#include <Kokkos_Core.hpp>
	#include <Kokkos_Layout.hpp>
	#include <Kokkos_MemoryTraits.hpp>
	#include <Kokkos_HostSpace.hpp>
	#include <Kokkos_ExecPolicy.hpp>
	#include <impl/Kokkos_Tags.hpp>

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {
	class QthreadExec ;
	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/** \brief Execution space supported by Qthread */
	class Qthread {
	public:
	//! \name Type declarations that all Kokkos devices must provide.
	//@{

	//! Tag this class as an execution space
	typedef Qthread execution_space ;
	typedef Kokkos::HostSpace memory_space ;
	+ //! This execution space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	typedef Kokkos::LayoutRight array_layout ;
	typedef memory_space::size_type size_type ;

	typedef ScratchMemorySpace< Qthread > scratch_memory_space ;

	- //! For backward compatibility:
	- typedef Qthread device_type ;
	-
	//@}
	/------------------------------------------------------------------------/

	/** \brief Initialization will construct one or more instances */
	static Qthread & instance( int = 0 );

	/** \brief Set the execution space to a "sleep" state.
	*
	* This function sets the "sleep" state in which it is not ready for work.
	* This may consume less resources than in an "ready" state,
	* but it may also take time to transition to the "ready" state.
	*
	* \return True if enters or is in the "sleep" state.
	* False if functions are currently executing.
	*/
	bool sleep();

	/** \brief Wake from the sleep state.
	*
	* \return True if enters or is in the "ready" state.
	* False if functions are currently executing.
	*/
	static bool wake();

	/** \brief Wait until all dispatched functions to complete.
	*
	* The parallel_for or parallel_reduce dispatch of a functor may
	* return asynchronously, before the functor completes. This
	* method does not return until all dispatched functors on this
	* device have completed.
	*/
	static void fence();

	/------------------------------------------------------------------------/

	static void initialize( int thread_count );
	static void finalize();

	/** \brief Print configuration information to the given output stream. */
	static void print_configuration( std::ostream & , const bool detail = false );

	int shepherd_size() const ;
	int shepherd_worker_size() const ;
	};

	/--------------------------------------------------------------------------/

	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template<>
	struct VerifyExecutionCanAccessMemorySpace
	< Kokkos::Qthread::memory_space
	, Kokkos::Qthread::scratch_memory_space
	>
	{
	enum { value = true };
	inline static void verify( void ) { }
	inline static void verify( const void * ) { }
	};

	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	#include <Kokkos_Parallel.hpp>
	#include <Qthread/Kokkos_QthreadExec.hpp>
	#include <Qthread/Kokkos_Qthread_Parallel.hpp>

	#endif /* #define KOKKOS_QTHREAD_HPP */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	diff --git a/lib/kokkos/core/src/Kokkos_ScratchSpace.hpp b/lib/kokkos/core/src/Kokkos_ScratchSpace.hpp
	index 56b954d9b..6e5b4f962 100755
	--- a/lib/kokkos/core/src/Kokkos_ScratchSpace.hpp
	+++ b/lib/kokkos/core/src/Kokkos_ScratchSpace.hpp
	@@ -1,115 +1,125 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_SCRATCHSPACE_HPP
	#define KOKKOS_SCRATCHSPACE_HPP

	#include <stdio.h>
	#include <Kokkos_Core_fwd.hpp>
	#include <impl/Kokkos_Tags.hpp>

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/** \brief Scratch memory space associated with an execution space.
	*
	*/
	template< class ExecSpace >
	class ScratchMemorySpace {
	public:

	// Alignment of memory chunks returned by 'get'
	// must be a power of two
	enum { ALIGN = 8 };

	private:

	mutable char * m_iter ;
	char * m_end ;

	ScratchMemorySpace();
	ScratchMemorySpace & operator = ( const ScratchMemorySpace & );

	enum { MASK = ALIGN - 1 }; // Alignment used by View::shmem_size

	public:

	//! Tag this class as a memory space
	typedef ScratchMemorySpace memory_space ;
	typedef ExecSpace execution_space ;
	+ //! This execution space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	typedef typename ExecSpace::array_layout array_layout ;
	typedef typename ExecSpace::size_type size_type ;

	template< typename IntType >
	KOKKOS_INLINE_FUNCTION static
	IntType align( const IntType & size )
	{ return ( size + MASK ) & ~MASK ; }

	template< typename IntType >
	KOKKOS_INLINE_FUNCTION
	void* get_shmem (const IntType& size) const {
	void* tmp = m_iter ;
	if (m_end < (m_iter += align (size))) {
	m_iter -= align (size); // put it back like it was
	- printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate %ld byte(s); remaining capacity is %ld byte(s)\n", long(size), long(m_end-m_iter));
	+ #ifdef KOKKOS_HAVE_DEBUG
	+ // mfh 23 Jun 2015: printf call consumes 25 registers
	+ // in a CUDA build, so only print in debug mode. The
	+ // function still returns NULL if not enough memory.
	+ printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
	+ "%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
	+ long(m_end-m_iter));
	+ #endif // KOKKOS_HAVE_DEBUG
	tmp = 0;
	}
	return tmp;
	}

	template< typename IntType >
	KOKKOS_INLINE_FUNCTION
	ScratchMemorySpace( void * ptr , const IntType & size )
	: m_iter( (char *) ptr )
	, m_end( m_iter + size )
	{}
	};

	} // namespace Kokkos

	#endif /* #ifndef KOKKOS_SCRATCHSPACE_HPP */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	diff --git a/lib/kokkos/core/src/Kokkos_Serial.hpp b/lib/kokkos/core/src/Kokkos_Serial.hpp
	index e9495724f..5773a18b3 100755
	--- a/lib/kokkos/core/src/Kokkos_Serial.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Serial.hpp
	@@ -1,879 +1,892 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_Serial.hpp
	/// \brief Declaration and definition of Kokkos::Serial device.

	#ifndef KOKKOS_SERIAL_HPP
	#define KOKKOS_SERIAL_HPP

	#include <cstddef>
	#include <iosfwd>
	#include <Kokkos_Parallel.hpp>
	#include <Kokkos_Layout.hpp>
	#include <Kokkos_HostSpace.hpp>
	#include <Kokkos_ScratchSpace.hpp>
	#include <Kokkos_MemoryTraits.hpp>
	#include <impl/Kokkos_Tags.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>

	#if defined( KOKKOS_HAVE_SERIAL )

	namespace Kokkos {

	/// \class Serial
	/// \brief Kokkos device for non-parallel execution
	///
	/// A "device" represents a parallel execution model. It tells Kokkos
	/// how to parallelize the execution of kernels in a parallel_for or
	/// parallel_reduce. For example, the Threads device uses Pthreads or
	/// C++11 threads on a CPU, the OpenMP device uses the OpenMP language
	/// extensions, and the Cuda device uses NVIDIA's CUDA programming
	/// model. The Serial device executes "parallel" kernels
	/// sequentially. This is useful if you really do not want to use
	/// threads, or if you want to explore different combinations of MPI
	/// and shared-memory parallel programming models.
	class Serial {
	public:
	//! \name Type declarations that all Kokkos devices must provide.
	//@{

	//! Tag this class as an execution space:
	typedef Serial execution_space ;
	//! The size_type typedef best suited for this device.
	typedef HostSpace::size_type size_type ;
	//! This device's preferred memory space.
	typedef HostSpace memory_space ;
	+ //! This execution space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	//! This device's preferred array layout.
	typedef LayoutRight array_layout ;

	/// \brief Scratch memory space
	typedef ScratchMemorySpace< Kokkos::Serial > scratch_memory_space ;

	- //! For backward compatibility:
	- typedef Serial device_type ;
	-
	//@}

	/// \brief True if and only if this method is being called in a
	/// thread-parallel function.
	///
	/// For the Serial device, this method <i>always</i> returns false,
	/// because parallel_for or parallel_reduce with the Serial device
	/// always execute sequentially.
	inline static int in_parallel() { return false ; }

	/** \brief Set the device in a "sleep" state.
	*
	* This function sets the device in a "sleep" state in which it is
	* not ready for work. This may consume less resources than if the
	* device were in an "awake" state, but it may also take time to
	* bring the device from a sleep state to be ready for work.
	*
	* \return True if the device is in the "sleep" state, else false if
	* the device is actively working and could not enter the "sleep"
	* state.
	*/
	static bool sleep();

	/// \brief Wake the device from the 'sleep' state so it is ready for work.
	///
	/// \return True if the device is in the "ready" state, else "false"
	/// if the device is actively working (which also means that it's
	/// awake).
	static bool wake();

	/// \brief Wait until all dispatched functors complete.
	///
	/// The parallel_for or parallel_reduce dispatch of a functor may
	/// return asynchronously, before the functor completes. This
	/// method does not return until all dispatched functors on this
	/// device have completed.
	static void fence() {}

	static void initialize( unsigned threads_count = 1 ,
	unsigned use_numa_count = 0 ,
	unsigned use_cores_per_numa = 0 ,
	bool allow_asynchronous_threadpool = false) {
	(void) threads_count;
	(void) use_numa_count;
	(void) use_cores_per_numa;
	(void) allow_asynchronous_threadpool;
	+
	+ // Init the array of locks used for arbitrarily sized atomics
	+ Impl::init_lock_array_host_space();
	+
	}

	static int is_initialized() { return 1 ; }

	//! Free any resources being consumed by the device.
	static void finalize() {}

	//! Print configuration information to the given output stream.
	- static void print_configuration( std::ostream & , const bool detail = false );
	+ static void print_configuration( std::ostream & , const bool detail = false ) {}

	//--------------------------------------------------------------------------

	inline static int thread_pool_size( int = 0 ) { return 1 ; }
	KOKKOS_INLINE_FUNCTION static int thread_pool_rank() { return 0 ; }

	//--------------------------------------------------------------------------

	KOKKOS_INLINE_FUNCTION static unsigned hardware_thread_id() { return thread_pool_rank(); }
	inline static unsigned max_hardware_threads() { return thread_pool_size(0); }

	//--------------------------------------------------------------------------

	static void * scratch_memory_resize( unsigned reduce_size , unsigned shared_size );

	//--------------------------------------------------------------------------
	};

	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template<>
	struct VerifyExecutionCanAccessMemorySpace
	< Kokkos::Serial::memory_space
	, Kokkos::Serial::scratch_memory_space
	>
	{
	enum { value = true };
	inline static void verify( void ) { }
	inline static void verify( const void * ) { }
	};

	namespace SerialImpl {

	struct Sentinel {

	void * m_scratch ;
	unsigned m_reduce_end ;
	unsigned m_shared_end ;

	Sentinel();
	~Sentinel();
	static Sentinel & singleton();
	};

	inline
	unsigned align( unsigned n );
	}
	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	class SerialTeamMember {
	private:
	typedef Kokkos::ScratchMemorySpace< Kokkos::Serial > scratch_memory_space ;
	const scratch_memory_space m_space ;
	const int m_league_rank ;
	const int m_league_size ;

	SerialTeamMember & operator = ( const SerialTeamMember & );

	public:

	KOKKOS_INLINE_FUNCTION
	const scratch_memory_space & team_shmem() const { return m_space ; }

	KOKKOS_INLINE_FUNCTION int league_rank() const { return m_league_rank ; }
	KOKKOS_INLINE_FUNCTION int league_size() const { return m_league_size ; }
	KOKKOS_INLINE_FUNCTION int team_rank() const { return 0 ; }
	KOKKOS_INLINE_FUNCTION int team_size() const { return 1 ; }

	KOKKOS_INLINE_FUNCTION void team_barrier() const {}

	template<class ValueType>
	KOKKOS_INLINE_FUNCTION
	void team_broadcast(const ValueType& , const int& ) const {}

	template< class ValueType, class JoinOp >
	KOKKOS_INLINE_FUNCTION
	- ValueType team_reduce( const ValueType & value
	- , const JoinOp & ) const
	+ ValueType team_reduce( const ValueType & value , const JoinOp & ) const
	{
	return value ;
	}

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering
	* with intra-team non-deterministic ordering accumulation.
	*
	* The global inter-team accumulation value will, at the end of the
	* league's parallel execution, be the scan's total.
	* Parallel execution ordering of the league's teams is non-deterministic.
	* As such the base value for each team's scan operation is similarly
	* non-deterministic.
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value , Type * const global_accum ) const
	{
	const Type tmp = global_accum ? *global_accum : Type(0) ;
	if ( global_accum ) { *global_accum += value ; }
	return tmp ;
	}

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	*
	* The highest rank thread can compute the reduction total as
	* reduction_total = dev.team_scan( value ) + value ;
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & ) const
	{ return Type(0); }

	-#ifdef KOKKOS_HAVE_CXX11
	-
	- /** \brief Executes op(iType i) for each i=0..N-1.
	- *
	- * This functionality requires C++11 support.*/
	- template< typename iType, class Operation>
	- KOKKOS_INLINE_FUNCTION void team_par_for(const iType n, const Operation & op) const {
	- for(int i=0; i<n ; i++) {
	- op(i);
	- }
	- }
	-
	-#endif
	-
	//----------------------------------------
	// Execution space specific:

	SerialTeamMember( int arg_league_rank
	, int arg_league_size
	, int arg_shared_size
	);
	};

	} // namespace Impl


	/*
	* < Kokkos::Serial , WorkArgTag >
	* < WorkArgTag , Impl::enable_if< Impl::is_same< Kokkos::Serial , Kokkos::DefaultExecutionSpace >::value >::type >
	*
	*/
	template< class Arg0 , class Arg1 >
	class TeamPolicy< Arg0 , Arg1 , Kokkos::Serial >
	{
	private:

	const int m_league_size ;

	public:

	//! Tag this class as a kokkos execution policy
	typedef TeamPolicy execution_policy ;

	//! Execution space of this execution policy:
	typedef Kokkos::Serial execution_space ;

	typedef typename
	Impl::if_c< ! Impl::is_same< Kokkos::Serial , Arg0 >::value , Arg0 , Arg1 >::type
	work_tag ;

	//----------------------------------------

	template< class FunctorType >
	static
	int team_size_max( const FunctorType & ) { return 1 ; }

	template< class FunctorType >
	static
	int team_size_recommended( const FunctorType & ) { return 1 ; }

	+ template< class FunctorType >
	+ static
	+ int team_size_recommended( const FunctorType & , const int& ) { return 1 ; }
	+
	//----------------------------------------

	inline int team_size() const { return 1 ; }
	inline int league_size() const { return m_league_size ; }

	/** \brief Specify league size, request team size */
	TeamPolicy( execution_space & , int league_size_request , int /* team_size_request */ , int vector_length_request = 1 )
	: m_league_size( league_size_request )
	{ (void) vector_length_request; }

	TeamPolicy( int league_size_request , int /* team_size_request */ , int vector_length_request = 1 )
	: m_league_size( league_size_request )
	{ (void) vector_length_request; }

	typedef Impl::SerialTeamMember member_type ;
	};

	} /* namespace Kokkos */

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelFor< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > Policy ;

	public:
	// work tag is void
	template< class PType >
	inline
	ParallelFor( typename Impl::enable_if<
	( Impl::is_same< PType , Policy >::value &&
	Impl::is_same< typename PType::work_tag , void >::value
	), const FunctorType & >::type functor
	, const PType & policy )
	{
	const typename PType::member_type e = policy.end();
	for ( typename PType::member_type i = policy.begin() ; i < e ; ++i ) {
	functor( i );
	}
	}

	// work tag is non-void
	template< class PType >
	inline
	ParallelFor( typename Impl::enable_if<
	( Impl::is_same< PType , Policy >::value &&
	! Impl::is_same< typename PType::work_tag , void >::value
	), const FunctorType & >::type functor
	, const PType & policy )
	{
	const typename PType::member_type e = policy.end();
	for ( typename PType::member_type i = policy.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i );
	}
	}
	};

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelReduce< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > >
	{
	public:
	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > Policy ;
	typedef typename Policy::work_tag WorkTag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , WorkTag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , WorkTag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	// Work tag is void
	template< class ViewType , class PType >
	ParallelReduce( typename Impl::enable_if<
	( Impl::is_view< ViewType >::value &&
	Impl::is_same< typename ViewType::memory_space , HostSpace >::value &&
	Impl::is_same< PType , Policy >::value &&
	Impl::is_same< typename PType::work_tag , void >::value
	), const FunctorType & >::type functor
	, const PType & policy
	, const ViewType & result
	)
	{
	pointer_type result_ptr = result.ptr_on_device();

	if ( ! result_ptr ) {
	result_ptr = (pointer_type)
	Kokkos::Serial::scratch_memory_resize( ValueTraits::value_size( functor ) , 0 );
	}

	reference_type update = ValueInit::init( functor , result_ptr );

	const typename PType::member_type e = policy.end();
	for ( typename PType::member_type i = policy.begin() ; i < e ; ++i ) {
	functor( i , update );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , WorkTag >::final( functor , result_ptr );
	}

	// Work tag is non-void
	template< class ViewType , class PType >
	ParallelReduce( typename Impl::enable_if<
	( Impl::is_view< ViewType >::value &&
	Impl::is_same< typename ViewType::memory_space , HostSpace >::value &&
	Impl::is_same< PType , Policy >::value &&
	! Impl::is_same< typename PType::work_tag , void >::value
	), const FunctorType & >::type functor
	, const PType & policy
	, const ViewType & result
	)
	{
	pointer_type result_ptr = result.ptr_on_device();

	if ( ! result_ptr ) {
	result_ptr = (pointer_type)
	Kokkos::Serial::scratch_memory_resize( ValueTraits::value_size( functor ) , 0 );
	}

	typename ValueTraits::reference_type update = ValueInit::init( functor , result_ptr );

	const typename PType::member_type e = policy.end();
	for ( typename PType::member_type i = policy.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i , update );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , WorkTag >::final( functor , result_ptr );
	}
	};

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelScan< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > Policy ;

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename Policy::work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , typename Policy::work_tag > ValueInit ;

	public:

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	// work tag is void
	template< class PType >
	inline
	ParallelScan( typename Impl::enable_if<
	( Impl::is_same< PType , Policy >::value &&
	Impl::is_same< typename PType::work_tag , void >::value
	), const FunctorType & >::type functor
	, const PType & policy )
	{
	pointer_type result_ptr = (pointer_type)
	Kokkos::Serial::scratch_memory_resize( ValueTraits::value_size( functor ) , 0 );

	reference_type update = ValueInit::init( functor , result_ptr );

	const typename PType::member_type e = policy.end();
	for ( typename PType::member_type i = policy.begin() ; i < e ; ++i ) {
	functor( i , update , true );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , typename Policy::work_tag >::final( functor , result_ptr );
	}

	// work tag is non-void
	template< class PType >
	inline
	ParallelScan( typename Impl::enable_if<
	( Impl::is_same< PType , Policy >::value &&
	! Impl::is_same< typename PType::work_tag , void >::value
	), const FunctorType & >::type functor
	, const PType & policy )
	{
	pointer_type result_ptr = (pointer_type)
	Kokkos::Serial::scratch_memory_resize( ValueTraits::value_size( functor ) , 0 );

	reference_type update = ValueInit::init( functor , result_ptr );

	const typename PType::member_type e = policy.end();
	for ( typename PType::member_type i = policy.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i , update , true );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , typename Policy::work_tag >::final( functor , result_ptr );
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelFor< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Serial > >
	{
	private:

	typedef Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Serial > Policy ;

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const FunctorType & >::type functor
	, const typename Policy::member_type & member )
	{ functor( member ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const FunctorType & >::type functor
	, const typename Policy::member_type & member )
	{ functor( TagType() , member ); }

	public:

	ParallelFor( const FunctorType & functor
	, const Policy & policy )
	{
	const int shared_size = FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() );

	Kokkos::Serial::scratch_memory_resize( 0 , shared_size );

	for ( int ileague = 0 ; ileague < policy.league_size() ; ++ileague ) {
	ParallelFor::template driver< typename Policy::work_tag >
	( functor , typename Policy::member_type(ileague,policy.league_size(),shared_size) );
	// functor( typename Policy::member_type(ileague,policy.league_size(),shared_size) );
	}
	}
	};

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelReduce< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Serial > >
	{
	private:

	typedef Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Serial > Policy ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename Policy::work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , typename Policy::work_tag > ValueInit ;

	public:

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	private:

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const FunctorType & >::type functor
	, const typename Policy::member_type & member
	, reference_type update )
	{ functor( member , update ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const FunctorType & >::type functor
	, const typename Policy::member_type & member
	, reference_type update )
	{ functor( TagType() , member , update ); }

	public:

	template< class ViewType >
	ParallelReduce( const FunctorType & functor
	, const Policy & policy
	, const ViewType & result
	)
	{
	const int reduce_size = ValueTraits::value_size( functor );
	const int shared_size = FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() );
	void * const scratch_reduce = Kokkos::Serial::scratch_memory_resize( reduce_size , shared_size );

	const pointer_type result_ptr =
	result.ptr_on_device() ? result.ptr_on_device()
	: (pointer_type) scratch_reduce ;

	reference_type update = ValueInit::init( functor , result_ptr );

	for ( int ileague = 0 ; ileague < policy.league_size() ; ++ileague ) {
	ParallelReduce::template driver< typename Policy::work_tag >
	( functor , typename Policy::member_type(ileague,policy.league_size(),shared_size) , update );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , typename Policy::work_tag >::final( functor , result_ptr );
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	-#ifdef KOKKOS_HAVE_CXX11
	-
	namespace Kokkos {

	namespace Impl {
	- template<typename iType>
	- struct TeamThreadLoopBoundariesStruct<iType,SerialTeamMember> {
	- typedef iType index_type;
	- enum {start = 0};
	- const iType end;
	- enum {increment = 1};
	- const SerialTeamMember& thread;

	- KOKKOS_INLINE_FUNCTION
	- TeamThreadLoopBoundariesStruct (const SerialTeamMember& thread_, const iType& count):
	- end(count),
	- thread(thread_)
	+template<typename iType>
	+struct TeamThreadRangeBoundariesStruct<iType,SerialTeamMember> {
	+ typedef iType index_type;
	+ const iType begin ;
	+ const iType end ;
	+ enum {increment = 1};
	+ const SerialTeamMember& thread;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TeamThreadRangeBoundariesStruct (const SerialTeamMember& arg_thread, const iType& arg_count)
	+ : begin(0)
	+ , end(arg_count)
	+ , thread(arg_thread)
	{}
	- };
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TeamThreadRangeBoundariesStruct (const SerialTeamMember& arg_thread, const iType& arg_begin, const iType & arg_end )
	+ : begin( arg_begin )
	+ , end( arg_end)
	+ , thread( arg_thread )
	+ {}
	+};

	template<typename iType>
	- struct ThreadVectorLoopBoundariesStruct<iType,SerialTeamMember> {
	+ struct ThreadVectorRangeBoundariesStruct<iType,SerialTeamMember> {
	typedef iType index_type;
	enum {start = 0};
	const iType end;
	enum {increment = 1};

	KOKKOS_INLINE_FUNCTION
	- ThreadVectorLoopBoundariesStruct (const SerialTeamMember& thread, const iType& count):
	+ ThreadVectorRangeBoundariesStruct (const SerialTeamMember& thread, const iType& count):
	end( count )
	{}
	};
	+
	} // namespace Impl

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::TeamThreadLoopBoundariesStruct<iType,Impl::SerialTeamMember>
	- TeamThreadLoop(const Impl::SerialTeamMember& thread, const iType& count) {
	- return Impl::TeamThreadLoopBoundariesStruct<iType,Impl::SerialTeamMember>(thread,count);
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>
	+TeamThreadRange( const Impl::SerialTeamMember& thread, const iType & count )
	+{
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>(thread,count);
	}

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::SerialTeamMember >
	- ThreadVectorLoop(const Impl::SerialTeamMember& thread, const iType& count) {
	- return Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::SerialTeamMember >(thread,count);
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>
	+TeamThreadRange( const Impl::SerialTeamMember& thread, const iType & begin , const iType & end )
	+{
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>(thread,begin,end);
	+}
	+
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::SerialTeamMember >
	+ ThreadVectorRange(const Impl::SerialTeamMember& thread, const iType& count) {
	+ return Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::SerialTeamMember >(thread,count);
	}

	KOKKOS_INLINE_FUNCTION
	Impl::ThreadSingleStruct<Impl::SerialTeamMember> PerTeam(const Impl::SerialTeamMember& thread) {
	return Impl::ThreadSingleStruct<Impl::SerialTeamMember>(thread);
	}

	KOKKOS_INLINE_FUNCTION
	Impl::VectorSingleStruct<Impl::SerialTeamMember> PerThread(const Impl::SerialTeamMember& thread) {
	return Impl::VectorSingleStruct<Impl::SerialTeamMember>(thread);
	}

	} // namespace Kokkos

	namespace Kokkos {

	/** \brief Inter-thread parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::SerialTeamMember>& loop_boundaries, const Lambda& lambda) {
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	+void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>& loop_boundaries, const Lambda& lambda) {
	+ for( iType i = loop_boundaries.begin; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	}

	/** \brief Inter-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::SerialTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>& loop_boundaries,
	const Lambda & lambda, ValueType& result) {

	result = ValueType();

	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ for( iType i = loop_boundaries.begin; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	result+=tmp;
	}

	result = loop_boundaries.thread.team_reduce(result,Impl::JoinAdd<ValueType>());
	}

	+#ifdef KOKKOS_HAVE_CXX11
	+
	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::SerialTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::SerialTeamMember>& loop_boundaries,
	const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	ValueType result = init_result;

	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ for( iType i = loop_boundaries.begin; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	join(result,tmp);
	}

	init_result = loop_boundaries.thread.team_reduce(result,Impl::JoinLambdaAdapter<ValueType,JoinType>(join));
	}

	+#endif // KOKKOS_HAVE_CXX11
	+
	} //namespace Kokkos

	namespace Kokkos {
	/** \brief Intra-thread vector parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::SerialTeamMember >&
	+void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::SerialTeamMember >&
	loop_boundaries, const Lambda& lambda) {
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::SerialTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::SerialTeamMember >&
	loop_boundaries, const Lambda & lambda, ValueType& result) {
	result = ValueType();
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	result+=tmp;
	}
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::SerialTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::SerialTeamMember >&
	loop_boundaries, const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	ValueType result = init_result;
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	join(result,tmp);
	}
	init_result = result;
	}

	/** \brief Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
	* for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes in the thread and a scan operation is performed.
	* Depending on the target execution space the operator might be called twice: once with final=false
	* and once with final=true. When final==true val contains the prefix sum value. The contribution of this
	* "i" needs to be added to val no matter whether final==true or not. In a serial execution
	* (i.e. team_size==1) the operator is only called once with final==true. Scan_val will be set
	* to the final sum value over all vector lanes.
	* This functionality requires C++11 support.*/
	template< typename iType, class FunctorType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_scan(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::SerialTeamMember >&
	+void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::SerialTeamMember >&
	loop_boundaries, const FunctorType & lambda) {

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	typedef typename ValueTraits::value_type value_type ;

	value_type scan_val = value_type();

	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,scan_val,true);
	}
	}

	} // namespace Kokkos

	namespace Kokkos {

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::VectorSingleStruct<Impl::SerialTeamMember>& , const FunctorType& lambda) {
	lambda();
	}

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::ThreadSingleStruct<Impl::SerialTeamMember>& , const FunctorType& lambda) {
	lambda();
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::VectorSingleStruct<Impl::SerialTeamMember>& , const FunctorType& lambda, ValueType& val) {
	lambda(val);
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::ThreadSingleStruct<Impl::SerialTeamMember>& , const FunctorType& lambda, ValueType& val) {
	lambda(val);
	}
	}
	-#endif // KOKKOS_HAVE_CXX11

	#endif // defined( KOKKOS_HAVE_SERIAL )
	#endif /* #define KOKKOS_SERIAL_HPP */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	diff --git a/lib/kokkos/core/src/Kokkos_TaskPolicy.hpp b/lib/kokkos/core/src/Kokkos_TaskPolicy.hpp
	index 27139107d..6f6453fd4 100755
	--- a/lib/kokkos/core/src/Kokkos_TaskPolicy.hpp
	+++ b/lib/kokkos/core/src/Kokkos_TaskPolicy.hpp
	@@ -1,467 +1,376 @@

	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	// Experimental unified task-data parallel manycore LDRD

	#ifndef KOKKOS_TASKPOLICY_HPP
	#define KOKKOS_TASKPOLICY_HPP

	#include <Kokkos_Core_fwd.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Tags.hpp>
	#include <impl/Kokkos_StaticAssert.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {
	namespace Impl {

	struct FutureValueTypeIsVoidError {};

	template < class ExecSpace , class ResultType , class FunctorType >
	class TaskMember ;

	template< class ExecPolicy , class ResultType , class FunctorType >
	class TaskForEach ;

	template< class ExecPolicy , class ResultType , class FunctorType >
	class TaskReduce ;

	template< class ExecPolicy , class ResultType , class FunctorType >
	struct TaskScan ;

	} /* namespace Impl */
	+} /* namespace Experimental */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {

	/*\brief States of a task /
	enum TaskState
	{ TASK_STATE_NULL = 0 ///< Does not exist
	, TASK_STATE_CONSTRUCTING = 1 ///< Is under construction
	, TASK_STATE_WAITING = 2 ///< Is waiting for execution
	, TASK_STATE_EXECUTING = 4 ///< Is executing
	, TASK_STATE_COMPLETE = 8 ///< Execution is complete
	};

	-template< class Arg1 = void , class Arg2 = void >
	-class FutureArray ;
	-
	/**
	*
	* Future< space > // value_type == void
	* Future< value > // space == Default
	* Future< value , space >
	*
	*/
	template< class Arg1 = void , class Arg2 = void >
	class Future {
	private:

	template< class , class , class > friend class Impl::TaskMember ;
	template< class > friend class TaskPolicy ;
	template< class , class > friend class Future ;
	- template< class , class > friend class FutureArray ;

	// Argument #2, if not void, must be the space.
	- enum { Arg1_is_space = Impl::is_execution_space< Arg1 >::value };
	- enum { Arg2_is_space = Impl::is_execution_space< Arg2 >::value };
	- enum { Arg2_is_void = Impl::is_same< Arg2 , void >::value };
	+ enum { Arg1_is_space = Kokkos::Impl::is_execution_space< Arg1 >::value };
	+ enum { Arg2_is_space = Kokkos::Impl::is_execution_space< Arg2 >::value };
	+ enum { Arg2_is_void = Kokkos::Impl::is_same< Arg2 , void >::value };

	struct ErrorNoExecutionSpace {};

	enum { Opt1 = Arg1_is_space && Arg2_is_void
	, Opt2 = ! Arg1_is_space && Arg2_is_void
	, Opt3 = ! Arg1_is_space && Arg2_is_space
	- , OptOK = Impl::StaticAssert< Opt1 \|\| Opt2 \|\| Opt3 , ErrorNoExecutionSpace >::value
	+ , OptOK = Kokkos::Impl::StaticAssert< Opt1 \|\| Opt2 \|\| Opt3 , ErrorNoExecutionSpace >::value
	};

	typedef typename
	- Impl::if_c< Opt2 \|\| Opt3 , Arg1 , void >::type
	+ Kokkos::Impl::if_c< Opt2 \|\| Opt3 , Arg1 , void >::type
	ValueType ;

	typedef typename
	- Impl::if_c< Opt1 , Arg1 , typename
	- Impl::if_c< Opt2 , Kokkos::DefaultExecutionSpace , typename
	- Impl::if_c< Opt3 , Arg2 , void
	+ Kokkos::Impl::if_c< Opt1 , Arg1 , typename
	+ Kokkos::Impl::if_c< Opt2 , Kokkos::DefaultExecutionSpace , typename
	+ Kokkos::Impl::if_c< Opt3 , Arg2 , void
	>::type >::type >::type
	ExecutionSpace ;

	typedef Impl::TaskMember< ExecutionSpace , void , void > TaskRoot ;
	typedef Impl::TaskMember< ExecutionSpace , ValueType , void > TaskValue ;

	TaskRoot * m_task ;

	public:

	typedef ValueType value_type;
	typedef ExecutionSpace execution_space ;

	//----------------------------------------

	KOKKOS_INLINE_FUNCTION
	TaskState get_task_state() const
	{ return 0 != m_task ? m_task->get_state() : TASK_STATE_NULL ; }

	//----------------------------------------

	explicit
	Future( TaskRoot * task )
	: m_task(0)
	{ TaskRoot::assign( & m_task , TaskRoot::template verify_type< value_type >( task ) ); }

	//----------------------------------------

	KOKKOS_INLINE_FUNCTION
	- ~Future() { TaskRoot::assign( & m_task , 0 , true /* no_throw */ ); }
	+ ~Future() { TaskRoot::assign( & m_task , 0 ); }

	//----------------------------------------

	KOKKOS_INLINE_FUNCTION
	Future() : m_task(0) {}

	KOKKOS_INLINE_FUNCTION
	Future( const Future & rhs )
	: m_task(0)
	{ TaskRoot::assign( & m_task , rhs.m_task ); }

	KOKKOS_INLINE_FUNCTION
	Future & operator = ( const Future & rhs )
	{ TaskRoot::assign( & m_task , rhs.m_task ); return *this ; }

	//----------------------------------------

	template< class A1 , class A2 >
	KOKKOS_INLINE_FUNCTION
	Future( const Future<A1,A2> & rhs )
	: m_task(0)
	{ TaskRoot::assign( & m_task , TaskRoot::template verify_type< value_type >( rhs.m_task ) ); }

	template< class A1 , class A2 >
	KOKKOS_INLINE_FUNCTION
	Future & operator = ( const Future<A1,A2> & rhs )
	{ TaskRoot::assign( & m_task , TaskRoot::template verify_type< value_type >( rhs.m_task ) ); return *this ; }

	//----------------------------------------

	typedef typename TaskValue::get_result_type get_result_type ;

	KOKKOS_INLINE_FUNCTION
	get_result_type get() const
	{ return static_cast<TaskValue*>( m_task )->get(); }
	};

	namespace Impl {

	template< class T >
	struct is_future : public Kokkos::Impl::bool_< false > {};

	template< class Arg0 , class Arg1 >
	-struct is_future< Kokkos::Future<Arg0,Arg1> > : public Kokkos::Impl::bool_< true > {};
	+struct is_future< Kokkos::Experimental::Future<Arg0,Arg1> > : public Kokkos::Impl::bool_< true > {};

	} /* namespace Impl */
	+} /* namespace Experimental */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	-
	-template< class Arg1 , class Arg2 >
	-class FutureArray {
	-private:
	-
	- typedef Future<Arg1,Arg2> future_type ;
	-
	- typedef typename future_type::execution_space ExecutionSpace ;
	- typedef typename ExecutionSpace::memory_space MemorySpace ;
	-
	- typedef Impl::TaskMember< ExecutionSpace , void , void > TaskRoot ;
	-
	- future_type * m_future ;
	-
	- //----------------------------------------
	-
	-public:
	-
	- typedef ExecutionSpace execution_space ;
	- typedef future_type value_type ;
	-
	- //----------------------------------------
	-
	- KOKKOS_INLINE_FUNCTION
	- size_t size() const
	- { return m_future ? reinterpret_cast<size_t>(m_future->m_task) : size_t(0) ; }
	-
	- KOKKOS_INLINE_FUNCTION
	- value_type & operator[]( const int i ) const
	- { return m_future[i+1]; }
	-
	- //----------------------------------------
	-
	- KOKKOS_INLINE_FUNCTION
	- ~FutureArray()
	- {
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- if ( m_future ) {
	- const size_t n = size();
	- for ( size_t i = 1 ; i <= n ; ++i ) {
	- TaskRoot::assign( & m_future[i].m_task , 0 );
	- }
	- m_future[0].m_task = 0 ;
	- MemorySpace::decrement( m_future );
	- }
	-#endif
	- }
	-
	- KOKKOS_INLINE_FUNCTION
	- FutureArray() : m_future(0) {}
	-
	- inline
	- FutureArray( const size_t n )
	- : m_future(0)
	- {
	- if ( n ) {
	- m_future = (future_type ) MemorySpace::allocate( "FutureArray" , sizeof(future_type) ( n + 1 ) );
	- for ( size_t i = 0 ; i <= n ; ++i ) m_future[i].m_task = 0 ;
	- }
	- }
	-
	- KOKKOS_INLINE_FUNCTION
	- FutureArray( const FutureArray & rhs )
	- : m_future( rhs.m_future )
	- {
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- MemorySpace::increment( m_future );
	-#endif
	- }
	-
	- KOKKOS_INLINE_FUNCTION
	- FutureArray & operator = ( const FutureArray & rhs )
	- {
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- MemorySpace::decrement( m_future );
	- MemorySpace::increment( rhs.m_future );
	-#endif
	- m_future = rhs.m_future ;
	- return *this ;
	- }
	-};
	-
	-} /* namespace Kokkos */
	-
	-//----------------------------------------------------------------------------
	-
	-namespace Kokkos {
	+namespace Experimental {

	/** \brief If the argument is an execution space then a serial task in that space */
	template< class Arg0 = Kokkos::DefaultExecutionSpace >
	class TaskPolicy {
	public:

	typedef typename Arg0::execution_space execution_space ;

	//----------------------------------------
	/** \brief Create a serial task with storage for dependences.
	*
	* Postcondition: Task is in the 'constructing' state.
	*/
	template< class FunctorType >
	Future< typename FunctorType::value_type , execution_space >
	create( const FunctorType & functor
	, const unsigned dependence_capacity /* = default */ ) const ;

	/** \brief Create a foreach task with storage for dependences. */
	template< class ExecPolicy , class FunctorType >
	Future< typename FunctorType::value_type , execution_space >
	create_foreach( const ExecPolicy & policy
	, const FunctorType & functor
	, const unsigned dependence_capacity /* = default */ ) const ;

	/** \brief Create a reduce task with storage for dependences. */
	template< class ExecPolicy , class FunctorType >
	Future< typename FunctorType::value_type , execution_space >
	create_reduce( const ExecPolicy & policy
	, const FunctorType & functor
	, const unsigned dependence_capacity /* = default */ ) const ;

	/** \brief Create a scan task with storage for dependences. */
	template< class ExecPolicy , class FunctorType >
	Future< typename FunctorType::value_type , execution_space >
	create_scan( const ExecPolicy & policy
	, const FunctorType & functor
	, const unsigned dependence_capacity /* = default */ ) const ;

	/** \brief Set dependence that 'after' cannot start execution
	* until 'before' has completed.
	*
	* Precondition: The 'after' task must be in then 'Constructing' state.
	*/
	template< class TA , class TB >
	void set_dependence( const Future<TA,execution_space> & after
	, const Future<TB,execution_space> & before ) const ;

	/** \brief Spawn a task in the 'Constructing' state
	*
	* Precondition: Task is in the 'constructing' state.
	* Postcondition: Task is waiting, executing, or complete.
	*/
	template< class T >
	const Future<T,execution_space> &
	spawn( const Future<T,execution_space> & ) const ;

	//----------------------------------------
	/** \brief Query dependence of an executing task */

	template< class FunctorType >
	Future< execution_space >
	get_dependence( FunctorType * , const int ) const ;

	//----------------------------------------
	/** \brief Clear current dependences of an executing task
	* in preparation for setting new dependences and
	* respawning.
	*
	* Precondition: The functor must be a task in the executing state.
	*/
	template< class FunctorType >
	void clear_dependence( FunctorType * ) const ;

	/** \brief Set dependence that 'after' cannot start execution
	* until 'before' has completed.
	*
	* The 'after' functor must be in the executing state
	*/
	template< class FunctorType , class TB >
	void set_dependence( FunctorType * after
	, const Future<TB,execution_space> & before ) const ;

	/** \brief Respawn (reschedule) an executing task to be called again
	* after all dependences have completed.
	*/
	template< class FunctorType >
	void respawn( FunctorType * ) const ;
	};

	//----------------------------------------------------------------------------
	/** \brief Create and spawn a single-thread task */
	template< class ExecSpace , class FunctorType >
	inline
	Future< typename FunctorType::value_type , ExecSpace >
	spawn( TaskPolicy<ExecSpace> & policy , const FunctorType & functor )
	{ return policy.spawn( policy.create( functor ) ); }

	/** \brief Create and spawn a single-thread task with dependences */
	template< class ExecSpace , class FunctorType , class Arg0 , class Arg1 >
	inline
	Future< typename FunctorType::value_type , ExecSpace >
	spawn( TaskPolicy<ExecSpace> & policy
	, const FunctorType & functor
	, const Future<Arg0,Arg1> & before_0
	, const Future<Arg0,Arg1> & before_1 )
	{
	Future< typename FunctorType::value_type , ExecSpace > f ;
	f = policy.create( functor , 2 );
	policy.add_dependence( f , before_0 );
	policy.add_dependence( f , before_1 );
	policy.spawn( f );
	return f ;
	}

	//----------------------------------------------------------------------------
	/** \brief Create and spawn a parallel_for task */
	template< class ExecSpace , class ParallelPolicyType , class FunctorType >
	inline
	Future< typename FunctorType::value_type , ExecSpace >
	spawn_foreach( TaskPolicy<ExecSpace> & task_policy
	, const ParallelPolicyType & parallel_policy
	, const FunctorType & functor )
	{ return task_policy.spawn( task_policy.create_foreach( parallel_policy , functor ) ); }

	/** \brief Create and spawn a parallel_reduce task */
	template< class ExecSpace , class ParallelPolicyType , class FunctorType >
	inline
	Future< typename FunctorType::value_type , ExecSpace >
	spawn_reduce( TaskPolicy<ExecSpace> & task_policy
	, const ParallelPolicyType & parallel_policy
	, const FunctorType & functor )
	{ return task_policy.spawn( task_policy.create_reduce( parallel_policy , functor ) ); }

	//----------------------------------------------------------------------------
	/** \brief Respawn a task functor with dependences */
	template< class ExecSpace , class FunctorType , class Arg0 , class Arg1 >
	inline
	void respawn( TaskPolicy<ExecSpace> & policy
	, FunctorType * functor
	, const Future<Arg0,Arg1> & before_0
	, const Future<Arg0,Arg1> & before_1
	)
	{
	policy.clear_dependence( functor );
	policy.add_dependence( functor , before_0 );
	policy.add_dependence( functor , before_1 );
	policy.respawn( functor );
	}

	//----------------------------------------------------------------------------

	template< class ExecSpace >
	void wait( TaskPolicy< ExecSpace > & );

	-template< class A0 , class A1 >
	-inline
	-void wait( const Future<A0,A1> & future )
	-{
	- wait( Future< void , typename Future<A0,A1>::execution_space >( future ) );
	-}
	-
	+} /* namespace Experimental */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #define KOKKOS_TASKPOLICY_HPP */

	diff --git a/lib/kokkos/core/src/Kokkos_Threads.hpp b/lib/kokkos/core/src/Kokkos_Threads.hpp
	index 3d6a64c4b..4661b714b 100755
	--- a/lib/kokkos/core/src/Kokkos_Threads.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Threads.hpp
	@@ -1,214 +1,217 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_THREADS_HPP
	#define KOKKOS_THREADS_HPP

	#include <Kokkos_Core_fwd.hpp>

	#if defined( KOKKOS_HAVE_PTHREAD )

	#include <cstddef>
	#include <iosfwd>
	#include <Kokkos_HostSpace.hpp>
	#include <Kokkos_ScratchSpace.hpp>
	#include <Kokkos_Layout.hpp>
	#include <Kokkos_MemoryTraits.hpp>
	#include <impl/Kokkos_Tags.hpp>

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {
	class ThreadsExec ;
	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	/** \brief Execution space for a pool of Pthreads or C11 threads on a CPU. */
	class Threads {
	public:
	//! \name Type declarations that all Kokkos devices must provide.
	//@{
	//! Tag this class as a kokkos execution space
	typedef Threads execution_space ;
	typedef Kokkos::HostSpace memory_space ;
	+
	+ //! This execution space preferred device_type
	+ typedef Kokkos::Device<execution_space,memory_space> device_type;
	+
	typedef Kokkos::LayoutRight array_layout ;
	typedef memory_space::size_type size_type ;

	typedef ScratchMemorySpace< Threads > scratch_memory_space ;

	- //! For backward compatibility
	- typedef Threads device_type ;

	//@}
	/------------------------------------------------------------------------/
	//! \name Static functions that all Kokkos devices must implement.
	//@{

	/// \brief True if and only if this method is being called in a
	/// thread-parallel function.
	static int in_parallel();

	/** \brief Set the device in a "sleep" state.
	*
	* This function sets the device in a "sleep" state in which it is
	* not ready for work. This may consume less resources than if the
	* device were in an "awake" state, but it may also take time to
	* bring the device from a sleep state to be ready for work.
	*
	* \return True if the device is in the "sleep" state, else false if
	* the device is actively working and could not enter the "sleep"
	* state.
	*/
	static bool sleep();

	/// \brief Wake the device from the 'sleep' state so it is ready for work.
	///
	/// \return True if the device is in the "ready" state, else "false"
	/// if the device is actively working (which also means that it's
	/// awake).
	static bool wake();

	/// \brief Wait until all dispatched functors complete.
	///
	/// The parallel_for or parallel_reduce dispatch of a functor may
	/// return asynchronously, before the functor completes. This
	/// method does not return until all dispatched functors on this
	/// device have completed.
	static void fence();

	/// \brief Free any resources being consumed by the device.
	///
	/// For the Threads device, this terminates spawned worker threads.
	static void finalize();

	/// \brief Print configuration information to the given output stream.
	static void print_configuration( std::ostream & , const bool detail = false );

	//@}
	/------------------------------------------------------------------------/
	/------------------------------------------------------------------------/
	//! \name Space-specific functions
	//@{

	/** \brief Initialize the device in the "ready to work" state.
	*
	* The device is initialized in a "ready to work" or "awake" state.
	* This state reduces latency and thus improves performance when
	* dispatching work. However, the "awake" state consumes resources
	* even when no work is being done. You may call sleep() to put
	* the device in a "sleeping" state that does not consume as many
	* resources, but it will take time (latency) to awaken the device
	* again (via the wake()) method so that it is ready for work.
	*
	* Teams of threads are distributed as evenly as possible across
	* the requested number of numa regions and cores per numa region.
	* A team will not be split across a numa region.
	*
	* If the 'use_' arguments are not supplied the hwloc is queried
	* to use all available cores.
	*/
	- static void initialize( unsigned threads_count = 1 ,
	+ static void initialize( unsigned threads_count = 0 ,
	unsigned use_numa_count = 0 ,
	unsigned use_cores_per_numa = 0 ,
	bool allow_asynchronous_threadpool = false );

	static int is_initialized();

	static Threads & instance( int = 0 );

	//----------------------------------------

	static int thread_pool_size( int depth = 0 );
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	static int thread_pool_rank();
	#else
	KOKKOS_INLINE_FUNCTION static int thread_pool_rank() { return 0 ; }
	#endif

	inline static unsigned max_hardware_threads() { return thread_pool_size(0); }
	KOKKOS_INLINE_FUNCTION static unsigned hardware_thread_id() { return thread_pool_rank(); }

	//@}
	//----------------------------------------
	};

	} // namespace Kokkos

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {

	template<>
	struct VerifyExecutionCanAccessMemorySpace
	< Kokkos::Threads::memory_space
	, Kokkos::Threads::scratch_memory_space
	>
	{
	enum { value = true };
	inline static void verify( void ) { }
	inline static void verify( const void * ) { }
	};

	} // namespace Impl
	} // namespace Kokkos

	/--------------------------------------------------------------------------/

	#include <Kokkos_ExecPolicy.hpp>
	#include <Kokkos_Parallel.hpp>
	#include <Threads/Kokkos_ThreadsExec.hpp>
	+#include <Threads/Kokkos_ThreadsTeam.hpp>
	#include <Threads/Kokkos_Threads_Parallel.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #if defined( KOKKOS_HAVE_PTHREAD ) */
	#endif /* #define KOKKOS_THREADS_HPP */


	diff --git a/lib/kokkos/core/src/Kokkos_Vectorization.hpp b/lib/kokkos/core/src/Kokkos_Vectorization.hpp
	index 8a91f2529..a60c0ecaa 100755
	--- a/lib/kokkos/core/src/Kokkos_Vectorization.hpp
	+++ b/lib/kokkos/core/src/Kokkos_Vectorization.hpp
	@@ -1,100 +1,53 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	/// \file Kokkos_Vectorization.hpp
	/// \brief Declaration and definition of Kokkos::Vectorization interface.
	#ifndef KOKKOS_VECTORIZATION_HPP
	#define KOKKOS_VECTORIZATION_HPP

	-#include <Kokkos_Core_fwd.hpp>
	-#include <Kokkos_ExecPolicy.hpp>
	-
	-namespace Kokkos {
	-template<class Space, int N>
	-struct Vectorization {
	- typedef Kokkos::TeamPolicy< Space > team_policy ;
	- typedef typename team_policy::member_type team_member ;
	-
	- enum {increment = 1};
	-
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int begin() { return 0;}
	-
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int thread_rank(const team_member &dev) {
	- return dev.team_rank();
	- }
	-
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int team_rank(const team_member &dev) {
	- return dev.team_rank()/increment;
	- }
	-
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int team_size(const team_member &dev) {
	- return dev.team_size()/increment;
	- }
	-
	- KOKKOS_FORCEINLINE_FUNCTION
	- static int global_thread_rank(const team_member &dev) {
	- return (dev.league_rank()*dev.team_size()+dev.team_rank());
	- }
	-
	- KOKKOS_FORCEINLINE_FUNCTION
	- static bool is_lane_0(const team_member &dev) {
	- return true;
	- }
	-
	- template<class Scalar>
	- KOKKOS_FORCEINLINE_FUNCTION
	- static Scalar reduce(const Scalar& val) {
	- return val;
	- }
	-};
	-}
	-
	#if defined( KOKKOS_HAVE_CUDA )
	#include <Cuda/Kokkos_Cuda_Vectorization.hpp>
	#endif

	#endif
	diff --git a/lib/kokkos/core/src/Kokkos_View.hpp b/lib/kokkos/core/src/Kokkos_View.hpp
	index c215214bf..cd6c8af9f 100755
	--- a/lib/kokkos/core/src/Kokkos_View.hpp
	+++ b/lib/kokkos/core/src/Kokkos_View.hpp
	@@ -1,1863 +1,1915 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_VIEW_HPP
	#define KOKKOS_VIEW_HPP

	#include <string>
	#include <Kokkos_Core_fwd.hpp>
	#include <Kokkos_HostSpace.hpp>
	#include <Kokkos_MemoryTraits.hpp>

	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	#include <impl/Kokkos_StaticAssert.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Shape.hpp>
	#include <impl/Kokkos_AnalyzeShape.hpp>
	#include <impl/Kokkos_ViewOffset.hpp>
	#include <impl/Kokkos_ViewSupport.hpp>
	#include <impl/Kokkos_Tags.hpp>
	+#include <type_traits>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	/** \brief View specialization mapping of view traits to a specialization tag */
	template< class ValueType ,
	class ArraySpecialize ,
	class ArrayLayout ,
	class MemorySpace ,
	class MemoryTraits >
	struct ViewSpecialize ;

	/** \brief Defines the type of a subview given a source view type
	* and subview argument types.
	*/
	template< class SrcViewType
	, class Arg0Type
	, class Arg1Type
	, class Arg2Type
	, class Arg3Type
	, class Arg4Type
	, class Arg5Type
	, class Arg6Type
	, class Arg7Type
	>
	struct ViewSubview /* { typedef ... type ; } */ ;

	template< class DstViewSpecialize ,
	class SrcViewSpecialize = void ,
	class Enable = void >
	struct ViewAssignment ;

	template< class DstMemorySpace , class SrcMemorySpace >
	struct DeepCopy ;

	} /* namespace Impl */
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	/** \class ViewTraits
	* \brief Traits class for accessing attributes of a View.
	*
	* This is an implementation detail of View. It is only of interest
	* to developers implementing a new specialization of View.
	*
	* Template argument permutations:
	* - View< DataType , void , void , void >
	* - View< DataType , Space , void , void >
	* - View< DataType , Space , MemoryTraits , void >
	* - View< DataType , Space , void , MemoryTraits >
	* - View< DataType , ArrayLayout , void , void >
	* - View< DataType , ArrayLayout , Space , void >
	* - View< DataType , ArrayLayout , MemoryTraits , void >
	* - View< DataType , ArrayLayout , Space , MemoryTraits >
	* - View< DataType , MemoryTraits , void , void >
	*/

	template< class DataType ,
	class Arg1 = void ,
	class Arg2 = void ,
	class Arg3 = void >
	class ViewTraits {
	private:

	// Layout, Space, and MemoryTraits are optional
	// but need to appear in that order. That means Layout
	// can only be Arg1, Space can be Arg1 or Arg2, and
	// MemoryTraits can be Arg1, Arg2 or Arg3

	enum { Arg1IsLayout = Impl::is_array_layout<Arg1>::value };

	enum { Arg1IsSpace = Impl::is_space<Arg1>::value };
	enum { Arg2IsSpace = Impl::is_space<Arg2>::value };

	enum { Arg1IsMemoryTraits = Impl::is_memory_traits<Arg1>::value };
	enum { Arg2IsMemoryTraits = Impl::is_memory_traits<Arg2>::value };
	enum { Arg3IsMemoryTraits = Impl::is_memory_traits<Arg3>::value };

	enum { Arg1IsVoid = Impl::is_same< Arg1 , void >::value };
	enum { Arg2IsVoid = Impl::is_same< Arg2 , void >::value };
	enum { Arg3IsVoid = Impl::is_same< Arg3 , void >::value };

	// Arg1 is Layout, Space, MemoryTraits, or void
	typedef typename
	Impl::StaticAssert<
	( 1 == Arg1IsLayout + Arg1IsSpace + Arg1IsMemoryTraits + Arg1IsVoid )
	, Arg1 >::type Arg1Verified ;

	// If Arg1 is Layout then Arg2 is Space, MemoryTraits, or void
	// If Arg1 is Space then Arg2 is MemoryTraits or void
	// If Arg1 is MemoryTraits then Arg2 is void
	// If Arg1 is Void then Arg2 is void
	typedef typename
	Impl::StaticAssert<
	( Arg1IsLayout && ( 1 == Arg2IsSpace + Arg2IsMemoryTraits + Arg2IsVoid ) ) \|\|
	( Arg1IsSpace && ( 0 == Arg2IsSpace ) && ( 1 == Arg2IsMemoryTraits + Arg2IsVoid ) ) \|\|
	( Arg1IsMemoryTraits && Arg2IsVoid ) \|\|
	( Arg1IsVoid && Arg2IsVoid )
	, Arg2 >::type Arg2Verified ;

	// Arg3 is MemoryTraits or void and at most one argument is MemoryTraits
	typedef typename
	Impl::StaticAssert<
	( 1 == Arg3IsMemoryTraits + Arg3IsVoid ) &&
	( Arg1IsMemoryTraits + Arg2IsMemoryTraits + Arg3IsMemoryTraits <= 1 )
	, Arg3 >::type Arg3Verified ;

	// Arg1 or Arg2 may have execution and memory spaces
	typedef typename Impl::if_c<( Arg1IsSpace ), Arg1Verified ,
	typename Impl::if_c<( Arg2IsSpace ), Arg2Verified ,
	Kokkos::DefaultExecutionSpace
	>::type >::type::execution_space ExecutionSpace ;

	typedef typename Impl::if_c<( Arg1IsSpace ), Arg1Verified ,
	typename Impl::if_c<( Arg2IsSpace ), Arg2Verified ,
	Kokkos::DefaultExecutionSpace
	>::type >::type::memory_space MemorySpace ;

	typedef typename Impl::is_space<
	typename Impl::if_c<( Arg1IsSpace ), Arg1Verified ,
	typename Impl::if_c<( Arg2IsSpace ), Arg2Verified ,
	Kokkos::DefaultExecutionSpace
	>::type >::type >::host_mirror_space HostMirrorSpace ;

	// Arg1 may be array layout
	typedef typename Impl::if_c< Arg1IsLayout , Arg1Verified ,
	typename ExecutionSpace::array_layout
	>::type ArrayLayout ;

	// Arg1, Arg2, or Arg3 may be memory traits
	typedef typename Impl::if_c< Arg1IsMemoryTraits , Arg1Verified ,
	typename Impl::if_c< Arg2IsMemoryTraits , Arg2Verified ,
	typename Impl::if_c< Arg3IsMemoryTraits , Arg3Verified ,
	MemoryManaged
	>::type >::type >::type MemoryTraits ;

	typedef Impl::AnalyzeShape<DataType> analysis ;

	public:

	//------------------------------------
	// Data type traits:

	typedef DataType data_type ;
	typedef typename analysis::const_type const_data_type ;
	typedef typename analysis::non_const_type non_const_data_type ;

	//------------------------------------
	// Array of intrinsic scalar type traits:

	typedef typename analysis::array_intrinsic_type array_intrinsic_type ;
	typedef typename analysis::const_array_intrinsic_type const_array_intrinsic_type ;
	typedef typename analysis::non_const_array_intrinsic_type non_const_array_intrinsic_type ;

	//------------------------------------
	// Value type traits:

	typedef typename analysis::value_type value_type ;
	typedef typename analysis::const_value_type const_value_type ;
	typedef typename analysis::non_const_value_type non_const_value_type ;

	//------------------------------------
	// Layout and shape traits:

	typedef ArrayLayout array_layout ;
	typedef typename analysis::shape shape_type ;

	enum { rank = shape_type::rank };
	enum { rank_dynamic = shape_type::rank_dynamic };

	//------------------------------------
	// Execution space, memory space, memory access traits, and host mirror space.

	typedef ExecutionSpace execution_space ;
	typedef MemorySpace memory_space ;
	+ typedef Device<ExecutionSpace,MemorySpace> device_type ;
	typedef MemoryTraits memory_traits ;
	typedef HostMirrorSpace host_mirror_space ;

	typedef typename memory_space::size_type size_type ;

	enum { is_hostspace = Impl::is_same< memory_space , HostSpace >::value };
	enum { is_managed = memory_traits::Unmanaged == 0 };
	enum { is_random_access = memory_traits::RandomAccess == 1 };

	//------------------------------------

	- typedef ExecutionSpace device_type ; // for backward compatibility, to be removed

	//------------------------------------
	// Specialization tag:

	typedef typename
	Impl::ViewSpecialize< value_type
	, typename analysis::specialize
	, array_layout
	, memory_space
	, memory_traits
	>::type specialize ;
	};

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	class ViewDefault {};

	/** \brief Default view specialization has LayoutLeft, LayoutRight, or LayoutStride.
	*/
	template< class ValueType , class MemorySpace , class MemoryTraits >
	struct ViewSpecialize< ValueType , void , LayoutLeft , MemorySpace , MemoryTraits >
	{ typedef ViewDefault type ; };

	template< class ValueType , class MemorySpace , class MemoryTraits >
	struct ViewSpecialize< ValueType , void , LayoutRight , MemorySpace , MemoryTraits >
	{ typedef ViewDefault type ; };

	template< class ValueType , class MemorySpace , class MemoryTraits >
	struct ViewSpecialize< ValueType , void , LayoutStride , MemorySpace , MemoryTraits >
	{ typedef ViewDefault type ; };

	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	/** \brief Types for compile-time detection of View usage errors */
	namespace ViewError {

	struct allocation_constructor_requires_managed {};
	struct allocation_constructor_requires_nonconst {};
	struct user_pointer_constructor_requires_unmanaged {};
	struct device_shmem_constructor_requires_unmanaged {};

	struct scalar_operator_called_from_non_scalar_view {};

	} /* namespace ViewError */

	//----------------------------------------------------------------------------
	/** \brief Enable view parentheses operator for
	* match of layout and integral arguments.
	* If correct rank define type from traits,
	* otherwise define type as an error message.
	*/
	template< class ReturnType , class Traits , class Layout , unsigned Rank ,
	typename iType0 = int , typename iType1 = int ,
	typename iType2 = int , typename iType3 = int ,
	typename iType4 = int , typename iType5 = int ,
	typename iType6 = int , typename iType7 = int ,
	class Enable = void >
	struct ViewEnableArrayOper ;

	template< class ReturnType , class Traits , class Layout , unsigned Rank ,
	typename iType0 , typename iType1 ,
	typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 ,
	typename iType6 , typename iType7 >
	struct ViewEnableArrayOper<
	ReturnType , Traits , Layout , Rank ,
	iType0 , iType1 , iType2 , iType3 ,
	iType4 , iType5 , iType6 , iType7 ,
	typename enable_if<
	iType0(0) == 0 && iType1(0) == 0 && iType2(0) == 0 && iType3(0) == 0 &&
	iType4(0) == 0 && iType5(0) == 0 && iType6(0) == 0 && iType7(0) == 0 &&
	is_same< typename Traits::array_layout , Layout >::value &&
	( unsigned(Traits::rank) == Rank )
	>::type >
	{
	typedef ReturnType type ;
	};

	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	/** \class View
	* \brief View to an array of data.
	*
	* A View represents an array of one or more dimensions.
	* For details, please refer to Kokkos' tutorial materials.
	*
	* \section Kokkos_View_TemplateParameters Template parameters
	*
	* This class has both required and optional template parameters. The
	* \c DataType parameter must always be provided, and must always be
	* first. The parameters \c Arg1Type, \c Arg2Type, and \c Arg3Type are
	* placeholders for different template parameters. The default value
	* of the fifth template parameter \c Specialize suffices for most use
	* cases. When explaining the template parameters, we won't refer to
	* \c Arg1Type, \c Arg2Type, and \c Arg3Type; instead, we will refer
	* to the valid categories of template parameters, in whatever order
	* they may occur.
	*
	* Valid ways in which template arguments may be specified:
	* - View< DataType , Space >
	* - View< DataType , Space , MemoryTraits >
	* - View< DataType , Space , void , MemoryTraits >
	* - View< DataType , Layout , Space >
	* - View< DataType , Layout , Space , MemoryTraits >
	*
	* \tparam DataType (required) This indicates both the type of each
	* entry of the array, and the combination of compile-time and
	* run-time array dimension(s). For example, <tt>double*</tt>
	* indicates a one-dimensional array of \c double with run-time
	* dimension, and <tt>int*[3]</tt> a two-dimensional array of \c int
	* with run-time first dimension and compile-time second dimension
	* (of 3). In general, the run-time dimensions (if any) must go
	* first, followed by zero or more compile-time dimensions. For
	* more examples, please refer to the tutorial materials.
	*
	* \tparam Space (required) The memory space.
	*
	* \tparam Layout (optional) The array's layout in memory. For
	* example, LayoutLeft indicates a column-major (Fortran style)
	* layout, and LayoutRight a row-major (C style) layout. If not
	* specified, this defaults to the preferred layout for the
	* <tt>Space</tt>.
	*
	* \tparam MemoryTraits (optional) Assertion of the user's intended
	* access behavior. For example, RandomAccess indicates read-only
	* access with limited spatial locality, and Unmanaged lets users
	* wrap externally allocated memory in a View without automatic
	* deallocation.
	*
	* \section Kokkos_View_MT MemoryTraits discussion
	*
	* \subsection Kokkos_View_MT_Interp MemoryTraits interpretation depends on Space
	*
	* Some \c MemoryTraits options may have different interpretations for
	* different \c Space types. For example, with the Cuda device,
	* \c RandomAccess tells Kokkos to fetch the data through the texture
	* cache, whereas the non-GPU devices have no such hardware construct.
	*
	* \subsection Kokkos_View_MT_PrefUse Preferred use of MemoryTraits
	*
	* Users should defer applying the optional \c MemoryTraits parameter
	* until the point at which they actually plan to rely on it in a
	* computational kernel. This minimizes the number of template
	* parameters exposed in their code, which reduces the cost of
	* compilation. Users may always assign a View without specified
	* \c MemoryTraits to a compatible View with that specification.
	* For example:
	* \code
	* // Pass in the simplest types of View possible.
	* void
	* doSomething (View<double*, Cuda> out,
	* View<const double*, Cuda> in)
	* {
	* // Assign the "generic" View in to a RandomAccess View in_rr.
	* // Note that RandomAccess View objects must have const data.
	* View<const double*, Cuda, RandomAccess> in_rr = in;
	* // ... do something with in_rr and out ...
	* }
	* \endcode
	*/
	template< class DataType ,
	class Arg1Type = void , /* ArrayLayout, SpaceType, or MemoryTraits */
	class Arg2Type = void , /* SpaceType or MemoryTraits */
	class Arg3Type = void , /* MemoryTraits */
	class Specialize =
	typename ViewTraits<DataType,Arg1Type,Arg2Type,Arg3Type>::specialize >
	class View ;

	namespace Impl {

	template< class C >
	struct is_view : public bool_< false > {};

	template< class D , class A1 , class A2 , class A3 , class S >
	struct is_view< View< D , A1 , A2 , A3 , S > > : public bool_< true > {};

	}

	//----------------------------------------------------------------------------

	template< class DataType ,
	class Arg1Type ,
	class Arg2Type ,
	class Arg3Type >
	class View< DataType , Arg1Type , Arg2Type , Arg3Type , Impl::ViewDefault >
	: public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
	{
	public:

	typedef ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type > traits ;

	private:

	// Assignment of compatible views requirement:
	template< class , class , class , class , class > friend class View ;

	// Assignment of compatible subview requirement:
	template< class , class , class > friend struct Impl::ViewAssignment ;

	// Dimensions, cardinality, capacity, and offset computation for
	// multidimensional array view of contiguous memory.
	// Inherits from Impl::Shape
	typedef Impl::ViewOffset< typename traits::shape_type
	, typename traits::array_layout
	> offset_map_type ;

	// Intermediary class for data management and access
	typedef Impl::ViewDataManagement< traits > view_data_management ;

	//----------------------------------------
	// Data members:

	typename view_data_management::handle_type m_ptr_on_device ;
	offset_map_type m_offset_map ;
	view_data_management m_management ;
	+ Impl::AllocationTracker m_tracker ;

	//----------------------------------------

	public:

	/** return type for all indexing operators */
	typedef typename view_data_management::return_type reference_type ;

	+ enum { reference_type_is_lvalue = view_data_management::ReturnTypeIsReference };
	+
	typedef View< typename traits::array_intrinsic_type ,
	typename traits::array_layout ,
	- typename traits::execution_space ,
	+ typename traits::device_type ,
	typename traits::memory_traits > array_type ;

	typedef View< typename traits::const_data_type ,
	typename traits::array_layout ,
	- typename traits::execution_space ,
	+ typename traits::device_type ,
	typename traits::memory_traits > const_type ;

	typedef View< typename traits::non_const_data_type ,
	typename traits::array_layout ,
	- typename traits::execution_space ,
	+ typename traits::device_type ,
	typename traits::memory_traits > non_const_type ;

	typedef View< typename traits::non_const_data_type ,
	typename traits::array_layout ,
	typename traits::host_mirror_space ,
	void > HostMirror ;

	//------------------------------------
	// Shape

	enum { Rank = traits::rank };

	KOKKOS_INLINE_FUNCTION offset_map_type shape() const { return m_offset_map ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_0() const { return m_offset_map.N0 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_1() const { return m_offset_map.N1 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_2() const { return m_offset_map.N2 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_3() const { return m_offset_map.N3 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_4() const { return m_offset_map.N4 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_5() const { return m_offset_map.N5 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_6() const { return m_offset_map.N6 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_7() const { return m_offset_map.N7 ; }
	KOKKOS_INLINE_FUNCTION typename traits::size_type size() const { return m_offset_map.cardinality(); }

	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	typename traits::size_type dimension( const iType & i ) const
	{ return Impl::dimension( m_offset_map , i ); }

	//------------------------------------
	// Destructor, constructors, assignment operators:

	KOKKOS_INLINE_FUNCTION
	- ~View()
	- { m_management.decrement( m_ptr_on_device ); }
	+ ~View() {}

	KOKKOS_INLINE_FUNCTION
	View()
	- : m_ptr_on_device((typename traits::value_type*) NULL)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{ m_offset_map.assign(0, 0,0,0,0,0,0,0,0); }

	KOKKOS_INLINE_FUNCTION
	View( const View & rhs )
	- : m_ptr_on_device((typename traits::value_type*) NULL)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	(void) Impl::ViewAssignment<
	typename traits::specialize ,
	typename traits::specialize >( *this , rhs );
	}

	KOKKOS_INLINE_FUNCTION
	View & operator = ( const View & rhs )
	{
	(void) Impl::ViewAssignment<
	typename traits::specialize ,
	typename traits::specialize >( *this , rhs );
	return *this ;
	}

	//------------------------------------
	// Construct or assign compatible view:

	template< class RT , class RL , class RD , class RM , class RS >
	KOKKOS_INLINE_FUNCTION
	View( const View<RT,RL,RD,RM,RS> & rhs )
	- : m_ptr_on_device((typename traits::value_type*) NULL)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	(void) Impl::ViewAssignment<
	typename traits::specialize , RS >( *this , rhs );
	}

	template< class RT , class RL , class RD , class RM , class RS >
	KOKKOS_INLINE_FUNCTION
	View & operator = ( const View<RT,RL,RD,RM,RS> & rhs )
	{
	(void) Impl::ViewAssignment<
	typename traits::specialize , RS >( *this , rhs );
	return *this ;
	}

	//------------------------------------
	/**\brief Allocation of a managed view with possible alignment padding.
	*
	* Allocation properties for allocating and initializing to the default value_type:
	* Kokkos::ViewAllocate()
	* Kokkos::ViewAllocate("label") OR "label"
	* Kokkos::ViewAllocate(std::string("label")) OR std::string("label")
	*
	* Allocation properties for allocating and bypassing initialization:
	* Kokkos::ViewAllocateWithoutInitializing()
	* Kokkos::ViewAllocateWithoutInitializing("label")
	*/

	template< class AllocationProperties >
	explicit inline
	View( const AllocationProperties & prop ,
	// Impl::ViewAllocProp::size_type exists when the traits and allocation properties
	// are valid for allocating viewed memory.
	const typename Impl::ViewAllocProp< traits , AllocationProperties >::size_type n0 = 0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0 ,
	const size_t n8 = 0 )
	- : m_ptr_on_device(0)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	typedef Impl::ViewAllocProp< traits , AllocationProperties > Alloc ;

	+ static_assert(!std::is_same<typename traits::array_layout, LayoutStride>::value,
	+ "LayoutStride does not support View constructor which takes dimensions directly!");
	+
	m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7, n8 );
	if(Alloc::AllowPadding)
	m_offset_map.set_padding();

	- m_ptr_on_device = view_data_management::template allocate< Alloc::Initialize >( Alloc::label(prop) , m_offset_map );
	+ m_ptr_on_device = view_data_management::template allocate< Alloc::Initialize >( Alloc::label(prop) , m_offset_map, m_tracker );
	+
	}

	template< class AllocationProperties >
	explicit inline
	View( const AllocationProperties & prop ,
	const typename traits::array_layout & layout ,
	// Impl::ViewAllocProp::size_type exists when the traits and allocation properties
	// are valid for allocating viewed memory.
	const typename Impl::ViewAllocProp< traits , AllocationProperties >::size_type = 0 )
	- : m_ptr_on_device(0)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	typedef Impl::ViewAllocProp< traits , AllocationProperties > Alloc ;

	m_offset_map.assign( layout );
	if(Alloc::AllowPadding)
	m_offset_map.set_padding();

	- m_ptr_on_device = view_data_management::template allocate< Alloc::Initialize >( Alloc::label(prop) , m_offset_map );
	+ m_ptr_on_device = view_data_management::template allocate< Alloc::Initialize >( Alloc::label(prop) , m_offset_map, m_tracker );

	m_management.set_noncontiguous();
	}

	//------------------------------------
	// Assign an unmanaged View from pointer, can be called in functors.
	// No alignment padding is performed.

	template< class Type >
	explicit KOKKOS_INLINE_FUNCTION
	View( Type * ptr ,
	typename Impl::ViewRawPointerProp< traits , Type >::size_type n0 = 0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0 ,
	const size_t n8 = 0 )
	: m_ptr_on_device(ptr)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7, n8 );
	m_management.set_unmanaged();
	}

	template< class Type >
	explicit KOKKOS_INLINE_FUNCTION
	View( Type * ptr ,
	typename traits::array_layout const & layout ,
	typename Impl::ViewRawPointerProp< traits , Type >::size_type = 0 )
	: m_ptr_on_device(ptr)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	m_offset_map.assign( layout );
	m_management.set_unmanaged();
	m_management.set_noncontiguous();
	}

	+
	+
	+ //------------------------------------
	+ // Assign a View from an AllocationTracker,
	+ // The allocator used must be compatiable with the memory space of the view
	+ // No alignment padding is performed.
	+ // TODO: Should these allow padding??? DJS 01/15/15
	+ explicit
	+ View( Impl::AllocationTracker const &arg_tracker ,
	+ const size_t n0 = 0 ,
	+ const size_t n1 = 0 ,
	+ const size_t n2 = 0 ,
	+ const size_t n3 = 0 ,
	+ const size_t n4 = 0 ,
	+ const size_t n5 = 0 ,
	+ const size_t n6 = 0 ,
	+ const size_t n7 = 0 ,
	+ const size_t n8 = 0 )
	+ : m_ptr_on_device(reinterpret_cast<typename traits::value_type*>(arg_tracker.alloc_ptr()))
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker(arg_tracker)
	+ {
	+ m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7, n8 );
	+
	+ const size_t req_size = m_offset_map.capacity() * sizeof(typename traits::value_type);
	+ if ( m_tracker.alloc_size() < req_size ) {
	+ Impl::throw_runtime_exception("Error: tracker.alloc_size() < req_size");
	+ }
	+ }
	+
	+ explicit
	+ View( Impl::AllocationTracker const & arg_tracker
	+ , typename traits::array_layout const & layout )
	+ : m_ptr_on_device(reinterpret_cast<typename traits::value_type*>(arg_tracker.alloc_ptr()))
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker(arg_tracker)
	+ {
	+ m_offset_map.assign( layout );
	+
	+ const size_t req_size = m_offset_map.capacity() * sizeof(typename traits::value_type);
	+ if ( m_tracker.alloc_size() < req_size ) {
	+ Impl::throw_runtime_exception("Error: tracker.alloc_size() < req_size");
	+ }
	+
	+ m_management.set_noncontiguous();
	+ }
	+
	//------------------------------------
	/** \brief Constructors for subviews requires following
	* type-compatibility condition, enforce via StaticAssert.
	*
	* Impl::is_same< View ,
	* typename Impl::ViewSubview< View<D,A1,A2,A3,Impl::ViewDefault>
	* , ArgType0 , ArgType1 , ArgType2 , ArgType3
	* , ArgType4 , ArgType5 , ArgType6 , ArgType7
	* >::type >::value
	*/
	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type , class SubArg6_type , class SubArg7_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	, const SubArg2_type & arg2 , const SubArg3_type & arg3
	, const SubArg4_type & arg4 , const SubArg5_type & arg5
	, const SubArg6_type & arg6 , const SubArg7_type & arg7
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type , class SubArg6_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	, const SubArg2_type & arg2 , const SubArg3_type & arg3
	, const SubArg4_type & arg4 , const SubArg5_type & arg5
	, const SubArg6_type & arg6
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	, const SubArg2_type & arg2 , const SubArg3_type & arg3
	, const SubArg4_type & arg4 , const SubArg5_type & arg5
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	, const SubArg2_type & arg2 , const SubArg3_type & arg3
	, const SubArg4_type & arg4
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	, const SubArg2_type & arg2 , const SubArg3_type & arg3
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type , class SubArg2_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	, const SubArg2_type & arg2
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type , class SubArg1_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0 , const SubArg1_type & arg1
	);

	template< class D , class A1 , class A2 , class A3
	, class SubArg0_type
	>
	KOKKOS_INLINE_FUNCTION
	View( const View<D,A1,A2,A3,Impl::ViewDefault> & src
	, const SubArg0_type & arg0
	);

	//------------------------------------
	// Assign unmanaged View to portion of execution space's shared memory

	typedef Impl::if_c< ! traits::is_managed ,
	const typename traits::execution_space::scratch_memory_space & ,
	Impl::ViewError::device_shmem_constructor_requires_unmanaged >
	if_scratch_memory_constructor ;

	explicit KOKKOS_INLINE_FUNCTION
	View( typename if_scratch_memory_constructor::type space ,
	const unsigned n0 = 0 ,
	const unsigned n1 = 0 ,
	const unsigned n2 = 0 ,
	const unsigned n3 = 0 ,
	const unsigned n4 = 0 ,
	const unsigned n5 = 0 ,
	const unsigned n6 = 0 ,
	const unsigned n7 = 0 )
	- : m_ptr_on_device(0)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	typedef typename traits::value_type value_type_ ;

	enum { align = 8 };
	enum { mask = align - 1 };

	m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7 );

	typedef Impl::if_c< ! traits::is_managed ,
	value_type_ * ,
	Impl::ViewError::device_shmem_constructor_requires_unmanaged >
	if_device_shmem_pointer ;

	// Select the first argument:
	m_ptr_on_device = if_device_shmem_pointer::select(
	(value_type_) space.get_shmem( unsigned( sizeof(value_type_) m_offset_map.capacity() + unsigned(mask) ) & ~unsigned(mask) ) );
	}

	explicit KOKKOS_INLINE_FUNCTION
	View( typename if_scratch_memory_constructor::type space ,
	typename traits::array_layout const & layout)
	- : m_ptr_on_device(0)
	+ : m_ptr_on_device()
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	typedef typename traits::value_type value_type_ ;

	typedef Impl::if_c< ! traits::is_managed ,
	value_type_ * ,
	Impl::ViewError::device_shmem_constructor_requires_unmanaged >
	if_device_shmem_pointer ;

	m_offset_map.assign( layout );
	m_management.set_unmanaged();
	m_management.set_noncontiguous();

	enum { align = 8 };
	enum { mask = align - 1 };

	// Select the first argument:
	m_ptr_on_device = if_device_shmem_pointer::select(
	(value_type_) space.get_shmem( unsigned( sizeof(value_type_) m_offset_map.capacity() + unsigned(mask) ) & ~unsigned(mask) ) );
	}

	static inline
	unsigned shmem_size( const unsigned n0 = 0 ,
	const unsigned n1 = 0 ,
	const unsigned n2 = 0 ,
	const unsigned n3 = 0 ,
	const unsigned n4 = 0 ,
	const unsigned n5 = 0 ,
	const unsigned n6 = 0 ,
	const unsigned n7 = 0 )
	{
	enum { align = 8 };
	enum { mask = align - 1 };

	typedef typename traits::value_type value_type_ ;

	offset_map_type offset_map ;

	offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7 );

	return unsigned( sizeof(value_type_) * offset_map.capacity() + unsigned(mask) ) & ~unsigned(mask) ;
	}

	//------------------------------------
	// Is not allocated

	KOKKOS_FORCEINLINE_FUNCTION
	bool is_null() const { return 0 == ptr_on_device() ; }

	//------------------------------------
	// Operators for scalar (rank zero) views.

	typedef Impl::if_c< traits::rank == 0 ,
	typename traits::value_type ,
	Impl::ViewError::scalar_operator_called_from_non_scalar_view >
	if_scalar_operator ;

	KOKKOS_INLINE_FUNCTION
	const View & operator = ( const typename if_scalar_operator::type & rhs ) const
	{
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	*m_ptr_on_device = if_scalar_operator::select( rhs );
	return *this ;
	}

	KOKKOS_FORCEINLINE_FUNCTION
	operator typename if_scalar_operator::type & () const
	{
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	return if_scalar_operator::select( *m_ptr_on_device );
	}

	KOKKOS_FORCEINLINE_FUNCTION
	typename if_scalar_operator::type & operator()() const
	{
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	return if_scalar_operator::select( *m_ptr_on_device );
	}

	KOKKOS_FORCEINLINE_FUNCTION
	typename if_scalar_operator::type & operator*() const
	{
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	return if_scalar_operator::select( *m_ptr_on_device );
	}

	//------------------------------------
	// Array member access operators enabled if
	// (1) a zero value of all argument types are compile-time comparable to zero
	// (2) the rank matches the number of arguments
	// (3) the memory space is valid for the access
	//------------------------------------
	// rank 1:
	+ // Specialisation for LayoutLeft and LayoutRight since we know its stride 1
	+
	+ template< typename iType0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< reference_type , traits, LayoutLeft, 1, iType0 >::type
	+ operator[] ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	+
	+ return m_ptr_on_device[ i0 ];
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< reference_type , traits, LayoutLeft, 1, iType0 >::type
	+ operator() ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	+
	+ return m_ptr_on_device[ i0 ];
	+ }

	template< typename iType0 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< reference_type , traits, typename traits::array_layout, 1, iType0 >::type
	+ typename Impl::ViewEnableArrayOper< reference_type , traits, LayoutLeft, 1, iType0 >::type
	+ at( const iType0 & i0 , const int , const int , const int ,
	+ const int , const int , const int , const int ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	+
	+ return m_ptr_on_device[ i0 ];
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< reference_type , traits, LayoutRight, 1, iType0 >::type
	operator[] ( const iType0 & i0 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ i0 ];
	}

	template< typename iType0 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< reference_type , traits, typename traits::array_layout, 1, iType0 >::type
	+ typename Impl::ViewEnableArrayOper< reference_type , traits, LayoutRight, 1, iType0 >::type
	operator() ( const iType0 & i0 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ i0 ];
	}

	template< typename iType0 >
	KOKKOS_FORCEINLINE_FUNCTION
	- typename Impl::ViewEnableArrayOper< reference_type , traits, typename traits::array_layout, 1, iType0 >::type
	+ typename Impl::ViewEnableArrayOper< reference_type , traits, LayoutRight, 1, iType0 >::type
	at( const iType0 & i0 , const int , const int , const int ,
	const int , const int , const int , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ i0 ];
	}

	+ template< typename iType0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< reference_type , traits,
	+ typename Impl::if_c<
	+ Impl::is_same<typename traits::array_layout, LayoutRight>::value \|\|
	+ Impl::is_same<typename traits::array_layout, LayoutLeft>::value ,
	+ void, typename traits::array_layout>::type,
	+ 1, iType0 >::type
	+ operator[] ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	+
	+ return m_ptr_on_device[ m_offset_map(i0) ];
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< reference_type , traits,
	+ typename Impl::if_c<
	+ Impl::is_same<typename traits::array_layout, LayoutRight>::value \|\|
	+ Impl::is_same<typename traits::array_layout, LayoutLeft>::value ,
	+ void, typename traits::array_layout>::type,
	+ 1, iType0 >::type
	+ operator() ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	+
	+ return m_ptr_on_device[ m_offset_map(i0) ];
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< reference_type , traits,
	+ typename Impl::if_c<
	+ Impl::is_same<typename traits::array_layout, LayoutRight>::value \|\|
	+ Impl::is_same<typename traits::array_layout, LayoutLeft>::value ,
	+ void, typename traits::array_layout>::type,
	+ 1, iType0 >::type
	+ at( const iType0 & i0 , const int , const int , const int ,
	+ const int , const int , const int , const int ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_1( m_offset_map, i0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );
	+
	+ return m_ptr_on_device[ m_offset_map(i0) ];
	+ }
	+
	// rank 2:

	template< typename iType0 , typename iType1 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 2, iType0, iType1 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0,i1 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1) ];
	}

	template< typename iType0 , typename iType1 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 2, iType0, iType1 >::type
	at( const iType0 & i0 , const iType1 & i1 , const int , const int ,
	const int , const int , const int , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0,i1 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1) ];
	}

	// rank 3:

	template< typename iType0 , typename iType1 , typename iType2 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 3, iType0, iType1, iType2 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_3( m_offset_map, i0,i1,i2 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2) ];
	}

	template< typename iType0 , typename iType1 , typename iType2 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 3, iType0, iType1, iType2 >::type
	at( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const int ,
	const int , const int , const int , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_3( m_offset_map, i0,i1,i2 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2) ];
	}

	// rank 4:

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 4, iType0, iType1, iType2, iType3 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_4( m_offset_map, i0,i1,i2,i3 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3) ];
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 4, iType0, iType1, iType2, iType3 >::type
	at( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const int , const int , const int , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_4( m_offset_map, i0,i1,i2,i3 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3) ];
	}

	// rank 5:

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 5, iType0, iType1, iType2, iType3 , iType4 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_5( m_offset_map, i0,i1,i2,i3,i4 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4) ];
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 5, iType0, iType1, iType2, iType3 , iType4 >::type
	at( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const int , const int , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_5( m_offset_map, i0,i1,i2,i3,i4 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4) ];
	}

	// rank 6:

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 6,
	iType0, iType1, iType2, iType3 , iType4, iType5 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_6( m_offset_map, i0,i1,i2,i3,i4,i5 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4,i5) ];
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 6,
	iType0, iType1, iType2, iType3 , iType4, iType5 >::type
	at( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const int , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_6( m_offset_map, i0,i1,i2,i3,i4,i5 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4,i5) ];
	}

	// rank 7:

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 , typename iType6 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 7,
	iType0, iType1, iType2, iType3 , iType4, iType5, iType6 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const iType6 & i6 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_7( m_offset_map, i0,i1,i2,i3,i4,i5,i6 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4,i5,i6) ];
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 , typename iType6 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 7,
	iType0, iType1, iType2, iType3 , iType4, iType5, iType6 >::type
	at( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const iType6 & i6 , const int ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_7( m_offset_map, i0,i1,i2,i3,i4,i5,i6 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4,i5,i6) ];
	}

	// rank 8:

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 , typename iType6 , typename iType7 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 8,
	iType0, iType1, iType2, iType3 , iType4, iType5, iType6, iType7 >::type
	operator() ( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const iType6 & i6 , const iType7 & i7 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_8( m_offset_map, i0,i1,i2,i3,i4,i5,i6,i7 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4,i5,i6,i7) ];
	}

	template< typename iType0 , typename iType1 , typename iType2 , typename iType3 ,
	typename iType4 , typename iType5 , typename iType6 , typename iType7 >
	KOKKOS_FORCEINLINE_FUNCTION
	typename Impl::ViewEnableArrayOper< reference_type ,
	traits, typename traits::array_layout, 8,
	iType0, iType1, iType2, iType3 , iType4, iType5, iType6, iType7 >::type
	at( const iType0 & i0 , const iType1 & i1 , const iType2 & i2 , const iType3 & i3 ,
	const iType4 & i4 , const iType5 & i5 , const iType6 & i6 , const iType7 & i7 ) const
	{
	KOKKOS_ASSERT_SHAPE_BOUNDS_8( m_offset_map, i0,i1,i2,i3,i4,i5,i6,i7 );
	KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , ptr_on_device() );

	return m_ptr_on_device[ m_offset_map(i0,i1,i2,i3,i4,i5,i6,i7) ];
	}

	//------------------------------------
	// Access to the underlying contiguous storage of this view specialization.
	// These methods are specific to specialization of a view.

	KOKKOS_FORCEINLINE_FUNCTION
	typename traits::value_type * ptr_on_device() const
	{ return (typename traits::value_type *) m_ptr_on_device ; }

	// Stride of physical storage, dimensioned to at least Rank
	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	void stride( iType * const s ) const
	{ m_offset_map.stride(s); }

	// Count of contiguously allocated data members including padding.
	KOKKOS_INLINE_FUNCTION
	typename traits::size_type capacity() const
	{ return m_offset_map.capacity(); }

	// If the view data can be treated (deep copied)
	// as a contiguous block of memory.
	KOKKOS_INLINE_FUNCTION
	bool is_contiguous() const
	{ return m_management.is_contiguous(); }
	+
	+ const Impl::AllocationTracker & tracker() const { return m_tracker; }
	};

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	template< class LT , class LL , class LD , class LM , class LS ,
	class RT , class RL , class RD , class RM , class RS >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( Impl::is_same< LS , RS >::value ), bool >::type
	operator == ( const View<LT,LL,LD,LM,LS> & lhs ,
	const View<RT,RL,RD,RM,RS> & rhs )
	{
	// Same data, layout, dimensions
	typedef ViewTraits<LT,LL,LD,LM> lhs_traits ;
	typedef ViewTraits<RT,RL,RD,RM> rhs_traits ;

	return
	Impl::is_same< typename lhs_traits::const_data_type ,
	typename rhs_traits::const_data_type >::value &&
	Impl::is_same< typename lhs_traits::array_layout ,
	typename rhs_traits::array_layout >::value &&
	Impl::is_same< typename lhs_traits::memory_space ,
	typename rhs_traits::memory_space >::value &&
	Impl::is_same< typename lhs_traits::specialize ,
	typename rhs_traits::specialize >::value &&
	lhs.ptr_on_device() == rhs.ptr_on_device() &&
	lhs.shape() == rhs.shape() ;
	}

	template< class LT , class LL , class LD , class LM , class LS ,
	class RT , class RL , class RD , class RM , class RS >
	KOKKOS_INLINE_FUNCTION
	bool operator != ( const View<LT,LL,LD,LM,LS> & lhs ,
	const View<RT,RL,RD,RM,RS> & rhs )
	{
	return ! operator==( lhs , rhs );
	}

	//----------------------------------------------------------------------------


	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	//----------------------------------------------------------------------------
	/** \brief Deep copy a value into a view.
	*/
	template< class DT , class DL , class DD , class DM , class DS >
	inline
	void deep_copy( const View<DT,DL,DD,DM,DS> & dst ,
	typename Impl::enable_if<(
	Impl::is_same< typename ViewTraits<DT,DL,DD,DM>::non_const_value_type ,
	typename ViewTraits<DT,DL,DD,DM>::value_type >::value
	), typename ViewTraits<DT,DL,DD,DM>::const_value_type >::type & value )
	{
	Impl::ViewFill< View<DT,DL,DD,DM,DS> >( dst , value );
	}

	template< class ST , class SL , class SD , class SM , class SS >
	inline
	typename Impl::enable_if<( ViewTraits<ST,SL,SD,SM>::rank == 0 )>::type
	deep_copy( ST & dst , const View<ST,SL,SD,SM,SS> & src )
	{
	typedef ViewTraits<ST,SL,SD,SM> src_traits ;
	typedef typename src_traits::memory_space src_memory_space ;
	Impl::DeepCopy< HostSpace , src_memory_space >( & dst , src.ptr_on_device() , sizeof(ST) );
	}

	//----------------------------------------------------------------------------
	/** \brief A deep copy between views of compatible type, and rank zero.
	*/
	template< class DT , class DL , class DD , class DM , class DS ,
	class ST , class SL , class SD , class SM , class SS >
	inline
	void deep_copy( const View<DT,DL,DD,DM,DS> & dst ,
	const View<ST,SL,SD,SM,SS> & src ,
	typename Impl::enable_if<(
	// Same type and destination is not constant:
	Impl::is_same< typename View<DT,DL,DD,DM,DS>::value_type ,
	typename View<ST,SL,SD,SM,SS>::non_const_value_type >::value
	&&
	// Rank zero:
	( unsigned(View<DT,DL,DD,DM,DS>::rank) == unsigned(0) ) &&
	( unsigned(View<ST,SL,SD,SM,SS>::rank) == unsigned(0) )
	)>::type * = 0 )
	{
	typedef View<DT,DL,DD,DM,DS> dst_type ;
	typedef View<ST,SL,SD,SM,SS> src_type ;

	typedef typename dst_type::memory_space dst_memory_space ;
	typedef typename src_type::memory_space src_memory_space ;
	typedef typename src_type::value_type value_type ;

	if ( dst.ptr_on_device() != src.ptr_on_device() ) {
	Impl::DeepCopy< dst_memory_space , src_memory_space >( dst.ptr_on_device() , src.ptr_on_device() , sizeof(value_type) );
	}
	}

	//----------------------------------------------------------------------------
	/** \brief A deep copy between views of the default specialization, compatible type,
	* same non-zero rank, same contiguous layout.
	*/
	template< class DT , class DL , class DD , class DM ,
	class ST , class SL , class SD , class SM >
	inline
	void deep_copy( const View<DT,DL,DD,DM,Impl::ViewDefault> & dst ,
	const View<ST,SL,SD,SM,Impl::ViewDefault> & src ,
	typename Impl::enable_if<(
	// Same type and destination is not constant:
	Impl::is_same< typename View<DT,DL,DD,DM,Impl::ViewDefault>::value_type ,
	typename View<ST,SL,SD,SM,Impl::ViewDefault>::non_const_value_type >::value
	&&
	// Same non-zero rank:
	( unsigned(View<DT,DL,DD,DM,Impl::ViewDefault>::rank) ==
	unsigned(View<ST,SL,SD,SM,Impl::ViewDefault>::rank) )
	&&
	( 0 < unsigned(View<DT,DL,DD,DM,Impl::ViewDefault>::rank) )
	&&
	// Same layout:
	Impl::is_same< typename View<DT,DL,DD,DM,Impl::ViewDefault>::array_layout ,
	typename View<ST,SL,SD,SM,Impl::ViewDefault>::array_layout >::value
	)>::type * = 0 )
	{
	typedef View<DT,DL,DD,DM,Impl::ViewDefault> dst_type ;
	typedef View<ST,SL,SD,SM,Impl::ViewDefault> src_type ;

	typedef typename dst_type::memory_space dst_memory_space ;
	typedef typename src_type::memory_space src_memory_space ;

	enum { is_contiguous = // Contiguous (e.g., non-strided, non-tiled) layout
	Impl::is_same< typename View<DT,DL,DD,DM,Impl::ViewDefault>::array_layout , LayoutLeft >::value \|\|
	Impl::is_same< typename View<DT,DL,DD,DM,Impl::ViewDefault>::array_layout , LayoutRight >::value };

	if ( dst.ptr_on_device() != src.ptr_on_device() ) {

	// Same shape (dimensions)
	- Impl::assert_shapes_are_equal( dst.shape() , src.shape() );

	- if ( is_contiguous && dst.capacity() == src.capacity() ) {
	+ const bool shapes_are_equal = dst.shape() == src.shape();
	+
	+ if ( shapes_are_equal && is_contiguous && dst.capacity() == src.capacity() ) {

	// Views span equal length contiguous range.
	// Assuming can perform a straight memory copy over this range.

	const size_t nbytes = sizeof(typename dst_type::value_type) * dst.capacity();

	Impl::DeepCopy< dst_memory_space , src_memory_space >( dst.ptr_on_device() , src.ptr_on_device() , nbytes );
	}
	else {
	// Destination view's execution space must be able to directly access source memory space
	// in order for the ViewRemap functor run in the destination memory space's execution space.
	- Impl::ViewRemap< dst_type , src_type >( dst , src );
	+ size_t stride[8];
	+ src.stride(stride);
	+ size_t size_stride = stride[0]*src.dimension_0();
	+ size_t size_dim = src.dimension_0();
	+ for(int i = 1; i<src.rank; i++) {
	+ if(stride[i]*src.dimension(i)>size_stride)
	+ size_stride = stride[i]*src.dimension(i);
	+ size_dim*=src.dimension(i);
	+ }
	+
	+ if( shapes_are_equal && size_stride == size_dim) {
	+ const size_t nbytes = sizeof(typename dst_type::value_type) * dst.capacity();
	+
	+ Impl::DeepCopy< dst_memory_space , src_memory_space >( dst.ptr_on_device() , src.ptr_on_device() , nbytes );
	+ } else {
	+ Impl::ViewRemap< dst_type , src_type >( dst , src );
	+ }
	}
	}
	}


	/** \brief Deep copy equal dimension arrays in the same space which
	* have different layouts or specializations.
	*/
	template< class DT , class DL , class DD , class DM , class DS ,
	class ST , class SL , class SD , class SM , class SS >
	inline
	void deep_copy( const View< DT, DL, DD, DM, DS > & dst ,
	const View< ST, SL, SD, SM, SS > & src ,
	const typename Impl::enable_if<(
	// Same type and destination is not constant:
	Impl::is_same< typename View<DT,DL,DD,DM,DS>::value_type ,
	typename View<DT,DL,DD,DM,DS>::non_const_value_type >::value
	&&
	// Source memory space is accessible to destination memory space
	Impl::VerifyExecutionCanAccessMemorySpace< typename View<DT,DL,DD,DM,DS>::memory_space
	, typename View<ST,SL,SD,SM,SS>::memory_space >::value
	&&
	// Same non-zero rank
	( unsigned( View<DT,DL,DD,DM,DS>::rank ) ==
	unsigned( View<ST,SL,SD,SM,SS>::rank ) )
	&&
	( 0 < unsigned( View<DT,DL,DD,DM,DS>::rank ) )
	&&
	// Different layout or different specialization:
	( ( ! Impl::is_same< typename View<DT,DL,DD,DM,DS>::array_layout ,
	typename View<ST,SL,SD,SM,SS>::array_layout >::value )
	\|\|
	( ! Impl::is_same< DS , SS >::value )
	)
	)>::type * = 0 )
	{
	typedef View< DT, DL, DD, DM, DS > dst_type ;
	typedef View< ST, SL, SD, SM, SS > src_type ;

	assert_shapes_equal_dimension( dst.shape() , src.shape() );

	Impl::ViewRemap< dst_type , src_type >( dst , src );
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	template< class T , class L , class D , class M , class S >
	typename Impl::enable_if<(
	- View<T,L,D,M,S>::is_managed
	+ View<T,L,D,M,S>::is_managed &&
	+ !Impl::is_same<L,LayoutStride>::value
	), typename View<T,L,D,M,S>::HostMirror >::type
	inline
	create_mirror( const View<T,L,D,M,S> & src )
	{
	typedef View<T,L,D,M,S> view_type ;
	typedef typename view_type::HostMirror host_view_type ;
	- typedef typename view_type::memory_space memory_space ;

	// 'view' is managed therefore we can allocate a
	// compatible host_view through the ordinary constructor.

	- std::string label = memory_space::query_label( src.ptr_on_device() );
	+ std::string label = src.tracker().label();
	label.append("_mirror");

	return host_view_type( label ,
	src.dimension_0() ,
	src.dimension_1() ,
	src.dimension_2() ,
	src.dimension_3() ,
	src.dimension_4() ,
	src.dimension_5() ,
	src.dimension_6() ,
	src.dimension_7() );
	}

	+template< class T , class L , class D , class M , class S >
	+typename Impl::enable_if<(
	+ View<T,L,D,M,S>::is_managed &&
	+ Impl::is_same<L,LayoutStride>::value
	+ ), typename View<T,L,D,M,S>::HostMirror >::type
	+inline
	+create_mirror( const View<T,L,D,M,S> & src )
	+{
	+ typedef View<T,L,D,M,S> view_type ;
	+ typedef typename view_type::HostMirror host_view_type ;
	+
	+ // 'view' is managed therefore we can allocate a
	+ // compatible host_view through the ordinary constructor.
	+
	+ std::string label = src.tracker().label();
	+ label.append("_mirror");
	+ LayoutStride layout;
	+ src.stride(layout.stride);
	+ layout.dimension[0] = src.dimension_0();
	+ layout.dimension[1] = src.dimension_1();
	+ layout.dimension[2] = src.dimension_2();
	+ layout.dimension[3] = src.dimension_3();
	+ layout.dimension[4] = src.dimension_4();
	+ layout.dimension[5] = src.dimension_5();
	+ layout.dimension[6] = src.dimension_6();
	+ layout.dimension[7] = src.dimension_7();
	+
	+ return host_view_type( label , layout );
	+}
	template< class T , class L , class D , class M , class S >
	typename Impl::enable_if<(
	View<T,L,D,M,S>::is_managed &&
	Impl::ViewAssignable< typename View<T,L,D,M,S>::HostMirror , View<T,L,D,M,S> >::value
	), typename View<T,L,D,M,S>::HostMirror >::type
	inline
	create_mirror_view( const View<T,L,D,M,S> & src )
	{
	return src ;
	}

	template< class T , class L , class D , class M , class S >
	typename Impl::enable_if<(
	View<T,L,D,M,S>::is_managed &&
	! Impl::ViewAssignable< typename View<T,L,D,M,S>::HostMirror , View<T,L,D,M,S> >::value
	), typename View<T,L,D,M,S>::HostMirror >::type
	inline
	create_mirror_view( const View<T,L,D,M,S> & src )
	{
	return create_mirror( src );
	}

	//----------------------------------------------------------------------------

	/** \brief Resize a view with copying old data to new data at the corresponding indices. */
	template< class T , class L , class D , class M , class S >
	inline
	void resize( View<T,L,D,M,S> & v ,
	const typename Impl::enable_if< ViewTraits<T,L,D,M>::is_managed , size_t >::type n0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0 )
	{
	typedef View<T,L,D,M,S> view_type ;
	- typedef typename view_type::memory_space memory_space ;

	- const std::string label = memory_space::query_label( v.ptr_on_device() );
	+ const std::string label = v.tracker().label();

	view_type v_resized( label, n0, n1, n2, n3, n4, n5, n6, n7 );

	Impl::ViewRemap< view_type , view_type >( v_resized , v );

	v = v_resized ;
	}

	/** \brief Reallocate a view without copying old data to new data */
	template< class T , class L , class D , class M , class S >
	inline
	void realloc( View<T,L,D,M,S> & v ,
	const typename Impl::enable_if< ViewTraits<T,L,D,M>::is_managed , size_t >::type n0 ,
	const size_t n1 = 0 ,
	const size_t n2 = 0 ,
	const size_t n3 = 0 ,
	const size_t n4 = 0 ,
	const size_t n5 = 0 ,
	const size_t n6 = 0 ,
	const size_t n7 = 0 )
	{
	typedef View<T,L,D,M,S> view_type ;
	- typedef typename view_type::memory_space memory_space ;

	// Query the current label and reuse it.
	- const std::string label = memory_space::query_label( v.ptr_on_device() );
	+ const std::string label = v.tracker().label();

	v = view_type(); // deallocate first, if the only view to memory.
	v = view_type( label, n0, n1, n2, n3, n4, n5, n6, n7 );
	}

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	+/** \brief Tag denoting that a subview should capture all of a dimension */
	struct ALL { KOKKOS_INLINE_FUNCTION ALL(){} };

	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst , src , arg0 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 , class ArgType2 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 ,
	- const ArgType2 & arg2 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1, arg2 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 ,
	- const ArgType2 & arg2 ,
	- const ArgType3 & arg3 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1, arg2, arg3 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	- class ArgType4 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 ,
	- const ArgType2 & arg2 ,
	- const ArgType3 & arg3 ,
	- const ArgType4 & arg4 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1, arg2, arg3, arg4 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	- class ArgType4 , class ArgType5 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 ,
	- const ArgType2 & arg2 ,
	- const ArgType3 & arg3 ,
	- const ArgType4 & arg4 ,
	- const ArgType5 & arg5 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1, arg2, arg3, arg4, arg5 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	- class ArgType4 , class ArgType5 , class ArgType6 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 ,
	- const ArgType2 & arg2 ,
	- const ArgType3 & arg3 ,
	- const ArgType4 & arg4 ,
	- const ArgType5 & arg5 ,
	- const ArgType6 & arg6 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1, arg2, arg3, arg4, arg5, arg6 );
	-
	- return dst ;
	-}
	-
	-template< class DstViewType ,
	- class T , class L , class D , class M , class S ,
	- class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	- class ArgType4 , class ArgType5 , class ArgType6 , class ArgType7 >
	-KOKKOS_INLINE_FUNCTION
	-DstViewType
	-subview( const View<T,L,D,M,S> & src ,
	- const ArgType0 & arg0 ,
	- const ArgType1 & arg1 ,
	- const ArgType2 & arg2 ,
	- const ArgType3 & arg3 ,
	- const ArgType4 & arg4 ,
	- const ArgType5 & arg5 ,
	- const ArgType6 & arg6 ,
	- const ArgType7 & arg7 )
	-{
	- DstViewType dst ;
	-
	- Impl::ViewAssignment<typename DstViewType::specialize,S>( dst, src, arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7 );
	-
	- return dst ;
	-}
	-
	-} // namespace Kokkos
	-
	-//----------------------------------------------------------------------------
	-
	-namespace Kokkos {
	-
	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 , class ArgType5 , class ArgType6 , class ArgType7 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , ArgType5 , ArgType6 , ArgType7
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 ,
	const ArgType5 & arg5 ,
	const ArgType6 & arg6 ,
	const ArgType7 & arg7 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , ArgType5 , ArgType6 , ArgType7
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 , class ArgType5 , class ArgType6 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , ArgType5 , ArgType6 , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 ,
	const ArgType5 & arg5 ,
	const ArgType6 & arg6 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , ArgType5 , ArgType6 , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1, arg2, arg3, arg4, arg5, arg6 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 , class ArgType5 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , ArgType5 , void , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 ,
	const ArgType5 & arg5 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , ArgType5 , void , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1, arg2, arg3, arg4, arg5 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 ,
	class ArgType4 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , void , void , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 ,
	const ArgType4 & arg4 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, ArgType4 , void , void , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1, arg2, arg3, arg4 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 , class ArgType2 , class ArgType3 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, void , void , void , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 ,
	const ArgType3 & arg3 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , ArgType3
	, void , void , void , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1, arg2, arg3 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 , class ArgType2 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , void
	, void , void , void , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 ,
	const ArgType2 & arg2 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , ArgType2 , void
	, void , void , void , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1, arg2 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 , class ArgType1 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , void , void
	, void , void , void , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 ,
	const ArgType1 & arg1 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , ArgType1 , void , void
	, void , void , void , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0, arg1 );
	}

	template< class D , class A1 , class A2 , class A3 , class S ,
	class ArgType0 >
	KOKKOS_INLINE_FUNCTION
	typename Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , void , void , void
	, void , void , void , void
	>::type
	subview( const View<D,A1,A2,A3,S> & src ,
	const ArgType0 & arg0 )
	{
	typedef typename
	Impl::ViewSubview< View<D,A1,A2,A3,S>
	, ArgType0 , void , void , void
	, void , void , void , void
	>::type
	DstViewType ;

	return DstViewType( src, arg0 );
	}

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#include <impl/Kokkos_ViewDefault.hpp>
	#include <impl/Kokkos_Atomic_View.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	+#else
	+
	+#include <impl/Kokkos_ViewOffset.hpp>
	+#include <impl/Kokkos_ViewSupport.hpp>
	+
	+#endif /* #if defined( KOKKOS_USING_EXPERIMENTAL_VIEW ) */
	+
	+#include <KokkosExp_View.hpp>
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	#endif

	diff --git a/lib/kokkos/core/src/Kokkos_hwloc.hpp b/lib/kokkos/core/src/Kokkos_hwloc.hpp
	index 6b8aea148..a0b007f64 100755
	--- a/lib/kokkos/core/src/Kokkos_hwloc.hpp
	+++ b/lib/kokkos/core/src/Kokkos_hwloc.hpp
	@@ -1,140 +1,140 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_HWLOC_HPP
	#define KOKKOS_HWLOC_HPP

	#include <utility>

	namespace Kokkos {

	/** \brief Minimal subset of logical 'hwloc' functionality available
	* from http://www.open-mpi.org/projects/hwloc/.
	*
	* The calls are NOT thread safe in order to avoid mutexes,
	* memory allocations, or other actions which could give the
	* runtime system an opportunity to migrate the threads or
	* touch allocated memory during the function calls.
	*
	* All calls to these functions should be performed by a thread
	* when it has guaranteed exclusive access; e.g., for OpenMP
	* within a 'critical' region.
	*/
	namespace hwloc {

	/** \brief Query if hwloc is available */
	bool available();

	/** \brief Query number of available NUMA regions.
	* This will be less than the hardware capacity
	* if the MPI process is pinned to a NUMA region.
	*/
	unsigned get_available_numa_count();

	/** \brief Query number of available cores per NUMA regions.
	* This will be less than the hardware capacity
	* if the MPI process is pinned to a set of cores.
	*/
	unsigned get_available_cores_per_numa();

	/** \brief Query number of available "hard" threads per core; i.e., hyperthreads */
	unsigned get_available_threads_per_core();

	} /* namespace hwloc */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	// Internal functions for binding persistent spawned threads.

	namespace Kokkos {
	namespace hwloc {

	/** \brief Recommend mapping of threads onto cores.
	*
	* If thread_count == 0 then choose and set a value.
	* If use_numa_count == 0 then choose and set a value.
	* If use_cores_per_numa == 0 then choose and set a value.
	*
	* Return 0 if asynchronous,
	* Return 1 if synchronous and threads_coord[0] is process core
	*/
	unsigned thread_mapping( const char * const label ,
	const bool allow_async ,
	unsigned & thread_count ,
	unsigned & use_numa_count ,
	unsigned & use_cores_per_numa ,
	std::pair<unsigned,unsigned> threads_coord[] );

	/** \brief Query core-coordinate of the current thread
	* with respect to the core_topology.
	*
	* As long as the thread is running within the
	* process binding the following condition holds.
	*
	* core_coordinate.first < core_topology.first
	* core_coordinate.second < core_topology.second
	*/
	std::pair<unsigned,unsigned> get_this_thread_coordinate();

	/** \brief Bind the current thread to a core. */
	bool bind_this_thread( const std::pair<unsigned,unsigned> );

	/** \brief Bind the current thread to one of the cores in the list.
	* Set that entry to (~0,~0) and return the index.
	* If binding fails return ~0.
	*/
	unsigned bind_this_thread( const unsigned coordinate_count ,
	std::pair<unsigned,unsigned> coordinate[] );

	/** \brief Unbind the current thread back to the original process binding */
	bool unbind_this_thread();

	} /* namespace hwloc */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #define KOKKOS_HWLOC_HPP */

	diff --git a/lib/kokkos/core/src/Makefile b/lib/kokkos/core/src/Makefile
	new file mode 100755
	index 000000000..24d8e465f
	--- /dev/null
	+++ b/lib/kokkos/core/src/Makefile
	@@ -0,0 +1,118 @@
	+KOKKOS_PATH = ../..
	+
	+PREFIX ?= /usr/local/lib/kokkos
	+
	+default: messages build-lib
	+ echo "End Build"
	+
	+
	+include $(KOKKOS_PATH)/Makefile.kokkos
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ CXX = nvcc_wrapper
	+ CXXFLAGS ?= -O3
	+ LINK = nvcc_wrapper
	+ LINKFLAGS ?=
	+else
	+ CXX ?= g++
	+ CXXFLAGS ?= -O3
	+ LINK ?= g++
	+ LINKFLAGS ?=
	+endif
	+
	+PWD = $(shell pwd)
	+
	+KOKKOS_HEADERS_INCLUDE = $(wildcard $(KOKKOS_PATH)/core/src/*.hpp)
	+KOKKOS_HEADERS_INCLUDE_IMPL = $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp)
	+KOKKOS_HEADERS_INCLUDE += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp)
	+KOKKOS_HEADERS_INCLUDE_IMPL += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp)
	+KOKKOS_HEADERS_INCLUDE += $(wildcard $(KOKKOS_PATH)/algorithms/src/*.hpp)
	+
	+CONDITIONAL_COPIES =
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ KOKKOS_HEADERS_CUDA += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
	+ CONDITIONAL_COPIES += copy-cuda
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ KOKKOS_HEADERS_THREADS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp)
	+ CONDITIONAL_COPIES += copy-threads
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ KOKKOS_HEADERS_OPENMP += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp)
	+ CONDITIONAL_COPIES += copy-openmp
	+endif
	+
	+messages:
	+ echo "Start Build"
	+
	+build-makefile-kokkos:
	+ rm -f Makefile.kokkos
	+ echo "#Global Settings used to generate this library" >> Makefile.kokkos
	+ echo "KOKKOS_PATH = $(PREFIX)" >> Makefile.kokkos
	+ echo "KOKKOS_DEVICES = $(KOKKOS_DEVICES)" >> Makefile.kokkos
	+ echo "KOKKOS_ARCH = $(KOKKOS_ARCH)" >> Makefile.kokkos
	+ echo "KOKKOS_DEBUG = $(KOKKOS_DEBUG)" >> Makefile.kokkos
	+ echo "KOKKOS_USE_TPLS = $(KOKKOS_USE_TPLS)" >> Makefile.kokkos
	+ echo "KOKKOS_CXX_STANDARD = $(KOKKOS_CXX_STANDARD)" >> Makefile.kokkos
	+ echo "KOKKOS_CUDA_OPTIONS = $(KOKKOS_CUDA_OPTIONS)" >> Makefile.kokkos
	+ echo "CXX ?= $(CXX)" >> Makefile.kokkos
	+ echo "" >> Makefile.kokkos
	+ echo "#Source and Header files of Kokkos relative to KOKKOS_PATH" >> Makefile.kokkos
	+ echo "KOKKOS_HEADERS = $(KOKKOS_HEADERS)" >> Makefile.kokkos
	+ echo "KOKKOS_SRC = $(KOKKOS_SRC)" >> Makefile.kokkos
	+ echo "" >> Makefile.kokkos
	+ echo "#Variables used in application Makefiles" >> Makefile.kokkos
	+ echo "KOKKOS_CPP_DEPENDS = $(KOKKOS_CPP_DEPENDS)" >> Makefile.kokkos
	+ echo "KOKKOS_CXXFLAGS = $(KOKKOS_CXXFLAGS)" >> Makefile.kokkos
	+ echo "KOKKOS_CPPFLAGS = $(KOKKOS_CPPFLAGS)" >> Makefile.kokkos
	+ echo "KOKKOS_LINK_DEPENDS = $(KOKKOS_LINK_DEPENDS)" >> Makefile.kokkos
	+ echo "KOKKOS_LIBS = $(KOKKOS_LIBS)" >> Makefile.kokkos
	+ echo "KOKKOS_LDFLAGS = $(KOKKOS_LDFLAGS)" >> Makefile.kokkos
	+ sed \
	+ -e 's\|$(KOKKOS_PATH)/core/src\|$(PREFIX)/include\|g' \
	+ -e 's\|$(KOKKOS_PATH)/containers/src\|$(PREFIX)/include\|g' \
	+ -e 's\|$(KOKKOS_PATH)/algorithms/src\|$(PREFIX)/include\|g' \
	+ -e 's\|-L$(PWD)\|-L$(PREFIX)/lib\|g' \
	+ -e 's\|= libkokkos.a\|= $(PREFIX)/lib/libkokkos.a\|g' \
	+ -e 's\|= KokkosCore_config.h\|= $(PREFIX)/include/KokkosCore_config.h\|g' Makefile.kokkos \
	+ > Makefile.kokkos.tmp
	+ mv -f Makefile.kokkos.tmp Makefile.kokkos
	+
	+build-lib: build-makefile-kokkos $(KOKKOS_LINK_DEPENDS)
	+
	+mkdir:
	+ mkdir -p $(PREFIX)
	+ mkdir -p $(PREFIX)/include
	+ mkdir -p $(PREFIX)/lib
	+ mkdir -p $(PREFIX)/include/impl
	+
	+copy-cuda: mkdir
	+ mkdir -p $(PREFIX)/include/Cuda
	+ cp $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda
	+
	+copy-threads: mkdir
	+ mkdir -p $(PREFIX)/include/Threads
	+ cp $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads
	+
	+copy-openmp: mkdir
	+ mkdir -p $(PREFIX)/include/OpenMP
	+ cp $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP
	+
	+install: mkdir $(CONDITIONAL_COPIES) build-lib
	+ cp $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
	+ cp $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
	+ cp Makefile.kokkos $(PREFIX)
	+ cp libkokkos.a $(PREFIX)/lib
	+ cp KokkosCore_config.h $(PREFIX)/include
	+
	+
	+
	+clean: kokkos-clean
	+ rm Makefile.kokkos
	+
	+
	+
	+
	diff --git a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
	index 7c338e585..f8393611e 100755
	--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
	+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
	@@ -1,496 +1,496 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_OPENMP_PARALLEL_HPP
	#define KOKKOS_OPENMP_PARALLEL_HPP

	#include <omp.h>

	#include <Kokkos_Parallel.hpp>
	#include <OpenMP/Kokkos_OpenMPexec.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelFor< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::OpenMP > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::OpenMP > Policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, const PType & range )
	{
	const typename PType::member_type work_end = range.end();
	for ( typename PType::member_type iwork = range.begin() ; iwork < work_end ; ++iwork ) {
	functor( iwork );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, const PType & range )
	{
	const typename PType::member_type work_end = range.end();
	for ( typename PType::member_type iwork = range.begin() ; iwork < work_end ; ++iwork ) {
	functor( typename PType::work_tag() , iwork );
	}
	}

	public:

	inline
	ParallelFor( const FunctorType & functor
	, const Policy & policy )
	{
	OpenMPexec::verify_is_process("Kokkos::OpenMP parallel_for");
	OpenMPexec::verify_initialized("Kokkos::OpenMP parallel_for");

	#pragma omp parallel
	{
	OpenMPexec & exec = * OpenMPexec::get_thread_omp();
	driver( functor , typename Policy::WorkRange( policy , exec.pool_rank() , exec.pool_size() ) );
	}
	/* END #pragma omp parallel */
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelReduce< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::OpenMP > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::OpenMP > Policy ;
	typedef typename Policy::work_tag WorkTag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , WorkTag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , WorkTag > ValueInit ;
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , WorkTag > ValueJoin ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, reference_type update
	, const PType & range )
	{
	const typename PType::member_type work_end = range.end();
	for ( typename PType::member_type iwork = range.begin() ; iwork < work_end ; ++iwork ) {
	functor( iwork , update );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, reference_type update
	, const PType & range )
	{
	const typename PType::member_type work_end = range.end();
	for ( typename PType::member_type iwork = range.begin() ; iwork < work_end ; ++iwork ) {
	functor( typename PType::work_tag() , iwork , update );
	}
	}

	public:

	//----------------------------------------

	template< class ViewType >
	inline
	ParallelReduce( typename Impl::enable_if<
	( Impl::is_view< ViewType >::value &&
	Impl::is_same< typename ViewType::memory_space , HostSpace >::value
	), const FunctorType & >::type functor
	, const Policy & policy
	, const ViewType & result_view )
	{
	OpenMPexec::verify_is_process("Kokkos::OpenMP parallel_reduce");
	OpenMPexec::verify_initialized("Kokkos::OpenMP parallel_reduce");

	OpenMPexec::resize_scratch( ValueTraits::value_size( functor ) , 0 );

	#pragma omp parallel
	{
	OpenMPexec & exec = * OpenMPexec::get_thread_omp();

	driver( functor
	, ValueInit::init( functor , exec.scratch_reduce() )
	, typename Policy::WorkRange( policy , exec.pool_rank() , exec.pool_size() )
	);
	}
	/* END #pragma omp parallel */

	{
	const pointer_type ptr = pointer_type( OpenMPexec::pool_rev(0)->scratch_reduce() );

	for ( int i = 1 ; i < OpenMPexec::pool_size() ; ++i ) {
	ValueJoin::join( functor , ptr , OpenMPexec::pool_rev(i)->scratch_reduce() );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , WorkTag >::final( functor , ptr );

	if ( result_view.ptr_on_device() ) {
	const int n = ValueTraits::value_count( functor );

	for ( int j = 0 ; j < n ; ++j ) { result_view.ptr_on_device()[j] = ptr[j] ; }
	}
	}
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelScan< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::OpenMP > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::OpenMP > Policy ;
	typedef typename Policy::work_tag WorkTag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , WorkTag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , WorkTag > ValueInit ;
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , WorkTag > ValueJoin ;
	typedef Kokkos::Impl::FunctorValueOps< FunctorType , WorkTag > ValueOps ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, reference_type update
	, const PType & range
	, const bool final )
	{
	const typename PType::member_type work_end = range.end();
	for ( typename PType::member_type iwork = range.begin() ; iwork < work_end ; ++iwork ) {
	functor( iwork , update , final );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, reference_type update
	, const PType & range
	, const bool final )
	{
	const typename PType::member_type work_end = range.end();
	for ( typename PType::member_type iwork = range.begin() ; iwork < work_end ; ++iwork ) {
	functor( typename PType::work_tag() , iwork , update , final );
	}
	}

	public:

	//----------------------------------------

	inline
	ParallelScan( const FunctorType & functor
	, const Policy & policy )
	{
	OpenMPexec::verify_is_process("Kokkos::OpenMP parallel_scan");
	OpenMPexec::verify_initialized("Kokkos::OpenMP parallel_scan");

	OpenMPexec::resize_scratch( 2 * ValueTraits::value_size( functor ) , 0 );

	#pragma omp parallel
	{
	OpenMPexec & exec = * OpenMPexec::get_thread_omp();

	driver( functor
	, ValueInit::init( functor , pointer_type( exec.scratch_reduce() ) + ValueTraits::value_count( functor ) )
	, typename Policy::WorkRange( policy , exec.pool_rank() , exec.pool_size() )
	, false );
	}
	/* END #pragma omp parallel */

	{
	const unsigned thread_count = OpenMPexec::pool_size();
	const unsigned value_count = ValueTraits::value_count( functor );

	pointer_type ptr_prev = 0 ;

	for ( unsigned rank_rev = thread_count ; rank_rev-- ; ) {

	pointer_type ptr = pointer_type( OpenMPexec::pool_rev(rank_rev)->scratch_reduce() );

	if ( ptr_prev ) {
	for ( unsigned i = 0 ; i < value_count ; ++i ) { ptr[i] = ptr_prev[ i + value_count ] ; }
	ValueJoin::join( functor , ptr + value_count , ptr );
	}
	else {
	ValueInit::init( functor , ptr );
	}

	ptr_prev = ptr ;
	}
	}

	#pragma omp parallel
	{
	OpenMPexec & exec = * OpenMPexec::get_thread_omp();

	driver( functor
	, ValueOps::reference( pointer_type( exec.scratch_reduce() ) )
	, typename Policy::WorkRange( policy , exec.pool_rank() , exec.pool_size() )
	, true );
	}
	/* END #pragma omp parallel */

	}

	//----------------------------------------
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelFor< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP > >
	{
	private:

	typedef Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP > Policy ;

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const FunctorType & >::type functor
	, const typename Policy::member_type & member )
	{ functor( member ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const FunctorType & >::type functor
	, const typename Policy::member_type & member )
	{ functor( TagType() , member ); }

	public:

	inline
	ParallelFor( const FunctorType & functor ,
	const Policy & policy )
	{
	OpenMPexec::verify_is_process("Kokkos::OpenMP parallel_for");
	OpenMPexec::verify_initialized("Kokkos::OpenMP parallel_for");

	const size_t team_reduce_size = Policy::member_type::team_reduce_size();
	const size_t team_shmem_size = FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() );

	OpenMPexec::resize_scratch( 0 , team_reduce_size + team_shmem_size );

	#pragma omp parallel
	{
	typename Policy::member_type member( * OpenMPexec::get_thread_omp() , policy , team_shmem_size );

	for ( ; member.valid() ; member.next() ) {
	ParallelFor::template driver< typename Policy::work_tag >( functor , member );
	}
	}
	/* END #pragma omp parallel */
	}

	void wait() {}
	};


	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelReduce< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP > >
	{
	private:

	typedef Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP > Policy ;
	typedef typename Policy::work_tag WorkTag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , WorkTag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , WorkTag > ValueInit ;
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , WorkTag > ValueJoin ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;


	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, const typename PType::member_type & member
	, reference_type update )
	{ functor( member , update ); }

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if< ! Impl::is_same< typename PType::work_tag , void >::value ,
	const FunctorType & >::type functor
	, const typename PType::member_type & member
	, reference_type update )
	{ functor( typename PType::work_tag() , member , update ); }

	public:

	inline
	ParallelReduce( const FunctorType & functor ,
	const Policy & policy )
	{
	OpenMPexec::verify_is_process("Kokkos::OpenMP parallel_reduce");

	const size_t team_reduce_size = Policy::member_type::team_reduce_size();
	const size_t team_shmem_size = FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() );

	OpenMPexec::resize_scratch( ValueTraits::value_size( functor ) , team_reduce_size + team_shmem_size );

	#pragma omp parallel
	{
	OpenMPexec & exec = * OpenMPexec::get_thread_omp();

	reference_type update = ValueInit::init( functor , exec.scratch_reduce() );

	for ( typename Policy::member_type member( exec , policy , team_shmem_size ); member.valid() ; member.next() ) {
	ParallelReduce::template driver< Policy >( functor , member , update );
	}
	}
	/* END #pragma omp parallel */

	{
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , WorkTag , reference_type > Join ;

	const pointer_type ptr = pointer_type( OpenMPexec::pool_rev(0)->scratch_reduce() );

	for ( int i = 1 ; i < OpenMPexec::pool_size() ; ++i ) {
	Join::join( functor , ptr , OpenMPexec::pool_rev(i)->scratch_reduce() );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , WorkTag >::final( functor , ptr );
	}
	}

	template< class ViewType >
	inline
	ParallelReduce( const FunctorType & functor ,
	const Policy & policy ,
	const ViewType & result )
	{
	OpenMPexec::verify_is_process("Kokkos::OpenMP parallel_reduce");

	const size_t team_reduce_size = Policy::member_type::team_reduce_size();
	const size_t team_shmem_size = FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() );

	OpenMPexec::resize_scratch( ValueTraits::value_size( functor ) , team_reduce_size + team_shmem_size );

	#pragma omp parallel
	{
	OpenMPexec & exec = * OpenMPexec::get_thread_omp();

	reference_type update = ValueInit::init( functor , exec.scratch_reduce() );

	for ( typename Policy::member_type member( exec , policy , team_shmem_size ); member.valid() ; member.next() ) {
	ParallelReduce::template driver< Policy >( functor , member , update );
	}
	}
	/* END #pragma omp parallel */

	{
	const pointer_type ptr = pointer_type( OpenMPexec::pool_rev(0)->scratch_reduce() );

	for ( int i = 1 ; i < OpenMPexec::pool_size() ; ++i ) {
	ValueJoin::join( functor , ptr , OpenMPexec::pool_rev(i)->scratch_reduce() );
	}

	Kokkos::Impl::FunctorFinal< FunctorType , WorkTag >::final( functor , ptr );

	const int n = ValueTraits::value_count( functor );

	for ( int j = 0 ; j < n ; ++j ) { result.ptr_on_device()[j] = ptr[j] ; }
	}
	}

	void wait() {}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* KOKKOS_OPENMP_PARALLEL_HPP */

	diff --git a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.cpp b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.cpp
	index 25683182d..ed98fd2f9 100755
	--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.cpp
	+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.cpp
	@@ -1,365 +1,364 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <stdio.h>
	#include <limits>
	#include <iostream>
	#include <vector>
	#include <Kokkos_Core.hpp>
	#include <impl/Kokkos_Error.hpp>
	#include <iostream>

	#ifdef KOKKOS_HAVE_OPENMP

	namespace Kokkos {
	namespace Impl {
	namespace {

	KOKKOS_INLINE_FUNCTION
	int kokkos_omp_in_parallel();

	int kokkos_omp_in_critical_region = ( Kokkos::HostSpace::register_in_parallel( kokkos_omp_in_parallel ) , 0 );

	KOKKOS_INLINE_FUNCTION
	int kokkos_omp_in_parallel()
	{
	#ifndef __CUDA_ARCH__
	return omp_in_parallel() && ! kokkos_omp_in_critical_region ;
	#else
	return 0;
	#endif
	}

	bool s_using_hwloc = false;

	} // namespace
	} // namespace Impl
	} // namespace Kokkos


	namespace Kokkos {
	namespace Impl {

	int OpenMPexec::m_map_rank[ OpenMPexec::MAX_THREAD_COUNT ] = { 0 };

	int OpenMPexec::m_pool_topo[ 4 ] = { 0 };

	-OpenMPexec * OpenMPexec::m_pool[ OpenMPexec::MAX_THREAD_COUNT ] = { 0 };
	+OpenMPexec::Pool OpenMPexec::m_pool;

	void OpenMPexec::verify_is_process( const char * const label )
	{
	if ( omp_in_parallel() ) {
	std::string msg( label );
	msg.append( " ERROR: in parallel" );
	Kokkos::Impl::throw_runtime_exception( msg );
	}
	}

	void OpenMPexec::verify_initialized( const char * const label )
	{
	if ( 0 == m_pool[0] ) {
	std::string msg( label );
	msg.append( " ERROR: not initialized" );
	Kokkos::Impl::throw_runtime_exception( msg );
	}
	}

	void OpenMPexec::clear_scratch()
	{
	#pragma omp parallel
	{
	const int rank_rev = m_map_rank[ omp_get_thread_num() ];
	-
	-#pragma omp critical
	- {
	- kokkos_omp_in_critical_region = 1 ;
	-
	- m_pool[ rank_rev ]->~OpenMPexec();
	- HostSpace::decrement( m_pool[ rank_rev ] );
	- m_pool[ rank_rev ] = 0 ;
	-
	- kokkos_omp_in_critical_region = 0 ;
	- }
	-/* END #pragma omp critical */
	+ m_pool.at(rank_rev).clear();
	}
	/* END #pragma omp parallel */
	}

	void OpenMPexec::resize_scratch( size_t reduce_size , size_t thread_size )
	{
	enum { ALIGN_MASK = Kokkos::Impl::MEMORY_ALIGNMENT - 1 };
	enum { ALLOC_EXEC = ( sizeof(OpenMPexec) + ALIGN_MASK ) & ~ALIGN_MASK };

	const size_t old_reduce_size = m_pool[0] ? m_pool[0]->m_scratch_reduce_end : 0 ;
	const size_t old_thread_size = m_pool[0] ? m_pool[0]->m_scratch_thread_end - m_pool[0]->m_scratch_reduce_end : 0 ;

	reduce_size = ( reduce_size + ALIGN_MASK ) & ~ALIGN_MASK ;
	thread_size = ( thread_size + ALIGN_MASK ) & ~ALIGN_MASK ;

	// Requesting allocation and old allocation is too small:

	const bool allocate = ( old_reduce_size < reduce_size ) \|\|
	( old_thread_size < thread_size );

	if ( allocate ) {
	if ( reduce_size < old_reduce_size ) { reduce_size = old_reduce_size ; }
	if ( thread_size < old_thread_size ) { thread_size = old_thread_size ; }
	}

	const size_t alloc_size = allocate ? ALLOC_EXEC + reduce_size + thread_size : 0 ;
	const int pool_size = m_pool_topo[0] ;

	if ( allocate ) {

	clear_scratch();

	#pragma omp parallel
	{
	const int rank_rev = m_map_rank[ omp_get_thread_num() ];
	const int rank = pool_size - ( rank_rev + 1 );

	-#pragma omp critical
	- {
	- kokkos_omp_in_critical_region = 1 ;
	-
	- m_pool[ rank_rev ] =
	- (OpenMPexec *) HostSpace::allocate( "openmp_scratch" , alloc_size );
	- new( m_pool[ rank_rev ] ) OpenMPexec( rank , ALLOC_EXEC , reduce_size , thread_size );
	-
	- kokkos_omp_in_critical_region = 0 ;
	- }
	-/* END #pragma omp critical */
	+ m_pool.at(rank_rev) = HostSpace::allocate_and_track( "openmp_scratch", alloc_size );
	+ new ( m_pool[ rank_rev ] ) OpenMPexec( rank , ALLOC_EXEC , reduce_size , thread_size );
	}
	/* END #pragma omp parallel */
	}
	}

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	//----------------------------------------------------------------------------

	int OpenMP::is_initialized()
	{ return 0 != Impl::OpenMPexec::m_pool[0]; }

	void OpenMP::initialize( unsigned thread_count ,
	unsigned use_numa_count ,
	unsigned use_cores_per_numa )
	{
	// Before any other call to OMP query the maximum number of threads
	// and save the value for re-initialization unit testing.
	- static int omp_max_threads = omp_get_max_threads();
	+
	+ //Using omp_get_max_threads(); is problematic in conjunction with
	+ //Hwloc on Intel (essentially an initial call to the OpenMP runtime
	+ //without a parallel region before will set a process mask for a single core
	+ //The runtime will than bind threads for a parallel region to other cores on the
	+ //entering the first parallel region and make the process mask the aggregate of
	+ //the thread masks. The intend seems to be to make serial code run fast, if you
	+ //compile with OpenMP enabled but don't actually use parallel regions or so
	+ //static int omp_max_threads = omp_get_max_threads();
	+ int nthreads = 0;
	+ #pragma omp parallel
	+ {
	+ #pragma omp atomic
	+ nthreads++;
	+ }
	+
	+ static int omp_max_threads = nthreads;

	const bool is_initialized = 0 != Impl::OpenMPexec::m_pool[0] ;

	bool thread_spawn_failed = false ;

	if ( ! is_initialized ) {

	// Use hwloc thread pinning if concerned with locality.
	// If spreading threads across multiple NUMA regions.
	// If hyperthreading is enabled.
	Impl::s_using_hwloc = hwloc::available() && (
	( 1 < Kokkos::hwloc::get_available_numa_count() ) \|\|
	( 1 < Kokkos::hwloc::get_available_threads_per_core() ) );

	std::pair<unsigned,unsigned> threads_coord[ Impl::OpenMPexec::MAX_THREAD_COUNT ];

	// If hwloc available then use it's maximum value.

	if ( thread_count == 0 ) {
	thread_count = Impl::s_using_hwloc
	? Kokkos::hwloc::get_available_numa_count() *
	Kokkos::hwloc::get_available_cores_per_numa() *
	Kokkos::hwloc::get_available_threads_per_core()
	: omp_max_threads ;
	}

	if(Impl::s_using_hwloc)
	hwloc::thread_mapping( "Kokkos::OpenMP::initialize" ,
	false /* do not allow asynchronous */ ,
	thread_count ,
	use_numa_count ,
	use_cores_per_numa ,
	threads_coord );

	// Spawn threads:

	omp_set_num_threads( thread_count );

	// Verify OMP interaction:
	if ( int(thread_count) != omp_get_max_threads() ) {
	thread_spawn_failed = true ;
	}

	// Verify spawning and bind threads:
	#pragma omp parallel
	{
	#pragma omp critical
	{
	if ( int(thread_count) != omp_get_num_threads() ) {
	thread_spawn_failed = true ;
	}

	// Call to 'bind_this_thread' is not thread safe so place this whole block in a critical region.
	// Call to 'new' may not be thread safe as well.

	// Reverse the rank for threads so that the scan operation reduces to the highest rank thread.

	const unsigned omp_rank = omp_get_thread_num();
	const unsigned thread_r = Impl::s_using_hwloc ? Kokkos::hwloc::bind_this_thread( thread_count , threads_coord ) : omp_rank ;

	Impl::OpenMPexec::m_map_rank[ omp_rank ] = thread_r ;
	}
	/* END #pragma omp critical */
	}
	/* END #pragma omp parallel */

	if ( ! thread_spawn_failed ) {
	Impl::OpenMPexec::m_pool_topo[0] = thread_count ;
	Impl::OpenMPexec::m_pool_topo[1] = Impl::s_using_hwloc ? thread_count / use_numa_count : thread_count;
	Impl::OpenMPexec::m_pool_topo[2] = Impl::s_using_hwloc ? thread_count / ( use_numa_count * use_cores_per_numa ) : 1;

	Impl::OpenMPexec::resize_scratch( 1024 , 1024 );
	}
	}

	if ( is_initialized \|\| thread_spawn_failed ) {
	std::string msg("Kokkos::OpenMP::initialize ERROR");

	if ( is_initialized ) { msg.append(" : already initialized"); }
	if ( thread_spawn_failed ) { msg.append(" : failed spawning threads"); }

	Kokkos::Impl::throw_runtime_exception(msg);
	}
	+
	+ // Init the array for used for arbitrarily sized atomics
	+ Impl::init_lock_array_host_space();
	}

	//----------------------------------------------------------------------------

	void OpenMP::finalize()
	{
	Impl::OpenMPexec::verify_initialized( "OpenMP::finalize" );
	Impl::OpenMPexec::verify_is_process( "OpenMP::finalize" );

	Impl::OpenMPexec::clear_scratch();

	Impl::OpenMPexec::m_pool_topo[0] = 0 ;
	Impl::OpenMPexec::m_pool_topo[1] = 0 ;
	Impl::OpenMPexec::m_pool_topo[2] = 0 ;

	- omp_set_num_threads(0);
	+ omp_set_num_threads(1);

	if ( Impl::s_using_hwloc ) {
	hwloc::unbind_this_thread();
	}
	}

	//----------------------------------------------------------------------------

	void OpenMP::print_configuration( std::ostream & s , const bool detail )
	{
	Impl::OpenMPexec::verify_is_process( "OpenMP::print_configuration" );

	s << "Kokkos::OpenMP" ;

	#if defined( KOKKOS_HAVE_OPENMP )
	s << " KOKKOS_HAVE_OPENMP" ;
	#endif
	#if defined( KOKKOS_HAVE_HWLOC )

	- const unsigned numa_count = Kokkos::hwloc::get_available_numa_count();
	+ const unsigned numa_count_ = Kokkos::hwloc::get_available_numa_count();
	const unsigned cores_per_numa = Kokkos::hwloc::get_available_cores_per_numa();
	const unsigned threads_per_core = Kokkos::hwloc::get_available_threads_per_core();

	- s << " hwloc[" << numa_count << "x" << cores_per_numa << "x" << threads_per_core << "]"
	+ s << " hwloc[" << numa_count_ << "x" << cores_per_numa << "x" << threads_per_core << "]"
	<< " hwloc_binding_" << ( Impl::s_using_hwloc ? "enabled" : "disabled" )
	;
	#endif

	const bool is_initialized = 0 != Impl::OpenMPexec::m_pool[0] ;

	if ( is_initialized ) {
	const int numa_count = Kokkos::Impl::OpenMPexec::m_pool_topo[0] / Kokkos::Impl::OpenMPexec::m_pool_topo[1] ;
	const int core_per_numa = Kokkos::Impl::OpenMPexec::m_pool_topo[1] / Kokkos::Impl::OpenMPexec::m_pool_topo[2] ;
	const int thread_per_core = Kokkos::Impl::OpenMPexec::m_pool_topo[2] ;

	s << " thread_pool_topology[ " << numa_count
	<< " x " << core_per_numa
	<< " x " << thread_per_core
	<< " ]"
	<< std::endl ;

	if ( detail ) {
	std::vector< std::pair<unsigned,unsigned> > coord( Kokkos::Impl::OpenMPexec::m_pool_topo[0] );

	#pragma omp parallel
	{
	#pragma omp critical
	{
	coord[ omp_get_thread_num() ] = hwloc::get_this_thread_coordinate();
	}
	/* END #pragma omp critical */
	}
	/* END #pragma omp parallel */

	for ( unsigned i = 0 ; i < coord.size() ; ++i ) {
	s << " thread omp_rank[" << i << "]"
	<< " kokkos_rank[" << Impl::OpenMPexec::m_map_rank[ i ] << "]"
	<< " hwloc_coord[" << coord[i].first << "." << coord[i].second << "]"
	<< std::endl ;
	}
	}
	}
	else {
	s << " not initialized" << std::endl ;
	}
	}

	} // namespace Kokkos

	#endif //KOKKOS_HAVE_OPENMP
	diff --git a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp
	index 82b27b97b..1ab08f648 100755
	--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp
	+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp
	@@ -1,758 +1,767 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_OPENMPEXEC_HPP
	#define KOKKOS_OPENMPEXEC_HPP

	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_spinwait.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp>

	#include <Kokkos_Atomic.hpp>

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------
	/** \brief Data for OpenMP thread execution */

	class OpenMPexec {
	public:

	enum { MAX_THREAD_COUNT = 4096 };

	+ struct Pool
	+ {
	+ Pool() : m_trackers() {}
	+
	+ AllocationTracker m_trackers[ MAX_THREAD_COUNT ];
	+
	+ OpenMPexec * operator[](int i)
	+ {
	+ return reinterpret_cast<OpenMPexec *>(m_trackers[i].alloc_ptr());
	+ }
	+
	+ AllocationTracker & at(int i)
	+ {
	+ return m_trackers[i];
	+ }
	+ };
	+
	private:

	static int m_pool_topo[ 4 ];
	static int m_map_rank[ MAX_THREAD_COUNT ];
	- static OpenMPexec * m_pool[ MAX_THREAD_COUNT ]; // Indexed by: m_pool_rank_rev
	+ static Pool m_pool; // Indexed by: m_pool_rank_rev

	friend class Kokkos::OpenMP ;

	int const m_pool_rank ;
	int const m_pool_rank_rev ;
	int const m_scratch_exec_end ;
	int const m_scratch_reduce_end ;
	int const m_scratch_thread_end ;

	int volatile m_barrier_state ;

	OpenMPexec();
	OpenMPexec( const OpenMPexec & );
	OpenMPexec & operator = ( const OpenMPexec & );

	static void clear_scratch();

	public:

	// Topology of a cache coherent thread pool:
	// TOTAL = NUMA x GRAIN
	// pool_size( depth = 0 )
	// pool_size(0) = total number of threads
	// pool_size(1) = number of threads per NUMA
	// pool_size(2) = number of threads sharing finest grain memory hierarchy

	inline static
	int pool_size( int depth = 0 ) { return m_pool_topo[ depth ]; }

	inline static
	OpenMPexec * pool_rev( int pool_rank_rev ) { return m_pool[ pool_rank_rev ]; }

	inline int pool_rank() const { return m_pool_rank ; }
	inline int pool_rank_rev() const { return m_pool_rank_rev ; }

	inline void * scratch_reduce() const { return ((char *) this) + m_scratch_exec_end ; }
	inline void * scratch_thread() const { return ((char *) this) + m_scratch_reduce_end ; }

	inline
	void state_wait( int state )
	{ Impl::spinwait( m_barrier_state , state ); }

	inline
	void state_set( int state ) { m_barrier_state = state ; }

	~OpenMPexec() {}

	- OpenMPexec( const int poolRank
	+ OpenMPexec( const int poolRank
	, const int scratch_exec_size
	, const int scratch_reduce_size
	, const int scratch_thread_size )
	: m_pool_rank( poolRank )
	, m_pool_rank_rev( pool_size() - ( poolRank + 1 ) )
	, m_scratch_exec_end( scratch_exec_size )
	, m_scratch_reduce_end( m_scratch_exec_end + scratch_reduce_size )
	, m_scratch_thread_end( m_scratch_reduce_end + scratch_thread_size )
	, m_barrier_state(0)
	{}

	static void finalize();

	static void initialize( const unsigned team_count ,
	const unsigned threads_per_team ,
	const unsigned numa_count ,
	const unsigned cores_per_numa );

	static void verify_is_process( const char * const );
	static void verify_initialized( const char * const );

	static void resize_scratch( size_t reduce_size , size_t thread_size );

	inline static
	OpenMPexec * get_thread_omp() { return m_pool[ m_map_rank[ omp_get_thread_num() ] ]; }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	class OpenMPexecTeamMember {
	private:

	enum { TEAM_REDUCE_SIZE = 512 };

	/** \brief Thread states for team synchronization */
	enum { Active = 0 , Rendezvous = 1 };

	typedef Kokkos::OpenMP execution_space ;
	typedef execution_space::scratch_memory_space scratch_memory_space ;

	Impl::OpenMPexec & m_exec ;
	scratch_memory_space m_team_shared ;
	int m_team_shmem ;
	int m_team_base_rev ;
	int m_team_rank_rev ;
	int m_team_rank ;
	int m_team_size ;
	int m_league_rank ;
	int m_league_end ;
	int m_league_size ;

	// Fan-in team threads, root of the fan-in which does not block returns true
	inline
	bool team_fan_in() const
	{
	for ( int n = 1 , j ; ( ( j = m_team_rank_rev + n ) < m_team_size ) && ! ( m_team_rank_rev & n ) ; n <<= 1 ) {
	m_exec.pool_rev( m_team_base_rev + j )->state_wait( Active );
	}

	if ( m_team_rank_rev ) {
	m_exec.state_set( Rendezvous );
	m_exec.state_wait( Rendezvous );
	}

	return 0 == m_team_rank_rev ;
	}

	inline
	void team_fan_out() const
	{
	for ( int n = 1 , j ; ( ( j = m_team_rank_rev + n ) < m_team_size ) && ! ( m_team_rank_rev & n ) ; n <<= 1 ) {
	m_exec.pool_rev( m_team_base_rev + j )->state_set( Active );
	}
	}

	public:

	KOKKOS_INLINE_FUNCTION
	const execution_space::scratch_memory_space & team_shmem() const
	{ return m_team_shared ; }

	KOKKOS_INLINE_FUNCTION int league_rank() const { return m_league_rank ; }
	KOKKOS_INLINE_FUNCTION int league_size() const { return m_league_size ; }
	KOKKOS_INLINE_FUNCTION int team_rank() const { return m_team_rank ; }
	KOKKOS_INLINE_FUNCTION int team_size() const { return m_team_size ; }

	KOKKOS_INLINE_FUNCTION void team_barrier() const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{}
	#else
	{
	if ( 1 < m_team_size ) {
	team_fan_in();
	team_fan_out();
	}
	}
	#endif

	template<class ValueType>
	KOKKOS_INLINE_FUNCTION
	void team_broadcast(ValueType& value, const int& thread_id) const
	{
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ }
	#else
	// Make sure there is enough scratch space:
	typedef typename if_c< sizeof(ValueType) < TEAM_REDUCE_SIZE
	, ValueType , void >::type type ;

	type * const local_value = ((type*) m_exec.scratch_thread());
	if(team_rank() == thread_id)
	*local_value = value;
	memory_fence();
	team_barrier();
	value = *local_value;
	#endif
	}

	#ifdef KOKKOS_HAVE_CXX11
	template< class ValueType, class JoinOp >
	KOKKOS_INLINE_FUNCTION ValueType
	team_reduce( const ValueType & value
	, const JoinOp & op_in ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return ValueType(); }
	#else
	{
	typedef ValueType value_type;
	const JoinLambdaAdapter<value_type,JoinOp> op(op_in);
	#endif
	#else // KOKKOS_HAVE_CXX11
	template< class JoinOp >
	KOKKOS_INLINE_FUNCTION typename JoinOp::value_type
	team_reduce( const typename JoinOp::value_type & value
	, const JoinOp & op ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return typename JoinOp::value_type(); }
	#else
	{
	typedef typename JoinOp::value_type value_type;
	#endif
	#endif // KOKKOS_HAVE_CXX11
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	// Make sure there is enough scratch space:
	typedef typename if_c< sizeof(value_type) < TEAM_REDUCE_SIZE
	, value_type , void >::type type ;

	type * const local_value = ((type*) m_exec.scratch_thread());

	// Set this thread's contribution
	*local_value = value ;

	// Fence to make sure the base team member has access:
	memory_fence();

	if ( team_fan_in() ) {
	// The last thread to synchronize returns true, all other threads wait for team_fan_out()
	type * const team_value = ((type*) m_exec.pool_rev( m_team_base_rev )->scratch_thread());

	// Join to the team value:
	for ( int i = 1 ; i < m_team_size ; ++i ) {
	op.join( team_value , ((type*) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread()) );
	}

	// The base team member may "lap" the other team members,
	// copy to their local value before proceeding.
	for ( int i = 1 ; i < m_team_size ; ++i ) {
	((type) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread()) = *team_value ;
	}

	// Fence to make sure all team members have access
	memory_fence();
	}

	team_fan_out();

	return ((type volatile const )local_value);
	}
	#endif
	/** \brief Intra-team exclusive prefix sum with team_rank() ordering
	* with intra-team non-deterministic ordering accumulation.
	*
	* The global inter-team accumulation value will, at the end of the
	* league's parallel execution, be the scan's total.
	* Parallel execution ordering of the league's teams is non-deterministic.
	* As such the base value for each team's scan operation is similarly
	* non-deterministic.
	*/
	template< typename ArgType >
	KOKKOS_INLINE_FUNCTION ArgType team_scan( const ArgType & value , ArgType * const global_accum ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return ArgType(); }
	#else
	{
	// Make sure there is enough scratch space:
	typedef typename if_c< sizeof(ArgType) < TEAM_REDUCE_SIZE , ArgType , void >::type type ;

	volatile type * const work_value = ((type*) m_exec.scratch_thread());

	*work_value = value ;

	memory_fence();

	if ( team_fan_in() ) {
	// The last thread to synchronize returns true, all other threads wait for team_fan_out()
	// m_team_base[0] == highest ranking team member
	// m_team_base[ m_team_size - 1 ] == lowest ranking team member
	//
	// 1) copy from lower to higher rank, initialize lowest rank to zero
	// 2) prefix sum from lowest to highest rank, skipping lowest rank

	type accum = 0 ;

	if ( global_accum ) {
	for ( int i = m_team_size ; i-- ; ) {
	type & val = ((type) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread());
	accum += val ;
	}
	accum = atomic_fetch_add( global_accum , accum );
	}

	for ( int i = m_team_size ; i-- ; ) {
	type & val = ((type) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread());
	- const type offset = accum ;
	+ const type offset = accum ;
	accum += val ;
	val = offset ;
	}

	memory_fence();
	}

	team_fan_out();

	return *work_value ;
	}
	#endif

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	*
	* The highest rank thread can compute the reduction total as
	* reduction_total = dev.team_scan( value ) + value ;
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value ) const
	{ return this-> template team_scan<Type>( value , 0 ); }

	-#ifdef KOKKOS_HAVE_CXX11
	-
	- /** \brief Inter-thread parallel for. Executes op(iType i) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all threads of the the calling thread team.
	- * This functionality requires C++11 support.*/
	- template< typename iType, class Operation>
	- KOKKOS_INLINE_FUNCTION void team_par_for(const iType n, const Operation & op) const {
	- const int chunk = ((n+m_team_size-1)/m_team_size);
	- const int start = chunk*m_team_rank;
	- const int end = start+chunk<n?start+chunk:n;
	- for(int i=start; i<end ; i++) {
	- op(i);
	- }
	- }
	-#endif
	-
	//----------------------------------------
	// Private for the driver

	private:

	typedef execution_space::scratch_memory_space space ;

	public:

	template< class Arg0 , class Arg1 >
	inline
	OpenMPexecTeamMember( Impl::OpenMPexec & exec
	, const TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP > & team
	, const int shmem_size
	)
	: m_exec( exec )
	, m_team_shared(0,0)
	, m_team_shmem( shmem_size )
	, m_team_base_rev(0)
	, m_team_rank_rev(0)
	, m_team_rank(0)
	, m_team_size( team.team_size() )
	, m_league_rank(0)
	, m_league_end(0)
	, m_league_size( team.league_size() )
	{
	const int pool_rank_rev = m_exec.pool_rank_rev();
	const int pool_team_rank_rev = pool_rank_rev % team.team_alloc();
	const int pool_league_rank_rev = pool_rank_rev / team.team_alloc();
	const int league_iter_end = team.league_size() - pool_league_rank_rev * team.team_iter();

	if ( pool_team_rank_rev < m_team_size && 0 < league_iter_end ) {
	m_team_base_rev = team.team_alloc() * pool_league_rank_rev ;
	m_team_rank_rev = pool_team_rank_rev ;
	m_team_rank = m_team_size - ( m_team_rank_rev + 1 );
	m_league_end = league_iter_end ;
	m_league_rank = league_iter_end > team.team_iter() ? league_iter_end - team.team_iter() : 0 ;
	new( (void) &m_team_shared ) space( ( (char) m_exec.pool_rev(m_team_base_rev)->scratch_thread() ) + TEAM_REDUCE_SIZE , m_team_shmem );
	}
	}

	bool valid() const
	{ return m_league_rank < m_league_end ; }

	void next()
	{
	if ( ++m_league_rank < m_league_end ) {
	team_barrier();
	new( (void) &m_team_shared ) space( ( (char) m_exec.pool_rev(m_team_base_rev)->scratch_thread() ) + TEAM_REDUCE_SIZE , m_team_shmem );
	}
	}

	static inline int team_reduce_size() { return TEAM_REDUCE_SIZE ; }
	};



	} // namespace Impl

	template< class Arg0 , class Arg1 >
	class TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP >
	{
	public:

	//! Tag this class as a kokkos execution policy
	typedef TeamPolicy execution_policy ;

	//! Execution space of this execution policy.
	typedef Kokkos::OpenMP execution_space ;

	typedef typename
	Impl::if_c< ! Impl::is_same< Kokkos::OpenMP , Arg0 >::value , Arg0 , Arg1 >::type
	work_tag ;

	//----------------------------------------

	template< class FunctorType >
	inline static
	int team_size_max( const FunctorType & )
	{ return execution_space::thread_pool_size(1); }

	template< class FunctorType >
	inline static
	int team_size_recommended( const FunctorType & )
	{ return execution_space::thread_pool_size(2); }

	+ template< class FunctorType >
	+ inline static
	+ int team_size_recommended( const FunctorType &, const int& )
	+ { return execution_space::thread_pool_size(2); }
	+
	//----------------------------------------

	private:

	int m_league_size ;
	int m_team_size ;
	int m_team_alloc ;
	int m_team_iter ;

	inline void init( const int league_size_request
	, const int team_size_request )
	{
	const int pool_size = execution_space::thread_pool_size(0);
	const int team_max = execution_space::thread_pool_size(1);
	const int team_grain = execution_space::thread_pool_size(2);

	m_league_size = league_size_request ;

	m_team_size = team_size_request < team_max ?
	team_size_request : team_max ;

	// Round team size up to a multiple of 'team_gain'
	const int team_size_grain = team_grain * ( ( m_team_size + team_grain - 1 ) / team_grain );
	const int team_count = pool_size / team_size_grain ;

	// Constraint : pool_size = m_team_alloc * team_count
	m_team_alloc = pool_size / team_count ;

	// Maxumum number of iterations each team will take:
	m_team_iter = ( m_league_size + team_count - 1 ) / team_count ;
	}

	public:

	inline int team_size() const { return m_team_size ; }
	inline int league_size() const { return m_league_size ; }

	/** \brief Specify league size, request team size */
	TeamPolicy( execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1)
	{ init( league_size_request , team_size_request ); (void) vector_length_request; }

	TeamPolicy( int league_size_request , int team_size_request , int vector_length_request = 1 )
	{ init( league_size_request , team_size_request ); (void) vector_length_request; }

	inline int team_alloc() const { return m_team_alloc ; }
	inline int team_iter() const { return m_team_iter ; }

	typedef Impl::OpenMPexecTeamMember member_type ;
	};

	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	inline
	int OpenMP::thread_pool_size( int depth )
	{
	return Impl::OpenMPexec::pool_size(depth);
	}

	KOKKOS_INLINE_FUNCTION
	int OpenMP::thread_pool_rank()
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	return Impl::OpenMPexec::m_map_rank[ omp_get_thread_num() ];
	#else
	return -1 ;
	#endif
	}

	} // namespace Kokkos


	-#ifdef KOKKOS_HAVE_CXX11
	-
	namespace Kokkos {

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>
	- TeamThreadLoop(const Impl::OpenMPexecTeamMember& thread, const iType& count) {
	- return Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>(thread,count);
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>
	+ TeamThreadRange(const Impl::OpenMPexecTeamMember& thread, const iType& count) {
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>(thread,count);
	+}
	+
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>
	+ TeamThreadRange(const Impl::OpenMPexecTeamMember& thread, const iType& begin, const iType& end) {
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>(thread,begin,end);
	}

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >
	- ThreadVectorLoop(const Impl::OpenMPexecTeamMember& thread, const iType& count) {
	- return Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >(thread,count);
	+Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember >
	+ ThreadVectorRange(const Impl::OpenMPexecTeamMember& thread, const iType& count) {
	+ return Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember >(thread,count);
	}

	KOKKOS_INLINE_FUNCTION
	Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember> PerTeam(const Impl::OpenMPexecTeamMember& thread) {
	return Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember>(thread);
	}

	KOKKOS_INLINE_FUNCTION
	Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember> PerThread(const Impl::OpenMPexecTeamMember& thread) {
	return Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember>(thread);
	}
	} // namespace Kokkos

	namespace Kokkos {

	/** \brief Inter-thread parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries, const Lambda& lambda) {
	+void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries, const Lambda& lambda) {
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	}

	/** \brief Inter-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries,
	const Lambda & lambda, ValueType& result) {

	result = ValueType();

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	result+=tmp;
	}

	result = loop_boundaries.thread.team_reduce(result,Impl::JoinAdd<ValueType>());
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries,
	const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	ValueType result = init_result;

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	join(result,tmp);
	}

	init_result = loop_boundaries.thread.team_reduce(result,join);
	}

	} //namespace Kokkos


	namespace Kokkos {
	/** \brief Intra-thread vector parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	loop_boundaries, const Lambda& lambda) {
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	loop_boundaries, const Lambda & lambda, ValueType& result) {
	result = ValueType();
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	result+=tmp;
	}
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	loop_boundaries, const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	ValueType result = init_result;
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	join(result,tmp);
	}
	init_result = result;
	}

	/** \brief Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
	* for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes in the thread and a scan operation is performed.
	* Depending on the target execution space the operator might be called twice: once with final=false
	* and once with final=true. When final==true val contains the prefix sum value. The contribution of this
	* "i" needs to be added to val no matter whether final==true or not. In a serial execution
	* (i.e. team_size==1) the operator is only called once with final==true. Scan_val will be set
	* to the final sum value over all vector lanes.
	* This functionality requires C++11 support.*/
	template< typename iType, class FunctorType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_scan(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	loop_boundaries, const FunctorType & lambda) {

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	typedef typename ValueTraits::value_type value_type ;

	value_type scan_val = value_type();

	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,scan_val,true);
	}
	}

	} // namespace Kokkos

	namespace Kokkos {

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda) {
	lambda();
	}

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda) {
	if(single_struct.team_member.team_rank()==0) lambda();
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	lambda(val);
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	void single(const Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	if(single_struct.team_member.team_rank()==0) {
	lambda(val);
	}
	single_struct.team_member.team_broadcast(val,0);
	}
	}

	-#endif // KOKKOS_HAVE_CXX11
	-
	#endif /* #ifndef KOKKOS_OPENMPEXEC_HPP */

	diff --git a/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.cpp b/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.cpp
	index ca76784a5..d8b40943d 100755
	--- a/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.cpp
	+++ b/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.cpp
	@@ -1,380 +1,484 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <Kokkos_Core_fwd.hpp>

	#if defined( KOKKOS_HAVE_QTHREAD )

	#include <stdio.h>
	#include <stdlib.h>
	#include <iostream>
	#include <sstream>
	#include <utility>
	#include <Kokkos_Qthread.hpp>
	#include <Kokkos_Atomic.hpp>
	#include <impl/Kokkos_Error.hpp>

	+// Defines to enable experimental Qthread functionality
	+
	#define QTHREAD_LOCAL_PRIORITY
	+#define CLONED_TASKS

	#include <qthread/qthread.h>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {
	namespace {

	enum { MAXIMUM_QTHREAD_WORKERS = 1024 };

	/** s_exec is indexed by the reverse rank of the workers
	* for faster fan-in / fan-out lookups
	* [ n - 1 , n - 2 , ... , 0 ]
	*/
	QthreadExec * s_exec[ MAXIMUM_QTHREAD_WORKERS ];

	int s_number_shepherds = 0 ;
	int s_number_workers_per_shepherd = 0 ;
	int s_number_workers = 0 ;

	inline
	QthreadExec ** worker_exec()
	{
	return s_exec + s_number_workers - ( qthread_shep() * s_number_workers_per_shepherd + qthread_worker_local(NULL) + 1 );
	}

	const int s_base_size = QthreadExec::align_alloc( sizeof(QthreadExec) );

	int s_worker_reduce_end = 0 ; /* End of worker reduction memory */
	int s_worker_shared_end = 0 ; /* Total of worker scratch memory */
	int s_worker_shared_begin = 0 ; /* Beginning of worker shared memory */

	-QthreadExecFunctionPointer s_active_function = 0 ;
	-const void * s_active_function_arg = 0 ;
	+QthreadExecFunctionPointer volatile s_active_function = 0 ;
	+const void * volatile s_active_function_arg = 0 ;

	} /* namespace */
	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------

	namespace Kokkos {

	void Qthread::initialize( int thread_count )
	{
	// Environment variable: QTHREAD_NUM_SHEPHERDS
	// Environment variable: QTHREAD_NUM_WORKERS_PER_SHEP
	// Environment variable: QTHREAD_HWPAR

	{
	char buffer[256];
	snprintf(buffer,sizeof(buffer),"QTHREAD_HWPAR=%d",thread_count);
	putenv(buffer);
	}

	const bool ok_init = ( QTHREAD_SUCCESS == qthread_initialize() ) &&
	( thread_count == qthread_num_shepherds() * qthread_num_workers_local(NO_SHEPHERD) ) &&
	( thread_count == qthread_num_workers() );

	bool ok_symmetry = true ;

	if ( ok_init ) {
	Impl::s_number_shepherds = qthread_num_shepherds();
	Impl::s_number_workers_per_shepherd = qthread_num_workers_local(NO_SHEPHERD);
	Impl::s_number_workers = Impl::s_number_shepherds * Impl::s_number_workers_per_shepherd ;

	for ( int i = 0 ; ok_symmetry && i < Impl::s_number_shepherds ; ++i ) {
	ok_symmetry = ( Impl::s_number_workers_per_shepherd == qthread_num_workers_local(i) );
	}
	}

	if ( ! ok_init \|\| ! ok_symmetry ) {
	std::ostringstream msg ;

	msg << "Kokkos::Qthread::initialize(" << thread_count << ") FAILED" ;
	msg << " : qthread_num_shepherds = " << qthread_num_shepherds();
	msg << " : qthread_num_workers_per_shepherd = " << qthread_num_workers_local(NO_SHEPHERD);
	msg << " : qthread_num_workers = " << qthread_num_workers();

	if ( ! ok_symmetry ) {
	msg << " : qthread_num_workers_local = {" ;
	for ( int i = 0 ; i < Impl::s_number_shepherds ; ++i ) {
	msg << " " << qthread_num_workers_local(i) ;
	}
	msg << " }" ;
	}

	Impl::s_number_workers = 0 ;
	Impl::s_number_shepherds = 0 ;
	Impl::s_number_workers_per_shepherd = 0 ;

	if ( ok_init ) { qthread_finalize(); }

	Kokkos::Impl::throw_runtime_exception( msg.str() );
	}

	Impl::QthreadExec::resize_worker_scratch( 256 , 256 );
	+
	+ // Init the array for used for arbitrarily sized atomics
	+ Impl::init_lock_array_host_space();
	+
	}

	void Qthread::finalize()
	{
	Impl::QthreadExec::clear_workers();

	if ( Impl::s_number_workers ) {
	qthread_finalize();
	}

	Impl::s_number_workers = 0 ;
	Impl::s_number_shepherds = 0 ;
	Impl::s_number_workers_per_shepherd = 0 ;
	}

	void Qthread::print_configuration( std::ostream & s , const bool detail )
	{
	s << "Kokkos::Qthread {"
	<< " num_shepherds(" << Impl::s_number_shepherds << ")"
	<< " num_workers_per_shepherd(" << Impl::s_number_workers_per_shepherd << ")"
	<< " }" << std::endl ;
	}

	Qthread & Qthread::instance( int )
	{
	static Qthread q ;
	return q ;
	}

	void Qthread::fence()
	{
	}

	int Qthread::shepherd_size() const { return Impl::s_number_shepherds ; }
	int Qthread::shepherd_worker_size() const { return Impl::s_number_workers_per_shepherd ; }

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {
	namespace {

	aligned_t driver_exec_all( void * arg )
	{
	- (s_active_function)( * worker_exec() , s_active_function_arg );
	+ QthreadExec & exec = **worker_exec();
	+
	+ (*s_active_function)( exec , s_active_function_arg );
	+
	+/*
	+ fprintf( stdout
	+ , "QthreadExec driver worker(%d:%d) shepherd(%d:%d) shepherd_worker(%d:%d) done\n"
	+ , exec.worker_rank()
	+ , exec.worker_size()
	+ , exec.shepherd_rank()
	+ , exec.shepherd_size()
	+ , exec.shepherd_worker_rank()
	+ , exec.shepherd_worker_size()
	+ );
	+ fflush(stdout);
	+*/

	return 0 ;
	}

	aligned_t driver_resize_worker_scratch( void * arg )
	{
	static volatile int lock_begin = 0 ;
	static volatile int lock_end = 0 ;

	QthreadExec ** const exec = worker_exec();

	//----------------------------------------
	// Serialize allocation for thread safety

	while ( ! atomic_compare_exchange_strong( & lock_begin , 0 , 1 ) ); // Spin wait to claim lock

	const bool ok = 0 == *exec ;

	if ( ok ) { exec = (QthreadExec ) malloc( s_base_size + s_worker_shared_end ); }

	lock_begin = 0 ; // release lock

	if ( ok ) { new( *exec ) QthreadExec(); }

	//----------------------------------------
	// Wait for all calls to complete to insure that each worker has executed.

	if ( s_number_workers == 1 + atomic_fetch_add( & lock_end , 1 ) ) { lock_end = 0 ; }

	while ( lock_end );

	+/*
	+ fprintf( stdout
	+ , "QthreadExec resize worker(%d:%d) shepherd(%d:%d) shepherd_worker(%d:%d) done\n"
	+ , (**exec).worker_rank()
	+ , (**exec).worker_size()
	+ , (**exec).shepherd_rank()
	+ , (**exec).shepherd_size()
	+ , (**exec).shepherd_worker_rank()
	+ , (**exec).shepherd_worker_size()
	+ );
	+ fflush(stdout);
	+*/
	+
	//----------------------------------------

	+ if ( ! ok ) {
	+ fprintf( stderr , "Kokkos::QthreadExec resize failed\n" );
	+ fflush( stderr );
	+ }
	+
	return 0 ;
	}

	void verify_is_process( const char * const label , bool not_active = false )
	{
	const bool not_process = 0 != qthread_shep() \|\| 0 != qthread_worker_local(NULL);
	const bool is_active = not_active && ( s_active_function \|\| s_active_function_arg );

	if ( not_process \|\| is_active ) {
	std::string msg( label );
	msg.append( " : FAILED" );
	if ( not_process ) msg.append(" : not called by main process");
	if ( is_active ) msg.append(" : parallel execution in progress");
	Kokkos::Impl::throw_runtime_exception( msg );
	}
	}

	}

	+int QthreadExec::worker_per_shepherd()
	+{
	+ return s_number_workers_per_shepherd ;
	+}
	+
	QthreadExec::QthreadExec()
	{
	const int shepherd_rank = qthread_shep();
	const int shepherd_worker_rank = qthread_worker_local(NULL);
	const int worker_rank = shepherd_rank * s_number_workers_per_shepherd + shepherd_worker_rank ;

	m_worker_base = s_exec ;
	m_shepherd_base = s_exec + s_number_workers_per_shepherd * ( ( s_number_shepherds - ( shepherd_rank + 1 ) ) );
	m_scratch_alloc = ( (unsigned char *) this ) + s_base_size ;
	m_reduce_end = s_worker_reduce_end ;
	m_shepherd_rank = shepherd_rank ;
	m_shepherd_size = s_number_shepherds ;
	m_shepherd_worker_rank = shepherd_worker_rank ;
	m_shepherd_worker_size = s_number_workers_per_shepherd ;
	m_worker_rank = worker_rank ;
	m_worker_size = s_number_workers ;
	m_worker_state = QthreadExec::Active ;
	}

	void QthreadExec::clear_workers()
	{
	for ( int iwork = 0 ; iwork < s_number_workers ; ++iwork ) {
	- free( s_exec[iwork] );
	+ QthreadExec * const exec = s_exec[iwork] ;
	s_exec[iwork] = 0 ;
	+ free( exec );
	}
	}

	void QthreadExec::shared_reset( Qthread::scratch_memory_space & space )
	{
	new( & space )
	Qthread::scratch_memory_space(
	((unsigned char ) (*m_shepherd_base).m_scratch_alloc ) + s_worker_shared_begin ,
	s_worker_shared_end - s_worker_shared_begin
	);
	}

	void QthreadExec::resize_worker_scratch( const int reduce_size , const int shared_size )
	{
	const int exec_all_reduce_alloc = align_alloc( reduce_size );
	const int shepherd_scan_alloc = align_alloc( 8 );
	const int shepherd_shared_end = exec_all_reduce_alloc + shepherd_scan_alloc + align_alloc( shared_size );

	if ( s_worker_reduce_end < exec_all_reduce_alloc \|\|
	s_worker_shared_end < shepherd_shared_end ) {

	+/*
	+ fprintf( stdout , "QthreadExec::resize\n");
	+ fflush(stdout);
	+*/
	+
	// Clear current worker memory before allocating new worker memory
	clear_workers();

	// Increase the buffers to an aligned allocation
	s_worker_reduce_end = exec_all_reduce_alloc ;
	s_worker_shared_begin = exec_all_reduce_alloc + shepherd_scan_alloc ;
	s_worker_shared_end = shepherd_shared_end ;

	// Need to query which shepherd this main 'process' is running...
	+
	+ const int main_shep = qthread_shep();

	// Have each worker resize its memory for proper first-touch
	+#if 0
	for ( int jshep = 0 ; jshep < s_number_shepherds ; ++jshep ) {
	- for ( int i = jshep ? 0 : 1 ; i < s_number_workers_per_shepherd ; ++i ) {
	-
	- // Unit tests hang with this call:
	- //
	- // qthread_fork_to_local_priority( driver_resize_workers , NULL , NULL , jshep );
	- //
	-
	+ for ( int i = jshep != main_shep ? 0 : 1 ; i < s_number_workers_per_shepherd ; ++i ) {
	qthread_fork_to( driver_resize_worker_scratch , NULL , NULL , jshep );
	}}
	+#else
	+ // If this function is used before the 'qthread.task_policy' unit test
	+ // the 'qthread.task_policy' unit test fails with a seg-fault within libqthread.so.
	+ for ( int jshep = 0 ; jshep < s_number_shepherds ; ++jshep ) {
	+ const int num_clone = jshep != main_shep ? s_number_workers_per_shepherd : s_number_workers_per_shepherd - 1 ;
	+
	+ if ( num_clone ) {
	+ const int ret = qthread_fork_clones_to_local_priority
	+ ( driver_resize_worker_scratch /* function */
	+ , NULL /* function data block */
	+ , NULL /* pointer to return value feb */
	+ , jshep /* shepherd number */
	+ , num_clone - 1 /* number of instances - 1 */
	+ );
	+
	+ assert(ret == QTHREAD_SUCCESS);
	+ }
	+ }
	+#endif

	driver_resize_worker_scratch( NULL );

	// Verify all workers allocated

	bool ok = true ;
	for ( int iwork = 0 ; ok && iwork < s_number_workers ; ++iwork ) { ok = 0 != s_exec[iwork] ; }

	if ( ! ok ) {
	std::ostringstream msg ;
	msg << "Kokkos::Impl::QthreadExec::resize : FAILED for workers {" ;
	for ( int iwork = 0 ; iwork < s_number_workers ; ++iwork ) {
	if ( 0 == s_exec[iwork] ) { msg << " " << ( s_number_workers - ( iwork + 1 ) ); }
	}
	msg << " }" ;
	Kokkos::Impl::throw_runtime_exception( msg.str() );
	}
	}
	}

	void QthreadExec::exec_all( Qthread & , QthreadExecFunctionPointer func , const void * arg )
	{
	verify_is_process("QthreadExec::exec_all(...)",true);

	+/*
	+ fprintf( stdout , "QthreadExec::exec_all\n");
	+ fflush(stdout);
	+*/
	+
	s_active_function = func ;
	s_active_function_arg = arg ;

	// Need to query which shepherd this main 'process' is running...

	const int main_shep = qthread_shep();

	+#if 0
	for ( int jshep = 0 , iwork = 0 ; jshep < s_number_shepherds ; ++jshep ) {
	for ( int i = jshep != main_shep ? 0 : 1 ; i < s_number_workers_per_shepherd ; ++i , ++iwork ) {
	-
	- // Unit tests hang with this call:
	- //
	- // qthread_fork_to_local_priority( driver_exec_all , NULL , NULL , jshep );
	- //
	-
	qthread_fork_to( driver_exec_all , NULL , NULL , jshep );
	}}
	+#else
	+ // If this function is used before the 'qthread.task_policy' unit test
	+ // the 'qthread.task_policy' unit test fails with a seg-fault within libqthread.so.
	+ for ( int jshep = 0 ; jshep < s_number_shepherds ; ++jshep ) {
	+ const int num_clone = jshep != main_shep ? s_number_workers_per_shepherd : s_number_workers_per_shepherd - 1 ;
	+
	+ if ( num_clone ) {
	+ const int ret = qthread_fork_clones_to_local_priority
	+ ( driver_exec_all /* function */
	+ , NULL /* function data block */
	+ , NULL /* pointer to return value feb */
	+ , jshep /* shepherd number */
	+ , num_clone - 1 /* number of instances - 1 */
	+ );
	+
	+ assert(ret == QTHREAD_SUCCESS);
	+ }
	+ }
	+#endif

	driver_exec_all( NULL );

	s_active_function = 0 ;
	s_active_function_arg = 0 ;
	}

	void * QthreadExec::exec_all_reduce_result()
	{
	return s_exec[0]->m_scratch_alloc ;
	}

	} /* namespace Impl */
	} /* namespace Kokkos */

	+namespace Kokkos {
	+namespace Impl {
	+
	+QthreadTeamPolicyMember::QthreadTeamPolicyMember()
	+ : m_exec( **worker_exec() )
	+ , m_team_shared(0,0)
	+ , m_team_size( 1 ) // s_number_workers_per_shepherd )
	+ , m_team_rank( 0 ) // m_exec.shepherd_worker_rank() )
	+ , m_league_size(1)
	+ , m_league_end(1)
	+ , m_league_rank(0)
	+{
	+ m_exec.shared_reset( m_team_shared );
	+}
	+
	+} /* namespace Impl */
	+} /* namespace Kokkos */
	+
	//----------------------------------------------------------------------------

	#endif /* #if defined( KOKKOS_HAVE_QTHREAD ) */

	diff --git a/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.hpp b/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.hpp
	index 5ed544c13..365883685 100755
	--- a/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.hpp
	+++ b/lib/kokkos/core/src/Qthread/Kokkos_QthreadExec.hpp
	@@ -1,580 +1,614 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_QTHREADEXEC_HPP
	#define KOKKOS_QTHREADEXEC_HPP

	#include <impl/Kokkos_spinwait.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------

	class QthreadExec ;

	typedef void (QthreadExecFunctionPointer)( QthreadExec & , const void );

	class QthreadExec {
	private:

	enum { Inactive = 0 , Active = 1 };

	const QthreadExec * const * m_worker_base ;
	const QthreadExec * const * m_shepherd_base ;

	void * m_scratch_alloc ; ///< Scratch memory [ reduce , team , shared ]
	int m_reduce_end ; ///< End of scratch reduction memory

	int m_shepherd_rank ;
	int m_shepherd_size ;

	int m_shepherd_worker_rank ;
	int m_shepherd_worker_size ;

	/*
	* m_worker_rank = m_shepherd_rank * m_shepherd_worker_size + m_shepherd_worker_rank
	* m_worker_size = m_shepherd_size * m_shepherd_worker_size
	*/
	int m_worker_rank ;
	int m_worker_size ;

	int mutable volatile m_worker_state ;


	friend class Kokkos::Qthread ;

	~QthreadExec();
	QthreadExec( const QthreadExec & );
	QthreadExec & operator = ( const QthreadExec & );

	public:

	QthreadExec();

	/** Execute the input function on all available Qthread workers */
	static void exec_all( Qthread & , QthreadExecFunctionPointer , const void * );

	//----------------------------------------
	/** Barrier across all workers participating in the 'exec_all' */
	void exec_all_barrier() const
	{
	const int rev_rank = m_worker_size - ( m_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < m_worker_size ) ; n <<= 1 ) {
	Impl::spinwait( m_worker_base[j]->m_worker_state , QthreadExec::Active );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < m_worker_size ) ; n <<= 1 ) {
	m_worker_base[j]->m_worker_state = QthreadExec::Active ;
	}
	}

	/** Barrier across workers within the shepherd with rank < team_rank */
	void shepherd_barrier( const int team_size ) const
	{
	if ( m_shepherd_worker_rank < team_size ) {

	const int rev_rank = team_size - ( m_shepherd_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	Impl::spinwait( m_shepherd_base[j]->m_worker_state , QthreadExec::Active );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	m_shepherd_base[j]->m_worker_state = QthreadExec::Active ;
	}
	}
	}

	//----------------------------------------
	/** Reduce across all workers participating in the 'exec_all' */
	template< class FunctorType , class ArgTag >
	inline
	void exec_all_reduce( const FunctorType & func ) const
	{
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
	- typedef Kokkos::Impl::FunctorValueOps< FunctorType , ArgTag > ValueOps ;

	const int rev_rank = m_worker_size - ( m_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < m_worker_size ) ; n <<= 1 ) {
	const QthreadExec & fan = *m_worker_base[j];

	Impl::spinwait( fan.m_worker_state , QthreadExec::Active );

	ValueJoin::join( func , m_scratch_alloc , fan.m_scratch_alloc );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < m_worker_size ) ; n <<= 1 ) {
	m_worker_base[j]->m_worker_state = QthreadExec::Active ;
	}
	}

	//----------------------------------------
	/** Scall across all workers participating in the 'exec_all' */
	template< class FunctorType , class ArgTag >
	inline
	void exec_all_scan( const FunctorType & func ) const
	{
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , ArgTag > ValueInit ;
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
	typedef Kokkos::Impl::FunctorValueOps< FunctorType , ArgTag > ValueOps ;

	const int rev_rank = m_worker_size - ( m_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < m_worker_size ) ; n <<= 1 ) {
	Impl::spinwait( m_worker_base[j]->m_worker_state , QthreadExec::Active );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}
	else {
	// Root thread scans across values before releasing threads
	// Worker data is in reverse order, so m_worker_base[0] is the
	// highest ranking thread.

	// Copy from lower ranking to higher ranking worker.
	- for ( int i = 1 ; i < n ; ++i ) {
	- ValueOps::copy( func , m_worker_base[i-1]->m_scratch_alloc
	- , m_worker_base[i]->m_scratch_alloc );
	+ for ( int i = 1 ; i < m_worker_size ; ++i ) {
	+ ValueOps::copy( func
	+ , m_worker_base[i-1]->m_scratch_alloc
	+ , m_worker_base[i]->m_scratch_alloc
	+ );
	}

	- ValueInit::init( func , m_worker_base[n-1]->m_scratch_alloc );
	+ ValueInit::init( func , m_worker_base[m_worker_size-1]->m_scratch_alloc );

	// Join from lower ranking to higher ranking worker.
	// Value at m_worker_base[n-1] is zero so skip adding it to m_worker_base[n-2].
	- for ( int i = n - 1 ; --i ; ) {
	+ for ( int i = m_worker_size - 1 ; --i ; ) {
	ValueJoin::join( func , m_worker_base[i-1]->m_scratch_alloc , m_worker_base[i]->m_scratch_alloc );
	}
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < m_worker_size ) ; n <<= 1 ) {
	m_worker_base[j]->m_worker_state = QthreadExec::Active ;
	}
	}

	//----------------------------------------

	template< class Type>
	inline
	volatile Type * shepherd_team_scratch_value() const
	{ return (volatile Type)(((unsigned char ) m_scratch_alloc) + m_reduce_end); }

	+ template< class Type >
	+ inline
	+ void shepherd_broadcast( Type & value , const int team_size , const int team_rank ) const
	+ {
	+ if ( m_shepherd_base ) {
	+ Type * const shared_value = m_shepherd_base[0]->shepherd_team_scratch_value<Type>();
	+ if ( m_shepherd_worker_rank == team_rank ) { *shared_value = value ; }
	+ memory_fence();
	+ shepherd_barrier( team_size );
	+ value = *shared_value ;
	+ }
	+ }
	+
	template< class Type >
	inline
	Type shepherd_reduce( const int team_size , const Type & value ) const
	{
	*shepherd_team_scratch_value<Type>() = value ;

	memory_fence();

	const int rev_rank = team_size - ( m_shepherd_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	Impl::spinwait( m_shepherd_base[j]->m_worker_state , QthreadExec::Active );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}
	else {
	Type & accum = * m_shepherd_base[0]->shepherd_team_scratch_value<Type>();
	for ( int i = 1 ; i < n ; ++i ) {
	accum += * m_shepherd_base[i]->shepherd_team_scratch_value<Type>();
	}
	for ( int i = 1 ; i < n ; ++i ) {
	* m_shepherd_base[i]->shepherd_team_scratch_value<Type>() = accum ;
	}

	memory_fence();
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	m_shepherd_base[j]->m_worker_state = QthreadExec::Active ;
	}

	return *shepherd_team_scratch_value<Type>();
	}

	template< class JoinOp >
	inline
	typename JoinOp::value_type
	shepherd_reduce( const int team_size
	, const typename JoinOp::value_type & value
	, const JoinOp & op ) const
	{
	typedef typename JoinOp::value_type Type ;

	*shepherd_team_scratch_value<Type>() = value ;

	memory_fence();

	const int rev_rank = team_size - ( m_shepherd_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	Impl::spinwait( m_shepherd_base[j]->m_worker_state , QthreadExec::Active );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}
	else {
	volatile Type & accum = * m_shepherd_base[0]->shepherd_team_scratch_value<Type>();
	- for ( int i = 1 ; i < n ; ++i ) {
	+ for ( int i = 1 ; i < team_size ; ++i ) {
	op.join( accum , * m_shepherd_base[i]->shepherd_team_scratch_value<Type>() );
	}
	- for ( int i = 1 ; i < n ; ++i ) {
	+ for ( int i = 1 ; i < team_size ; ++i ) {
	* m_shepherd_base[i]->shepherd_team_scratch_value<Type>() = accum ;
	}

	memory_fence();
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	m_shepherd_base[j]->m_worker_state = QthreadExec::Active ;
	}

	return *shepherd_team_scratch_value<Type>();
	}

	template< class Type >
	inline
	Type shepherd_scan( const int team_size
	, const Type & value
	, Type * const global_value = 0 ) const
	{
	*shepherd_team_scratch_value<Type>() = value ;

	memory_fence();

	const int rev_rank = team_size - ( m_shepherd_worker_rank + 1 );

	int n , j ;

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	Impl::spinwait( m_shepherd_base[j]->m_worker_state , QthreadExec::Active );
	}

	if ( rev_rank ) {
	m_worker_state = QthreadExec::Inactive ;
	Impl::spinwait( m_worker_state , QthreadExec::Inactive );
	}
	else {
	// Root thread scans across values before releasing threads
	// Worker data is in reverse order, so m_shepherd_base[0] is the
	// highest ranking thread.

	// Copy from lower ranking to higher ranking worker.

	Type accum = * m_shepherd_base[0]->shepherd_team_scratch_value<Type>();
	- for ( int i = 1 ; i < n ; ++i ) {
	+ for ( int i = 1 ; i < team_size ; ++i ) {
	const Type tmp = * m_shepherd_base[i]->shepherd_team_scratch_value<Type>();
	accum += tmp ;
	* m_shepherd_base[i-1]->shepherd_team_scratch_value<Type>() = tmp ;
	}

	- * m_shepherd_base[n-1]->shepherd_team_scratch_value<Type>() =
	+ * m_shepherd_base[team_size-1]->shepherd_team_scratch_value<Type>() =
	global_value ? atomic_fetch_add( global_value , accum ) : 0 ;

	// Join from lower ranking to higher ranking worker.
	- for ( int i = n ; --i ; ) {
	+ for ( int i = team_size ; --i ; ) {
	* m_shepherd_base[i-1]->shepherd_team_scratch_value<Type>() += * m_shepherd_base[i]->shepherd_team_scratch_value<Type>();
	}

	memory_fence();
	}

	for ( n = 1 ; ( ! ( rev_rank & n ) ) && ( ( j = rev_rank + n ) < team_size ) ; n <<= 1 ) {
	m_shepherd_base[j]->m_worker_state = QthreadExec::Active ;
	}

	return *shepherd_team_scratch_value<Type>();
	}

	//----------------------------------------

	static inline
	int align_alloc( int size )
	{
	enum { ALLOC_GRAIN = 1 << 6 /* power of two, 64bytes */};
	enum { ALLOC_GRAIN_MASK = ALLOC_GRAIN - 1 };
	return ( size + ALLOC_GRAIN_MASK ) & ~ALLOC_GRAIN_MASK ;
	}

	void shared_reset( Qthread::scratch_memory_space & );

	void * exec_all_reduce_value() const { return m_scratch_alloc ; }

	static void * exec_all_reduce_result();

	static void resize_worker_scratch( const int reduce_size , const int shared_size );
	static void clear_workers();

	//----------------------------------------

	inline int worker_rank() const { return m_worker_rank ; }
	inline int worker_size() const { return m_worker_size ; }
	inline int shepherd_worker_rank() const { return m_shepherd_worker_rank ; }
	inline int shepherd_worker_size() const { return m_shepherd_worker_size ; }
	inline int shepherd_rank() const { return m_shepherd_rank ; }
	inline int shepherd_size() const { return m_shepherd_size ; }
	+
	+ static int worker_per_shepherd();
	};

	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	class QthreadTeamPolicyMember {
	private:

	typedef Kokkos::Qthread execution_space ;
	typedef execution_space::scratch_memory_space scratch_memory_space ;


	Impl::QthreadExec & m_exec ;
	scratch_memory_space m_team_shared ;
	const int m_team_size ;
	const int m_team_rank ;
	const int m_league_size ;
	const int m_league_end ;
	int m_league_rank ;

	public:

	KOKKOS_INLINE_FUNCTION
	const scratch_memory_space & team_shmem() const { return m_team_shared ; }

	KOKKOS_INLINE_FUNCTION int league_rank() const { return m_league_rank ; }
	KOKKOS_INLINE_FUNCTION int league_size() const { return m_league_size ; }
	KOKKOS_INLINE_FUNCTION int team_rank() const { return m_team_rank ; }
	KOKKOS_INLINE_FUNCTION int team_size() const { return m_team_size ; }

	KOKKOS_INLINE_FUNCTION void team_barrier() const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{}
	#else
	{ m_exec.shepherd_barrier( m_team_size ); }
	#endif

	+ template< typename Type >
	+ KOKKOS_INLINE_FUNCTION Type team_broadcast( const Type & value , int rank ) const
	+#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ { return Type(); }
	+#else
	+ { return m_exec.template shepherd_broadcast<Type>( value , m_team_size , rank ); }
	+#endif
	+
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_reduce( const Type & value ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return Type(); }
	#else
	{ return m_exec.template shepherd_reduce<Type>( m_team_size , value ); }
	#endif

	template< typename JoinOp >
	KOKKOS_INLINE_FUNCTION typename JoinOp::value_type
	team_reduce( const typename JoinOp::value_type & value
	, const JoinOp & op ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return typename JoinOp::value_type(); }
	#else
	{ return m_exec.template shepherd_reduce<JoinOp>( m_team_size , value , op ); }
	#endif

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	*
	* The highest rank thread can compute the reduction total as
	* reduction_total = dev.team_scan( value ) + value ;
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return Type(); }
	#else
	{ return m_exec.template shepherd_scan<Type>( m_team_size , value ); }
	#endif

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering
	* with intra-team non-deterministic ordering accumulation.
	*
	* The global inter-team accumulation value will, at the end of the
	* league's parallel execution, be the scan's total.
	* Parallel execution ordering of the league's teams is non-deterministic.
	* As such the base value for each team's scan operation is similarly
	* non-deterministic.
	*/
	template< typename Type >
	KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value , Type * const global_accum ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return Type(); }
	#else
	{ return m_exec.template shepherd_scan<Type>( m_team_size , value , global_accum ); }
	#endif

	+ //----------------------------------------
	+ // Private driver for task-team parallel
	+
	+ QthreadTeamPolicyMember();
	+
	//----------------------------------------
	// Private for the driver ( for ( member_type i(exec,team); i ; i.next_team() ) { ... }

	// Initialize
	template< class Arg0 , class Arg1 >
	QthreadTeamPolicyMember( Impl::QthreadExec & exec , const TeamPolicy<Arg0,Arg1,Qthread> & team )
	: m_exec( exec )
	, m_team_shared(0,0)
	, m_team_size( team.m_team_size )
	, m_team_rank( exec.shepherd_worker_rank() )
	, m_league_size( team.m_league_size )
	, m_league_end( team.m_league_size - team.m_shepherd_iter * ( exec.shepherd_size() - ( exec.shepherd_rank() + 1 ) ) )
	, m_league_rank( m_league_end > team.m_shepherd_iter ? m_league_end - team.m_shepherd_iter : 0 )
	{
	m_exec.shared_reset( m_team_shared );
	}

	// Continue
	operator bool () const { return m_league_rank < m_league_end ; }

	// iterate
	void next_team() { ++m_league_rank ; m_exec.shared_reset( m_team_shared ); }
	};

	} // namespace Impl

	template< class Arg0 , class Arg1 >
	class TeamPolicy< Arg0 , Arg1 , Kokkos::Qthread >
	{
	private:

	const int m_league_size ;
	const int m_team_size ;
	const int m_shepherd_iter ;

	public:

	//! Tag this class as a kokkos execution policy
	typedef TeamPolicy execution_policy ;
	typedef Qthread execution_space ;

	typedef typename
	Impl::if_c< ! Impl::is_same< Kokkos::Qthread , Arg0 >::value , Arg0 , Arg1 >::type
	work_tag ;

	//----------------------------------------

	template< class FunctorType >
	inline static
	int team_size_max( const FunctorType & )
	{ return Qthread::instance().shepherd_worker_size(); }

	template< class FunctorType >
	static int team_size_recommended( const FunctorType & f )
	{ return team_size_max( f ); }

	+ template< class FunctorType >
	+ inline static
	+ int team_size_recommended( const FunctorType & f , const int& )
	+ { return team_size_max( f ); }
	+
	//----------------------------------------

	inline int team_size() const { return m_team_size ; }
	inline int league_size() const { return m_league_size ; }

	// One active team per shepherd
	TeamPolicy( Kokkos::Qthread & q
	, const int league_size
	, const int team_size
	)
	: m_league_size( league_size )
	, m_team_size( team_size < q.shepherd_worker_size()
	? team_size : q.shepherd_worker_size() )
	, m_shepherd_iter( ( league_size + q.shepherd_size() - 1 ) / q.shepherd_size() )
	{
	}

	// One active team per shepherd
	TeamPolicy( const int league_size
	, const int team_size
	)
	: m_league_size( league_size )
	, m_team_size( team_size < Qthread::instance().shepherd_worker_size()
	? team_size : Qthread::instance().shepherd_worker_size() )
	, m_shepherd_iter( ( league_size + Qthread::instance().shepherd_size() - 1 ) / Qthread::instance().shepherd_size() )
	{
	}

	typedef Impl::QthreadTeamPolicyMember member_type ;

	friend class Impl::QthreadTeamPolicyMember ;
	};

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #define KOKKOS_QTHREADEXEC_HPP */

	diff --git a/lib/kokkos/core/src/Qthread/Kokkos_Qthread_Parallel.hpp b/lib/kokkos/core/src/Qthread/Kokkos_Qthread_Parallel.hpp
	index ab89c0519..dc76a0c42 100755
	--- a/lib/kokkos/core/src/Qthread/Kokkos_Qthread_Parallel.hpp
	+++ b/lib/kokkos/core/src/Qthread/Kokkos_Qthread_Parallel.hpp
	@@ -1,418 +1,643 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_QTHREAD_PARALLEL_HPP
	#define KOKKOS_QTHREAD_PARALLEL_HPP

	#include <vector>

	#include <Kokkos_Parallel.hpp>

	#include <impl/Kokkos_StaticAssert.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>

	#include <Qthread/Kokkos_QthreadExec.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelFor< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > Policy ;

	const FunctorType m_func ;
	const Policy m_policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( i );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( ! Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i );
	}
	}

	// Function is called once by every concurrent thread.
	static void execute( QthreadExec & exec , const void * arg )
	{
	const ParallelFor & self = * ((const ParallelFor *) arg );

	driver( self.m_func , typename Policy::WorkRange( self.m_policy , exec.worker_rank() , exec.worker_size() ) );

	// All threads wait for completion.
	exec.exec_all_barrier();
	}

	public:

	ParallelFor( const FunctorType & functor
	, const Policy & policy
	)
	: m_func( functor )
	, m_policy( policy )
	{
	Impl::QthreadExec::exec_all( Qthread::instance() , & ParallelFor::execute , this );
	}
	};

	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelReduce< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > Policy ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename Policy::work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , typename Policy::work_tag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const FunctorType m_func ;
	const Policy m_policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( i , update );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( ! Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i , update );
	}
	}

	static void execute( QthreadExec & exec , const void * arg )
	{
	const ParallelReduce & self = * ((const ParallelReduce *) arg );

	driver( self.m_func
	, ValueInit::init( self.m_func , exec.exec_all_reduce_value() )
	, typename Policy::WorkRange( self.m_policy , exec.worker_rank() , exec.worker_size() )
	);

	exec.template exec_all_reduce<FunctorType, typename Policy::work_tag >( self.m_func );
	}

	public:

	template< class HostViewType >
	ParallelReduce( const FunctorType & functor
	, const Policy & policy
	, const HostViewType & result_view )
	: m_func( functor )
	, m_policy( policy )
	{
	QthreadExec::resize_worker_scratch( ValueTraits::value_size( m_func ) , 0 );

	Impl::QthreadExec::exec_all( Qthread::instance() , & ParallelReduce::execute , this );

	const pointer_type data = (pointer_type) QthreadExec::exec_all_reduce_result();

	Kokkos::Impl::FunctorFinal< FunctorType , typename Policy::work_tag >::final( m_func , data );

	if ( result_view.ptr_on_device() ) {
	const unsigned n = ValueTraits::value_count( m_func );
	for ( unsigned i = 0 ; i < n ; ++i ) { result_view.ptr_on_device()[i] = data[i]; }
	}
	}
	};

	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelFor< FunctorType , TeamPolicy< Arg0 , Arg1 , Kokkos::Qthread > >
	{
	private:

	typedef TeamPolicy< Arg0 , Arg1 , Kokkos::Qthread > Policy ;

	const FunctorType m_func ;
	const Policy m_team ;

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member ) const
	{ m_func( member ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member ) const
	{ m_func( TagType() , member ); }

	static void execute( QthreadExec & exec , const void * arg )
	{
	const ParallelFor & self = * ((const ParallelFor *) arg );

	typename Policy::member_type member( exec , self.m_team );

	while ( member ) {
	self.ParallelFor::template driver< typename Policy::work_tag >( member );
	member.team_barrier();
	member.next_team();
	}

	exec.exec_all_barrier();
	}

	public:

	ParallelFor( const FunctorType & functor ,
	const Policy & policy )
	: m_func( functor )
	, m_team( policy )
	{
	QthreadExec::resize_worker_scratch
	( /* reduction memory */ 0
	, /* team shared memory */ FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) );

	Impl::QthreadExec::exec_all( Qthread::instance() , & ParallelFor::execute , this );
	}
	};

	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelReduce< FunctorType , TeamPolicy< Arg0 , Arg1 , Kokkos::Qthread > >
	{
	private:

	typedef TeamPolicy< Arg0 , Arg1 , Kokkos::Qthread > Policy ;

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename Policy::work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , typename Policy::work_tag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const FunctorType m_func ;
	const Policy m_team ;

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member
	, reference_type update ) const
	{ m_func( member , update ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member
	, reference_type update ) const
	{ m_func( TagType() , member , update ); }

	static void execute( QthreadExec & exec , const void * arg )
	{
	const ParallelReduce & self = * ((const ParallelReduce *) arg );

	// Initialize thread-local value
	reference_type update = ValueInit::init( self.m_func , exec.exec_all_reduce_value() );

	typename Policy::member_type member( exec , self.m_team );

	while ( member ) {
	self.ParallelReduce::template driver< typename Policy::work_tag >( member , update );
	member.team_barrier();
	member.next_team();
	}

	exec.template exec_all_reduce< FunctorType , typename Policy::work_tag >( self.m_func );
	}

	public:

	template< class ViewType >
	ParallelReduce( const FunctorType & functor ,
	const Policy & policy ,
	const ViewType & result )
	: m_func( functor )
	, m_team( policy )
	{
	QthreadExec::resize_worker_scratch
	( /* reduction memory */ ValueTraits::value_size( functor )
	, /* team shared memory */ FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) );

	Impl::QthreadExec::exec_all( Qthread::instance() , & ParallelReduce::execute , this );

	const pointer_type data = (pointer_type) QthreadExec::exec_all_reduce_result();

	Kokkos::Impl::FunctorFinal< FunctorType , typename Policy::work_tag >::final( m_func , data );

	const unsigned n = ValueTraits::value_count( m_func );
	for ( unsigned i = 0 ; i < n ; ++i ) { result.ptr_on_device()[i] = data[i]; }
	}
	};

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelScan< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > Policy ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , typename Policy::work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , typename Policy::work_tag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const FunctorType m_func ;
	const Policy m_policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const bool final
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( i , update , final );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( ! Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const bool final
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i , update , final );
	}
	}

	static void execute( QthreadExec & exec , const void * arg )
	{
	const ParallelScan & self = * ((const ParallelScan *) arg );

	const typename Policy::WorkRange range( self.m_policy , exec.worker_rank() , exec.worker_size() );

	// Initialize thread-local value
	reference_type update = ValueInit::init( self.m_func , exec.exec_all_reduce_value() );

	driver( self.m_func , update , false , range );

	exec.template exec_all_scan< FunctorType , typename Policy::work_tag >( self.m_func );

	driver( self.m_func , update , true , range );

	exec.exec_all_barrier();
	}

	public:

	ParallelScan( const FunctorType & functor
	, const Policy & policy
	)
	: m_func( functor )
	, m_policy( policy )
	{
	QthreadExec::resize_worker_scratch( ValueTraits::value_size( m_func ) , 0 );

	Impl::QthreadExec::exec_all( Qthread::instance() , & ParallelScan::execute , this );
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	+namespace Kokkos {
	+
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>
	+TeamThreadRange(const Impl::QthreadTeamPolicyMember& thread, const iType& count)
	+{
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>(thread,count);
	+}
	+
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>
	+TeamThreadRange( const Impl::QthreadTeamPolicyMember& thread
	+ , const iType & begin
	+ , const iType & end
	+ )
	+{
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>(thread,begin,end);
	+}
	+
	+
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember >
	+ ThreadVectorRange(const Impl::QthreadTeamPolicyMember& thread, const iType& count) {
	+ return Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember >(thread,count);
	+}
	+
	+
	+KOKKOS_INLINE_FUNCTION
	+Impl::ThreadSingleStruct<Impl::QthreadTeamPolicyMember> PerTeam(const Impl::QthreadTeamPolicyMember& thread) {
	+ return Impl::ThreadSingleStruct<Impl::QthreadTeamPolicyMember>(thread);
	+}
	+
	+KOKKOS_INLINE_FUNCTION
	+Impl::VectorSingleStruct<Impl::QthreadTeamPolicyMember> PerThread(const Impl::QthreadTeamPolicyMember& thread) {
	+ return Impl::VectorSingleStruct<Impl::QthreadTeamPolicyMember>(thread);
	+}
	+
	+} // namespace Kokkos
	+
	+namespace Kokkos {
	+
	+ /** \brief Inter-thread parallel_for. Executes lambda(iType i) for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all threads of the the calling thread team.
	+ * This functionality requires C++11 support.*/
	+template<typename iType, class Lambda>
	+KOKKOS_INLINE_FUNCTION
	+void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>& loop_boundaries, const Lambda& lambda) {
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	+ lambda(i);
	+}
	+
	+/** \brief Inter-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
	+ * val is performed and put into result. This functionality requires C++11 support.*/
	+template< typename iType, class Lambda, typename ValueType >
	+KOKKOS_INLINE_FUNCTION
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>& loop_boundaries,
	+ const Lambda & lambda, ValueType& result) {
	+
	+ result = ValueType();
	+
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ ValueType tmp = ValueType();
	+ lambda(i,tmp);
	+ result+=tmp;
	+ }
	+
	+ result = loop_boundaries.thread.team_reduce(result,Impl::JoinAdd<ValueType>());
	+}
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+
	+/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	+ * val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	+ * The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	+ * the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	+ * '1 for '). This functionality requires C++11 support./
	+template< typename iType, class Lambda, typename ValueType, class JoinType >
	+KOKKOS_INLINE_FUNCTION
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember>& loop_boundaries,
	+ const Lambda & lambda, const JoinType& join, ValueType& init_result) {
	+
	+ ValueType result = init_result;
	+
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ ValueType tmp = ValueType();
	+ lambda(i,tmp);
	+ join(result,tmp);
	+ }
	+
	+ init_result = loop_boundaries.thread.team_reduce(result,Impl::JoinLambdaAdapter<ValueType,JoinType>(join));
	+}
	+
	+#endif /* #if defined( KOKKOS_HAVE_CXX11 ) */
	+
	+} // namespace Kokkos
	+
	+namespace Kokkos {
	+/** \brief Intra-thread vector parallel_for. Executes lambda(iType i) for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all vector lanes of the the calling thread.
	+ * This functionality requires C++11 support.*/
	+template<typename iType, class Lambda>
	+KOKKOS_INLINE_FUNCTION
	+void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember >&
	+ loop_boundaries, const Lambda& lambda) {
	+ #ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	+ #pragma ivdep
	+ #endif
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	+ lambda(i);
	+}
	+
	+/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
	+ * val is performed and put into result. This functionality requires C++11 support.*/
	+template< typename iType, class Lambda, typename ValueType >
	+KOKKOS_INLINE_FUNCTION
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember >&
	+ loop_boundaries, const Lambda & lambda, ValueType& result) {
	+ result = ValueType();
	+#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	+#pragma ivdep
	+#endif
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ ValueType tmp = ValueType();
	+ lambda(i,tmp);
	+ result+=tmp;
	+ }
	+}
	+
	+/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	+ * val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	+ * The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	+ * the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	+ * '1 for '). This functionality requires C++11 support./
	+template< typename iType, class Lambda, typename ValueType, class JoinType >
	+KOKKOS_INLINE_FUNCTION
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember >&
	+ loop_boundaries, const Lambda & lambda, const JoinType& join, ValueType& init_result) {
	+
	+ ValueType result = init_result;
	+#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	+#pragma ivdep
	+#endif
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ ValueType tmp = ValueType();
	+ lambda(i,tmp);
	+ join(result,tmp);
	+ }
	+ init_result = result;
	+}
	+
	+/** \brief Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
	+ * for each i=0..N-1.
	+ *
	+ * The range i=0..N-1 is mapped to all vector lanes in the thread and a scan operation is performed.
	+ * Depending on the target execution space the operator might be called twice: once with final=false
	+ * and once with final=true. When final==true val contains the prefix sum value. The contribution of this
	+ * "i" needs to be added to val no matter whether final==true or not. In a serial execution
	+ * (i.e. team_size==1) the operator is only called once with final==true. Scan_val will be set
	+ * to the final sum value over all vector lanes.
	+ * This functionality requires C++11 support.*/
	+template< typename iType, class FunctorType >
	+KOKKOS_INLINE_FUNCTION
	+void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::QthreadTeamPolicyMember >&
	+ loop_boundaries, const FunctorType & lambda) {
	+
	+ typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	+ typedef typename ValueTraits::value_type value_type ;
	+
	+ value_type scan_val = value_type();
	+
	+#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	+#pragma ivdep
	+#endif
	+ for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	+ lambda(i,scan_val,true);
	+ }
	+}
	+
	+} // namespace Kokkos
	+
	+namespace Kokkos {
	+
	+template<class FunctorType>
	+KOKKOS_INLINE_FUNCTION
	+void single(const Impl::VectorSingleStruct<Impl::QthreadTeamPolicyMember>& single_struct, const FunctorType& lambda) {
	+ lambda();
	+}
	+
	+template<class FunctorType>
	+KOKKOS_INLINE_FUNCTION
	+void single(const Impl::ThreadSingleStruct<Impl::QthreadTeamPolicyMember>& single_struct, const FunctorType& lambda) {
	+ if(single_struct.team_member.team_rank()==0) lambda();
	+}
	+
	+template<class FunctorType, class ValueType>
	+KOKKOS_INLINE_FUNCTION
	+void single(const Impl::VectorSingleStruct<Impl::QthreadTeamPolicyMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	+ lambda(val);
	+}
	+
	+template<class FunctorType, class ValueType>
	+KOKKOS_INLINE_FUNCTION
	+void single(const Impl::ThreadSingleStruct<Impl::QthreadTeamPolicyMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	+ if(single_struct.team_member.team_rank()==0) {
	+ lambda(val);
	+ }
	+ single_struct.team_member.team_broadcast(val,0);
	+}
	+
	+} // namespace Kokkos
	+
	+
	#endif /* #define KOKKOS_QTHREAD_PARALLEL_HPP */

	diff --git a/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.cpp b/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.cpp
	index b830079af..9787d2646 100755
	--- a/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.cpp
	+++ b/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.cpp
	@@ -1,299 +1,451 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	// Experimental unified task-data parallel manycore LDRD

	#include <Kokkos_Core_fwd.hpp>

	#if defined( KOKKOS_HAVE_QTHREAD )

	#include <stdio.h>

	#include <stdlib.h>
	#include <stdexcept>
	#include <iostream>
	#include <sstream>
	#include <string>

	#include <Kokkos_Atomic.hpp>
	#include <Qthread/Kokkos_Qthread_TaskPolicy.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {
	namespace Impl {

	typedef TaskMember< Kokkos::Qthread , void , void > Task ;

	namespace {

	inline
	unsigned padded_sizeof_derived( unsigned sizeof_derived )
	{
	return sizeof_derived +
	( sizeof_derived % sizeof(Task) ? sizeof(Task) - sizeof_derived % sizeof(Task*) : 0 );
	}

	+// int lock_alloc_dealloc = 0 ;
	+
	} // namespace

	void Task::deallocate( void * ptr )
	{
	// Counting on 'free' thread safety so lock/unlock not required.
	// However, isolate calls here to mitigate future need to introduce lock/unlock.

	// lock

	+ // while ( ! Kokkos::atomic_compare_exchange_strong( & lock_alloc_dealloc , 0 , 1 ) );
	+
	free( ptr );

	// unlock
	+
	+ // Kokkos::atomic_compare_exchange_strong( & lock_alloc_dealloc , 1 , 0 );
	}

	void * Task::allocate( const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity )
	{
	// Counting on 'malloc' thread safety so lock/unlock not required.
	// However, isolate calls here to mitigate future need to introduce lock/unlock.

	// lock

	+ // while ( ! Kokkos::atomic_compare_exchange_strong( & lock_alloc_dealloc , 0 , 1 ) );
	+
	void * const ptr = malloc( padded_sizeof_derived( arg_sizeof_derived ) + arg_dependence_capacity * sizeof(Task*) );

	// unlock

	+ // Kokkos::atomic_compare_exchange_strong( & lock_alloc_dealloc , 1 , 0 );
	+
	return ptr ;
	}

	Task::~TaskMember()
	{

	}


	-Task::TaskMember( const function_verify_type arg_verify
	- , const function_dealloc_type arg_dealloc
	- , const function_apply_type arg_apply
	- , const unsigned arg_sizeof_derived
	- , const unsigned arg_dependence_capacity
	+Task::TaskMember( const function_verify_type arg_verify
	+ , const function_dealloc_type arg_dealloc
	+ , const function_apply_single_type arg_apply_single
	+ , const function_apply_team_type arg_apply_team
	+ , volatile int & arg_active_count
	+ , const unsigned arg_sizeof_derived
	+ , const unsigned arg_dependence_capacity
	)
	: m_dealloc( arg_dealloc )
	, m_verify( arg_verify )
	- , m_apply( arg_apply )
	+ , m_apply_single( arg_apply_single )
	+ , m_apply_team( arg_apply_team )
	+ , m_active_count( & arg_active_count )
	+ , m_qfeb(0)
	, m_dep( (Task *)( ((unsigned char ) this) + padded_sizeof_derived( arg_sizeof_derived ) ) )
	, m_dep_capacity( arg_dependence_capacity )
	, m_dep_size( 0 )
	, m_ref_count( 0 )
	- , m_state( Kokkos::TASK_STATE_CONSTRUCTING )
	- , m_qfeb(0)
	+ , m_state( Kokkos::Experimental::TASK_STATE_CONSTRUCTING )
	{
	qthread_empty( & m_qfeb ); // Set to full when complete
	for ( unsigned i = 0 ; i < arg_dependence_capacity ; ++i ) m_dep[i] = 0 ;
	}

	-Task::TaskMember( const function_dealloc_type arg_dealloc
	- , const function_apply_type arg_apply
	- , const unsigned arg_sizeof_derived
	- , const unsigned arg_dependence_capacity
	+Task::TaskMember( const function_dealloc_type arg_dealloc
	+ , const function_apply_single_type arg_apply_single
	+ , const function_apply_team_type arg_apply_team
	+ , volatile int & arg_active_count
	+ , const unsigned arg_sizeof_derived
	+ , const unsigned arg_dependence_capacity
	)
	: m_dealloc( arg_dealloc )
	, m_verify( & Task::verify_type<void> )
	- , m_apply( arg_apply )
	+ , m_apply_single( arg_apply_single )
	+ , m_apply_team( arg_apply_team )
	+ , m_active_count( & arg_active_count )
	+ , m_qfeb(0)
	, m_dep( (Task *)( ((unsigned char ) this) + padded_sizeof_derived( arg_sizeof_derived ) ) )
	, m_dep_capacity( arg_dependence_capacity )
	, m_dep_size( 0 )
	, m_ref_count( 0 )
	- , m_state( Kokkos::TASK_STATE_CONSTRUCTING )
	- , m_qfeb(0)
	+ , m_state( Kokkos::Experimental::TASK_STATE_CONSTRUCTING )
	{
	qthread_empty( & m_qfeb ); // Set to full when complete
	for ( unsigned i = 0 ; i < arg_dependence_capacity ; ++i ) m_dep[i] = 0 ;
	}

	//----------------------------------------------------------------------------

	void Task::throw_error_add_dependence() const
	{
	std::cerr << "TaskMember< Qthread >::add_dependence ERROR"
	<< " state(" << m_state << ")"
	<< " dep_size(" << m_dep_size << ")"
	<< std::endl ;
	throw std::runtime_error("TaskMember< Qthread >::add_dependence ERROR");
	}

	void Task::throw_error_verify_type()
	{
	throw std::runtime_error("TaskMember< Qthread >::verify_type ERROR");
	}

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	void Task::assign( Task ** const lhs , Task * rhs , const bool no_throw )
	{
	static const char msg_error_header[] = "Kokkos::Impl::TaskManager<Kokkos::Qthread>::assign ERROR" ;
	static const char msg_error_count[] = ": negative reference count" ;
	static const char msg_error_complete[] = ": destroy task that is not complete" ;
	static const char msg_error_dependences[] = ": destroy task that has dependences" ;
	static const char msg_error_exception[] = ": caught internal exception" ;

	- const char * msg_error = 0 ;
	+ if ( rhs ) { Kokkos::atomic_fetch_add( & (*rhs).m_ref_count , 1 ); }
	+
	+ Task * const lhs_val = Kokkos::atomic_exchange( lhs , rhs );
	+
	+ if ( lhs_val ) {

	- try {
	+ const int count = Kokkos::atomic_fetch_add( & (*lhs_val).m_ref_count , -1 );

	- if ( *lhs ) {
	+ const char * msg_error = 0 ;

	- const int count = Kokkos::atomic_fetch_add( & (**lhs).m_ref_count , -1 );
	+ try {

	if ( 1 == count ) {

	// Reference count at zero, delete it

	// Should only be deallocating a completed task
	- if ( (**lhs).m_state == Kokkos::TASK_STATE_COMPLETE ) {
	+ if ( (*lhs_val).m_state == Kokkos::Experimental::TASK_STATE_COMPLETE ) {

	// A completed task should not have dependences...
	- for ( int i = 0 ; i < (**lhs).m_dep_size && 0 == msg_error ; ++i ) {
	- if ( (**lhs).m_dep[i] ) msg_error = msg_error_dependences ;
	+ for ( int i = 0 ; i < (*lhs_val).m_dep_size && 0 == msg_error ; ++i ) {
	+ if ( (*lhs_val).m_dep[i] ) msg_error = msg_error_dependences ;
	}
	}
	else {
	msg_error = msg_error_complete ;
	}

	if ( 0 == msg_error ) {
	// Get deletion function and apply it
	- const Task::function_dealloc_type d = (**lhs).m_dealloc ;
	+ const Task::function_dealloc_type d = (*lhs_val).m_dealloc ;

	- (d)( lhs );
	+ (*d)( lhs_val );
	}
	}
	else if ( count <= 0 ) {
	msg_error = msg_error_count ;
	}
	}
	-
	- if ( 0 == msg_error && rhs ) { Kokkos::atomic_fetch_add( & (*rhs).m_ref_count , 1 ); }
	-
	- *lhs = rhs ;
	- }
	- catch( ... ) {
	- if ( 0 == msg_error ) msg_error = msg_error_exception ;
	- }
	-
	- if ( 0 != msg_error ) {
	- if ( no_throw ) {
	- std::cerr << msg_error_header << msg_error << std::endl ;
	- std::cerr.flush();
	+ catch( ... ) {
	+ if ( 0 == msg_error ) msg_error = msg_error_exception ;
	}
	- else {
	- std::string msg(msg_error_header);
	- msg.append(msg_error);
	- throw std::runtime_error( msg );
	+
	+ if ( 0 != msg_error ) {
	+ if ( no_throw ) {
	+ std::cerr << msg_error_header << msg_error << std::endl ;
	+ std::cerr.flush();
	+ }
	+ else {
	+ std::string msg(msg_error_header);
	+ msg.append(msg_error);
	+ throw std::runtime_error( msg );
	+ }
	}
	}
	}
	#endif


	//----------------------------------------------------------------------------

	aligned_t Task::qthread_func( void * arg )
	{
	Task * const task = reinterpret_cast< Task * >(arg);

	- task->m_state = Kokkos::TASK_STATE_EXECUTING ;
	+ // First member of the team change state to executing.
	+ // Use compare-exchange to avoid race condition with a respawn.
	+ Kokkos::atomic_compare_exchange_strong( & task->m_state
	+ , int(Kokkos::Experimental::TASK_STATE_WAITING)
	+ , int(Kokkos::Experimental::TASK_STATE_EXECUTING)
	+ );
	+
	+ // It is a single thread's responsibility to close out
	+ // this task's execution.
	+ bool close_out = false ;
	+
	+ if ( task->m_apply_team ) {
	+
	+ Kokkos::Impl::QthreadTeamPolicyMember member ;
	+
	+ (*task->m_apply_team)( task , member );
	+
	+fprintf( stdout
	+ , "worker(%d.%d) task 0x%.12lx executed by member(%d:%d)\n"
	+ , qthread_shep()
	+ , qthread_worker_local(NULL)
	+ , reinterpret_cast<unsigned long>(task)
	+ , member.team_rank()
	+ , member.team_size()
	+ );
	+fflush(stdout);
	+
	+ member.team_barrier();
	+
	+ close_out = member.team_rank() == 0 ;
	+ }
	+ else {
	+ (*task->m_apply_single)( task );
	+
	+ close_out = true ;
	+ }
	+
	+ if ( close_out ) {
	+
	+ // When dependent tasks run there would be a race
	+ // condition between destroying this task and
	+ // querying the active count pointer from this task.
	+ int volatile * active_count = task->m_active_count ;
	+
	+ if ( task->m_state == ( Kokkos::Experimental::TASK_STATE_WAITING \| Kokkos::Experimental::TASK_STATE_EXECUTING ) ) {
	+
	+#if 0
	+fprintf( stdout
	+ , "worker(%d.%d) task 0x%.12lx respawn\n"
	+ , qthread_shep()
	+ , qthread_worker_local(NULL)
	+ , reinterpret_cast<unsigned long>(task)
	+ );
	+fflush(stdout);
	+#endif
	+
	+ // Task respawned, set state to waiting and reschedule the task
	+ task->m_state = Kokkos::Experimental::TASK_STATE_WAITING ;
	+ task->schedule();
	+ }
	+ else {

	- (*task->m_apply)( task );
	+ // Task did not respawn, is complete
	+ task->m_state = Kokkos::Experimental::TASK_STATE_COMPLETE ;

	- if ( task->m_state == Kokkos::TASK_STATE_EXECUTING ) {
	- // Task did not respawn, is complete
	- task->m_state = Kokkos::TASK_STATE_COMPLETE ;
	+ // Release dependences before allowing dependent tasks to run.
	+ // Otherwise there is a thread race condition for removing dependences.
	+ for ( int i = 0 ; i < task->m_dep_size ; ++i ) {
	+ assign( & task->m_dep[i] , 0 );
	+ }

	- // Release dependences before allowing dependent tasks to run.
	- // Otherwise their is a thread race condition for removing dependences.
	- for ( int i = 0 ; i < task->m_dep_size ; ++i ) {
	- assign( & task->m_dep[i] , 0 );
	+ // Set qthread FEB to full so that dependent tasks are allowed to execute.
	+ // This 'task' may be deleted immediately following this function call.
	+ qthread_fill( & task->m_qfeb );
	}

	- // Set qthread FEB to full so that dependent tasks are allowed to execute
	- qthread_fill( & task->m_qfeb );
	+ // Decrement active task count before returning.
	+ Kokkos::atomic_decrement( active_count );
	}

	+#if 0
	+fprintf( stdout
	+ , "worker(%d.%d) task 0x%.12lx return\n"
	+ , qthread_shep()
	+ , qthread_worker_local(NULL)
	+ , reinterpret_cast<unsigned long>(task)
	+ );
	+fflush(stdout);
	+#endif
	+
	return 0 ;
	}

	+void Task::respawn()
	+{
	+ // Change state from pure executing to ( waiting \| executing )
	+ // to avoid confusion with simply waiting.
	+ Kokkos::atomic_compare_exchange_strong( & m_state
	+ , int(Kokkos::Experimental::TASK_STATE_EXECUTING)
	+ , int(Kokkos::Experimental::TASK_STATE_WAITING \|
	+ Kokkos::Experimental::TASK_STATE_EXECUTING)
	+ );
	+}
	+
	void Task::schedule()
	{
	// Is waiting for execution

	+ // Increment active task count before spawning.
	+ Kokkos::atomic_increment( m_active_count );
	+
	// spawn in qthread. must malloc the precondition array and give to qthread.
	// qthread will eventually free this allocation so memory will not be leaked.

	// concern with thread safety of malloc, does this need to be guarded?
	aligned_t qprecon = (aligned_t ) malloc( ( m_dep_size + 1 ) * sizeof(aligned_t *) );

	qprecon[0] = reinterpret_cast<aligned_t *>( uintptr_t(m_dep_size) );

	for ( int i = 0 ; i < m_dep_size ; ++i ) {
	qprecon[i+1] = & m_dep[i]->m_qfeb ; // Qthread precondition flag
	}

	- m_state = Kokkos::TASK_STATE_WAITING ;
	+ if ( m_apply_single ) {
	+ qthread_spawn( & Task::qthread_func /* function */
	+ , this /* function argument */
	+ , 0
	+ , NULL
	+ , m_dep_size , qprecon /* dependences */
	+ , NO_SHEPHERD
	+ , QTHREAD_SPAWN_SIMPLE /* allows optimization for non-blocking task */
	+ );
	+ }
	+ else {
	+ // If more than one shepherd spawn on a shepherd other than this shepherd
	+ const int num_shepherd = qthread_num_shepherds();
	+ const int num_worker_per_shepherd = qthread_num_workers_local(NO_SHEPHERD);
	+ const int this_shepherd = qthread_shep();
	+
	+ int spawn_shepherd = ( this_shepherd + 1 ) % num_shepherd ;
	+
	+fprintf( stdout
	+ , "worker(%d.%d) task 0x%.12lx spawning on shepherd(%d) clone(%d)\n"
	+ , qthread_shep()
	+ , qthread_worker_local(NULL)
	+ , reinterpret_cast<unsigned long>(this)
	+ , spawn_shepherd
	+ , num_worker_per_shepherd - 1
	+ );
	+fflush(stdout);
	+
	+ qthread_spawn_cloneable
	+ ( & Task::qthread_func
	+ , this
	+ , 0
	+ , NULL
	+ , m_dep_size , qprecon /* dependences */
	+ , spawn_shepherd
	+ // , unsigned( QTHREAD_SPAWN_SIMPLE \| QTHREAD_SPAWN_LOCAL_PRIORITY )
	+ , unsigned( QTHREAD_SPAWN_LOCAL_PRIORITY )
	+ , num_worker_per_shepherd - 1
	+ );
	+ }
	+}
	+
	+} // namespace Impl
	+} // namespace Experimental
	+} // namespace Kokkos
	+
	+namespace Kokkos {
	+namespace Experimental {

	- qthread_spawn( & Task::qthread_func , this , 0 , NULL
	- , m_dep_size , qprecon
	- , NO_SHEPHERD , QTHREAD_SPAWN_SIMPLE );
	+TaskPolicy< Kokkos::Qthread >::member_type &
	+TaskPolicy< Kokkos::Qthread >::member_single()
	+{
	+ static member_type s ;
	+ return s ;
	}

	-void Task::wait( const Future< void, Kokkos::Qthread> & f )
	+void wait( Kokkos::Experimental::TaskPolicy< Kokkos::Qthread > & policy )
	{
	- if ( f.m_task ) {
	- aligned_t tmp ;
	- qthread_readFF( & tmp , & f.m_task->m_qfeb );
	- }
	+ volatile int * const active_task_count = & policy.m_active_count ;
	+ while ( *active_task_count ) qthread_yield();
	}

	-} // namespace Impl
	+} // namespace Experimental
	} // namespace Kokkos

	#endif /* #if defined( KOKKOS_HAVE_QTHREAD ) */

	diff --git a/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.hpp b/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.hpp
	index 3764f10b3..af44b62a1 100755
	--- a/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.hpp
	+++ b/lib/kokkos/core/src/Qthread/Kokkos_Qthread_TaskPolicy.hpp
	@@ -1,736 +1,646 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	// Experimental unified task-data parallel manycore LDRD

	#ifndef KOKKOS_QTHREAD_TASKPOLICY_HPP
	#define KOKKOS_QTHREAD_TASKPOLICY_HPP

	#include <string>
	#include <typeinfo>
	#include <stdexcept>

	+//----------------------------------------------------------------------------
	+// Defines to enable experimental Qthread functionality
	+
	+#define QTHREAD_LOCAL_PRIORITY
	+#define CLONED_TASKS
	+
	#include <qthread.h>

	+#undef QTHREAD_LOCAL_PRIORITY
	+#undef CLONED_TASKS
	+
	+//----------------------------------------------------------------------------
	+
	#include <Kokkos_Qthread.hpp>
	#include <Kokkos_TaskPolicy.hpp>
	#include <Kokkos_View.hpp>

	#include <impl/Kokkos_FunctorAdapter.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {
	namespace Impl {

	template<>
	class TaskMember< Kokkos::Qthread , void , void >
	{
	public:

	- typedef void (* function_apply_type) ( TaskMember * );
	+ typedef void (* function_apply_single_type) ( TaskMember * );
	+ typedef void (* function_apply_team_type) ( TaskMember * , Kokkos::Impl::QthreadTeamPolicyMember & );
	typedef void (* function_dealloc_type)( TaskMember * );
	typedef TaskMember * (* function_verify_type) ( TaskMember * );

	private:

	- const function_dealloc_type m_dealloc ; ///< Deallocation
	- const function_verify_type m_verify ; ///< Result type verification
	- const function_apply_type m_apply ; ///< Apply function
	- TaskMember ** const m_dep ; ///< Dependences
	- const int m_dep_capacity ; ///< Capacity of dependences
	- int m_dep_size ; ///< Actual count of dependences
	- int m_ref_count ; ///< Reference count
	- int m_state ; ///< State of the task
	- aligned_t m_qfeb ; ///< Qthread full/empty bit
	+ const function_dealloc_type m_dealloc ; ///< Deallocation
	+ const function_verify_type m_verify ; ///< Result type verification
	+ const function_apply_single_type m_apply_single ; ///< Apply function
	+ const function_apply_team_type m_apply_team ; ///< Apply function
	+ int volatile * const m_active_count ; ///< Count of active tasks on this policy
	+ aligned_t m_qfeb ; ///< Qthread full/empty bit
	+ TaskMember ** const m_dep ; ///< Dependences
	+ const int m_dep_capacity ; ///< Capacity of dependences
	+ int m_dep_size ; ///< Actual count of dependences
	+ int m_ref_count ; ///< Reference count
	+ int m_state ; ///< State of the task

	TaskMember() /* = delete */ ;
	TaskMember( const TaskMember & ) /* = delete */ ;
	TaskMember & operator = ( const TaskMember & ) /* = delete */ ;

	static aligned_t qthread_func( void * arg );

	static void * allocate( const unsigned arg_sizeof_derived , const unsigned arg_dependence_capacity );
	static void deallocate( void * );

	void throw_error_add_dependence() const ;
	static void throw_error_verify_type();

	template < class DerivedTaskType >
	static
	void deallocate( TaskMember * t )
	{
	DerivedTaskType * ptr = static_cast< DerivedTaskType * >(t);
	ptr->~DerivedTaskType();
	deallocate( (void *) ptr );
	}

	+ void schedule();
	+
	protected :

	~TaskMember();

	// Used by TaskMember< Qthread , ResultType , void >
	- TaskMember( const function_verify_type arg_verify
	- , const function_dealloc_type arg_dealloc
	- , const function_apply_type arg_apply
	- , const unsigned arg_sizeof_derived
	- , const unsigned arg_dependence_capacity
	+ TaskMember( const function_verify_type arg_verify
	+ , const function_dealloc_type arg_dealloc
	+ , const function_apply_single_type arg_apply_single
	+ , const function_apply_team_type arg_apply_team
	+ , volatile int & arg_active_count
	+ , const unsigned arg_sizeof_derived
	+ , const unsigned arg_dependence_capacity
	);

	// Used for TaskMember< Qthread , void , void >
	- TaskMember( const function_dealloc_type arg_dealloc
	- , const function_apply_type arg_apply
	- , const unsigned arg_sizeof_derived
	- , const unsigned arg_dependence_capacity
	+ TaskMember( const function_dealloc_type arg_dealloc
	+ , const function_apply_single_type arg_apply_single
	+ , const function_apply_team_type arg_apply_team
	+ , volatile int & arg_active_count
	+ , const unsigned arg_sizeof_derived
	+ , const unsigned arg_dependence_capacity
	);

	public:

	template< typename ResultType >
	KOKKOS_FUNCTION static
	TaskMember * verify_type( TaskMember * t )
	{
	- enum { check_type = ! Impl::is_same< ResultType , void >::value };
	+ enum { check_type = ! Kokkos::Impl::is_same< ResultType , void >::value };

	if ( check_type && t != 0 ) {

	// Verify that t->m_verify is this function
	const function_verify_type self = & TaskMember::template verify_type< ResultType > ;

	if ( t->m_verify != self ) {
	t = 0 ;
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	throw_error_verify_type();
	#endif
	}
	}
	return t ;
	}

	//----------------------------------------
	/* Inheritence Requirements on task types:
	* typedef FunctorType::value_type value_type ;
	* class DerivedTaskType
	* : public TaskMember< Qthread , value_type , FunctorType >
	* { ... };
	* class TaskMember< Qthread , value_type , FunctorType >
	* : public TaskMember< Qthread , value_type , void >
	* , public Functor
	* { ... };
	* If value_type != void
	* class TaskMember< Qthread , value_type , void >
	* : public TaskMember< Qthread , void , void >
	*
	* Allocate space for DerivedTaskType followed by TaskMember*[ dependence_capacity ]
	*
	*/

	/** \brief Allocate and construct a single-thread task */
	template< class DerivedTaskType >
	static
	- TaskMember * create( const typename DerivedTaskType::functor_type & arg_functor
	- , const unsigned arg_dependence_capacity )
	+ TaskMember * create_single( const typename DerivedTaskType::functor_type & arg_functor
	+ , volatile int & arg_active_count
	+ , const unsigned arg_dependence_capacity )
	{
	typedef typename DerivedTaskType::functor_type functor_type ;
	typedef typename functor_type::value_type value_type ;

	DerivedTaskType * const task =
	new( allocate( sizeof(DerivedTaskType) , arg_dependence_capacity ) )
	DerivedTaskType( & TaskMember::template deallocate< DerivedTaskType >
	, & TaskMember::template apply_single< functor_type , value_type >
	+ , 0
	+ , arg_active_count
	, sizeof(DerivedTaskType)
	, arg_dependence_capacity
	, arg_functor );

	return static_cast< TaskMember * >( task );
	}

	- /** \brief Allocate and construct a data parallel task */
	+ /** \brief Allocate and construct a team-thread task */
	template< class DerivedTaskType >
	static
	- TaskMember * create( const typename DerivedTaskType::policy_type & arg_policy
	- , const typename DerivedTaskType::functor_type & arg_functor
	- , const unsigned arg_dependence_capacity )
	+ TaskMember * create_team( const typename DerivedTaskType::functor_type & arg_functor
	+ , volatile int & arg_active_count
	+ , const unsigned arg_dependence_capacity )
	{
	+ typedef typename DerivedTaskType::functor_type functor_type ;
	+ typedef typename functor_type::value_type value_type ;
	+
	DerivedTaskType * const task =
	new( allocate( sizeof(DerivedTaskType) , arg_dependence_capacity ) )
	DerivedTaskType( & TaskMember::template deallocate< DerivedTaskType >
	+ , 0
	+ , & TaskMember::template apply_team< functor_type , value_type >
	+ , arg_active_count
	, sizeof(DerivedTaskType)
	, arg_dependence_capacity
	- , arg_policy
	- , arg_functor
	- );
	+ , arg_functor );

	return static_cast< TaskMember * >( task );
	}

	- void schedule();
	- static void wait( const Future< void , Kokkos::Qthread > & );
	+ void respawn();
	+ void spawn()
	+ {
	+ m_state = Kokkos::Experimental::TASK_STATE_WAITING ;
	+ schedule();
	+ }

	//----------------------------------------

	typedef FutureValueTypeIsVoidError get_result_type ;

	KOKKOS_INLINE_FUNCTION
	get_result_type get() const { return get_result_type() ; }

	KOKKOS_INLINE_FUNCTION
	- Kokkos::TaskState get_state() const { return Kokkos::TaskState( m_state ); }
	+ Kokkos::Experimental::TaskState get_state() const { return Kokkos::Experimental::TaskState( m_state ); }

	//----------------------------------------

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	static
	void assign( TaskMember ** const lhs , TaskMember * const rhs , const bool no_throw = false );
	#else
	KOKKOS_INLINE_FUNCTION static
	void assign( TaskMember ** const lhs , TaskMember * const rhs , const bool no_throw = false ) {}
	#endif

	KOKKOS_INLINE_FUNCTION
	TaskMember * get_dependence( int i ) const
	- { return ( Kokkos::TASK_STATE_EXECUTING == m_state && 0 <= i && i < m_dep_size ) ? m_dep[i] : (TaskMember*) 0 ; }
	+ { return ( Kokkos::Experimental::TASK_STATE_EXECUTING == m_state && 0 <= i && i < m_dep_size ) ? m_dep[i] : (TaskMember*) 0 ; }

	KOKKOS_INLINE_FUNCTION
	int get_dependence() const
	{ return m_dep_size ; }

	KOKKOS_INLINE_FUNCTION
	void clear_dependence()
	{
	for ( int i = 0 ; i < m_dep_size ; ++i ) assign( m_dep + i , 0 );
	m_dep_size = 0 ;
	}

	KOKKOS_INLINE_FUNCTION
	void add_dependence( TaskMember * before )
	{
	- if ( ( Kokkos::TASK_STATE_CONSTRUCTING == m_state \|\|
	- Kokkos::TASK_STATE_EXECUTING == m_state ) &&
	+ if ( ( Kokkos::Experimental::TASK_STATE_CONSTRUCTING == m_state \|\|
	+ Kokkos::Experimental::TASK_STATE_EXECUTING == m_state ) &&
	m_dep_size < m_dep_capacity ) {
	assign( m_dep + m_dep_size , before );
	++m_dep_size ;
	}
	else {
	throw_error_add_dependence();
	}
	}

	//----------------------------------------

	template< class FunctorType , class ResultType >
	KOKKOS_INLINE_FUNCTION static
	- void apply_single( typename Impl::enable_if< ! Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	+ void apply_single( typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	{
	typedef TaskMember< Kokkos::Qthread , ResultType , FunctorType > derived_type ;

	// TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	// : public TaskMember< Kokkos::Qthread , ResultType , void >
	// , public FunctorType
	// { ... };

	derived_type & m = * static_cast< derived_type * >( t );

	- Impl::FunctorApply< FunctorType , void , ResultType & >::apply( (FunctorType &) m , & m.m_result );
	+ Kokkos::Impl::FunctorApply< FunctorType , void , ResultType & >::apply( (FunctorType &) m , & m.m_result );
	}

	template< class FunctorType , class ResultType >
	KOKKOS_INLINE_FUNCTION static
	- void apply_single( typename Impl::enable_if< Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	+ void apply_single( typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	{
	typedef TaskMember< Kokkos::Qthread , ResultType , FunctorType > derived_type ;

	// TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	// : public TaskMember< Kokkos::Qthread , ResultType , void >
	// , public FunctorType
	// { ... };

	derived_type & m = * static_cast< derived_type * >( t );

	- Impl::FunctorApply< FunctorType , void , void >::apply( (FunctorType &) m );
	+ Kokkos::Impl::FunctorApply< FunctorType , void , void >::apply( (FunctorType &) m );
	+ }
	+
	+ //----------------------------------------
	+
	+ template< class FunctorType , class ResultType >
	+ KOKKOS_INLINE_FUNCTION static
	+ void apply_team( typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t
	+ , Kokkos::Impl::QthreadTeamPolicyMember & member )
	+ {
	+ typedef TaskMember< Kokkos::Qthread , ResultType , FunctorType > derived_type ;
	+
	+ derived_type & m = * static_cast< derived_type * >( t );
	+
	+ m.FunctorType::apply( member , m.m_result );
	+ }
	+
	+ template< class FunctorType , class ResultType >
	+ KOKKOS_INLINE_FUNCTION static
	+ void apply_team( typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t
	+ , Kokkos::Impl::QthreadTeamPolicyMember & member )
	+ {
	+ typedef TaskMember< Kokkos::Qthread , ResultType , FunctorType > derived_type ;
	+
	+ derived_type & m = * static_cast< derived_type * >( t );
	+
	+ m.FunctorType::apply( member );
	}
	};

	//----------------------------------------------------------------------------
	/** \brief Base class for tasks with a result value in the Qthread execution space.
	*
	* The FunctorType must be void because this class is accessed by the
	* Future class for the task and result value.
	*
	* Must be derived from TaskMember<S,void,void> 'root class' so the Future class
	* can correctly static_cast from the 'root class' to this class.
	*/
	template < class ResultType >
	class TaskMember< Kokkos::Qthread , ResultType , void >
	: public TaskMember< Kokkos::Qthread , void , void >
	{
	public:

	ResultType m_result ;

	typedef const ResultType & get_result_type ;

	KOKKOS_INLINE_FUNCTION
	get_result_type get() const { return m_result ; }

	protected:

	typedef TaskMember< Kokkos::Qthread , void , void > task_root_type ;
	- typedef task_root_type::function_dealloc_type function_dealloc_type ;
	- typedef task_root_type::function_apply_type function_apply_type ;
	+ typedef task_root_type::function_dealloc_type function_dealloc_type ;
	+ typedef task_root_type::function_apply_single_type function_apply_single_type ;
	+ typedef task_root_type::function_apply_team_type function_apply_team_type ;

	inline
	- TaskMember( const function_dealloc_type arg_dealloc
	- , const function_apply_type arg_apply
	- , const unsigned arg_sizeof_derived
	- , const unsigned arg_dependence_capacity
	+ TaskMember( const function_dealloc_type arg_dealloc
	+ , const function_apply_single_type arg_apply_single
	+ , const function_apply_team_type arg_apply_team
	+ , volatile int & arg_active_count
	+ , const unsigned arg_sizeof_derived
	+ , const unsigned arg_dependence_capacity
	)
	: task_root_type( & task_root_type::template verify_type< ResultType >
	, arg_dealloc
	- , arg_apply
	+ , arg_apply_single
	+ , arg_apply_team
	+ , arg_active_count
	, arg_sizeof_derived
	, arg_dependence_capacity )
	, m_result()
	{}
	-
	};

	template< class ResultType , class FunctorType >
	class TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	: public TaskMember< Kokkos::Qthread , ResultType , void >
	, public FunctorType
	{
	public:

	typedef FunctorType functor_type ;

	typedef TaskMember< Kokkos::Qthread , void , void > task_root_type ;
	typedef TaskMember< Kokkos::Qthread , ResultType , void > task_base_type ;
	- typedef task_root_type::function_dealloc_type function_dealloc_type ;
	- typedef task_root_type::function_apply_type function_apply_type ;
	+ typedef task_root_type::function_dealloc_type function_dealloc_type ;
	+ typedef task_root_type::function_apply_single_type function_apply_single_type ;
	+ typedef task_root_type::function_apply_team_type function_apply_team_type ;

	inline
	- TaskMember( const function_dealloc_type arg_dealloc
	- , const function_apply_type arg_apply
	- , const unsigned arg_sizeof_derived
	- , const unsigned arg_dependence_capacity
	- , const functor_type & arg_functor
	- )
	- : task_base_type( arg_dealloc , arg_apply , arg_sizeof_derived , arg_dependence_capacity )
	- , functor_type( arg_functor )
	- {}
	-};
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-/** \brief ForEach task in the Qthread execution space
	- *
	- * Derived from TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	- * so that Functor can be cast to task root type without knowing policy.
	- */
	-template< class Arg0 , class Arg1 , class Arg2 , class ResultType , class FunctorType >
	-class TaskForEach< Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread >
	- , ResultType
	- , FunctorType >
	- : TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	-{
	-public:
	-
	- typedef FunctorType functor_type ;
	- typedef RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > policy_type ;
	-
	-private:
	-
	- friend class Kokkos::TaskPolicy< Kokkos::Qthread > ;
	- friend class Kokkos::Impl::TaskMember< Kokkos::Qthread , void , void > ;
	-
	- typedef TaskMember< Kokkos::Qthread , void , void > task_root_type ;
	- typedef TaskMember< Kokkos::Qthread , ResultType , FunctorType > task_base_type ;
	- typedef task_root_type::function_dealloc_type function_dealloc_type ;
	-
	- policy_type m_policy ;
	-
	- template< class Tag >
	- inline
	- typename Impl::enable_if< Impl::is_same<Tag,void>::value >::type
	- apply_policy() const
	- {
	- const typename policy_type::member_type e = m_policy.end();
	- for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	- functor_type::operator()(i);
	- }
	- }
	-
	- template< class Tag >
	- inline
	- typename Impl::enable_if< ! Impl::is_same<Tag,void>::value >::type
	- apply_policy() const
	- {
	- const Tag tag ;
	- const typename policy_type::member_type e = m_policy.end();
	- for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	- functor_type::operator()(tag,i);
	- }
	- }
	-
	- static
	- void apply_parallel( task_root_type * t )
	- {
	- static_cast<TaskForEach*>(t)->template apply_policy< typename policy_type::work_tag >();
	-
	- task_root_type::template apply_single< functor_type , ResultType >( t );
	- }
	-
	- TaskForEach( const function_dealloc_type arg_dealloc
	- , const int arg_sizeof_derived
	- , const int arg_dependence_capacity
	- , const policy_type & arg_policy
	- , const functor_type & arg_functor
	- )
	- : task_base_type( arg_dealloc
	- , & apply_parallel
	- , arg_sizeof_derived
	- , arg_dependence_capacity
	- , arg_functor )
	- , m_policy( arg_policy )
	- {}
	-
	- TaskForEach() /* = delete */ ;
	- TaskForEach( const TaskForEach & ) /* = delete */ ;
	- TaskForEach & operator = ( const TaskForEach & ) /* = delete */ ;
	-};
	-
	-//----------------------------------------------------------------------------
	-/** \brief Reduce task in the Qthread execution space
	- *
	- * Derived from TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	- * so that Functor can be cast to task root type without knowing policy.
	- */
	-template< class Arg0 , class Arg1 , class Arg2 , class ResultType , class FunctorType >
	-class TaskReduce< Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread >
	- , ResultType
	- , FunctorType >
	- : TaskMember< Kokkos::Qthread , ResultType , FunctorType >
	-{
	-public:
	-
	- typedef FunctorType functor_type ;
	- typedef RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Qthread > policy_type ;
	-
	-private:
	-
	- friend class Kokkos::TaskPolicy< Kokkos::Qthread > ;
	- friend class Kokkos::Impl::TaskMember< Kokkos::Qthread , void , void > ;
	-
	- typedef TaskMember< Kokkos::Qthread , void , void > task_root_type ;
	- typedef TaskMember< Kokkos::Qthread , ResultType , FunctorType > task_base_type ;
	- typedef task_root_type::function_dealloc_type function_dealloc_type ;
	-
	- policy_type m_policy ;
	-
	- template< class Tag >
	- inline
	- void apply_policy( typename Impl::enable_if< Impl::is_same<Tag,void>::value , ResultType & >::type result ) const
	- {
	- Impl::FunctorValueInit< functor_type , Tag >::init( *this , & result );
	- const typename policy_type::member_type e = m_policy.end();
	- for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	- functor_type::operator()( i, result );
	- }
	- }
	-
	- template< class Tag >
	- inline
	- void apply_policy( typename Impl::enable_if< ! Impl::is_same<Tag,void>::value , ResultType & >::type result ) const
	- {
	- Impl::FunctorValueInit< functor_type , Tag >::init( *this , & result );
	- const Tag tag ;
	- const typename policy_type::member_type e = m_policy.end();
	- for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	- functor_type::operator()( tag, i, result );
	- }
	- }
	-
	- static
	- void apply_parallel( task_root_type * t )
	- {
	- TaskReduce * const task = static_cast<TaskReduce*>(t);
	-
	- task->template apply_policy< typename policy_type::work_tag >( task->task_base_type::m_result );
	-
	- task_root_type::template apply_single< functor_type , ResultType >( t );
	- }
	-
	- TaskReduce( const function_dealloc_type arg_dealloc
	- , const int arg_sizeof_derived
	- , const int arg_dependence_capacity
	- , const policy_type & arg_policy
	- , const functor_type & arg_functor
	+ TaskMember( const function_dealloc_type arg_dealloc
	+ , const function_apply_single_type arg_apply_single
	+ , const function_apply_team_type arg_apply_team
	+ , volatile int & arg_active_count
	+ , const unsigned arg_sizeof_derived
	+ , const unsigned arg_dependence_capacity
	+ , const functor_type & arg_functor
	)
	: task_base_type( arg_dealloc
	- , & apply_parallel
	+ , arg_apply_single
	+ , arg_apply_team
	+ , arg_active_count
	, arg_sizeof_derived
	- , arg_dependence_capacity
	- , arg_functor )
	- , m_policy( arg_policy )
	+ , arg_dependence_capacity )
	+ , functor_type( arg_functor )
	{}
	-
	- TaskReduce() /* = delete */ ;
	- TaskReduce( const TaskReduce & ) /* = delete */ ;
	- TaskReduce & operator = ( const TaskReduce & ) /* = delete */ ;
	};

	-
	} /* namespace Impl */
	+} /* namespace Experimental */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {
	+
	+void wait( TaskPolicy< Kokkos::Qthread > & );

	template<>
	class TaskPolicy< Kokkos::Qthread >
	{
	public:

	- typedef Kokkos::Qthread execution_space ;
	+ typedef Kokkos::Qthread execution_space ;
	+ typedef Kokkos::Impl::QthreadTeamPolicyMember member_type ;

	private:

	typedef Impl::TaskMember< execution_space , void , void > task_root_type ;

	TaskPolicy & operator = ( const TaskPolicy & ) /* = delete */ ;

	template< class FunctorType >
	static inline
	const task_root_type * get_task_root( const FunctorType * f )
	{
	typedef Impl::TaskMember< execution_space , typename FunctorType::value_type , FunctorType > task_type ;
	return static_cast< const task_root_type * >( static_cast< const task_type * >(f) );
	}

	template< class FunctorType >
	static inline
	task_root_type * get_task_root( FunctorType * f )
	{
	typedef Impl::TaskMember< execution_space , typename FunctorType::value_type , FunctorType > task_type ;
	return static_cast< task_root_type * >( static_cast< task_type * >(f) );
	}

	- const unsigned m_default_dependence_capacity ;
	+ const unsigned m_default_dependence_capacity ;
	+ volatile int m_active_count_root ;
	+ volatile int & m_active_count ;

	public:

	KOKKOS_INLINE_FUNCTION
	- TaskPolicy() : m_default_dependence_capacity(4) {}
	-
	- KOKKOS_INLINE_FUNCTION
	- TaskPolicy( const TaskPolicy & rhs ) : m_default_dependence_capacity( rhs.m_default_dependence_capacity ) {}
	+ TaskPolicy()
	+ : m_default_dependence_capacity(4)
	+ , m_active_count_root(0)
	+ , m_active_count( m_active_count_root )
	+ {}

	KOKKOS_INLINE_FUNCTION
	explicit
	TaskPolicy( const unsigned arg_default_dependence_capacity )
	- : m_default_dependence_capacity( arg_default_dependence_capacity ) {}
	+ : m_default_dependence_capacity( arg_default_dependence_capacity )
	+ , m_active_count_root(0)
	+ , m_active_count( m_active_count_root )
	+ {}

	KOKKOS_INLINE_FUNCTION
	- TaskPolicy( const TaskPolicy &
	+ TaskPolicy( const TaskPolicy & rhs )
	+ : m_default_dependence_capacity( rhs.m_default_dependence_capacity )
	+ , m_active_count_root(0)
	+ , m_active_count( rhs.m_active_count )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TaskPolicy( const TaskPolicy & rhs
	, const unsigned arg_default_dependence_capacity )
	- : m_default_dependence_capacity( arg_default_dependence_capacity ) {}
	+ : m_default_dependence_capacity( arg_default_dependence_capacity )
	+ , m_active_count_root(0)
	+ , m_active_count( rhs.m_active_count )
	+ {}

	//----------------------------------------

	template< class ValueType >
	const Future< ValueType , execution_space > &
	spawn( const Future< ValueType , execution_space > & f ) const
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- f.m_task->schedule();
	+ f.m_task->spawn();
	#endif
	return f ;
	}

	// Create single-thread task

	template< class FunctorType >
	Future< typename FunctorType::value_type , execution_space >
	create( const FunctorType & functor
	, const unsigned dependence_capacity = ~0u ) const
	{
	typedef typename FunctorType::value_type value_type ;
	typedef Impl::TaskMember< execution_space , value_type , FunctorType > task_type ;
	return Future< value_type , execution_space >(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- task_root_type::create< task_type >(
	- functor , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	+ task_root_type::create_single< task_type >
	+ ( functor
	+ , m_active_count
	+ , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity )
	+ )
	#endif
	);
	}

	- // Create parallel foreach task
	+ // Create thread-team task

	- template< class PolicyType , class FunctorType >
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	Future< typename FunctorType::value_type , execution_space >
	- create_foreach( const PolicyType & policy
	- , const FunctorType & functor
	- , const unsigned dependence_capacity = ~0u ) const
	+ create_team( const FunctorType & functor
	+ , const unsigned dependence_capacity = ~0u ) const
	{
	- typedef typename FunctorType::value_type value_type ;
	- typedef Impl::TaskForEach< PolicyType , value_type , FunctorType > task_type ;
	- return Future< value_type , execution_space >(
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- task_root_type::create< task_type >( policy , functor ,
	- ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	-#endif
	- );
	- }
	-
	- // Create parallel reduce task
	+ typedef typename FunctorType::value_type value_type ;
	+ typedef Impl::TaskMember< execution_space , value_type , FunctorType > task_type ;

	- template< class PolicyType , class FunctorType >
	- Future< typename FunctorType::value_type , execution_space >
	- create_reduce( const PolicyType & policy
	- , const FunctorType & functor
	- , const unsigned dependence_capacity = ~0u ) const
	- {
	- typedef typename FunctorType::value_type value_type ;
	- typedef Impl::TaskReduce< PolicyType , value_type , FunctorType > task_type ;
	return Future< value_type , execution_space >(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- task_root_type::create< task_type >( policy , functor ,
	- ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	+ task_root_type::create_team< task_type >
	+ ( functor
	+ , m_active_count
	+ , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity )
	+ )
	#endif
	);
	}

	// Add dependence
	template< class A1 , class A2 , class A3 , class A4 >
	void add_dependence( const Future<A1,A2> & after
	, const Future<A3,A4> & before
	- , typename Impl::enable_if
	- < Impl::is_same< typename Future<A1,A2>::execution_space , execution_space >::value
	+ , typename Kokkos::Impl::enable_if
	+ < Kokkos::Impl::is_same< typename Future<A1,A2>::execution_space , execution_space >::value
	&&
	- Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	+ Kokkos::Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	>::type * = 0
	)
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	after.m_task->add_dependence( before.m_task );
	#endif
	}

	//----------------------------------------
	// Functions for an executing task functor to query dependences,
	// set new dependences, and respawn itself.

	template< class FunctorType >
	Future< void , execution_space >
	get_dependence( const FunctorType * task_functor , int i ) const
	{
	return Future<void,execution_space>(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	get_task_root(task_functor)->get_dependence(i)
	#endif
	);
	}

	template< class FunctorType >
	int get_dependence( const FunctorType * task_functor ) const
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return get_task_root(task_functor)->get_dependence(); }
	#else
	{ return 0 ; }
	#endif

	template< class FunctorType >
	void clear_dependence( FunctorType * task_functor ) const
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	get_task_root(task_functor)->clear_dependence();
	#endif
	}

	template< class FunctorType , class A3 , class A4 >
	void add_dependence( FunctorType * task_functor
	, const Future<A3,A4> & before
	- , typename Impl::enable_if
	- < Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	+ , typename Kokkos::Impl::enable_if
	+ < Kokkos::Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	>::type * = 0
	)
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	get_task_root(task_functor)->add_dependence( before.m_task );
	#endif
	}

	template< class FunctorType >
	void respawn( FunctorType * task_functor ) const
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- { get_task_root(task_functor)->schedule(); }
	+ { get_task_root(task_functor)->respawn(); }
	#else
	{}
	#endif

	-};
	-
	-inline
	-void wait( TaskPolicy< Kokkos::Qthread > & );
	+ static member_type & member_single();

	-inline
	-void wait( const Future< void , Kokkos::Qthread > & future )
	-{ Impl::TaskMember< Kokkos::Qthread , void , void >::wait( future ); }
	+ friend void wait( TaskPolicy< Kokkos::Qthread > & );
	+};

	+} /* namespace Experimental */
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #define KOKKOS_QTHREAD_TASK_HPP */

	diff --git a/lib/kokkos/core/src/Qthread/README b/lib/kokkos/core/src/Qthread/README
	new file mode 100755
	index 000000000..5d8f29a4e
	--- /dev/null
	+++ b/lib/kokkos/core/src/Qthread/README
	@@ -0,0 +1,28 @@
	+
	+# This Qthreads back-end uses an experimental branch of the Qthreads repository with special #define options.
	+
	+# Cloning repository and branch:
	+
	+git clone https://github.com/stelleg/qthreads qthreads-with-clone
	+
	+cd qthreads-with-clone
	+
	+# Added to ./git/config
	+#
	+# [branch "cloned_tasks"]
	+# remote = origin
	+# merge = refs/heads/cloned_tasks
	+#
	+
	+git branch cloned_tasks
	+git checkout cloned_tasks
	+git pull
	+
	+sh autogen.sh
	+
	+# configurure with 'hwloc' installation:
	+
	+./configure CFLAGS="-DCLONED_TASKS -DQTHREAD_LOCAL_PRIORITY" --with-hwloc=${HWLOCDIR} --prefix=${INSTALLDIR}
	+
	+
	+
	diff --git a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.cpp b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.cpp
	index 1c2db5f1a..99553fccb 100755
	--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.cpp
	+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.cpp
	@@ -1,745 +1,758 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <Kokkos_Core_fwd.hpp>

	#if defined( KOKKOS_HAVE_PTHREAD ) \|\| defined( KOKKOS_HAVE_WINTHREAD )

	#include <stdint.h>
	#include <limits>
	#include <utility>
	#include <iostream>
	#include <sstream>
	#include <Kokkos_Threads.hpp>
	#include <Kokkos_hwloc.hpp>
	#include <Kokkos_Atomic.hpp>
	#include <impl/Kokkos_Error.hpp>


	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {
	namespace {

	ThreadsExec s_threads_process ;
	ThreadsExec * s_threads_exec[ ThreadsExec::MAX_THREAD_COUNT ] = { 0 };
	pthread_t s_threads_pid[ ThreadsExec::MAX_THREAD_COUNT ] = { 0 };
	std::pair<unsigned,unsigned> s_threads_coord[ ThreadsExec::MAX_THREAD_COUNT ];

	int s_thread_pool_size[3] = { 0 , 0 , 0 };

	unsigned s_current_reduce_size = 0 ;
	unsigned s_current_shared_size = 0 ;

	void (* volatile s_current_function)( ThreadsExec & , const void * );
	const void * volatile s_current_function_arg = 0 ;

	struct Sentinel {
	Sentinel()
	{
	HostSpace::register_in_parallel( ThreadsExec::in_parallel );
	}

	~Sentinel()
	{
	if ( s_thread_pool_size[0] \|\|
	s_thread_pool_size[1] \|\|
	s_thread_pool_size[2] \|\|
	s_current_reduce_size \|\|
	s_current_shared_size \|\|
	s_current_function \|\|
	s_current_function_arg \|\|
	s_threads_exec[0] ) {
	std::cerr << "ERROR : Process exiting without calling Kokkos::Threads::terminate()" << std::endl ;
	}
	}
	};

	inline
	unsigned fan_size( const unsigned rank , const unsigned size )
	{
	const unsigned rank_rev = size - ( rank + 1 );
	unsigned count = 0 ;
	for ( unsigned n = 1 ; ( rank_rev + n < size ) && ! ( rank_rev & n ) ; n <<= 1 ) { ++count ; }
	return count ;
	}

	} // namespace
	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	void execute_function_noop( ThreadsExec & , const void * ) {}

	void ThreadsExec::driver(void)
	{
	ThreadsExec this_thread ;

	while ( ThreadsExec::Active == this_thread.m_pool_state ) {

	(*s_current_function)( this_thread , s_current_function_arg );

	// Deactivate thread and wait for reactivation
	this_thread.m_pool_state = ThreadsExec::Inactive ;

	wait_yield( this_thread.m_pool_state , ThreadsExec::Inactive );
	}
	}

	ThreadsExec::ThreadsExec()
	: m_pool_base(0)
	- , m_scratch(0)
	+ , m_scratch()
	, m_scratch_reduce_end(0)
	, m_scratch_thread_end(0)
	+ , m_numa_rank(0)
	+ , m_numa_core_rank(0)
	, m_pool_rank(0)
	, m_pool_size(0)
	, m_pool_fan_size(0)
	, m_pool_state( ThreadsExec::Terminating )
	{
	if ( & s_threads_process != this ) {

	// A spawned thread

	ThreadsExec * const nil = 0 ;

	// Which entry in 's_threads_exec', possibly determined from hwloc binding
	const int entry = ((size_t)s_current_function_arg) < size_t(s_thread_pool_size[0])
	? ((size_t)s_current_function_arg)
	: size_t(Kokkos::hwloc::bind_this_thread( s_thread_pool_size[0] , s_threads_coord ));

	// Given a good entry set this thread in the 's_threads_exec' array
	if ( entry < s_thread_pool_size[0] &&
	nil == atomic_compare_exchange( s_threads_exec + entry , nil , this ) ) {

	- m_pool_base = s_threads_exec ;
	- m_pool_rank = s_thread_pool_size[0] - ( entry + 1 );
	- m_pool_size = s_thread_pool_size[0] ;
	- m_pool_fan_size = fan_size( m_pool_rank , m_pool_size );
	- m_pool_state = ThreadsExec::Active ;
	+ const std::pair<unsigned,unsigned> coord = Kokkos::hwloc::get_this_thread_coordinate();
	+
	+ m_numa_rank = coord.first ;
	+ m_numa_core_rank = coord.second ;
	+ m_pool_base = s_threads_exec ;
	+ m_pool_rank = s_thread_pool_size[0] - ( entry + 1 );
	+ m_pool_size = s_thread_pool_size[0] ;
	+ m_pool_fan_size = fan_size( m_pool_rank , m_pool_size );
	+ m_pool_state = ThreadsExec::Active ;

	s_threads_pid[ m_pool_rank ] = pthread_self();

	// Inform spawning process that the threads_exec entry has been set.
	s_threads_process.m_pool_state = ThreadsExec::Active ;
	}
	else {
	// Inform spawning process that the threads_exec entry could not be set.
	s_threads_process.m_pool_state = ThreadsExec::Terminating ;
	}
	}
	else {
	// Enables 'parallel_for' to execute on unitialized Threads device
	m_pool_rank = 0 ;
	m_pool_size = 1 ;
	m_pool_state = ThreadsExec::Inactive ;

	s_threads_pid[ m_pool_rank ] = pthread_self();
	}
	}

	ThreadsExec::~ThreadsExec()
	{
	const unsigned entry = m_pool_size - ( m_pool_rank + 1 );

	m_pool_base = 0 ;
	- m_scratch = 0 ;
	+ m_scratch.clear();
	m_scratch_reduce_end = 0 ;
	m_scratch_thread_end = 0 ;
	- m_pool_rank = 0 ;
	- m_pool_size = 0 ;
	- m_pool_fan_size = 0 ;
	+ m_numa_rank = 0 ;
	+ m_numa_core_rank = 0 ;
	+ m_pool_rank = 0 ;
	+ m_pool_size = 0 ;
	+ m_pool_fan_size = 0 ;

	m_pool_state = ThreadsExec::Terminating ;

	if ( & s_threads_process != this && entry < MAX_THREAD_COUNT ) {
	ThreadsExec * const nil = 0 ;

	atomic_compare_exchange( s_threads_exec + entry , this , nil );

	s_threads_process.m_pool_state = ThreadsExec::Terminating ;
	}
	}


	int ThreadsExec::get_thread_count()
	{
	return s_thread_pool_size[0] ;
	}

	ThreadsExec * ThreadsExec::get_thread( const int init_thread_rank )
	{
	ThreadsExec * const th =
	init_thread_rank < s_thread_pool_size[0]
	? s_threads_exec[ s_thread_pool_size[0] - ( init_thread_rank + 1 ) ] : 0 ;

	if ( 0 == th \|\| th->m_pool_rank != init_thread_rank ) {
	std::ostringstream msg ;
	msg << "Kokkos::Impl::ThreadsExec::get_thread ERROR : "
	<< "thread " << init_thread_rank << " of " << s_thread_pool_size[0] ;
	if ( 0 == th ) {
	msg << " does not exist" ;
	}
	else {
	msg << " has wrong thread_rank " << th->m_pool_rank ;
	}
	Kokkos::Impl::throw_runtime_exception( msg.str() );
	}

	return th ;
	}

	//----------------------------------------------------------------------------

	-void ThreadsExec::execute_get_binding( ThreadsExec & exec , const void * )
	-{
	- s_threads_coord[ exec.m_pool_rank ] = Kokkos::hwloc::get_this_thread_coordinate();
	-}
	-
	void ThreadsExec::execute_sleep( ThreadsExec & exec , const void * )
	{
	ThreadsExec::global_lock();
	ThreadsExec::global_unlock();

	const int n = exec.m_pool_fan_size ;
	const int rank_rev = exec.m_pool_size - ( exec.m_pool_rank + 1 );

	for ( int i = 0 ; i < n ; ++i ) {
	Impl::spinwait( exec.m_pool_base[ rank_rev + (1<<i) ]->m_pool_state , ThreadsExec::Active );
	}

	exec.m_pool_state = ThreadsExec::Inactive ;
	}

	}
	}

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	void ThreadsExec::verify_is_process( const std::string & name , const bool initialized )
	{
	if ( ! is_process() ) {
	std::string msg( name );
	msg.append( " FAILED : Called by a worker thread, can only be called by the master process." );
	Kokkos::Impl::throw_runtime_exception( msg );
	}

	if ( initialized && 0 == s_thread_pool_size[0] ) {
	std::string msg( name );
	msg.append( " FAILED : Threads not initialized." );
	Kokkos::Impl::throw_runtime_exception( msg );
	}
	}

	int ThreadsExec::in_parallel()
	{
	// A thread function is in execution and
	// the function argument is not the special threads process argument and
	// the master process is a worker or is not the master process.
	return s_current_function &&
	( & s_threads_process != s_current_function_arg ) &&
	( s_threads_process.m_pool_base \|\| ! is_process() );
	}

	// Wait for root thread to become inactive
	void ThreadsExec::fence()
	{
	if ( s_thread_pool_size[0] ) {
	// Wait for the root thread to complete:
	Impl::spinwait( s_threads_exec[0]->m_pool_state , ThreadsExec::Active );
	}

	s_current_function = 0 ;
	s_current_function_arg = 0 ;
	}

	/** \brief Begin execution of the asynchronous functor */
	void ThreadsExec::start( void (func)( ThreadsExec & , const void ) , const void * arg )
	{
	verify_is_process("ThreadsExec::start" , true );

	if ( s_current_function \|\| s_current_function_arg ) {
	Kokkos::Impl::throw_runtime_exception( std::string( "ThreadsExec::start() FAILED : already executing" ) );
	}

	s_current_function = func ;
	s_current_function_arg = arg ;

	// Activate threads:
	for ( int i = s_thread_pool_size[0] ; 0 < i-- ; ) {
	s_threads_exec[i]->m_pool_state = ThreadsExec::Active ;
	}

	if ( s_threads_process.m_pool_size ) {
	// Master process is the root thread, run it:
	(*func)( s_threads_process , arg );
	s_threads_process.m_pool_state = ThreadsExec::Inactive ;
	}
	}

	//----------------------------------------------------------------------------

	bool ThreadsExec::sleep()
	{
	verify_is_process("ThreadsExec::sleep", true );

	if ( & execute_sleep == s_current_function ) return false ;

	fence();

	ThreadsExec::global_lock();

	s_current_function = & execute_sleep ;

	// Activate threads:
	for ( unsigned i = s_thread_pool_size[0] ; 0 < i ; ) {
	s_threads_exec[--i]->m_pool_state = ThreadsExec::Active ;
	}

	return true ;
	}

	bool ThreadsExec::wake()
	{
	verify_is_process("ThreadsExec::wake", true );

	if ( & execute_sleep != s_current_function ) return false ;

	ThreadsExec::global_unlock();

	if ( s_threads_process.m_pool_base ) {
	execute_sleep( s_threads_process , 0 );
	s_threads_process.m_pool_state = ThreadsExec::Inactive ;
	}

	fence();

	return true ;
	}

	//----------------------------------------------------------------------------

	void ThreadsExec::execute_serial( void (func)( ThreadsExec & , const void ) )
	{
	s_current_function = func ;
	s_current_function_arg = & s_threads_process ;

	const unsigned begin = s_threads_process.m_pool_base ? 1 : 0 ;

	for ( unsigned i = s_thread_pool_size[0] ; begin < i ; ) {
	ThreadsExec & th = * s_threads_exec[ --i ];

	th.m_pool_state = ThreadsExec::Active ;

	wait_yield( th.m_pool_state , ThreadsExec::Active );
	}

	if ( s_threads_process.m_pool_base ) {
	s_threads_process.m_pool_state = ThreadsExec::Active ;
	(*func)( s_threads_process , 0 );
	s_threads_process.m_pool_state = ThreadsExec::Inactive ;
	}

	s_current_function_arg = 0 ;
	s_current_function = 0 ;
	}

	//----------------------------------------------------------------------------

	void * ThreadsExec::root_reduce_scratch()
	{
	return s_threads_process.reduce_memory();
	}

	void ThreadsExec::execute_resize_scratch( ThreadsExec & exec , const void * )
	{
	- if ( exec.m_scratch ) {
	- HostSpace::decrement( exec.m_scratch );
	- exec.m_scratch = 0 ;
	- }
	+ exec.m_scratch.clear();

	exec.m_scratch_reduce_end = s_threads_process.m_scratch_reduce_end ;
	exec.m_scratch_thread_end = s_threads_process.m_scratch_thread_end ;

	if ( s_threads_process.m_scratch_thread_end ) {

	exec.m_scratch =
	- HostSpace::allocate( "thread_scratch" , s_threads_process.m_scratch_thread_end );
	+ HostSpace::allocate_and_track( "thread_scratch" , s_threads_process.m_scratch_thread_end );

	- unsigned * ptr = (unsigned *)( exec.m_scratch );
	+ unsigned * ptr = reinterpret_cast<unsigned *>( exec.m_scratch.alloc_ptr() );
	unsigned * const end = ptr + s_threads_process.m_scratch_thread_end / sizeof(unsigned);

	// touch on this thread
	while ( ptr < end ) *ptr++ = 0 ;
	}
	}

	void * ThreadsExec::resize_scratch( size_t reduce_size , size_t thread_size )
	{
	enum { ALIGN_MASK = Kokkos::Impl::MEMORY_ALIGNMENT - 1 };

	fence();

	const size_t old_reduce_size = s_threads_process.m_scratch_reduce_end ;
	const size_t old_thread_size = s_threads_process.m_scratch_thread_end - s_threads_process.m_scratch_reduce_end ;

	reduce_size = ( reduce_size + ALIGN_MASK ) & ~ALIGN_MASK ;
	thread_size = ( thread_size + ALIGN_MASK ) & ~ALIGN_MASK ;

	// Increase size or deallocate completely.

	if ( ( old_reduce_size < reduce_size ) \|\|
	( old_thread_size < thread_size ) \|\|
	( ( reduce_size == 0 && thread_size == 0 ) &&
	( old_reduce_size != 0 \|\| old_thread_size != 0 ) ) ) {

	verify_is_process( "ThreadsExec::resize_scratch" , true );

	s_threads_process.m_scratch_reduce_end = reduce_size ;
	s_threads_process.m_scratch_thread_end = reduce_size + thread_size ;

	execute_serial( & execute_resize_scratch );

	s_threads_process.m_scratch = s_threads_exec[0]->m_scratch ;
	}

	- return s_threads_process.m_scratch ;
	+ return s_threads_process.m_scratch.alloc_ptr() ;
	}

	//----------------------------------------------------------------------------

	void ThreadsExec::print_configuration( std::ostream & s , const bool detail )
	{
	verify_is_process("ThreadsExec::print_configuration",false);

	fence();

	const unsigned numa_count = Kokkos::hwloc::get_available_numa_count();
	const unsigned cores_per_numa = Kokkos::hwloc::get_available_cores_per_numa();
	const unsigned threads_per_core = Kokkos::hwloc::get_available_threads_per_core();

	// Forestall compiler warnings for unused variables.
	(void) numa_count;
	(void) cores_per_numa;
	(void) threads_per_core;

	s << "Kokkos::Threads" ;

	#if defined( KOKKOS_HAVE_PTHREAD )
	s << " KOKKOS_HAVE_PTHREAD" ;
	#endif
	#if defined( KOKKOS_HAVE_HWLOC )
	s << " hwloc[" << numa_count << "x" << cores_per_numa << "x" << threads_per_core << "]" ;
	#endif

	if ( s_thread_pool_size[0] ) {
	s << " threads[" << s_thread_pool_size[0] << "]"
	<< " threads_per_numa[" << s_thread_pool_size[1] << "]"
	<< " threads_per_core[" << s_thread_pool_size[2] << "]"
	;
	if ( 0 == s_threads_process.m_pool_base ) { s << " Asynchronous" ; }
	s << " ReduceScratch[" << s_current_reduce_size << "]"
	<< " SharedScratch[" << s_current_shared_size << "]" ;
	s << std::endl ;

	if ( detail ) {

	- execute_serial( & execute_get_binding );
	-
	for ( int i = 0 ; i < s_thread_pool_size[0] ; ++i ) {
	- ThreadsExec * const th = s_threads_exec[i] ;
	- s << " Thread hwloc("
	- << s_threads_coord[i].first << "."
	- << s_threads_coord[i].second << ")" ;

	- s_threads_coord[i].first = ~0u ;
	- s_threads_coord[i].second = ~0u ;
	+ ThreadsExec * const th = s_threads_exec[i] ;

	if ( th ) {
	+
	const int rank_rev = th->m_pool_size - ( th->m_pool_rank + 1 );

	- s << " rank(" << th->m_pool_rank << ")" ;
	+ s << " Thread[ " << th->m_pool_rank << " : "
	+ << th->m_numa_rank << "." << th->m_numa_core_rank << " ]" ;

	- if ( th->m_pool_fan_size ) {
	- s << " Fan{" ;
	- for ( int j = 0 ; j < th->m_pool_fan_size ; ++j ) {
	- s << " " << th->m_pool_base[rank_rev+(1<<j)]->m_pool_rank ;
	- }
	- s << " }" ;
	+ s << " Fan{" ;
	+ for ( int j = 0 ; j < th->m_pool_fan_size ; ++j ) {
	+ ThreadsExec * const thfan = th->m_pool_base[rank_rev+(1<<j)] ;
	+ s << " [ " << thfan->m_pool_rank << " : "
	+ << thfan->m_numa_rank << "." << thfan->m_numa_core_rank << " ]" ;
	}
	+ s << " }" ;

	if ( th == & s_threads_process ) {
	s << " is_process" ;
	}
	}
	s << std::endl ;
	}
	}
	}
	else {
	s << " not initialized" << std::endl ;
	}
	}

	//----------------------------------------------------------------------------

	int ThreadsExec::is_initialized()
	{ return 0 != s_threads_exec[0] ; }

	void ThreadsExec::initialize( unsigned thread_count ,
	unsigned use_numa_count ,
	unsigned use_cores_per_numa ,
	bool allow_asynchronous_threadpool )
	{
	static const Sentinel sentinel ;

	const bool is_initialized = 0 != s_thread_pool_size[0] ;

	unsigned thread_spawn_failed = 0 ;

	for ( int i = 0; i < ThreadsExec::MAX_THREAD_COUNT ; i++)
	s_threads_exec[i] = NULL;

	if ( ! is_initialized ) {

	// If thread_count, use_numa_count, or use_cores_per_numa are zero
	// then they will be given default values based upon hwloc detection
	// and allowed asynchronous execution.

	const bool hwloc_avail = hwloc::available();

	+ if ( thread_count == 0 ) {
	+ thread_count = hwloc_avail
	+ ? Kokkos::hwloc::get_available_numa_count() *
	+ Kokkos::hwloc::get_available_cores_per_numa() *
	+ Kokkos::hwloc::get_available_threads_per_core()
	+ : 1 ;
	+ }
	+
	const unsigned thread_spawn_begin =
	hwloc::thread_mapping( "Kokkos::Threads::initialize" ,
	allow_asynchronous_threadpool ,
	thread_count ,
	use_numa_count ,
	use_cores_per_numa ,
	s_threads_coord );

	const std::pair<unsigned,unsigned> proc_coord = s_threads_coord[0] ;

	if ( thread_spawn_begin ) {
	// Synchronous with s_threads_coord[0] as the process core
	// Claim entry #0 for binding the process core.
	s_threads_coord[0] = std::pair<unsigned,unsigned>(~0u,~0u);
	}

	- s_thread_pool_size[0] = thread_count ;
	+ s_thread_pool_size[0] = thread_count ;
	s_thread_pool_size[1] = s_thread_pool_size[0] / use_numa_count ;
	s_thread_pool_size[2] = s_thread_pool_size[1] / use_cores_per_numa ;
	s_current_function = & execute_function_noop ; // Initialization work function

	for ( unsigned ith = thread_spawn_begin ; ith < thread_count ; ++ith ) {

	s_threads_process.m_pool_state = ThreadsExec::Inactive ;

	// If hwloc available then spawned thread will
	// choose its own entry in 's_threads_coord'
	// otherwise specify the entry.
	s_current_function_arg = (void*)static_cast<uintptr_t>( hwloc_avail ? ~0u : ith );

	// Spawn thread executing the 'driver()' function.
	// Wait until spawned thread has attempted to initialize.
	// If spawning and initialization is successfull then
	// an entry in 's_threads_exec' will be assigned.
	if ( ThreadsExec::spawn() ) {
	wait_yield( s_threads_process.m_pool_state , ThreadsExec::Inactive );
	}
	if ( s_threads_process.m_pool_state == ThreadsExec::Terminating ) break ;
	}

	// Wait for all spawned threads to deactivate before zeroing the function.

	for ( unsigned ith = thread_spawn_begin ; ith < thread_count ; ++ith ) {
	// Try to protect against cache coherency failure by casting to volatile.
	ThreadsExec * const th = ((ThreadsExec * volatile *)s_threads_exec)[ith] ;
	if ( th ) {
	wait_yield( th->m_pool_state , ThreadsExec::Active );
	}
	else {
	++thread_spawn_failed ;
	}
	}

	s_current_function = 0 ;
	s_current_function_arg = 0 ;
	s_threads_process.m_pool_state = ThreadsExec::Inactive ;

	if ( ! thread_spawn_failed ) {
	// Bind process to the core on which it was located before spawning occured
	Kokkos::hwloc::bind_this_thread( proc_coord );

	if ( thread_spawn_begin ) { // Include process in pool.
	- s_threads_exec[0] = & s_threads_process ;
	- s_threads_process.m_pool_base = s_threads_exec ;
	- s_threads_process.m_pool_rank = thread_count - 1 ; // Reversed for scan-compatible reductions
	- s_threads_process.m_pool_size = thread_count ;
	- s_threads_process.m_pool_fan_size = fan_size( s_threads_process.m_pool_rank , s_threads_process.m_pool_size );
	+ const std::pair<unsigned,unsigned> coord = Kokkos::hwloc::get_this_thread_coordinate();
	+
	+ s_threads_exec[0] = & s_threads_process ;
	+ s_threads_process.m_numa_rank = coord.first ;
	+ s_threads_process.m_numa_core_rank = coord.second ;
	+ s_threads_process.m_pool_base = s_threads_exec ;
	+ s_threads_process.m_pool_rank = thread_count - 1 ; // Reversed for scan-compatible reductions
	+ s_threads_process.m_pool_size = thread_count ;
	+ s_threads_process.m_pool_fan_size = fan_size( s_threads_process.m_pool_rank , s_threads_process.m_pool_size );
	s_threads_pid[ s_threads_process.m_pool_rank ] = pthread_self();
	}
	else {
	s_threads_process.m_pool_base = 0 ;
	s_threads_process.m_pool_rank = 0 ;
	s_threads_process.m_pool_size = 0 ;
	s_threads_process.m_pool_fan_size = 0 ;
	}

	// Initial allocations:
	ThreadsExec::resize_scratch( 1024 , 1024 );
	}
	else {
	- s_thread_pool_size[0] = 0 ;
	+ s_thread_pool_size[0] = 0 ;
	s_thread_pool_size[1] = 0 ;
	s_thread_pool_size[2] = 0 ;
	}
	}

	if ( is_initialized \|\| thread_spawn_failed ) {

	std::ostringstream msg ;

	msg << "Kokkos::Threads::initialize ERROR" ;

	if ( is_initialized ) {
	msg << " : already initialized" ;
	}
	if ( thread_spawn_failed ) {
	msg << " : failed to spawn " << thread_spawn_failed << " threads" ;
	}

	Kokkos::Impl::throw_runtime_exception( msg.str() );
	}
	+
	+ // Init the array for used for arbitrarily sized atomics
	+ Impl::init_lock_array_host_space();
	+
	}

	//----------------------------------------------------------------------------

	void ThreadsExec::finalize()
	{
	verify_is_process("ThreadsExec::finalize",false);

	fence();

	resize_scratch(0,0);

	const unsigned begin = s_threads_process.m_pool_base ? 1 : 0 ;

	for ( unsigned i = s_thread_pool_size[0] ; begin < i-- ; ) {

	if ( s_threads_exec[i] ) {

	s_threads_exec[i]->m_pool_state = ThreadsExec::Terminating ;

	wait_yield( s_threads_process.m_pool_state , ThreadsExec::Inactive );

	s_threads_process.m_pool_state = ThreadsExec::Inactive ;
	}

	s_threads_pid[i] = 0 ;
	}

	if ( s_threads_process.m_pool_base ) {
	( & s_threads_process )->~ThreadsExec();
	s_threads_exec[0] = 0 ;
	}

	Kokkos::hwloc::unbind_this_thread();

	s_thread_pool_size[0] = 0 ;
	s_thread_pool_size[1] = 0 ;
	s_thread_pool_size[2] = 0 ;

	// Reset master thread to run solo.
	- s_threads_process.m_pool_base = 0 ;
	- s_threads_process.m_pool_rank = 0 ;
	- s_threads_process.m_pool_size = 1 ;
	- s_threads_process.m_pool_fan_size = 0 ;
	+ s_threads_process.m_numa_rank = 0 ;
	+ s_threads_process.m_numa_core_rank = 0 ;
	+ s_threads_process.m_pool_base = 0 ;
	+ s_threads_process.m_pool_rank = 0 ;
	+ s_threads_process.m_pool_size = 1 ;
	+ s_threads_process.m_pool_fan_size = 0 ;
	s_threads_process.m_pool_state = ThreadsExec::Inactive ;
	}

	//----------------------------------------------------------------------------

	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	Threads & Threads::instance(int)
	{
	static Threads t ;
	return t ;
	}

	int Threads::thread_pool_size( int depth )
	{
	return Impl::s_thread_pool_size[depth];
	}

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	int Threads::thread_pool_rank()
	{
	const pthread_t pid = pthread_self();
	int i = 0;
	while ( ( i < Impl::s_thread_pool_size[0] ) && ( pid != Impl::s_threads_pid[i] ) ) { ++i ; }
	return i ;
	}
	#endif

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #if defined( KOKKOS_HAVE_PTHREAD ) \|\| defined( KOKKOS_HAVE_WINTHREAD ) */

	diff --git a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.hpp b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.hpp
	index e60a1094a..382069797 100755
	--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.hpp
	+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.hpp
	@@ -1,1041 +1,465 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_THREADSEXEC_HPP
	#define KOKKOS_THREADSEXEC_HPP

	#include <stdio.h>

	#include <utility>
	#include <impl/Kokkos_spinwait.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp>

	#include <Kokkos_Atomic.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	-//----------------------------------------------------------------------------
	-
	-template< class > struct ThreadsExecAdapter ;
	-
	-//----------------------------------------------------------------------------
	-
	-class ThreadsExecTeamMember ;
	-
	class ThreadsExec {
	public:

	// Fan array has log_2(NT) reduction threads plus 2 scan threads
	// Currently limited to 16k threads.
	enum { MAX_FAN_COUNT = 16 };
	enum { MAX_THREAD_COUNT = 1 << ( MAX_FAN_COUNT - 2 ) };
	enum { VECTOR_LENGTH = 8 };

	/** \brief States of a worker thread */
	enum { Terminating ///< Termination in progress
	, Inactive ///< Exists, waiting for work
	, Active ///< Exists, performing work
	, Rendezvous ///< Exists, waiting in a barrier or reduce

	, ScanCompleted
	, ScanAvailable
	, ReductionAvailable
	};

	private:

	- friend class ThreadsExecTeamMember ;
	- friend class ThreadsExecTeamVectorMember ;
	friend class Kokkos::Threads ;

	// Fan-in operations' root is the highest ranking thread
	// to place the 'scan' reduction intermediate values on
	// the threads that need them.
	// For a simple reduction the thread location is arbitrary.

	- /** \brief Reduction memory reserved for team reductions */
	- enum { REDUCE_TEAM_BASE = 512 };
	-
	ThreadsExec * const * m_pool_base ; ///< Base for pool fan-in

	- void * m_scratch ;
	+ Impl::AllocationTracker m_scratch ;
	int m_scratch_reduce_end ;
	int m_scratch_thread_end ;
	+ int m_numa_rank ;
	+ int m_numa_core_rank ;
	int m_pool_rank ;
	int m_pool_size ;
	int m_pool_fan_size ;
	int volatile m_pool_state ; ///< State for global synchronizations


	static void global_lock();
	static void global_unlock();
	static bool spawn();

	static void execute_resize_scratch( ThreadsExec & , const void * );
	static void execute_sleep( ThreadsExec & , const void * );
	- static void execute_get_binding( ThreadsExec & , const void * );

	ThreadsExec( const ThreadsExec & );
	ThreadsExec & operator = ( const ThreadsExec & );

	static void execute_serial( void ()( ThreadsExec & , const void ) );

	public:

	KOKKOS_INLINE_FUNCTION int pool_size() const { return m_pool_size ; }
	KOKKOS_INLINE_FUNCTION int pool_rank() const { return m_pool_rank ; }
	+ KOKKOS_INLINE_FUNCTION int numa_rank() const { return m_numa_rank ; }
	+ KOKKOS_INLINE_FUNCTION int numa_core_rank() const { return m_numa_core_rank ; }

	static int get_thread_count();
	static ThreadsExec * get_thread( const int init_thread_rank );

	- inline void * reduce_memory() const { return ((unsigned char *) m_scratch ); }
	- KOKKOS_INLINE_FUNCTION void * scratch_memory() const { return ((unsigned char *) m_scratch ) + m_scratch_reduce_end ; }
	+ inline void * reduce_memory() const { return reinterpret_cast<unsigned char *>(m_scratch.alloc_ptr()); }
	+ KOKKOS_INLINE_FUNCTION void * scratch_memory() const { return reinterpret_cast<unsigned char *>(m_scratch.alloc_ptr()) + m_scratch_reduce_end ; }
	+
	+ KOKKOS_INLINE_FUNCTION int volatile & state() { return m_pool_state ; }
	+ KOKKOS_INLINE_FUNCTION ThreadsExec * const * pool_base() const { return m_pool_base ; }

	static void driver(void);

	~ThreadsExec();
	ThreadsExec();

	static void * resize_scratch( size_t reduce_size , size_t thread_size );

	static void * root_reduce_scratch();

	static bool is_process();

	static void verify_is_process( const std::string & , const bool initialized );

	static int is_initialized();

	static void initialize( unsigned thread_count ,
	unsigned use_numa_count ,
	unsigned use_cores_per_numa ,
	bool allow_asynchronous_threadpool );

	static void finalize();

	/* Given a requested team size, return valid team size */
	static unsigned team_size_valid( unsigned );

	static void print_configuration( std::ostream & , const bool detail = false );

	//------------------------------------

	static void wait_yield( volatile int & , const int );

	//------------------------------------
	// All-thread functions:

	+ inline
	+ int all_reduce( const int value )
	+ {
	+ // Make sure there is enough scratch space:
	+ const int rev_rank = m_pool_size - ( m_pool_rank + 1 );
	+
	+ ((volatile int) reduce_memory()) = value ;
	+
	+ memory_fence();
	+
	+ // Fan-in reduction with highest ranking thread as the root
	+ for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	+ // Wait: Active -> Rendezvous
	+ Impl::spinwait( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
	+ }
	+
	+ if ( rev_rank ) {
	+ m_pool_state = ThreadsExec::Rendezvous ;
	+ // Wait: Rendezvous -> Active
	+ Impl::spinwait( m_pool_state , ThreadsExec::Rendezvous );
	+ }
	+ else {
	+ // Root thread does the reduction and broadcast
	+
	+ int accum = 0 ;
	+
	+ for ( int rank = 0 ; rank < m_pool_size ; ++rank ) {
	+ accum += ((volatile int ) get_thread( rank )->reduce_memory());
	+ }
	+
	+ for ( int rank = 0 ; rank < m_pool_size ; ++rank ) {
	+ ((volatile int ) get_thread( rank )->reduce_memory()) = accum ;
	+ }
	+
	+ memory_fence();
	+
	+ for ( int rank = 0 ; rank < m_pool_size ; ++rank ) {
	+ get_thread( rank )->m_pool_state = ThreadsExec::Active ;
	+ }
	+ }
	+
	+ return ((volatile int) reduce_memory());
	+ }
	+
	+ //------------------------------------
	+ // All-thread functions:
	+
	template< class FunctorType , class ArgTag >
	inline
	void fan_in_reduce( const FunctorType & f ) const
	{
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , ArgTag > Join ;
	typedef Kokkos::Impl::FunctorFinal< FunctorType , ArgTag > Final ;

	const int rev_rank = m_pool_size - ( m_pool_rank + 1 );

	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {

	ThreadsExec & fan = *m_pool_base[ rev_rank + ( 1 << i ) ] ;

	Impl::spinwait( fan.m_pool_state , ThreadsExec::Active );

	Join::join( f , reduce_memory() , fan.reduce_memory() );
	}

	if ( ! rev_rank ) {
	Final::final( f , reduce_memory() );
	}
	}

	inline
	void fan_in() const
	{
	const int rev_rank = m_pool_size - ( m_pool_rank + 1 );

	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	Impl::spinwait( m_pool_base[rev_rank+(1<<i)]->m_pool_state , ThreadsExec::Active );
	}
	}

	template< class FunctorType , class ArgTag >
	inline
	void scan_large( const FunctorType & f )
	{
	// Sequence of states:
	// 0) Active : entry and exit state
	// 1) ReductionAvailable : reduction value available
	// 2) ScanAvailable : inclusive scan value available
	// 3) Rendezvous : All threads inclusive scan value are available
	// 4) ScanCompleted : exclusive scan value copied

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , ArgTag > Traits ;
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , ArgTag > Join ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , ArgTag > Init ;

	typedef typename Traits::value_type scalar_type ;

	const int rev_rank = m_pool_size - ( m_pool_rank + 1 );
	const unsigned count = Traits::value_count( f );

	scalar_type * const work_value = (scalar_type *) reduce_memory();

	//--------------------------------
	// Fan-in reduction with highest ranking thread as the root
	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	ThreadsExec & fan = *m_pool_base[ rev_rank + (1<<i) ];

	// Wait: Active -> ReductionAvailable (or ScanAvailable)
	Impl::spinwait( fan.m_pool_state , ThreadsExec::Active );
	Join::join( f , work_value , fan.reduce_memory() );
	}

	// Copy reduction value to scan value before releasing from this phase.
	for ( unsigned i = 0 ; i < count ; ++i ) { work_value[i+count] = work_value[i] ; }

	if ( rev_rank ) {

	// Set: Active -> ReductionAvailable
	m_pool_state = ThreadsExec::ReductionAvailable ;

	// Wait for contributing threads' scan value to be available.
	if ( ( 1 << m_pool_fan_size ) < ( m_pool_rank + 1 ) ) {
	ThreadsExec & th = *m_pool_base[ rev_rank + ( 1 << m_pool_fan_size ) ] ;

	// Wait: Active -> ReductionAvailable
	// Wait: ReductionAvailable -> ScanAvailable
	Impl::spinwait( th.m_pool_state , ThreadsExec::Active );
	Impl::spinwait( th.m_pool_state , ThreadsExec::ReductionAvailable );

	Join::join( f , work_value + count , ((scalar_type *)th.reduce_memory()) + count );
	}

	// This thread has completed inclusive scan
	// Set: ReductionAvailable -> ScanAvailable
	m_pool_state = ThreadsExec::ScanAvailable ;

	// Wait for all threads to complete inclusive scan
	// Wait: ScanAvailable -> Rendezvous
	Impl::spinwait( m_pool_state , ThreadsExec::ScanAvailable );
	}

	//--------------------------------

	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	ThreadsExec & fan = *m_pool_base[ rev_rank + (1<<i) ];
	// Wait: ReductionAvailable -> ScanAvailable
	Impl::spinwait( fan.m_pool_state , ThreadsExec::ReductionAvailable );
	// Set: ScanAvailable -> Rendezvous
	fan.m_pool_state = ThreadsExec::Rendezvous ;
	}

	// All threads have completed the inclusive scan.
	// All non-root threads are in the Rendezvous state.
	// Threads are free to overwrite their reduction value.
	//--------------------------------

	if ( ( rev_rank + 1 ) < m_pool_size ) {
	// Exclusive scan: copy the previous thread's inclusive scan value

	ThreadsExec & th = *m_pool_base[ rev_rank + 1 ] ; // Not the root thread

	const scalar_type * const src_value = ((scalar_type *)th.reduce_memory()) + count ;

	for ( unsigned j = 0 ; j < count ; ++j ) { work_value[j] = src_value[j]; }
	}
	else {
	(void) Init::init( f , work_value );
	}

	//--------------------------------
	// Wait for all threads to copy previous thread's inclusive scan value
	// Wait for all threads: Rendezvous -> ScanCompleted
	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	Impl::spinwait( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Rendezvous );
	}
	if ( rev_rank ) {
	// Set: ScanAvailable -> ScanCompleted
	m_pool_state = ThreadsExec::ScanCompleted ;
	// Wait: ScanCompleted -> Active
	Impl::spinwait( m_pool_state , ThreadsExec::ScanCompleted );
	}
	// Set: ScanCompleted -> Active
	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	m_pool_base[ rev_rank + (1<<i) ]->m_pool_state = ThreadsExec::Active ;
	}
	}

	template< class FunctorType , class ArgTag >
	inline
	void scan_small( const FunctorType & f )
	{
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , ArgTag > Traits ;
	typedef Kokkos::Impl::FunctorValueJoin< FunctorType , ArgTag > Join ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , ArgTag > Init ;

	typedef typename Traits::value_type scalar_type ;

	const int rev_rank = m_pool_size - ( m_pool_rank + 1 );
	const unsigned count = Traits::value_count( f );

	scalar_type * const work_value = (scalar_type *) reduce_memory();

	//--------------------------------
	// Fan-in reduction with highest ranking thread as the root
	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	// Wait: Active -> Rendezvous
	Impl::spinwait( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
	}

	for ( unsigned i = 0 ; i < count ; ++i ) { work_value[i+count] = work_value[i]; }

	if ( rev_rank ) {
	m_pool_state = ThreadsExec::Rendezvous ;
	// Wait: Rendezvous -> Active
	Impl::spinwait( m_pool_state , ThreadsExec::Rendezvous );
	}
	else {
	// Root thread does the thread-scan before releasing threads

	scalar_type * ptr_prev = 0 ;

	for ( int rank = 0 ; rank < m_pool_size ; ++rank ) {
	scalar_type * const ptr = (scalar_type *) get_thread( rank )->reduce_memory();
	if ( rank ) {
	for ( unsigned i = 0 ; i < count ; ++i ) { ptr[i] = ptr_prev[ i + count ]; }
	Join::join( f , ptr + count , ptr );
	}
	else {
	(void) Init::init( f , ptr );
	}
	ptr_prev = ptr ;
	}
	}

	for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
	m_pool_base[ rev_rank + (1<<i) ]->m_pool_state = ThreadsExec::Active ;
	}
	}

	//------------------------------------
	/** \brief Wait for previous asynchronous functor to
	* complete and release the Threads device.
	* Acquire the Threads device and start this functor.
	*/
	static void start( void ()( ThreadsExec & , const void ) , const void * );

	static int in_parallel();
	static void fence();
	static bool sleep();
	static bool wake();
	};

	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-class ThreadsExecTeamMember {
	-private:
	-
	- enum { TEAM_REDUCE_SIZE = 512 };
	-
	- typedef Kokkos::Threads execution_space ;
	- typedef execution_space::scratch_memory_space space ;
	-
	- Impl::ThreadsExec & m_exec ;
	- space m_team_shared ;
	- ThreadsExec * const * m_team_base ; ///< Base for team fan-in
	- int m_team_shared_size ;
	- int m_team_size ;
	- int m_team_rank ;
	- int m_team_rank_rev ;
	- int m_league_size ;
	- int m_league_end ;
	- int m_league_rank ;
	-
	- inline
	- void set_team_shared()
	- { new( & m_team_shared ) space( ((char ) (m_team_base)->scratch_memory()) + TEAM_REDUCE_SIZE , m_team_shared_size ); }
	-
	- // Fan-in and wait until the matching fan-out is called.
	- // The root thread which does not wait will return true.
	- // All other threads will return false during the fan-out.
	- KOKKOS_INLINE_FUNCTION bool team_fan_in() const
	- {
	- int n , j ;
	-
	- // Wait for fan-in threads
	- for ( n = 1 ; ( ! ( m_team_rank_rev & n ) ) && ( ( j = m_team_rank_rev + n ) < m_team_size ) ; n <<= 1 ) {
	- Impl::spinwait( m_team_base[j]->m_pool_state , ThreadsExec::Active );
	- }
	-
	- // If not root then wait for release
	- if ( m_team_rank_rev ) {
	- m_exec.m_pool_state = ThreadsExec::Rendezvous ;
	- Impl::spinwait( m_exec.m_pool_state , ThreadsExec::Rendezvous );
	- }
	-
	- return ! m_team_rank_rev ;
	- }
	-
	- KOKKOS_INLINE_FUNCTION void team_fan_out() const
	- {
	- int n , j ;
	- for ( n = 1 ; ( ! ( m_team_rank_rev & n ) ) && ( ( j = m_team_rank_rev + n ) < m_team_size ) ; n <<= 1 ) {
	- m_team_base[j]->m_pool_state = ThreadsExec::Active ;
	- }
	- }
	-
	-public:
	-
	- KOKKOS_INLINE_FUNCTION static int team_reduce_size() { return TEAM_REDUCE_SIZE ; }
	-
	- KOKKOS_INLINE_FUNCTION
	- const execution_space::scratch_memory_space & team_shmem() const
	- { return m_team_shared ; }
	-
	- KOKKOS_INLINE_FUNCTION int league_rank() const { return m_league_rank ; }
	- KOKKOS_INLINE_FUNCTION int league_size() const { return m_league_size ; }
	- KOKKOS_INLINE_FUNCTION int team_rank() const { return m_team_rank ; }
	- KOKKOS_INLINE_FUNCTION int team_size() const { return m_team_size ; }
	-
	- KOKKOS_INLINE_FUNCTION void team_barrier() const
	- {
	- team_fan_in();
	- team_fan_out();
	- }
	-
	- template<class ValueType>
	- KOKKOS_INLINE_FUNCTION
	- void team_broadcast(ValueType& value, const int& thread_id) const
	- {
	-#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- { }
	-#else
	- // Make sure there is enough scratch space:
	- typedef typename if_c< sizeof(ValueType) < TEAM_REDUCE_SIZE
	- , ValueType , void >::type type ;
	-
	- type * const local_value = ((type*) m_exec.scratch_memory());
	- if(team_rank() == thread_id)
	- *local_value = value;
	- memory_fence();
	- team_barrier();
	- value = *local_value;
	-#endif
	- }
	-
	- template< typename Type >
	- KOKKOS_INLINE_FUNCTION Type team_reduce( const Type & value ) const
	-#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- { return Type(); }
	-#else
	- {
	- // Make sure there is enough scratch space:
	- typedef typename if_c< sizeof(Type) < ThreadsExec::REDUCE_TEAM_BASE , Type , void >::type type ;
	-
	- ((volatile type) m_exec.scratch_memory() ) = value ;
	-
	- memory_fence();
	-
	- type & accum = ((type ) m_team_base[0]->scratch_memory() );
	-
	- if ( team_fan_in() ) {
	- for ( int i = 1 ; i < m_team_size ; ++i ) {
	- accum += ((type ) m_team_base[i]->scratch_memory() );
	- }
	- memory_fence();
	- }
	-
	- team_fan_out();
	-
	- return accum ;
	- }
	-#endif
	-
	-#ifdef KOKKOS_HAVE_CXX11
	- template< class ValueType, class JoinOp >
	- KOKKOS_INLINE_FUNCTION ValueType
	- team_reduce( const ValueType & value
	- , const JoinOp & op_in ) const
	- #if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- { return ValueType(); }
	- #else
	- {
	- typedef ValueType value_type;
	- const JoinLambdaAdapter<value_type,JoinOp> op(op_in);
	- #endif
	-#else // KOKKOS_HAVE_CXX11
	- template< class JoinOp >
	- KOKKOS_INLINE_FUNCTION typename JoinOp::value_type
	- team_reduce( const typename JoinOp::value_type & value
	- , const JoinOp & op ) const
	- #if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- { return typename JoinOp::value_type(); }
	- #else
	- {
	- typedef typename JoinOp::value_type value_type;
	- #endif
	-#endif // KOKKOS_HAVE_CXX11
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- // Make sure there is enough scratch space:
	- typedef typename if_c< sizeof(value_type) < ThreadsExec::REDUCE_TEAM_BASE
	- , value_type , void >::type type ;
	-
	- type * const local_value = ((type*) m_exec.scratch_memory());
	-
	- // Set this thread's contribution
	- *local_value = value ;
	-
	- // Fence to make sure the base team member has access:
	- memory_fence();
	-
	- if ( team_fan_in() ) {
	- // The last thread to synchronize returns true, all other threads wait for team_fan_out()
	- type * const team_value = ((type*) m_team_base[0]->scratch_memory());
	-
	- // Join to the team value:
	- for ( int i = 1 ; i < m_team_size ; ++i ) {
	- op.join( team_value , ((type*) m_team_base[i]->scratch_memory()) );
	- }
	-
	- // Team base thread may "lap" member threads so copy out to their local value.
	- for ( int i = 1 ; i < m_team_size ; ++i ) {
	- ((type) m_team_base[i]->scratch_memory()) = *team_value ;
	- }
	-
	- // Fence to make sure all team members have access
	- memory_fence();
	- }
	-
	- team_fan_out();
	-
	- // Value was changed by the team base
	- return ((type volatile const ) local_value);
	- }
	-#endif
	-
	- /** \brief Intra-team exclusive prefix sum with team_rank() ordering
	- * with intra-team non-deterministic ordering accumulation.
	- *
	- * The global inter-team accumulation value will, at the end of the
	- * league's parallel execution, be the scan's total.
	- * Parallel execution ordering of the league's teams is non-deterministic.
	- * As such the base value for each team's scan operation is similarly
	- * non-deterministic.
	- */
	- template< typename ArgType >
	- KOKKOS_INLINE_FUNCTION ArgType team_scan( const ArgType & value , ArgType * const global_accum ) const
	-#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- { return ArgType(); }
	-#else
	- {
	- // Make sure there is enough scratch space:
	- typedef typename if_c< sizeof(ArgType) < ThreadsExec::REDUCE_TEAM_BASE , ArgType , void >::type type ;
	-
	- volatile type * const work_value = ((type*) m_exec.scratch_memory());
	-
	- *work_value = value ;
	-
	- memory_fence();
	-
	- if ( team_fan_in() ) {
	- // The last thread to synchronize returns true, all other threads wait for team_fan_out()
	- // m_team_base[0] == highest ranking team member
	- // m_team_base[ m_team_size - 1 ] == lowest ranking team member
	- //
	- // 1) copy from lower to higher rank, initialize lowest rank to zero
	- // 2) prefix sum from lowest to highest rank, skipping lowest rank
	-
	- type accum = 0 ;
	-
	- if ( global_accum ) {
	- for ( int i = m_team_size ; i-- ; ) {
	- type & val = ((type) m_team_base[i]->scratch_memory());
	- accum += val ;
	- }
	- accum = atomic_fetch_add( global_accum , accum );
	- }
	-
	- for ( int i = m_team_size ; i-- ; ) {
	- type & val = ((type) m_team_base[i]->scratch_memory());
	- const type offset = accum ;
	- accum += val ;
	- val = offset ;
	- }
	-
	- memory_fence();
	- }
	-
	- team_fan_out();
	-
	- return *work_value ;
	- }
	-#endif
	-
	- /** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	- *
	- * The highest rank thread can compute the reduction total as
	- * reduction_total = dev.team_scan( value ) + value ;
	- */
	- template< typename ArgType >
	- KOKKOS_INLINE_FUNCTION ArgType team_scan( const ArgType & value ) const
	- { return this-> template team_scan<ArgType>( value , 0 ); }
	-
	-#ifdef KOKKOS_HAVE_CXX11
	-
	- /** \brief Inter-thread parallel for. Executes op(iType i) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all threads of the the calling thread team.
	- * This functionality requires C++11 support.*/
	- template< typename iType, class Operation>
	- KOKKOS_INLINE_FUNCTION void team_par_for(const iType n, const Operation & op) const {
	- const int chunk = ((n+m_team_size-1)/m_team_size);
	- const int start = chunk*m_team_rank;
	- const int end = start+chunk<n?start+chunk:n;
	- for(int i=start; i<end ; i++) {
	- op(i);
	- }
	- }
	-#endif
	- //----------------------------------------
	- // Private for the driver
	-
	- template< class Arg0 , class Arg1 >
	- ThreadsExecTeamMember( Impl::ThreadsExec & exec
	- , const TeamPolicy< Arg0 , Arg1 , Kokkos::Threads > & team
	- , const int shared_size )
	- : m_exec( exec )
	- , m_team_shared(0,0)
	- , m_team_base(0)
	- , m_team_shared_size( shared_size )
	- , m_team_size(0)
	- , m_team_rank(0)
	- , m_team_rank_rev(0)
	- , m_league_size(0)
	- , m_league_end(0)
	- , m_league_rank(0)
	- {
	- if ( team.league_size() ) {
	- // Execution is using device-team interface:
	-
	- const int pool_rank_rev = exec.pool_size() - ( exec.pool_rank() + 1 );
	- const int team_rank_rev = pool_rank_rev % team.team_alloc();
	-
	- // May be using fewer threads per team than a multiple of threads per core,
	- // some threads will idle.
	-
	- if ( team_rank_rev < team.team_size() ) {
	- const size_t pool_league_size = exec.pool_size() / team.team_alloc() ;
	- const size_t pool_league_rank_rev = pool_rank_rev / team.team_alloc() ;
	- const size_t pool_league_rank = pool_league_size - ( pool_league_rank_rev + 1 );
	-
	- m_team_base = exec.m_pool_base + team.team_alloc() * pool_league_rank_rev ;
	- m_team_size = team.team_size() ;
	- m_team_rank = team.team_size() - ( team_rank_rev + 1 );
	- m_team_rank_rev = team_rank_rev ;
	- m_league_size = team.league_size();
	- m_league_rank = ( team.league_size() * pool_league_rank ) / pool_league_size ;
	- m_league_end = ( team.league_size() * (pool_league_rank+1) ) / pool_league_size ;
	-
	- set_team_shared();
	- }
	- }
	- }
	-
	- bool valid() const
	- { return m_league_rank < m_league_end ; }
	-
	- void next()
	- {
	- if ( ++m_league_rank < m_league_end ) {
	- team_barrier();
	- set_team_shared();
	- }
	- }
	-};
	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	inline int Threads::in_parallel()
	{ return Impl::ThreadsExec::in_parallel(); }

	inline int Threads::is_initialized()
	{ return Impl::ThreadsExec::is_initialized(); }

	inline void Threads::initialize(
	unsigned threads_count ,
	unsigned use_numa_count ,
	unsigned use_cores_per_numa ,
	bool allow_asynchronous_threadpool )
	{
	Impl::ThreadsExec::initialize( threads_count , use_numa_count , use_cores_per_numa , allow_asynchronous_threadpool );
	}

	inline void Threads::finalize()
	{
	Impl::ThreadsExec::finalize();
	}

	inline void Threads::print_configuration( std::ostream & s , const bool detail )
	{
	Impl::ThreadsExec::print_configuration( s , detail );
	}

	inline bool Threads::sleep()
	{ return Impl::ThreadsExec::sleep() ; }

	inline bool Threads::wake()
	{ return Impl::ThreadsExec::wake() ; }

	inline void Threads::fence()
	{ Impl::ThreadsExec::fence() ; }

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	-namespace Kokkos {
	-
	-template< class Arg0 , class Arg1 >
	-class TeamPolicy< Arg0 , Arg1 , Kokkos::Threads >
	-{
	-private:
	-
	- int m_league_size ;
	- int m_team_size ;
	- int m_team_alloc ;
	-
	- inline
	- void init( const int league_size_request
	- , const int team_size_request )
	- {
	- const int pool_size = execution_space::thread_pool_size(0);
	- const int team_max = execution_space::thread_pool_size(1);
	- const int team_grain = execution_space::thread_pool_size(2);
	-
	- m_league_size = league_size_request ;
	-
	- m_team_size = team_size_request < team_max ?
	- team_size_request : team_max ;
	-
	- // Round team size up to a multiple of 'team_gain'
	- const int team_size_grain = team_grain * ( ( m_team_size + team_grain - 1 ) / team_grain );
	- const int team_count = pool_size / team_size_grain ;
	-
	- // Constraint : pool_size = m_team_alloc * team_count
	- m_team_alloc = pool_size / team_count ;
	- }
	-
	-
	-public:
	-
	- //! Tag this class as a kokkos execution policy
	- typedef TeamPolicy execution_policy ;
	- typedef Kokkos::Threads execution_space ;
	-
	- typedef typename
	- Impl::if_c< ! Impl::is_same< Kokkos::Threads , Arg0 >::value , Arg0 , Arg1 >::type
	- work_tag ;
	-
	- //----------------------------------------
	-
	- template< class FunctorType >
	- inline static
	- int team_size_max( const FunctorType & )
	- { return execution_space::thread_pool_size(1); }
	-
	- template< class FunctorType >
	- static int team_size_recommended( const FunctorType & )
	- { return execution_space::thread_pool_size(2); }
	-
	- //----------------------------------------
	-
	- inline int team_size() const { return m_team_size ; }
	- inline int team_alloc() const { return m_team_alloc ; }
	- inline int league_size() const { return m_league_size ; }
	-
	- /** \brief Specify league size, request team size */
	- TeamPolicy( execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1 )
	- : m_league_size(0)
	- , m_team_size(0)
	- , m_team_alloc(0)
	- { init(league_size_request,team_size_request); (void) vector_length_request; }
	-
	- TeamPolicy( int league_size_request , int team_size_request , int vector_length_request = 1 )
	- : m_league_size(0)
	- , m_team_size(0)
	- , m_team_alloc(0)
	- { init(league_size_request,team_size_request); (void) vector_length_request; }
	-
	- typedef Impl::ThreadsExecTeamMember member_type ;
	-
	- friend class Impl::ThreadsExecTeamMember ;
	-};
	-
	-
	-} /* namespace Kokkos */
	-
	-
	-#ifdef KOKKOS_HAVE_CXX11
	-
	-namespace Kokkos {
	-
	-template<typename iType>
	-KOKKOS_INLINE_FUNCTION
	-Impl::TeamThreadLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember>
	- TeamThreadLoop(const Impl::ThreadsExecTeamMember& thread, const iType& count) {
	- return Impl::TeamThreadLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember>(thread,count);
	-}
	-
	-template<typename iType>
	-KOKKOS_INLINE_FUNCTION
	-Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember >
	- ThreadVectorLoop(const Impl::ThreadsExecTeamMember& thread, const iType& count) {
	- return Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember >(thread,count);
	-}
	-
	-
	-KOKKOS_INLINE_FUNCTION
	-Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember> PerTeam(const Impl::ThreadsExecTeamMember& thread) {
	- return Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember>(thread);
	-}
	-
	-KOKKOS_INLINE_FUNCTION
	-Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember> PerThread(const Impl::ThreadsExecTeamMember& thread) {
	- return Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember>(thread);
	-}
	-} // namespace Kokkos
	-
	-namespace Kokkos {
	-
	- /** \brief Inter-thread parallel_for. Executes lambda(iType i) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all threads of the the calling thread team.
	- * This functionality requires C++11 support.*/
	-template<typename iType, class Lambda>
	-KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember>& loop_boundaries, const Lambda& lambda) {
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	- lambda(i);
	-}
	-
	-/** \brief Inter-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
	- * val is performed and put into result. This functionality requires C++11 support.*/
	-template< typename iType, class Lambda, typename ValueType >
	-KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember>& loop_boundaries,
	- const Lambda & lambda, ValueType& result) {
	-
	- result = ValueType();
	-
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	- ValueType tmp = ValueType();
	- lambda(i,tmp);
	- result+=tmp;
	- }
	-
	- result = loop_boundaries.thread.team_reduce(result,Impl::JoinAdd<ValueType>());
	-}
	-
	-/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	- * val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	- * The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	- * the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	- * '1 for '). This functionality requires C++11 support./
	-template< typename iType, class Lambda, typename ValueType, class JoinType >
	-KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember>& loop_boundaries,
	- const Lambda & lambda, const JoinType& join, ValueType& init_result) {
	-
	- ValueType result = init_result;
	-
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	- ValueType tmp = ValueType();
	- lambda(i,tmp);
	- join(result,tmp);
	- }
	-
	- init_result = loop_boundaries.thread.team_reduce(result,Impl::JoinLambdaAdapter<ValueType,JoinType>(join));
	-}
	-
	-} //namespace Kokkos
	-
	-
	-namespace Kokkos {
	-/** \brief Intra-thread vector parallel_for. Executes lambda(iType i) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all vector lanes of the the calling thread.
	- * This functionality requires C++11 support.*/
	-template<typename iType, class Lambda>
	-KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	- loop_boundaries, const Lambda& lambda) {
	- #ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	- #pragma ivdep
	- #endif
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	- lambda(i);
	-}
	-
	-/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
	- * val is performed and put into result. This functionality requires C++11 support.*/
	-template< typename iType, class Lambda, typename ValueType >
	-KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	- loop_boundaries, const Lambda & lambda, ValueType& result) {
	- result = ValueType();
	-#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	-#pragma ivdep
	-#endif
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	- ValueType tmp = ValueType();
	- lambda(i,tmp);
	- result+=tmp;
	- }
	-}
	-
	-/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	- * val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	- * The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	- * the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	- * '1 for '). This functionality requires C++11 support./
	-template< typename iType, class Lambda, typename ValueType, class JoinType >
	-KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	- loop_boundaries, const Lambda & lambda, const JoinType& join, ValueType& init_result) {
	-
	- ValueType result = init_result;
	-#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	-#pragma ivdep
	-#endif
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	- ValueType tmp = ValueType();
	- lambda(i,tmp);
	- join(result,tmp);
	- }
	- init_result = result;
	-}
	-
	-/** \brief Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
	- * for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all vector lanes in the thread and a scan operation is performed.
	- * Depending on the target execution space the operator might be called twice: once with final=false
	- * and once with final=true. When final==true val contains the prefix sum value. The contribution of this
	- * "i" needs to be added to val no matter whether final==true or not. In a serial execution
	- * (i.e. team_size==1) the operator is only called once with final==true. Scan_val will be set
	- * to the final sum value over all vector lanes.
	- * This functionality requires C++11 support.*/
	-template< typename iType, class FunctorType >
	-KOKKOS_INLINE_FUNCTION
	-void parallel_scan(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	- loop_boundaries, const FunctorType & lambda) {
	-
	- typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	- typedef typename ValueTraits::value_type value_type ;
	-
	- value_type scan_val = value_type();
	-
	-#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	-#pragma ivdep
	-#endif
	- for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	- lambda(i,scan_val,true);
	- }
	-}
	-
	-} // namespace Kokkos
	-
	-namespace Kokkos {
	-
	-template<class FunctorType>
	-KOKKOS_INLINE_FUNCTION
	-void single(const Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda) {
	- lambda();
	-}
	-
	-template<class FunctorType>
	-KOKKOS_INLINE_FUNCTION
	-void single(const Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda) {
	- if(single_struct.team_member.team_rank()==0) lambda();
	-}
	-
	-template<class FunctorType, class ValueType>
	-KOKKOS_INLINE_FUNCTION
	-void single(const Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	- lambda(val);
	-}
	-
	-template<class FunctorType, class ValueType>
	-KOKKOS_INLINE_FUNCTION
	-void single(const Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	- if(single_struct.team_member.team_rank()==0) {
	- lambda(val);
	- }
	- single_struct.team_member.team_broadcast(val,0);
	-}
	-}
	-#endif // KOKKOS_HAVE_CXX11
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	#endif /* #define KOKKOS_THREADSEXEC_HPP */

	diff --git a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec_base.cpp b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec_base.cpp
	index 1c875328c..40d5efd0f 100755
	--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec_base.cpp
	+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec_base.cpp
	@@ -1,254 +1,254 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <Kokkos_Core_fwd.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#if defined( KOKKOS_HAVE_PTHREAD )

	/* Standard 'C' Linux libraries */

	#include <pthread.h>
	#include <sched.h>
	#include <errno.h>

	/* Standard C++ libaries */

	#include <cstdlib>
	#include <string>
	#include <iostream>
	#include <stdexcept>

	#include <Kokkos_Threads.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {
	namespace {

	pthread_mutex_t host_internal_pthread_mutex = PTHREAD_MUTEX_INITIALIZER ;

	// Pthreads compatible driver.
	// Recovery from an exception would require constant intra-thread health
	// verification; which would negatively impact runtime. As such simply
	// abort the process.

	void * internal_pthread_driver( void * )
	{
	try {
	ThreadsExec::driver();
	}
	catch( const std::exception & x ) {
	std::cerr << "Exception thrown from worker thread: " << x.what() << std::endl ;
	std::cerr.flush();
	std::abort();
	}
	catch( ... ) {
	std::cerr << "Exception thrown from worker thread" << std::endl ;
	std::cerr.flush();
	std::abort();
	}
	return NULL ;
	}

	} // namespace

	//----------------------------------------------------------------------------
	// Spawn a thread

	bool ThreadsExec::spawn()
	{
	bool result = false ;

	pthread_attr_t attr ;

	if ( 0 == pthread_attr_init( & attr ) \|\|
	0 == pthread_attr_setscope( & attr, PTHREAD_SCOPE_SYSTEM ) \|\|
	0 == pthread_attr_setdetachstate( & attr, PTHREAD_CREATE_DETACHED ) ) {

	pthread_t pt ;

	result = 0 == pthread_create( & pt, & attr, internal_pthread_driver, 0 );
	}

	pthread_attr_destroy( & attr );

	return result ;
	}

	//----------------------------------------------------------------------------

	bool ThreadsExec::is_process()
	{
	static const pthread_t master_pid = pthread_self();

	return pthread_equal( master_pid , pthread_self() );
	}

	void ThreadsExec::global_lock()
	{
	pthread_mutex_lock( & host_internal_pthread_mutex );
	}

	void ThreadsExec::global_unlock()
	{
	pthread_mutex_unlock( & host_internal_pthread_mutex );
	}

	//----------------------------------------------------------------------------

	void ThreadsExec::wait_yield( volatile int & flag , const int value )
	{
	while ( value == flag ) { sched_yield(); }
	}

	} // namespace Impl
	} // namespace Kokkos

	/* end #if defined( KOKKOS_HAVE_PTHREAD ) */
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_HAVE_WINTHREAD )

	/* Windows libraries */
	#include <windows.h>
	#include <process.h>

	/* Standard C++ libaries */

	#include <cstdlib>
	#include <string>
	#include <iostream>
	#include <stdexcept>

	#include <Kokkos_Threads.hpp>

	//----------------------------------------------------------------------------
	// Driver for each created pthread

	namespace Kokkos {
	namespace Impl {
	namespace {

	unsigned WINAPI internal_winthread_driver( void * arg )
	{
	ThreadsExec::driver();

	return 0 ;
	}

	class ThreadLockWindows {
	private:
	CRITICAL_SECTION m_handle ;

	~ThreadLockWindows()
	{ DeleteCriticalSection( & m_handle ); }

	ThreadLockWindows();
	{ InitializeCriticalSection( & m_handle ); }

	ThreadLockWindows( const ThreadLockWindows & );
	ThreadLockWindows & operator = ( const ThreadLockWindows & );

	public:

	static ThreadLockWindows & singleton();

	void lock()
	{ EnterCriticalSection( & m_handle ); }

	void unlock()
	{ LeaveCriticalSection( & m_handle ); }
	};

	ThreadLockWindows & ThreadLockWindows::singleton()
	{ static ThreadLockWindows self ; return self ; }

	} // namespace <>
	} // namespace Kokkos
	} // namespace Impl

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	// Spawn this thread

	bool ThreadsExec::spawn()
	{
	unsigned Win32ThreadID = 0 ;

	HANDLE handle =
	_beginthreadex(0,0,internal_winthread_driver,0,0, & Win32ThreadID );

	return ! handle ;
	}

	bool ThreadsExec::is_process() { return true ; }

	void ThreadsExec::global_lock()
	{ ThreadLockWindows::singleton().lock(); }

	void ThreadsExec::global_unlock()
	{ ThreadLockWindows::singleton().unlock(); }

	void ThreadsExec::wait_yield( volatile int & flag , const int value ) {}
	{
	while ( value == flag ) { Sleep(0); }
	}

	} // namespace Impl
	} // namespace Kokkos

	#endif /* end #elif defined( KOKKOS_HAVE_WINTHREAD ) */
	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------



	diff --git a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp b/lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
	similarity index 59%
	copy from lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp
	copy to lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
	index 82b27b97b..53b5eb01d 100755
	--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMPexec.hpp
	+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
	@@ -1,758 +1,730 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_OPENMPEXEC_HPP
	-#define KOKKOS_OPENMPEXEC_HPP
	+#ifndef KOKKOS_THREADSTEAM_HPP
	+#define KOKKOS_THREADSTEAM_HPP

	-#include <impl/Kokkos_Traits.hpp>
	+#include <stdio.h>
	+
	+#include <utility>
	#include <impl/Kokkos_spinwait.hpp>
	+#include <impl/Kokkos_FunctorAdapter.hpp>

	#include <Kokkos_Atomic.hpp>

	+//----------------------------------------------------------------------------
	+
	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------
	-/** \brief Data for OpenMP thread execution */
	-
	-class OpenMPexec {
	-public:
	-
	- enum { MAX_THREAD_COUNT = 4096 };
	-
	-private:
	-
	- static int m_pool_topo[ 4 ];
	- static int m_map_rank[ MAX_THREAD_COUNT ];
	- static OpenMPexec * m_pool[ MAX_THREAD_COUNT ]; // Indexed by: m_pool_rank_rev
	-
	- friend class Kokkos::OpenMP ;
	-
	- int const m_pool_rank ;
	- int const m_pool_rank_rev ;
	- int const m_scratch_exec_end ;
	- int const m_scratch_reduce_end ;
	- int const m_scratch_thread_end ;
	-
	- int volatile m_barrier_state ;
	-
	- OpenMPexec();
	- OpenMPexec( const OpenMPexec & );
	- OpenMPexec & operator = ( const OpenMPexec & );
	-
	- static void clear_scratch();
	-
	-public:
	-
	- // Topology of a cache coherent thread pool:
	- // TOTAL = NUMA x GRAIN
	- // pool_size( depth = 0 )
	- // pool_size(0) = total number of threads
	- // pool_size(1) = number of threads per NUMA
	- // pool_size(2) = number of threads sharing finest grain memory hierarchy
	-
	- inline static
	- int pool_size( int depth = 0 ) { return m_pool_topo[ depth ]; }
	-
	- inline static
	- OpenMPexec * pool_rev( int pool_rank_rev ) { return m_pool[ pool_rank_rev ]; }
	-
	- inline int pool_rank() const { return m_pool_rank ; }
	- inline int pool_rank_rev() const { return m_pool_rank_rev ; }

	- inline void * scratch_reduce() const { return ((char *) this) + m_scratch_exec_end ; }
	- inline void * scratch_thread() const { return ((char *) this) + m_scratch_reduce_end ; }
	+template< class > struct ThreadsExecAdapter ;

	- inline
	- void state_wait( int state )
	- { Impl::spinwait( m_barrier_state , state ); }
	-
	- inline
	- void state_set( int state ) { m_barrier_state = state ; }
	-
	- ~OpenMPexec() {}
	-
	- OpenMPexec( const int poolRank
	- , const int scratch_exec_size
	- , const int scratch_reduce_size
	- , const int scratch_thread_size )
	- : m_pool_rank( poolRank )
	- , m_pool_rank_rev( pool_size() - ( poolRank + 1 ) )
	- , m_scratch_exec_end( scratch_exec_size )
	- , m_scratch_reduce_end( m_scratch_exec_end + scratch_reduce_size )
	- , m_scratch_thread_end( m_scratch_reduce_end + scratch_thread_size )
	- , m_barrier_state(0)
	- {}
	-
	- static void finalize();
	-
	- static void initialize( const unsigned team_count ,
	- const unsigned threads_per_team ,
	- const unsigned numa_count ,
	- const unsigned cores_per_numa );
	-
	- static void verify_is_process( const char * const );
	- static void verify_initialized( const char * const );
	-
	- static void resize_scratch( size_t reduce_size , size_t thread_size );
	-
	- inline static
	- OpenMPexec * get_thread_omp() { return m_pool[ m_map_rank[ omp_get_thread_num() ] ]; }
	-};
	-
	-} // namespace Impl
	-} // namespace Kokkos
	-
	-//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	-namespace Kokkos {
	-namespace Impl {
	-
	-class OpenMPexecTeamMember {
	+class ThreadsExecTeamMember {
	private:

	enum { TEAM_REDUCE_SIZE = 512 };

	- /** \brief Thread states for team synchronization */
	- enum { Active = 0 , Rendezvous = 1 };
	-
	- typedef Kokkos::OpenMP execution_space ;
	- typedef execution_space::scratch_memory_space scratch_memory_space ;
	+ typedef Kokkos::Threads execution_space ;
	+ typedef execution_space::scratch_memory_space space ;

	- Impl::OpenMPexec & m_exec ;
	- scratch_memory_space m_team_shared ;
	- int m_team_shmem ;
	- int m_team_base_rev ;
	- int m_team_rank_rev ;
	- int m_team_rank ;
	+ ThreadsExec * const m_exec ;
	+ ThreadsExec * const * m_team_base ; ///< Base for team fan-in
	+ space m_team_shared ;
	+ int m_team_shared_size ;
	int m_team_size ;
	- int m_league_rank ;
	- int m_league_end ;
	+ int m_team_rank ;
	+ int m_team_rank_rev ;
	int m_league_size ;
	+ int m_league_end ;
	+ int m_league_rank ;

	- // Fan-in team threads, root of the fan-in which does not block returns true
	inline
	- bool team_fan_in() const
	+ void set_team_shared()
	+ { new( & m_team_shared ) space( ((char ) (m_team_base)->scratch_memory()) + TEAM_REDUCE_SIZE , m_team_shared_size ); }
	+
	+public:
	+
	+ // Fan-in and wait until the matching fan-out is called.
	+ // The root thread which does not wait will return true.
	+ // All other threads will return false during the fan-out.
	+ KOKKOS_INLINE_FUNCTION bool team_fan_in() const
	{
	- for ( int n = 1 , j ; ( ( j = m_team_rank_rev + n ) < m_team_size ) && ! ( m_team_rank_rev & n ) ; n <<= 1 ) {
	- m_exec.pool_rev( m_team_base_rev + j )->state_wait( Active );
	+ int n , j ;
	+
	+ // Wait for fan-in threads
	+ for ( n = 1 ; ( ! ( m_team_rank_rev & n ) ) && ( ( j = m_team_rank_rev + n ) < m_team_size ) ; n <<= 1 ) {
	+ Impl::spinwait( m_team_base[j]->state() , ThreadsExec::Active );
	}

	+ // If not root then wait for release
	if ( m_team_rank_rev ) {
	- m_exec.state_set( Rendezvous );
	- m_exec.state_wait( Rendezvous );
	+ m_exec->state() = ThreadsExec::Rendezvous ;
	+ Impl::spinwait( m_exec->state() , ThreadsExec::Rendezvous );
	}

	- return 0 == m_team_rank_rev ;
	+ return ! m_team_rank_rev ;
	}

	- inline
	- void team_fan_out() const
	+ KOKKOS_INLINE_FUNCTION void team_fan_out() const
	{
	- for ( int n = 1 , j ; ( ( j = m_team_rank_rev + n ) < m_team_size ) && ! ( m_team_rank_rev & n ) ; n <<= 1 ) {
	- m_exec.pool_rev( m_team_base_rev + j )->state_set( Active );
	+ int n , j ;
	+ for ( n = 1 ; ( ! ( m_team_rank_rev & n ) ) && ( ( j = m_team_rank_rev + n ) < m_team_size ) ; n <<= 1 ) {
	+ m_team_base[j]->state() = ThreadsExec::Active ;
	}
	}

	public:

	+ KOKKOS_INLINE_FUNCTION static int team_reduce_size() { return TEAM_REDUCE_SIZE ; }
	+
	KOKKOS_INLINE_FUNCTION
	const execution_space::scratch_memory_space & team_shmem() const
	{ return m_team_shared ; }

	KOKKOS_INLINE_FUNCTION int league_rank() const { return m_league_rank ; }
	KOKKOS_INLINE_FUNCTION int league_size() const { return m_league_size ; }
	KOKKOS_INLINE_FUNCTION int team_rank() const { return m_team_rank ; }
	KOKKOS_INLINE_FUNCTION int team_size() const { return m_team_size ; }

	KOKKOS_INLINE_FUNCTION void team_barrier() const
	-#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- {}
	-#else
	{
	- if ( 1 < m_team_size ) {
	- team_fan_in();
	- team_fan_out();
	- }
	+ team_fan_in();
	+ team_fan_out();
	}
	-#endif

	template<class ValueType>
	KOKKOS_INLINE_FUNCTION
	void team_broadcast(ValueType& value, const int& thread_id) const
	{
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ }
	#else
	// Make sure there is enough scratch space:
	typedef typename if_c< sizeof(ValueType) < TEAM_REDUCE_SIZE
	, ValueType , void >::type type ;

	- type * const local_value = ((type*) m_exec.scratch_thread());
	- if(team_rank() == thread_id)
	- *local_value = value;
	- memory_fence();
	- team_barrier();
	- value = *local_value;
	+ if ( m_team_base ) {
	+ type * const local_value = ((type*) m_team_base[0]->scratch_memory());
	+ if(team_rank() == thread_id) *local_value = value;
	+ memory_fence();
	+ team_barrier();
	+ value = *local_value;
	+ }
	#endif
	}

	+ template< typename Type >
	+ KOKKOS_INLINE_FUNCTION Type team_reduce( const Type & value ) const
	+#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ { return Type(); }
	+#else
	+ {
	+ // Make sure there is enough scratch space:
	+ typedef typename if_c< sizeof(Type) < TEAM_REDUCE_SIZE , Type , void >::type type ;
	+
	+ if ( 0 == m_exec ) return value ;
	+
	+ ((volatile type) m_exec->scratch_memory() ) = value ;
	+
	+ memory_fence();
	+
	+ type & accum = ((type ) m_team_base[0]->scratch_memory() );
	+
	+ if ( team_fan_in() ) {
	+ for ( int i = 1 ; i < m_team_size ; ++i ) {
	+ accum += ((type ) m_team_base[i]->scratch_memory() );
	+ }
	+ memory_fence();
	+ }
	+
	+ team_fan_out();
	+
	+ return accum ;
	+ }
	+#endif
	+
	#ifdef KOKKOS_HAVE_CXX11
	template< class ValueType, class JoinOp >
	KOKKOS_INLINE_FUNCTION ValueType
	team_reduce( const ValueType & value
	, const JoinOp & op_in ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return ValueType(); }
	#else
	{
	typedef ValueType value_type;
	const JoinLambdaAdapter<value_type,JoinOp> op(op_in);
	#endif
	#else // KOKKOS_HAVE_CXX11
	template< class JoinOp >
	KOKKOS_INLINE_FUNCTION typename JoinOp::value_type
	team_reduce( const typename JoinOp::value_type & value
	, const JoinOp & op ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return typename JoinOp::value_type(); }
	#else
	{
	typedef typename JoinOp::value_type value_type;
	#endif
	#endif // KOKKOS_HAVE_CXX11
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	// Make sure there is enough scratch space:
	typedef typename if_c< sizeof(value_type) < TEAM_REDUCE_SIZE
	, value_type , void >::type type ;

	- type * const local_value = ((type*) m_exec.scratch_thread());
	+ if ( 0 == m_exec ) return value ;
	+
	+ type * const local_value = ((type*) m_exec->scratch_memory());

	// Set this thread's contribution
	*local_value = value ;

	// Fence to make sure the base team member has access:
	memory_fence();

	if ( team_fan_in() ) {
	// The last thread to synchronize returns true, all other threads wait for team_fan_out()
	- type * const team_value = ((type*) m_exec.pool_rev( m_team_base_rev )->scratch_thread());
	+ type * const team_value = ((type*) m_team_base[0]->scratch_memory());

	// Join to the team value:
	for ( int i = 1 ; i < m_team_size ; ++i ) {
	- op.join( team_value , ((type*) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread()) );
	+ op.join( team_value , ((type*) m_team_base[i]->scratch_memory()) );
	}

	- // The base team member may "lap" the other team members,
	- // copy to their local value before proceeding.
	+ // Team base thread may "lap" member threads so copy out to their local value.
	for ( int i = 1 ; i < m_team_size ; ++i ) {
	- ((type) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread()) = *team_value ;
	+ ((type) m_team_base[i]->scratch_memory()) = *team_value ;
	}

	// Fence to make sure all team members have access
	memory_fence();
	}

	team_fan_out();

	- return ((type volatile const )local_value);
	+ // Value was changed by the team base
	+ return ((type volatile const ) local_value);
	}
	#endif
	+
	/** \brief Intra-team exclusive prefix sum with team_rank() ordering
	* with intra-team non-deterministic ordering accumulation.
	*
	* The global inter-team accumulation value will, at the end of the
	* league's parallel execution, be the scan's total.
	* Parallel execution ordering of the league's teams is non-deterministic.
	* As such the base value for each team's scan operation is similarly
	* non-deterministic.
	*/
	template< typename ArgType >
	KOKKOS_INLINE_FUNCTION ArgType team_scan( const ArgType & value , ArgType * const global_accum ) const
	#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return ArgType(); }
	#else
	{
	// Make sure there is enough scratch space:
	typedef typename if_c< sizeof(ArgType) < TEAM_REDUCE_SIZE , ArgType , void >::type type ;

	- volatile type * const work_value = ((type*) m_exec.scratch_thread());
	+ if ( 0 == m_exec ) return type(0);
	+
	+ volatile type * const work_value = ((type*) m_exec->scratch_memory());

	*work_value = value ;

	memory_fence();

	if ( team_fan_in() ) {
	// The last thread to synchronize returns true, all other threads wait for team_fan_out()
	// m_team_base[0] == highest ranking team member
	// m_team_base[ m_team_size - 1 ] == lowest ranking team member
	//
	// 1) copy from lower to higher rank, initialize lowest rank to zero
	// 2) prefix sum from lowest to highest rank, skipping lowest rank

	type accum = 0 ;

	if ( global_accum ) {
	for ( int i = m_team_size ; i-- ; ) {
	- type & val = ((type) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread());
	+ type & val = ((type) m_team_base[i]->scratch_memory());
	accum += val ;
	}
	accum = atomic_fetch_add( global_accum , accum );
	}

	for ( int i = m_team_size ; i-- ; ) {
	- type & val = ((type) m_exec.pool_rev( m_team_base_rev + i )->scratch_thread());
	- const type offset = accum ;
	+ type & val = ((type) m_team_base[i]->scratch_memory());
	+ const type offset = accum ;
	accum += val ;
	val = offset ;
	}

	memory_fence();
	}

	team_fan_out();

	return *work_value ;
	}
	#endif

	/** \brief Intra-team exclusive prefix sum with team_rank() ordering.
	*
	* The highest rank thread can compute the reduction total as
	* reduction_total = dev.team_scan( value ) + value ;
	*/
	- template< typename Type >
	- KOKKOS_INLINE_FUNCTION Type team_scan( const Type & value ) const
	- { return this-> template team_scan<Type>( value , 0 ); }
	-
	-#ifdef KOKKOS_HAVE_CXX11
	+ template< typename ArgType >
	+ KOKKOS_INLINE_FUNCTION ArgType team_scan( const ArgType & value ) const
	+ { return this-> template team_scan<ArgType>( value , 0 ); }

	- /** \brief Inter-thread parallel for. Executes op(iType i) for each i=0..N-1.
	- *
	- * The range i=0..N-1 is mapped to all threads of the the calling thread team.
	- * This functionality requires C++11 support.*/
	- template< typename iType, class Operation>
	- KOKKOS_INLINE_FUNCTION void team_par_for(const iType n, const Operation & op) const {
	- const int chunk = ((n+m_team_size-1)/m_team_size);
	- const int start = chunk*m_team_rank;
	- const int end = start+chunk<n?start+chunk:n;
	- for(int i=start; i<end ; i++) {
	- op(i);
	- }
	- }
	-#endif

	//----------------------------------------
	// Private for the driver

	-private:
	-
	- typedef execution_space::scratch_memory_space space ;
	-
	-public:
	-
	template< class Arg0 , class Arg1 >
	- inline
	- OpenMPexecTeamMember( Impl::OpenMPexec & exec
	- , const TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP > & team
	- , const int shmem_size
	- )
	+ ThreadsExecTeamMember( Impl::ThreadsExec * exec
	+ , const TeamPolicy< Arg0 , Arg1 , Kokkos::Threads > & team
	+ , const int shared_size )
	: m_exec( exec )
	+ , m_team_base(0)
	, m_team_shared(0,0)
	- , m_team_shmem( shmem_size )
	- , m_team_base_rev(0)
	- , m_team_rank_rev(0)
	+ , m_team_shared_size( shared_size )
	+ , m_team_size(0)
	, m_team_rank(0)
	- , m_team_size( team.team_size() )
	- , m_league_rank(0)
	+ , m_team_rank_rev(0)
	+ , m_league_size(0)
	, m_league_end(0)
	- , m_league_size( team.league_size() )
	+ , m_league_rank(0)
	{
	- const int pool_rank_rev = m_exec.pool_rank_rev();
	- const int pool_team_rank_rev = pool_rank_rev % team.team_alloc();
	- const int pool_league_rank_rev = pool_rank_rev / team.team_alloc();
	- const int league_iter_end = team.league_size() - pool_league_rank_rev * team.team_iter();
	-
	- if ( pool_team_rank_rev < m_team_size && 0 < league_iter_end ) {
	- m_team_base_rev = team.team_alloc() * pool_league_rank_rev ;
	- m_team_rank_rev = pool_team_rank_rev ;
	- m_team_rank = m_team_size - ( m_team_rank_rev + 1 );
	- m_league_end = league_iter_end ;
	- m_league_rank = league_iter_end > team.team_iter() ? league_iter_end - team.team_iter() : 0 ;
	- new( (void) &m_team_shared ) space( ( (char) m_exec.pool_rev(m_team_base_rev)->scratch_thread() ) + TEAM_REDUCE_SIZE , m_team_shmem );
	+ if ( team.league_size() ) {
	+ // Execution is using device-team interface:
	+
	+ const int pool_rank_rev = m_exec->pool_size() - ( m_exec->pool_rank() + 1 );
	+ const int team_rank_rev = pool_rank_rev % team.team_alloc();
	+
	+ // May be using fewer threads per team than a multiple of threads per core,
	+ // some threads will idle.
	+
	+ if ( team_rank_rev < team.team_size() ) {
	+ const size_t pool_league_size = m_exec->pool_size() / team.team_alloc() ;
	+ const size_t pool_league_rank_rev = pool_rank_rev / team.team_alloc() ;
	+ const size_t pool_league_rank = pool_league_size - ( pool_league_rank_rev + 1 );
	+
	+ m_team_base = m_exec->pool_base() + team.team_alloc() * pool_league_rank_rev ;
	+ m_team_size = team.team_size() ;
	+ m_team_rank = team.team_size() - ( team_rank_rev + 1 );
	+ m_team_rank_rev = team_rank_rev ;
	+ m_league_size = team.league_size();
	+
	+ m_league_rank = ( team.league_size() * pool_league_rank ) / pool_league_size ;
	+ m_league_end = ( team.league_size() * (pool_league_rank+1) ) / pool_league_size ;
	+
	+ set_team_shared();
	+ }
	}
	}

	+ ThreadsExecTeamMember()
	+ : m_exec(0)
	+ , m_team_base(0)
	+ , m_team_shared(0,0)
	+ , m_team_shared_size(0)
	+ , m_team_size(1)
	+ , m_team_rank(0)
	+ , m_team_rank_rev(0)
	+ , m_league_size(1)
	+ , m_league_end(0)
	+ , m_league_rank(0)
	+ {}
	+
	+ inline
	+ ThreadsExec & threads_exec_team_base() const { return m_team_base ? *m_team_base : m_exec ; }
	+
	bool valid() const
	{ return m_league_rank < m_league_end ; }

	void next()
	{
	if ( ++m_league_rank < m_league_end ) {
	team_barrier();
	- new( (void) &m_team_shared ) space( ( (char) m_exec.pool_rev(m_team_base_rev)->scratch_thread() ) + TEAM_REDUCE_SIZE , m_team_shmem );
	+ set_team_shared();
	}
	}

	- static inline int team_reduce_size() { return TEAM_REDUCE_SIZE ; }
	+ void set_league_shmem( const int arg_league_rank
	+ , const int arg_league_size
	+ , const int arg_shmem_size
	+ )
	+ {
	+ m_league_rank = arg_league_rank ;
	+ m_league_size = arg_league_size ;
	+ m_team_shared_size = arg_shmem_size ;
	+ set_team_shared();
	+ }
	};

	+} /* namespace Impl */
	+} /* namespace Kokkos */

	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------

	-} // namespace Impl
	+namespace Kokkos {

	template< class Arg0 , class Arg1 >
	-class TeamPolicy< Arg0 , Arg1 , Kokkos::OpenMP >
	+class TeamPolicy< Arg0 , Arg1 , Kokkos::Threads >
	{
	-public:
	-
	- //! Tag this class as a kokkos execution policy
	- typedef TeamPolicy execution_policy ;
	-
	- //! Execution space of this execution policy.
	- typedef Kokkos::OpenMP execution_space ;
	-
	- typedef typename
	- Impl::if_c< ! Impl::is_same< Kokkos::OpenMP , Arg0 >::value , Arg0 , Arg1 >::type
	- work_tag ;
	-
	- //----------------------------------------
	-
	- template< class FunctorType >
	- inline static
	- int team_size_max( const FunctorType & )
	- { return execution_space::thread_pool_size(1); }
	-
	- template< class FunctorType >
	- inline static
	- int team_size_recommended( const FunctorType & )
	- { return execution_space::thread_pool_size(2); }
	-
	- //----------------------------------------
	-
	private:

	int m_league_size ;
	int m_team_size ;
	int m_team_alloc ;
	- int m_team_iter ;

	- inline void init( const int league_size_request
	- , const int team_size_request )
	- {
	+ inline
	+ void init( const int league_size_request
	+ , const int team_size_request )
	+ {
	const int pool_size = execution_space::thread_pool_size(0);
	const int team_max = execution_space::thread_pool_size(1);
	const int team_grain = execution_space::thread_pool_size(2);

	m_league_size = league_size_request ;

	m_team_size = team_size_request < team_max ?
	team_size_request : team_max ;

	// Round team size up to a multiple of 'team_gain'
	const int team_size_grain = team_grain * ( ( m_team_size + team_grain - 1 ) / team_grain );
	const int team_count = pool_size / team_size_grain ;

	// Constraint : pool_size = m_team_alloc * team_count
	m_team_alloc = pool_size / team_count ;
	+ }

	- // Maxumum number of iterations each team will take:
	- m_team_iter = ( m_league_size + team_count - 1 ) / team_count ;
	- }

	public:

	- inline int team_size() const { return m_team_size ; }
	+ //! Tag this class as a kokkos execution policy
	+ typedef TeamPolicy execution_policy ;
	+ typedef Kokkos::Threads execution_space ;
	+
	+ typedef typename
	+ Impl::if_c< ! Impl::is_same< Kokkos::Threads , Arg0 >::value , Arg0 , Arg1 >::type
	+ work_tag ;
	+
	+ //----------------------------------------
	+
	+ template< class FunctorType >
	+ inline static
	+ int team_size_max( const FunctorType & )
	+ { return execution_space::thread_pool_size(1); }
	+
	+ template< class FunctorType >
	+ static int team_size_recommended( const FunctorType & )
	+ { return execution_space::thread_pool_size(2); }
	+
	+
	+ template< class FunctorType >
	+ inline static
	+ int team_size_recommended( const FunctorType &, const int& )
	+ { return execution_space::thread_pool_size(2); }
	+
	+ //----------------------------------------
	+
	+ inline int team_size() const { return m_team_size ; }
	+ inline int team_alloc() const { return m_team_alloc ; }
	inline int league_size() const { return m_league_size ; }

	/** \brief Specify league size, request team size */
	- TeamPolicy( execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1)
	- { init( league_size_request , team_size_request ); (void) vector_length_request; }
	+ TeamPolicy( execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1 )
	+ : m_league_size(0)
	+ , m_team_size(0)
	+ , m_team_alloc(0)
	+ { init(league_size_request,team_size_request); (void) vector_length_request; }

	TeamPolicy( int league_size_request , int team_size_request , int vector_length_request = 1 )
	- { init( league_size_request , team_size_request ); (void) vector_length_request; }
	+ : m_league_size(0)
	+ , m_team_size(0)
	+ , m_team_alloc(0)
	+ { init(league_size_request,team_size_request); (void) vector_length_request; }

	- inline int team_alloc() const { return m_team_alloc ; }
	- inline int team_iter() const { return m_team_iter ; }
	+ typedef Impl::ThreadsExecTeamMember member_type ;

	- typedef Impl::OpenMPexecTeamMember member_type ;
	+ friend class Impl::ThreadsExecTeamMember ;
	};

	-} // namespace Kokkos

	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	+} /* namespace Kokkos */
	+

	namespace Kokkos {

	-inline
	-int OpenMP::thread_pool_size( int depth )
	+template<typename iType>
	+KOKKOS_INLINE_FUNCTION
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>
	+TeamThreadRange(const Impl::ThreadsExecTeamMember& thread, const iType& count)
	{
	- return Impl::OpenMPexec::pool_size(depth);
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>(thread,count);
	}

	+template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-int OpenMP::thread_pool_rank()
	+Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>
	+TeamThreadRange( const Impl::ThreadsExecTeamMember& thread
	+ , const iType & begin
	+ , const iType & end
	+ )
	{
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	- return Impl::OpenMPexec::m_map_rank[ omp_get_thread_num() ];
	-#else
	- return -1 ;
	-#endif
	+ return Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>(thread,begin,end);
	}

	-} // namespace Kokkos
	-
	-
	-#ifdef KOKKOS_HAVE_CXX11
	-
	-namespace Kokkos {

	template<typename iType>
	KOKKOS_INLINE_FUNCTION
	-Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>
	- TeamThreadLoop(const Impl::OpenMPexecTeamMember& thread, const iType& count) {
	- return Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>(thread,count);
	+Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember >
	+ ThreadVectorRange(const Impl::ThreadsExecTeamMember& thread, const iType& count) {
	+ return Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember >(thread,count);
	}

	-template<typename iType>
	-KOKKOS_INLINE_FUNCTION
	-Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >
	- ThreadVectorLoop(const Impl::OpenMPexecTeamMember& thread, const iType& count) {
	- return Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >(thread,count);
	-}

	KOKKOS_INLINE_FUNCTION
	-Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember> PerTeam(const Impl::OpenMPexecTeamMember& thread) {
	- return Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember>(thread);
	+Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember> PerTeam(const Impl::ThreadsExecTeamMember& thread) {
	+ return Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember>(thread);
	}

	KOKKOS_INLINE_FUNCTION
	-Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember> PerThread(const Impl::OpenMPexecTeamMember& thread) {
	- return Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember>(thread);
	+Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember> PerThread(const Impl::ThreadsExecTeamMember& thread) {
	+ return Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember>(thread);
	}
	} // namespace Kokkos

	namespace Kokkos {

	/** \brief Inter-thread parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries, const Lambda& lambda) {
	+void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>& loop_boundaries, const Lambda& lambda) {
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	}

	/** \brief Inter-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>& loop_boundaries,
	const Lambda & lambda, ValueType& result) {

	result = ValueType();

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	result+=tmp;
	}

	result = loop_boundaries.thread.team_reduce(result,Impl::JoinAdd<ValueType>());
	}

	+#if defined( KOKKOS_HAVE_CXX11 )
	+
	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::TeamThreadLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember>& loop_boundaries,
	+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember>& loop_boundaries,
	const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	ValueType result = init_result;

	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	join(result,tmp);
	}

	- init_result = loop_boundaries.thread.team_reduce(result,join);
	+ init_result = loop_boundaries.thread.team_reduce(result,Impl::JoinLambdaAdapter<ValueType,JoinType>(join));
	}

	+#endif /* #if defined( KOKKOS_HAVE_CXX11 ) */
	+
	} //namespace Kokkos


	namespace Kokkos {
	/** \brief Intra-thread vector parallel_for. Executes lambda(iType i) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread.
	* This functionality requires C++11 support.*/
	template<typename iType, class Lambda>
	KOKKOS_INLINE_FUNCTION
	-void parallel_for(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	loop_boundaries, const Lambda& lambda) {
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment)
	lambda(i);
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
	* val is performed and put into result. This functionality requires C++11 support.*/
	template< typename iType, class Lambda, typename ValueType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	loop_boundaries, const Lambda & lambda, ValueType& result) {
	result = ValueType();
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	result+=tmp;
	}
	}

	/** \brief Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
	* val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
	* The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
	* the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
	* '1 for '). This functionality requires C++11 support./
	template< typename iType, class Lambda, typename ValueType, class JoinType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_reduce(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	loop_boundaries, const Lambda & lambda, const JoinType& join, ValueType& init_result) {

	ValueType result = init_result;
	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	ValueType tmp = ValueType();
	lambda(i,tmp);
	join(result,tmp);
	}
	init_result = result;
	}

	/** \brief Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
	* for each i=0..N-1.
	*
	* The range i=0..N-1 is mapped to all vector lanes in the thread and a scan operation is performed.
	* Depending on the target execution space the operator might be called twice: once with final=false
	* and once with final=true. When final==true val contains the prefix sum value. The contribution of this
	* "i" needs to be added to val no matter whether final==true or not. In a serial execution
	* (i.e. team_size==1) the operator is only called once with final==true. Scan_val will be set
	* to the final sum value over all vector lanes.
	* This functionality requires C++11 support.*/
	template< typename iType, class FunctorType >
	KOKKOS_INLINE_FUNCTION
	-void parallel_scan(const Impl::ThreadVectorLoopBoundariesStruct<iType,Impl::OpenMPexecTeamMember >&
	+void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ThreadsExecTeamMember >&
	loop_boundaries, const FunctorType & lambda) {

	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
	typedef typename ValueTraits::value_type value_type ;

	value_type scan_val = value_type();

	#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	#pragma ivdep
	#endif
	for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
	lambda(i,scan_val,true);
	}
	}

	} // namespace Kokkos

	namespace Kokkos {

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	-void single(const Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda) {
	+void single(const Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda) {
	lambda();
	}

	template<class FunctorType>
	KOKKOS_INLINE_FUNCTION
	-void single(const Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda) {
	+void single(const Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda) {
	if(single_struct.team_member.team_rank()==0) lambda();
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	-void single(const Impl::VectorSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	+void single(const Impl::VectorSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	lambda(val);
	}

	template<class FunctorType, class ValueType>
	KOKKOS_INLINE_FUNCTION
	-void single(const Impl::ThreadSingleStruct<Impl::OpenMPexecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	+void single(const Impl::ThreadSingleStruct<Impl::ThreadsExecTeamMember>& single_struct, const FunctorType& lambda, ValueType& val) {
	if(single_struct.team_member.team_rank()==0) {
	lambda(val);
	}
	single_struct.team_member.team_broadcast(val,0);
	}
	}

	-#endif // KOKKOS_HAVE_CXX11
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------

	-#endif /* #ifndef KOKKOS_OPENMPEXEC_HPP */
	+#endif /* #define KOKKOS_THREADSTEAM_HPP */

	diff --git a/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp b/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
	index 4bb1b25f8..4b2a16912 100755
	--- a/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
	+++ b/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
	@@ -1,427 +1,427 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_THREADS_PARALLEL_HPP
	#define KOKKOS_THREADS_PARALLEL_HPP

	#include <vector>

	#include <Kokkos_Parallel.hpp>

	#include <impl/Kokkos_StaticAssert.hpp>
	#include <impl/Kokkos_FunctorAdapter.hpp>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelFor< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Threads > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Threads > Policy ;

	const FunctorType m_func ;
	const Policy m_policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( i );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( ! Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i );
	}
	}

	static void execute( ThreadsExec & exec , const void * arg )
	{
	const ParallelFor & self = * ((const ParallelFor *) arg );

	driver( self.m_func , typename Policy::WorkRange( self.m_policy , exec.pool_rank() , exec.pool_size() ) );

	exec.fan_in();
	}

	public:

	ParallelFor( const FunctorType & functor
	, const Policy & policy )
	: m_func( functor )
	, m_policy( policy )
	{
	ThreadsExec::start( & ParallelFor::execute , this );

	ThreadsExec::fence();
	}
	};

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelFor< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Threads > >
	{
	private:

	typedef TeamPolicy< Arg0 , Arg1 , Kokkos::Threads > Policy ;

	const FunctorType m_func ;
	const Policy m_policy ;
	const int m_shared ;

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member ) const
	{ m_func( member ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member ) const
	{ m_func( TagType() , member ); }

	static void execute( ThreadsExec & exec , const void * arg )
	{
	const ParallelFor & self = * ((const ParallelFor *) arg );

	- typename Policy::member_type member( exec , self.m_policy , self.m_shared );
	+ typename Policy::member_type member( & exec , self.m_policy , self.m_shared );

	for ( ; member.valid() ; member.next() ) {
	self.ParallelFor::template driver< typename Policy::work_tag >( member );
	}

	exec.fan_in();
	}

	public:

	ParallelFor( const FunctorType & functor
	, const Policy & policy )
	: m_func( functor )
	, m_policy( policy )
	, m_shared( FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) )
	{
	ThreadsExec::resize_scratch( 0 , Policy::member_type::team_reduce_size() + m_shared );

	ThreadsExec::start( & ParallelFor::execute , this );

	ThreadsExec::fence();
	}
	};



	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelReduce< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Threads > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Threads > Policy ;
	typedef typename Policy::work_tag work_tag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , work_tag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const FunctorType m_func ;
	const Policy m_policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( i , update );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( ! Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i , update );
	}
	}

	static void execute( ThreadsExec & exec , const void * arg )
	{
	const ParallelReduce & self = * ((const ParallelReduce *) arg );

	driver( self.m_func
	, ValueInit::init( self.m_func , exec.reduce_memory() )
	, typename Policy::WorkRange( self.m_policy , exec.pool_rank() , exec.pool_size() )
	);

	exec.template fan_in_reduce< FunctorType , work_tag >( self.m_func );
	}

	public:

	template< class HostViewType >
	ParallelReduce( const FunctorType & functor ,
	const Policy & policy ,
	const HostViewType & result_view )
	: m_func( functor )
	, m_policy( policy )
	{
	ThreadsExec::resize_scratch( ValueTraits::value_size( m_func ) , 0 );

	ThreadsExec::start( & ParallelReduce::execute , this );

	const pointer_type data = (pointer_type) ThreadsExec::root_reduce_scratch();

	ThreadsExec::fence();

	if ( result_view.ptr_on_device() ) {
	const unsigned n = ValueTraits::value_count( m_func );
	for ( unsigned i = 0 ; i < n ; ++i ) { result_view.ptr_on_device()[i] = data[i]; }
	}
	}
	};

	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 >
	class ParallelReduce< FunctorType , Kokkos::TeamPolicy< Arg0 , Arg1 , Kokkos::Threads > >
	{
	private:

	typedef TeamPolicy< Arg0 , Arg1 , Kokkos::Threads > Policy ;
	typedef typename Policy::work_tag work_tag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , work_tag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const FunctorType m_func ;
	const Policy m_policy ;
	const int m_shared ;

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member
	, reference_type update ) const
	{ m_func( member , update ); }

	template< class TagType >
	KOKKOS_FORCEINLINE_FUNCTION
	void driver( typename Impl::enable_if< ! Impl::is_same< TagType , void >::value ,
	const typename Policy::member_type & >::type member
	, reference_type update ) const
	{ m_func( TagType() , member , update ); }

	static void execute( ThreadsExec & exec , const void * arg )
	{
	const ParallelReduce & self = * ((const ParallelReduce *) arg );

	// Initialize thread-local value
	reference_type update = ValueInit::init( self.m_func , exec.reduce_memory() );

	- typename Policy::member_type member( exec , self.m_policy , self.m_shared );
	+ typename Policy::member_type member( & exec , self.m_policy , self.m_shared );
	for ( ; member.valid() ; member.next() ) {
	self.ParallelReduce::template driver< work_tag >( member , update );
	}

	exec.template fan_in_reduce< FunctorType , work_tag >( self.m_func );
	}

	public:

	ParallelReduce( const FunctorType & functor
	, const Policy & policy )
	: m_func( functor )
	, m_policy( policy )
	, m_shared( FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) )
	{
	ThreadsExec::resize_scratch( ValueTraits::value_size( m_func ) , Policy::member_type::team_reduce_size() + m_shared );

	ThreadsExec::start( & ParallelReduce::execute , this );

	ThreadsExec::fence();
	}

	template< class ViewType >
	ParallelReduce( const FunctorType & functor
	, const Policy & policy
	, const ViewType & result )
	: m_func( functor )
	, m_policy( policy )
	, m_shared( FunctorTeamShmemSize< FunctorType >::value( functor , policy.team_size() ) )
	{
	ThreadsExec::resize_scratch( ValueTraits::value_size( m_func ) , Policy::member_type::team_reduce_size() + m_shared );

	ThreadsExec::start( & ParallelReduce::execute , this );

	const pointer_type data = (pointer_type) ThreadsExec::root_reduce_scratch();

	ThreadsExec::fence();

	const unsigned n = ValueTraits::value_count( m_func );
	for ( unsigned i = 0 ; i < n ; ++i ) { result.ptr_on_device()[i] = data[i]; }
	}
	};

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	template< class FunctorType , class Arg0 , class Arg1 , class Arg2 >
	class ParallelScan< FunctorType , Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Threads > >
	{
	private:

	typedef Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Threads > Policy ;
	typedef typename Policy::work_tag work_tag ;
	typedef Kokkos::Impl::FunctorValueTraits< FunctorType , work_tag > ValueTraits ;
	typedef Kokkos::Impl::FunctorValueInit< FunctorType , work_tag > ValueInit ;

	typedef typename ValueTraits::pointer_type pointer_type ;
	typedef typename ValueTraits::reference_type reference_type ;

	const FunctorType m_func ;
	const Policy m_policy ;

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const bool final
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( i , update , final );
	}
	}

	template< class PType >
	KOKKOS_FORCEINLINE_FUNCTION static
	void driver( typename Impl::enable_if<
	( ! Impl::is_same< typename PType::work_tag , void >::value )
	, const FunctorType & >::type functor
	, reference_type update
	, const bool final
	, const PType & range )
	{
	const typename PType::member_type e = range.end();
	for ( typename PType::member_type i = range.begin() ; i < e ; ++i ) {
	functor( typename PType::work_tag() , i , update , final );
	}
	}

	static void execute( ThreadsExec & exec , const void * arg )
	{
	const ParallelScan & self = * ((const ParallelScan *) arg );

	const typename Policy::WorkRange range( self.m_policy , exec.pool_rank() , exec.pool_size() );

	reference_type update = ValueInit::init( self.m_func , exec.reduce_memory() );

	driver( self.m_func , update , false , range );

	// exec.<FunctorType,work_tag>scan_large( self.m_func );
	exec.template scan_small<FunctorType,work_tag>( self.m_func );

	driver( self.m_func , update , true , range );

	exec.fan_in();
	}

	public:

	ParallelScan( const FunctorType & functor , const Policy & policy )
	: m_func( functor )
	, m_policy( policy )
	{
	ThreadsExec::resize_scratch( 2 * ValueTraits::value_size( m_func ) , 0 );
	ThreadsExec::start( & ParallelScan::execute , this );
	ThreadsExec::fence();
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #define KOKKOS_THREADS_PARALLEL_HPP */

	diff --git a/lib/kokkos/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp b/lib/kokkos/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp
	new file mode 100755
	index 000000000..8ad7f15ec
	--- /dev/null
	+++ b/lib/kokkos/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp
	@@ -0,0 +1,599 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+// Experimental unified task-data parallel manycore LDRD
	+
	+#include <stdio.h>
	+#include <iostream>
	+#include <sstream>
	+#include <Threads/Kokkos_Threads_TaskPolicy.hpp>
	+
	+#if defined( KOKKOS_HAVE_PTHREAD )
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+typedef TaskMember< Kokkos::Threads , void , void > Task ;
	+
	+namespace {
	+
	+int volatile s_count_serial = 0 ;
	+int volatile s_count_team = 0 ;
	+Task * volatile s_ready_team = 0 ;
	+Task * volatile s_ready_serial = 0 ;
	+Task * const s_lock = reinterpret_cast<Task*>( ~((unsigned long)0) );
	+Task * const s_denied = reinterpret_cast<Task*>( ~((unsigned long)0) - 1 );
	+
	+} /* namespace */
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+TaskPolicy< Kokkos::Threads >::TaskPolicy
	+ ( const unsigned arg_default_dependence_capacity
	+ , const unsigned arg_team_size
	+ )
	+ : m_default_dependence_capacity( arg_default_dependence_capacity )
	+ , m_team_size( arg_team_size )
	+{
	+ const int threads_total = Threads::thread_pool_size(0);
	+ const int threads_per_numa = Threads::thread_pool_size(1);
	+ const int threads_per_core = Threads::thread_pool_size(2);
	+
	+ if ( 0 == arg_team_size ) {
	+ // If a team task then claim for execution until count is zero
	+ // Issue: team collectives cannot assume which pool members are in the team.
	+ // Issue: team must only span a single NUMA region.
	+
	+ // If more than one thread per core then map cores to work team,
	+ // else map numa to work team.
	+
	+ if ( 1 < threads_per_core ) m_team_size = threads_per_core ;
	+ else if ( 1 < threads_per_numa ) m_team_size = threads_per_numa ;
	+ else m_team_size = 1 ;
	+ }
	+
	+ // Verify a valid team size
	+ const bool valid_team_size =
	+ ( 0 < m_team_size && m_team_size <= threads_total ) &&
	+ (
	+ ( 1 == m_team_size ) \|\|
	+ ( threads_per_core == m_team_size ) \|\|
	+ ( threads_per_numa == m_team_size )
	+ );
	+
	+ if ( ! valid_team_size ) {
	+ std::ostringstream msg ;
	+
	+ msg << "Kokkos::Experimental::TaskPolicy< Kokkos::Threads > ERROR"
	+ << " invalid team_size(" << m_team_size << ")"
	+ << " threads_per_core(" << threads_per_core << ")"
	+ << " threads_per_numa(" << threads_per_numa << ")"
	+ << " threads_total(" << threads_total << ")"
	+ ;
	+
	+ Kokkos::Impl::throw_runtime_exception( msg.str() );
	+
	+ }
	+}
	+
	+TaskPolicy< Kokkos::Threads >::member_type &
	+TaskPolicy< Kokkos::Threads >::member_single()
	+{
	+ static member_type s ;
	+ return s ;
	+}
	+
	+void wait( Kokkos::Experimental::TaskPolicy< Kokkos::Threads > & policy )
	+{
	+ typedef Kokkos::Impl::ThreadsExecTeamMember member_type ;
	+
	+ enum { BASE_SHMEM = 1024 };
	+
	+ void * const arg = reinterpret_cast<void*>( long( policy.m_team_size ) );
	+
	+ Kokkos::Impl::ThreadsExec::resize_scratch( 0 , member_type::team_reduce_size() + BASE_SHMEM );
	+ Kokkos::Impl::ThreadsExec::start( & Impl::Task::execute_ready_tasks_driver , arg );
	+ Kokkos::Impl::ThreadsExec::fence();
	+}
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+//----------------------------------------------------------------------------
	+
	+void Task::throw_error_verify_type()
	+{
	+ Kokkos::Impl::throw_runtime_exception("TaskMember< Threads >::verify_type ERROR");
	+}
	+
	+void Task::deallocate( void * ptr )
	+{
	+ free( ptr );
	+}
	+
	+void * Task::allocate( const unsigned n )
	+{
	+ void * const ptr = malloc(n);
	+
	+ return ptr ;
	+}
	+
	+Task::~TaskMember()
	+{
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+void Task::reschedule()
	+{
	+ // Reschedule transitions from executing back to waiting.
	+ const int old_state = atomic_compare_exchange( & m_state , int(TASK_STATE_EXECUTING) , int(TASK_STATE_WAITING) );
	+
	+ if ( old_state != int(TASK_STATE_EXECUTING) ) {
	+
	+fprintf( stderr
	+ , "reschedule ERROR task[%lx] state(%d)\n"
	+ , (unsigned long) this
	+ , old_state
	+ );
	+fflush(stderr);
	+
	+ }
	+}
	+
	+void Task::schedule()
	+{
	+ //----------------------------------------
	+ // State is either constructing or already waiting.
	+ // If constructing then transition to waiting.
	+
	+ {
	+ const int old_state = atomic_compare_exchange( & m_state , int(TASK_STATE_CONSTRUCTING) , int(TASK_STATE_WAITING) );
	+ Task * const waitTask = ((Task volatile const *) & m_wait );
	+ Task * const next = ((Task volatile const *) & m_next );
	+
	+ if ( s_denied == waitTask \|\| 0 != next \|\|
	+ ( old_state != int(TASK_STATE_CONSTRUCTING) &&
	+ old_state != int(TASK_STATE_WAITING) ) ) {
	+ fprintf(stderr,"Task::schedule task(0x%lx) STATE ERROR: state(%d) wait(0x%lx) next(0x%lx)\n"
	+ , (unsigned long) this
	+ , old_state
	+ , (unsigned long) waitTask
	+ , (unsigned long) next );
	+ fflush(stderr);
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Impl::Task spawn or respawn state error");
	+ }
	+ }
	+
	+ //----------------------------------------
	+ // Insert this task into another dependence that is not complete
	+ // Push on to the wait queue, fails if ( s_denied == m_dep[i]->m_wait )
	+
	+ bool insert_in_ready_queue = true ;
	+
	+ for ( int i = 0 ; i < m_dep_size && insert_in_ready_queue ; ) {
	+
	+ Task * const task_dep = m_dep[i] ;
	+ Task * const head_value_old = ((Task volatile *) & task_dep->m_wait );
	+
	+ if ( s_denied == head_value_old ) {
	+ // Wait queue is closed, try again with the next queue
	+ ++i ;
	+ }
	+ else {
	+
	+ // Wait queue is open and not locked.
	+ // If CAS succeeds then have acquired the lock.
	+
	+ // Have exclusive access to this task.
	+ // Assign m_next assuming a successfull insertion into the queue.
	+ // Fence the memory assignment before attempting the CAS.
	+
	+ ((Task volatile *) & m_next ) = head_value_old ;
	+
	+ memory_fence();
	+
	+ // Attempt to insert this task into the queue
	+
	+ Task * const wait_queue_head = atomic_compare_exchange( & task_dep->m_wait , head_value_old , this );
	+
	+ if ( head_value_old == wait_queue_head ) {
	+ insert_in_ready_queue = false ;
	+ }
	+ }
	+ }
	+
	+ //----------------------------------------
	+ // All dependences are complete, insert into the ready list
	+
	+ if ( insert_in_ready_queue ) {
	+
	+ // Increment the count of ready tasks.
	+ // Count is decremented when task is complete.
	+
	+ Task * volatile * queue = 0 ;
	+
	+ if ( m_serial ) {
	+ atomic_increment( & s_count_serial );
	+ queue = & s_ready_serial ;
	+ }
	+ else {
	+ atomic_increment( & s_count_team );
	+ queue = & s_ready_team ;
	+ }
	+
	+ while ( insert_in_ready_queue ) {
	+
	+ Task * const head_value_old = *queue ;
	+
	+ if ( s_lock != head_value_old ) {
	+ // Read the head of ready queue, if same as previous value then CAS locks the ready queue
	+ // Only access via CAS
	+
	+ // Have exclusive access to this task, assign to head of queue, assuming successful insert
	+ // Fence assignment before attempting insert.
	+ ((Task volatile *) & m_next ) = head_value_old ;
	+
	+ memory_fence();
	+
	+ Task * const ready_queue_head = atomic_compare_exchange( queue , head_value_old , this );
	+
	+ if ( head_value_old == ready_queue_head ) {
	+ // Successful insert
	+ insert_in_ready_queue = false ; // done
	+ }
	+ }
	+ }
	+ }
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+
	+void Task::assign( Task ** const lhs_ptr , Task * rhs )
	+{
	+ // Increment rhs reference count.
	+ if ( rhs ) { atomic_increment( & rhs->m_ref_count ); }
	+
	+ // Assign the pointer and retrieve the previous value.
	+
	+ Task * const old_lhs = atomic_exchange( lhs_ptr , rhs );
	+
	+ if ( old_lhs ) {
	+
	+ // Decrement former lhs reference count.
	+ // If reference count is zero task must be complete, then delete task.
	+ // Task is ready for deletion when wait == s_denied
	+
	+ int const count = atomic_fetch_add( & (old_lhs->m_ref_count) , -1 ) - 1 ;
	+
	+ // if 'count != 0' then 'old_lhs' may be deallocated before dereferencing
	+ Task * const wait = count == 0 ? ((Task const volatile ) & old_lhs->m_wait ) : (Task) 0 ;
	+
	+ if ( count < 0 \|\| ( count == 0 && wait != s_denied ) ) {
	+
	+ static const char msg_error_header[] = "Kokkos::Impl::TaskManager<Kokkos::Threads>::assign ERROR deleting" ;
	+
	+ fprintf( stderr , "%s task(0x%lx) m_ref_count(%d) , m_wait(0x%ld)\n"
	+ , msg_error_header
	+ , (unsigned long) old_lhs
	+ , count
	+ , (unsigned long) wait );
	+ fflush(stderr);
	+
	+ Kokkos::Impl::throw_runtime_exception( msg_error_header );
	+ }
	+
	+ if ( count == 0 ) {
	+ // When 'count == 0' this thread has exclusive access to 'old_lhs'
	+ const Task::function_dealloc_type d = old_lhs->m_dealloc ;
	+ (*d)( old_lhs );
	+ }
	+ }
	+}
	+
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+Task * Task::get_dependence( int i ) const
	+{
	+ Task * const t = m_dep[i] ;
	+
	+ if ( Kokkos::Experimental::TASK_STATE_EXECUTING != m_state \|\| i < 0 \|\| m_dep_size <= i \|\| 0 == t ) {
	+
	+fprintf( stderr
	+ , "TaskMember< Threads >::get_dependence ERROR : task[%lx]{ state(%d) dep_size(%d) dep[%d] = %lx }\n"
	+ , (unsigned long) this
	+ , m_state
	+ , m_dep_size
	+ , i
	+ , (unsigned long) t
	+ );
	+fflush( stderr );
	+
	+ Kokkos::Impl::throw_runtime_exception("TaskMember< Threads >::get_dependence ERROR");
	+ }
	+
	+ return t ;
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+void Task::add_dependence( Task * before )
	+{
	+ if ( before != 0 ) {
	+
	+ int const state = ((volatile const int ) & m_state );
	+
	+ // Can add dependence during construction or during execution
	+
	+ if ( ( Kokkos::Experimental::TASK_STATE_CONSTRUCTING == state \|\|
	+ Kokkos::Experimental::TASK_STATE_EXECUTING == state ) &&
	+ m_dep_size < m_dep_capacity ) {
	+
	+ ++m_dep_size ;
	+
	+ assign( m_dep + (m_dep_size-1) , before );
	+
	+ memory_fence();
	+ }
	+ else {
	+
	+fprintf( stderr
	+ , "TaskMember< Threads >::add_dependence ERROR : task[%lx]{ state(%d) dep_size(%d) m_dep_capacity(%d) }\n"
	+ , (unsigned long) this
	+ , m_state
	+ , m_dep_size
	+ , m_dep_capacity
	+ );
	+fflush( stderr );
	+
	+ Kokkos::Impl::throw_runtime_exception("TaskMember< Threads >::add_dependence ERROR");
	+ }
	+ }
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+void Task::clear_dependence()
	+{
	+ for ( int i = m_dep_size - 1 ; 0 <= i ; --i ) {
	+ assign( m_dep + i , 0 );
	+ }
	+
	+ ((volatile int ) & m_dep_size ) = 0 ;
	+
	+ memory_fence();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+Task * Task::pop_ready_task( Task * volatile * const queue )
	+{
	+ Task * const task_old = *queue ;
	+
	+ if ( s_lock != task_old && 0 != task_old ) {
	+
	+ Task * const task = atomic_compare_exchange( queue , task_old , s_lock );
	+
	+ if ( task_old == task ) {
	+
	+ // May have acquired the lock and task.
	+ // One or more other threads may have acquired this same task and lock
	+ // due to respawning ABA race condition.
	+ // Can only be sure of acquire with a successful state transition from waiting to executing
	+
	+ const int old_state = atomic_compare_exchange( & task->m_state, int(TASK_STATE_WAITING), int(TASK_STATE_EXECUTING) );
	+
	+ if ( old_state == int(TASK_STATE_WAITING) ) {
	+
	+ // Transitioned this task from waiting to executing
	+ // Update the queue to the next entry and release the lock
	+
	+ Task * const next_old = ((Task volatile *) & task->m_next );
	+
	+ Task * const s = atomic_compare_exchange( queue , s_lock , next_old );
	+
	+ if ( s != s_lock ) {
	+ fprintf(stderr,"Task::pop_ready_task( 0x%lx ) UNLOCK ERROR\n", (unsigned long) queue );
	+ fflush(stderr);
	+ }
	+
	+ ((Task volatile *) & task->m_next ) = 0 ;
	+
	+ return task ;
	+ }
	+ else {
	+ fprintf(stderr,"Task::pop_ready_task( 0x%lx ) task(0x%lx) state(%d) ERROR\n"
	+ , (unsigned long) queue
	+ , (unsigned long) task
	+ , old_state );
	+ fflush(stderr);
	+ }
	+ }
	+ }
	+
	+ return (Task *) 0 ;
	+}
	+
	+
	+void Task::complete_executed_task( Task * task , volatile int * const queue_count )
	+{
	+ // State is either executing or if respawned then waiting,
	+ // try to transition from executing to complete.
	+ // Reads the current value.
	+
	+ const int state_old =
	+ atomic_compare_exchange( & task->m_state
	+ , int(Kokkos::Experimental::TASK_STATE_EXECUTING)
	+ , int(Kokkos::Experimental::TASK_STATE_COMPLETE) );
	+
	+ if ( Kokkos::Experimental::TASK_STATE_WAITING == state_old ) {
	+ task->schedule(); /* Task requested a respawn so reschedule it */
	+ }
	+ else if ( Kokkos::Experimental::TASK_STATE_EXECUTING != state_old ) {
	+ fprintf( stderr
	+ , "TaskMember< Threads >::execute_serial completion ERROR : task[%lx]{ state_old(%d) dep_size(%d) }\n"
	+ , (unsigned long) & task
	+ , state_old
	+ , task->m_dep_size
	+ );
	+ fflush( stderr );
	+ }
	+ else {
	+
	+ // Clear dependences of this task before locking wait queue
	+
	+ task->clear_dependence();
	+
	+ // Stop other tasks from adding themselves to this task's wait queue.
	+ // The wait queue is updated concurrently so guard with an atomic.
	+ // Setting the wait queue to denied denotes delete-ability of the task by any thread.
	+ // Therefore, once 'denied' the task pointer must be treated as invalid.
	+
	+ Task * wait_queue = ((Task volatile *) & task->m_wait );
	+ Task * wait_queue_old = 0 ;
	+
	+ do {
	+ wait_queue_old = wait_queue ;
	+ wait_queue = atomic_compare_exchange( & task->m_wait , wait_queue_old , s_denied );
	+ } while ( wait_queue_old != wait_queue );
	+
	+ task = 0 ;
	+
	+ // Pop waiting tasks and schedule them
	+ while ( wait_queue ) {
	+ Task * const x = wait_queue ; wait_queue = x->m_next ; x->m_next = 0 ;
	+ x->schedule();
	+ }
	+ }
	+
	+ atomic_decrement( queue_count );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+void Task::execute_ready_tasks_driver( Kokkos::Impl::ThreadsExec & exec , const void * arg )
	+{
	+ typedef Kokkos::Impl::ThreadsExecTeamMember member_type ;
	+
	+ // Whole pool is calling this function
	+
	+ // Create the thread team member with shared memory for the given task.
	+ const int team_size = reinterpret_cast<long>( arg );
	+
	+ member_type member( & exec , TeamPolicy< Kokkos::Threads >( 1 , team_size ) , 0 );
	+
	+ Kokkos::Impl::ThreadsExec & exec_team_base = member.threads_exec_team_base();
	+
	+ Task * volatile * const task_team_ptr = reinterpret_cast<Task**>( exec_team_base.reduce_memory() );
	+
	+ if ( member.team_fan_in() ) {
	+ *task_team_ptr = 0 ;
	+ Kokkos::memory_fence();
	+ }
	+ member.team_fan_out();
	+
	+ long int iteration_count = 0 ;
	+
	+ // Each team must iterate this loop synchronously to insure team-execution of team-task
	+
	+ while ( 0 < s_count_serial \|\| 0 < s_count_team ) {
	+
	+ if ( member.team_rank() == 0 ) {
	+ // Only one team member attempts to pop a team task
	+ *task_team_ptr = pop_ready_task( & s_ready_team );
	+ }
	+
	+ // Query if team acquired a team task
	+ Task * const task_team = *task_team_ptr ;
	+
	+ if ( task_team ) {
	+ // Set shared memory
	+ member.set_league_shmem( 0 , 1 , task_team->m_shmem_size );
	+
	+ (*task_team->m_team)( task_team , member );
	+
	+ // Do not proceed until all members have completed the task,
	+ // the task has been completed or rescheduled, and
	+ // the team task pointer has been cleared.
	+ if ( member.team_fan_in() ) {
	+ complete_executed_task( task_team , & s_count_team );
	+ *task_team_ptr = 0 ;
	+ Kokkos::memory_fence();
	+ }
	+ member.team_fan_out();
	+ }
	+ else {
	+ Task * const task_serial = pop_ready_task( & s_ready_serial );
	+
	+ if ( task_serial ) {
	+ if ( task_serial->m_serial ) (*task_serial->m_serial)( task_serial );
	+
	+ complete_executed_task( task_serial , & s_count_serial );
	+ }
	+ }
	+
	+ ++iteration_count ;
	+ }
	+
	+ exec.fan_in();
	+}
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+#endif /* #if defined( KOKKOS_HAVE_PTHREAD ) */
	+
	diff --git a/lib/kokkos/core/src/Threads/Kokkos_Threads_TaskPolicy.hpp b/lib/kokkos/core/src/Threads/Kokkos_Threads_TaskPolicy.hpp
	new file mode 100755
	index 000000000..024671324
	--- /dev/null
	+++ b/lib/kokkos/core/src/Threads/Kokkos_Threads_TaskPolicy.hpp
	@@ -0,0 +1,584 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+// Experimental unified task-data parallel manycore LDRD
	+
	+#ifndef KOKKOS_THREADS_TASKPOLICY_HPP
	+#define KOKKOS_THREADS_TASKPOLICY_HPP
	+
	+
	+#include <Kokkos_Threads.hpp>
	+#include <Kokkos_TaskPolicy.hpp>
	+
	+#if defined( KOKKOS_HAVE_PTHREAD )
	+
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+/** \brief Base class for all Kokkos::Threads tasks */
	+template<>
	+class TaskMember< Kokkos::Threads , void , void > {
	+public:
	+
	+ typedef void (* function_dealloc_type)( TaskMember * );
	+ typedef TaskMember * (* function_verify_type) ( TaskMember * );
	+ typedef void (* function_single_type) ( TaskMember * );
	+ typedef void (* function_team_type) ( TaskMember * , Kokkos::Impl::ThreadsExecTeamMember & );
	+
	+private:
	+
	+ // Needed to disambiguate references to base class variables
	+ // without triggering a false-positive on Intel compiler warning #955.
	+ typedef TaskMember< Kokkos::Threads , void , void > SelfType ;
	+
	+ function_dealloc_type m_dealloc ; ///< Deallocation
	+ function_verify_type m_verify ; ///< Result type verification
	+ function_team_type m_team ; ///< Apply function
	+ function_single_type m_serial ; ///< Apply function
	+ TaskMember ** m_dep ; ///< Dependences
	+ TaskMember * m_wait ; ///< Linked list of tasks waiting on this task
	+ TaskMember * m_next ; ///< Linked list of tasks waiting on a different task
	+ int m_dep_capacity ; ///< Capacity of dependences
	+ int m_dep_size ; ///< Actual count of dependences
	+ int m_shmem_size ;
	+ int m_ref_count ; ///< Reference count
	+ int m_state ; ///< State of the task
	+
	+ // 7 pointers + 5 integers
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+ TaskMember( const TaskMember & ) = delete ;
	+ TaskMember & operator = ( const TaskMember & ) = delete ;
	+#else
	+ TaskMember( const TaskMember & );
	+ TaskMember & operator = ( const TaskMember & );
	+#endif
	+
	+ static void * allocate( const unsigned arg_size );
	+ static void deallocate( void * );
	+
	+ template< class DerivedTaskType >
	+ static
	+ void deallocate( TaskMember * t )
	+ {
	+ DerivedTaskType * ptr = static_cast< DerivedTaskType * >(t);
	+ ptr->~DerivedTaskType();
	+ deallocate( (void*) ptr );
	+ }
	+
	+ static TaskMember * pop_ready_task( TaskMember * volatile * const queue );
	+ static void complete_executed_task( TaskMember * , volatile int * const );
	+
	+ static void throw_error_verify_type();
	+
	+protected:
	+
	+ TaskMember()
	+ : m_dealloc(0)
	+ , m_verify(0)
	+ , m_team(0)
	+ , m_serial(0)
	+ , m_dep(0)
	+ , m_wait(0)
	+ , m_next(0)
	+ , m_dep_capacity(0)
	+ , m_dep_size(0)
	+ , m_shmem_size(0)
	+ , m_ref_count(0)
	+ , m_state(0)
	+ {}
	+
	+public:
	+
	+ static void execute_ready_tasks_driver( Kokkos::Impl::ThreadsExec & , const void * );
	+
	+ ~TaskMember();
	+
	+ template< typename ResultType >
	+ KOKKOS_FUNCTION static
	+ TaskMember * verify_type( TaskMember * t )
	+ {
	+ enum { check_type = ! Kokkos::Impl::is_same< ResultType , void >::value };
	+
	+ if ( check_type && t != 0 ) {
	+
	+ // Verify that t->m_verify is this function
	+ const function_verify_type self = & TaskMember::template verify_type< ResultType > ;
	+
	+ if ( t->m_verify != self ) {
	+ t = 0 ;
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ throw_error_verify_type();
	+#endif
	+ }
	+ }
	+ return t ;
	+ }
	+
	+ //----------------------------------------
	+ /* Inheritence Requirements on task types:
	+ *
	+ * class DerivedTaskType
	+ * : public TaskMember< Threads , DerivedType::value_type , FunctorType >
	+ * { ... };
	+ *
	+ * class TaskMember< Threads , DerivedType::value_type , FunctorType >
	+ * : public TaskMember< Threads , DerivedType::value_type , void >
	+ * , public Functor
	+ * { ... };
	+ *
	+ * If value_type != void
	+ * class TaskMember< Threads , value_type , void >
	+ * : public TaskMember< Threads , void , void >
	+ *
	+ * Allocate space for DerivedTaskType followed by TaskMember*[ dependence_capacity ]
	+ *
	+ */
	+ //----------------------------------------
	+
	+ template< class DerivedTaskType , class Tag >
	+ KOKKOS_FUNCTION static
	+ void apply_single( typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same< typename DerivedTaskType::result_type , void >::value
	+ , TaskMember * >::type t )
	+ {
	+ typedef typename DerivedTaskType::functor_type functor_type ;
	+ typedef typename DerivedTaskType::result_type result_type ;
	+
	+ DerivedTaskType & self = * static_cast< DerivedTaskType * >(t);
	+
	+ Kokkos::Impl::FunctorApply< functor_type , Tag , result_type & >
	+ ::apply( (functor_type &) self , & self.m_result );
	+ }
	+
	+ template< class DerivedTaskType , class Tag >
	+ KOKKOS_FUNCTION static
	+ void apply_single( typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same< typename DerivedTaskType::result_type , void >::value
	+ , TaskMember * >::type t )
	+ {
	+ typedef typename DerivedTaskType::functor_type functor_type ;
	+
	+ DerivedTaskType & self = * static_cast< DerivedTaskType * >(t);
	+
	+ Kokkos::Impl::FunctorApply< functor_type , Tag , void >::apply( (functor_type &) self );
	+ }
	+
	+ //----------------------------------------
	+
	+ template< class DerivedTaskType , class Tag >
	+ KOKKOS_FUNCTION static
	+ void apply_team( typename Kokkos::Impl::enable_if<(
	+ Kokkos::Impl::is_same<Tag,void>::value
	+ &&
	+ Kokkos::Impl::is_same<typename DerivedTaskType::result_type,void>::value
	+ ), TaskMember * >::type t
	+ , Kokkos::Impl::ThreadsExecTeamMember & member
	+ )
	+ {
	+ DerivedTaskType & self = * static_cast< DerivedTaskType * >(t);
	+
	+ self.DerivedTaskType::functor_type::apply( member );
	+ }
	+
	+ /** \brief Allocate and construct a task */
	+ template< class DerivedTaskType , class Tag >
	+ KOKKOS_FUNCTION static
	+ void apply_team( typename Kokkos::Impl::enable_if<(
	+ Kokkos::Impl::is_same<Tag,void>::value
	+ &&
	+ ! Kokkos::Impl::is_same<typename DerivedTaskType::result_type,void>::value
	+ ), TaskMember * >::type t
	+ , Kokkos::Impl::ThreadsExecTeamMember & member
	+ )
	+ {
	+ DerivedTaskType & self = * static_cast< DerivedTaskType * >(t);
	+
	+ self.DerivedTaskType::functor_type::apply( member , self.m_result );
	+ }
	+
	+ //----------------------------------------
	+
	+ /** \brief Allocate and construct a task */
	+ template< class DerivedTaskType , class Tag >
	+ static
	+ TaskMember * create( const typename DerivedTaskType::functor_type & arg_functor
	+ , const function_team_type arg_apply_team
	+ , const function_single_type arg_apply_single
	+ , const unsigned arg_team_shmem
	+ , const unsigned arg_dependence_capacity
	+ )
	+ {
	+ enum { padding_size = sizeof(DerivedTaskType) % sizeof(TaskMember*)
	+ ? sizeof(TaskMember) - sizeof(DerivedTaskType) % sizeof(TaskMember) : 0 };
	+ enum { derived_size = sizeof(DerivedTaskType) + padding_size };
	+
	+ DerivedTaskType * const task =
	+ new( allocate( derived_size + sizeof(TaskMember) arg_dependence_capacity ) )
	+ DerivedTaskType( arg_functor );
	+
	+ task->SelfType::m_dealloc = & TaskMember::template deallocate< DerivedTaskType > ;
	+ task->SelfType::m_verify = & TaskMember::template verify_type< typename DerivedTaskType::value_type > ;
	+ task->SelfType::m_team = arg_apply_team ;
	+ task->SelfType::m_serial = arg_apply_single ;
	+ task->SelfType::m_dep = (TaskMember*)( ((unsigned char )task) + derived_size );
	+ task->SelfType::m_dep_capacity = arg_dependence_capacity ;
	+ task->SelfType::m_shmem_size = arg_team_shmem ;
	+ task->SelfType::m_state = TASK_STATE_CONSTRUCTING ;
	+
	+ for ( unsigned i = 0 ; i < arg_dependence_capacity ; ++i ) task->SelfType::m_dep[i] = 0 ;
	+
	+ return static_cast< TaskMember * >( task );
	+ }
	+
	+ void reschedule();
	+ void schedule();
	+
	+ //----------------------------------------
	+
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ static
	+ void assign( TaskMember ** const lhs , TaskMember * const rhs );
	+#else
	+ KOKKOS_INLINE_FUNCTION static
	+ void assign( TaskMember ** const lhs , TaskMember * const rhs ) {}
	+#endif
	+
	+ TaskMember * get_dependence( int i ) const ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ int get_dependence() const
	+ { return m_dep_size ; }
	+
	+ void clear_dependence();
	+ void add_dependence( TaskMember * before );
	+
	+ //----------------------------------------
	+
	+ typedef FutureValueTypeIsVoidError get_result_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ get_result_type get() const { return get_result_type() ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Experimental::TaskState get_state() const { return Kokkos::Experimental::TaskState( m_state ); }
	+
	+};
	+
	+/** \brief A Future< Kokkos::Threads , ResultType > will cast
	+ * from TaskMember< Kokkos::Threads , void , void >
	+ * to TaskMember< Kokkos::Threads , ResultType , void >
	+ * to query the result.
	+ */
	+template< class ResultType >
	+class TaskMember< Kokkos::Threads , ResultType , void >
	+ : public TaskMember< Kokkos::Threads , void , void >
	+{
	+public:
	+
	+ typedef ResultType result_type ;
	+
	+ result_type m_result ;
	+
	+ typedef const result_type & get_result_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ get_result_type get() const { return m_result ; }
	+
	+ inline
	+ TaskMember() : TaskMember< Kokkos::Threads , void , void >(), m_result() {}
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+ TaskMember( const TaskMember & ) = delete ;
	+ TaskMember & operator = ( const TaskMember & ) = delete ;
	+#else
	+private:
	+ TaskMember( const TaskMember & );
	+ TaskMember & operator = ( const TaskMember & );
	+#endif
	+};
	+
	+/** \brief Callback functions will cast
	+ * from TaskMember< Kokkos::Threads , void , void >
	+ * to TaskMember< Kokkos::Threads , ResultType , FunctorType >
	+ * to execute work functions.
	+ */
	+template< class ResultType , class FunctorType >
	+class TaskMember< Kokkos::Threads , ResultType , FunctorType >
	+ : public TaskMember< Kokkos::Threads , ResultType , void >
	+ , public FunctorType
	+{
	+public:
	+ typedef ResultType result_type ;
	+ typedef FunctorType functor_type ;
	+
	+ inline
	+ TaskMember( const functor_type & arg_functor )
	+ : TaskMember< Kokkos::Threads , ResultType , void >()
	+ , functor_type( arg_functor )
	+ {}
	+};
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+void wait( TaskPolicy< Kokkos::Threads > & );
	+
	+template<>
	+class TaskPolicy< Kokkos::Threads >
	+{
	+public:
	+
	+ typedef Kokkos::Threads execution_space ;
	+ typedef TaskPolicy execution_policy ;
	+ typedef Kokkos::Impl::ThreadsExecTeamMember member_type ;
	+
	+private:
	+
	+ typedef Impl::TaskMember< Kokkos::Threads , void , void > task_root_type ;
	+
	+ int m_default_dependence_capacity ;
	+ int m_team_size ; ///< Fixed size of a task-team
	+
	+ template< class FunctorType >
	+ static inline
	+ const task_root_type * get_task_root( const FunctorType * f )
	+ {
	+ typedef Impl::TaskMember< execution_space , typename FunctorType::value_type , FunctorType > task_type ;
	+ return static_cast< const task_root_type * >( static_cast< const task_type * >(f) );
	+ }
	+
	+ template< class FunctorType >
	+ static inline
	+ task_root_type * get_task_root( FunctorType * f )
	+ {
	+ typedef Impl::TaskMember< execution_space , typename FunctorType::value_type , FunctorType > task_type ;
	+ return static_cast< task_root_type * >( static_cast< task_type * >(f) );
	+ }
	+
	+public:
	+
	+ // Valid team sizes are 1,
	+ // Threads::pool_size(1) == threads per numa, or
	+ // Threads::pool_size(2) == threads per core
	+
	+ TaskPolicy( const unsigned arg_default_dependence_capacity = 4
	+ , const unsigned arg_team_size = 0 /* default from thread pool topology */
	+ );
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TaskPolicy( const TaskPolicy & rhs )
	+ : m_default_dependence_capacity( rhs.m_default_dependence_capacity )
	+ , m_team_size( rhs.m_team_size )
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ TaskPolicy( const TaskPolicy & rhs
	+ , const unsigned arg_default_dependence_capacity )
	+ : m_default_dependence_capacity( arg_default_dependence_capacity )
	+ , m_team_size( rhs.m_team_size )
	+ {}
	+
	+ TaskPolicy & operator = ( const TaskPolicy &rhs ) {
	+ m_default_dependence_capacity = rhs.m_default_dependence_capacity;
	+ m_team_size = rhs.m_team_size;
	+ return *this;
	+ }
	+
	+ // Create serial-thread task
	+
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ Future< typename FunctorType::value_type , execution_space >
	+ create( const FunctorType & functor
	+ , const unsigned dependence_capacity = ~0u ) const
	+ {
	+ typedef typename FunctorType::value_type value_type ;
	+ typedef Impl::TaskMember< execution_space , value_type , FunctorType > task_type ;
	+
	+ return Future< value_type , execution_space >(
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ task_root_type::create< task_type , void >
	+ ( functor
	+ , task_root_type::function_team_type(0)
	+ , & task_root_type::template apply_single< task_type , void >
	+ , 0
	+ , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity )
	+ )
	+#endif
	+ );
	+ }
	+
	+ // Create thread-team task
	+
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ Future< typename FunctorType::value_type , execution_space >
	+ create_team( const FunctorType & functor
	+ , const unsigned dependence_capacity = ~0u ) const
	+ {
	+ typedef typename FunctorType::value_type value_type ;
	+ typedef Impl::TaskMember< execution_space , value_type , FunctorType > task_type ;
	+
	+ return Future< value_type , execution_space >(
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ task_root_type::create< task_type , void >
	+ ( functor
	+ , & task_root_type::template apply_team< task_type , void >
	+ , task_root_type::function_single_type(0)
	+ , Kokkos::Impl::FunctorTeamShmemSize< FunctorType >::value( functor , m_team_size )
	+ , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity )
	+ )
	+#endif
	+ );
	+ }
	+
	+ template< class A1 , class A2 , class A3 , class A4 >
	+ KOKKOS_INLINE_FUNCTION
	+ void add_dependence( const Future<A1,A2> & after
	+ , const Future<A3,A4> & before
	+ , typename Kokkos::Impl::enable_if
	+ < Kokkos::Impl::is_same< typename Future<A1,A2>::execution_space , execution_space >::value
	+ &&
	+ Kokkos::Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	+ >::type * = 0
	+ ) const
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ after.m_task->add_dependence( before.m_task );
	+#endif
	+ }
	+
	+ template< class FunctorType , class A3 , class A4 >
	+ KOKKOS_INLINE_FUNCTION
	+ void add_dependence( FunctorType * task_functor
	+ , const Future<A3,A4> & before
	+ , typename Kokkos::Impl::enable_if
	+ < Kokkos::Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	+ >::type * = 0
	+ ) const
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ { get_task_root(task_functor)->add_dependence( before.m_task ); }
	+#else
	+ {}
	+#endif
	+
	+
	+ template< class ValueType >
	+ const Future< ValueType , execution_space > &
	+ spawn( const Future< ValueType , execution_space > & f ) const
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ f.m_task->schedule();
	+#endif
	+ return f ;
	+ }
	+
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ void respawn( FunctorType * task_functor ) const
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ { get_task_root(task_functor)->reschedule(); }
	+#else
	+ {}
	+#endif
	+
	+ //----------------------------------------
	+ // Functions for an executing task functor to query dependences,
	+ // set new dependences, and respawn itself.
	+
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ Future< void , execution_space >
	+ get_dependence( const FunctorType * task_functor , int i ) const
	+ {
	+ return Future<void,execution_space>(
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ get_task_root(task_functor)->get_dependence(i)
	+#endif
	+ );
	+ }
	+
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ int get_dependence( const FunctorType * task_functor ) const
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ { return get_task_root(task_functor)->get_dependence(); }
	+#else
	+ { return 0 ; }
	+#endif
	+
	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ void clear_dependence( FunctorType * task_functor ) const
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ { get_task_root(task_functor)->clear_dependence(); }
	+#else
	+ {}
	+#endif
	+
	+ //----------------------------------------
	+
	+ static member_type & member_single();
	+
	+ friend void wait( TaskPolicy< Kokkos::Threads > & );
	+};
	+
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+#endif /* #if defined( KOKKOS_HAVE_PTHREAD ) */
	+
	+//----------------------------------------------------------------------------
	+
	+#endif /* #ifndef KOKKOS_THREADS_TASKPOLICY_HPP */
	+
	+
	diff --git a/lib/kokkos/core/src/build.cuda.mac b/lib/kokkos/core/src/build.cuda.mac
	deleted file mode 100755
	index 8c94550b7..000000000
	--- a/lib/kokkos/core/src/build.cuda.mac
	+++ /dev/null
	@@ -1,28 +0,0 @@
	-#!/bin/bash
	-
	-touch KokkosCore_config.h
	-
	-#flags="-I../ -I./ -I../../../TPL -c -O3 -arch=sm_30 -Xcompiler -fPIC -DKOKKOS_HAVE_CUDA -DKOKKOS_HAVE_PTHREAD --compiler-bindir=/Users/mhoemme/pkg/gcc-4.7.2/bin"
	-flags="-I../ -I./ -I../../../TPL -c -O3 -arch=sm_30 -Xcompiler -fPIC -DKOKKOS_HAVE_CUDA -DKOKKOS_HAVE_PTHREAD"
	-CC=nvcc
	-cd Cuda
	-rm *.o
	-$CC $flags Kokkos_Cuda_Impl.cu
	-$CC $flags Kokkos_CudaSpace.cu
	-cd ..
	-cd impl
	-rm *.o
	-$CC $flags Kokkos_hwloc.cpp
	-$CC $flags Kokkos_MemoryTracking.cpp
	-$CC $flags Kokkos_Shape.cpp
	-$CC $flags Kokkos_Error.cpp
	-$CC $flags Kokkos_HostSpace.cpp
	-$CC $flags Kokkos_Serial.cpp
	-cd ..
	-cd Threads
	-rm *.o
	-$CC $flags Kokkos_ThreadsExec.cpp
	-$CC $flags Kokkos_ThreadsExec_base.cpp
	-cd ..
	-$CC -arch=sm_35 -lib -o libkokkoscore-cuda.a Cuda/.o impl/.o Threads/*.o
	-
	diff --git a/lib/kokkos/core/src/build_common.sh b/lib/kokkos/core/src/build_common.sh
	deleted file mode 100755
	index e029e5123..000000000
	--- a/lib/kokkos/core/src/build_common.sh
	+++ /dev/null
	@@ -1,281 +0,0 @@
	-#!/bin/bash
	-
	-#-----------------------------------------------------------------------------
	-# Shared portion of build script for the base Kokkos functionality
	-# Simple build script with options
	-#-----------------------------------------------------------------------------
	-if [ ! -d "${KOKKOS}" \
	- -o ! -d "${KOKKOS}/src" \
	- -o ! -d "${KOKKOS}/src/impl" \
	- -o ! -d "${KOKKOS}/src/Cuda" \
	- -o ! -d "${KOKKOS}/src/OpenMP" \
	- -o ! -d "${KOKKOS}/src/Threads" \
	- ] ;
	-then
	-echo "Must set KOKKOS to the kokkos/core directory"
	-exit -1
	-fi
	-
	-#-----------------------------------------------------------------------------
	-
	-INC_PATH="-I${KOKKOS}/src"
	-INC_PATH="${INC_PATH} -I${KOKKOS}/../TPL"
	-
	-#-----------------------------------------------------------------------------
	-
	-while [ -n "${1}" ] ; do
	-
	-ARG="${1}"
	-shift 1
	-
	-case ${ARG} in
	-#----------- OPTIONS -----------
	-OPT \| opt \| O3 \| -O3 ) OPTFLAGS="${OPTFLAGS} -O3" ;;
	-#-------------------------------
	-DBG \| dbg \| g \| -g ) KOKKOS_EXPRESSION_CHECK=1 ;;
	-#-------------------------------
	-HWLOC \| hwloc ) KOKKOS_HAVE_HWLOC=${1} ; shift 1 ;;
	-#-------------------------------
	-MPI \| mpi )
	- KOKKOS_HAVE_MPI=${1} ; shift 1
	- CXX="${KOKKOS_HAVE_MPI}/bin/mpicxx"
	- LINK="${KOKKOS_HAVE_MPI}/bin/mpicxx"
	- INC_PATH="${INC_PATH} -I${KOKKOS_HAVE_MPI}/include"
	- ;;
	-#-------------------------------
	-OMP \| omp \| OpenMP )
	- KOKKOS_HAVE_OPENMP=1
	- ;;
	-#-------------------------------
	-CUDA \| Cuda \| cuda )
	- # CUDA_ARCH options: 20 30 35
	- CUDA_ARCH=${1} ; shift 1
	- #
	- # -x cu : process all files through the Cuda compiler as Cuda code.
	- # -lib -o : produce library
	- #
	- NVCC="nvcc -gencode arch=compute_${CUDA_ARCH},code=sm_${CUDA_ARCH}"
	- NVCC="${NVCC} -maxrregcount=64"
	- NVCC="${NVCC} -Xcompiler -Wall,-ansi"
	- NVCC="${NVCC} -lib -o libCuda.a -x cu"
	-
	- NVCC_SOURCES="${NVCC_SOURCES} ${KOKKOS}/src/Cuda/*.cu"
	- LIB="${LIB} libCuda.a -L/usr/local/cuda/lib64 -lcudart -lcusparse"
	- ;;#-------------------------------
	-CUDA_OSX \| Cuda_OSX \| cuda_osx )
	- # CUDA_ARCH options: 20 30 35
	- CUDA_ARCH=${1} ; shift 1
	- #
	- # -x cu : process all files through the Cuda compiler as Cuda code.
	- # -lib -o : produce library
	- #
	- NVCC="nvcc -gencode arch=compute_${CUDA_ARCH},code=sm_${CUDA_ARCH}"
	- NVCC="${NVCC} -maxrregcount=64"
	- NVCC="${NVCC} -Xcompiler -Wall,-ansi -Xcompiler -m64"
	- NVCC="${NVCC} -lib -o libCuda.a -x cu"
	-
	- NVCC_SOURCES="${NVCC_SOURCES} ${KOKKOS}/src/Cuda/*.cu"
	- LIB="${LIB} libCuda.a -Xlinker -rpath -Xlinker /Developer/NVIDIA/CUDA-5.5/lib -L /Developer/NVIDIA/CUDA-5.5/lib -lcudart -lcusparse"
	- ;;
	-#-------------------------------
	-GNU \| gnu \| g++ )
	- # Turn on lots of warnings and ansi compliance.
	- # The Trilinos build system requires '-pedantic'
	- #
	- CXX="g++ -Wall -Wextra -ansi -pedantic"
	- LINK="g++"
	- CXX="${CXX} -rdynamic -DENABLE_TRACEBACK"
	- LIB="${LIB} -ldl"
	- ;;
	-#-------------------------------
	-GNU_OSX \| gnu_osx \| g++_osx )
	- # Turn on lots of warnings and ansi compliance.
	- # The Trilinos build system requires '-pedantic'
	- #
	- CXX="g++ -Wall -Wextra -ansi -pedantic -m64"
	- LINK="g++"
	- CXX="${CXX} -DENABLE_TRACEBACK"
	- LIB="${LIB} -ldl"
	- ;;
	-#-------------------------------
	-INTEL \| intel \| icc \| icpc )
	- # -xW = use SSE and SSE2 instructions
	- CXX="icpc -Wall"
	- LINK="icpc"
	- LIB="${LIB} -lstdc++"
	- ;;
	-#-------------------------------
	-MPIINTEL \| mpiintel \| mpiicc \| mpiicpc )
	- # -xW = use SSE and SSE2 instructions
	- CXX="mpiicpc -Wall"
	- LINK="mpiicpc"
	- LIB="${LIB} -lstdc++"
	- KOKKOS_HAVE_MPI=1
	-;;
	-#-------------------------------
	-MIC \| mic )
	- CXX="icpc -mmic -ansi-alias -Wall"
	- LINK="icpc -mmic"
	- CXX="${CXX} -mGLOB_default_function_attrs=knc_stream_store_controls=2"
	- # CXX="${CXX} -vec-report6"
	- # CXX="${CXX} -guide-vec"
	- LIB="${LIB} -lstdc++"
	- COMPILE_MIC="on"
	- ;;
	-#-------------------------------
	-MPIMIC \| mpimic )
	- CXX="mpiicpc -mmic -ansi-alias -Wall"
	- LINK="mpiicpc -mmic"
	- KOKKOS_HAVE_MPI=1
	- CXX="${CXX} -mGLOB_default_function_attrs=knc_stream_store_controls=2"
	- # CXX="${CXX} -vec-report6"
	- # CXX="${CXX} -guide-vec"
	- LIB="${LIB} -lstdc++"
	- COMPILE_MIC="on"
	- ;;
	-#-------------------------------
	-curie )
	- CXX="CC"
	- LINK="CC"
	- INC_PATH="${INC_PATH} -I/opt/cray/mpt/default/gni/mpich2-cray/74"
	- KOKKOS_HAVE_MPI=1
	- ;;
	-#-------------------------------
	-MKL \| mkl )
	- HAVE_MKL=${1} ; shift 1 ;
	- CXX_FLAGS="${CXX_FLAGS} -DKOKKOS_USE_MKL -I${HAVE_MKL}/include/"
	- ARCH="intel64"
	- if [ -n "${COMPILE_MIC}" ] ;
	- then
	- ARCH="mic"
	- fi
	- LIB="${LIB} -L${HAVE_MKL}/lib/${ARCH}/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core"
	- NVCC_FLAGS="${NVCC_FLAGS} -DKOKKOS_USE_MKL"
	-;;
	-#-------------------------------
	-CUSPARSE \| cusparse )
	- CXX_FLAGS="${CXX_FLAGS} -DKOKKOS_USE_CUSPARSE"
	- NVCC_FLAGS="${NVCC_FLAGS} -DKOKKOS_USE_CUSPARSE"
	- LIB="${LIB} -lcusparse"
	-;;
	-#-------------------------------
	-AVX \| avx )
	- CXX_FLAGS="${CXX_FLAGS} -mavx"
	-;;
	-#-------------------------------
	-*) echo 'unknown option: ' ${ARG} ; exit -1 ;;
	-esac
	-done
	-
	-#-----------------------------------------------------------------------------
	-
	-if [ -z "${CXX}" ] ;
	-then
	- echo "No C++ compiler selected"
	- exit -1
	-fi
	-
	-if [ -n "${KOKKOS_HAVE_OPENMP}" ]
	-then
	-CXX="${CXX} -fopenmp"
	-CXX_SOURCES="${CXX_SOURCES} ${KOKKOS}/src/OpenMP/*.cpp"
	-fi
	-
	-#-----------------------------------------------------------------------------
	-# Option for PTHREAD or WINTHREAD eventually
	-
	-KOKKOS_HAVE_PTHREAD=1
	-
	-if [ -n "${KOKKOS_HAVE_PTHREAD}" ] ;
	-then
	- LIB="${LIB} -lpthread"
	-fi
	-
	-#-----------------------------------------------------------------------------
	-# Option for enabling the Serial device
	-
	-KOKKOS_HAVE_SERIAL=1
	-
	-#-----------------------------------------------------------------------------
	-# Attach options to compile lines
	-
	-CXX="${CXX} ${OPTFLAGS}"
	-
	-if [ -n "${NVCC}" ] ;
	-then
	- NVCC="${NVCC} ${OPTFLAGS}"
	-fi
	-
	-#-----------------------------------------------------------------------------
	-
	-CXX_SOURCES="${CXX_SOURCES} ${KOKKOS}/src/impl/*.cpp"
	-CXX_SOURCES="${CXX_SOURCES} ${KOKKOS}/src/Threads/*.cpp"
	-
	-#-----------------------------------------------------------------------------
	-#
	-
	-if [ -n "${KOKKOS_HAVE_HWLOC}" ] ;
	-then
	-
	- if [ ! -d ${KOKKOS_HAVE_HWLOC} ] ;
	- then
	- echo "${KOKKOS_HAVE_HWLOC} does not exist"
	- exit 1
	- fi
	-
	- echo "LD_LIBRARY_PATH must include ${KOKKOS_HAVE_HWLOC}/lib"
	-
	- LIB="${LIB} -L${KOKKOS_HAVE_HWLOC}/lib -lhwloc"
	- INC_PATH="${INC_PATH} -I${KOKKOS_HAVE_HWLOC}/include"
	-fi
	-
	-#-----------------------------------------------------------------------------
	-
	-INC_PATH="${INC_PATH} -I."
	-
	-CONFIG="KokkosCore_config.h"
	-
	-rm -f ${CONFIG}
	-
	-echo "#ifndef KOKKOS_CORE_CONFIG_H" >> ${CONFIG}
	-echo "#define KOKKOS_CORE_CONFIG_H" >> ${CONFIG}
	-
	-if [ -n "${KOKKOS_HAVE_MPI}" ] ;
	-then
	- echo "#define KOKKOS_HAVE_MPI" >> ${CONFIG}
	-fi
	-
	-if [ -n "${NVCC}" ] ;
	-then
	- echo "#define KOKKOS_HAVE_CUDA" >> ${CONFIG}
	-fi
	-
	-if [ -n "${KOKKOS_HAVE_PTHREAD}" ] ;
	-then
	- echo "#define KOKKOS_HAVE_PTHREAD" >> ${CONFIG}
	-fi
	-
	-if [ -n "${KOKKOS_HAVE_SERIAL}" ] ;
	-then
	- echo "#define KOKKOS_HAVE_SERIAL" >> ${CONFIG}
	-fi
	-
	-if [ -n "${KOKKOS_HAVE_HWLOC}" ] ;
	-then
	- echo "#define KOKKOS_HAVE_HWLOC" >> ${CONFIG}
	-fi
	-
	-if [ -n "${KOKKOS_HAVE_OPENMP}" ] ;
	-then
	- echo "#define KOKKOS_HAVE_OPENMP" >> ${CONFIG}
	-fi
	-
	-if [ -n "${KOKKOS_EXPRESSION_CHECK}" ] ;
	-then
	- echo "#define KOKKOS_EXPRESSION_CHECK" >> ${CONFIG}
	-fi
	-
	-echo "#endif" >> ${CONFIG}
	-
	-#-----------------------------------------------------------------------------
	-
	diff --git a/lib/kokkos/core/src/impl/KokkosExp_SharedAlloc.cpp b/lib/kokkos/core/src/impl/KokkosExp_SharedAlloc.cpp
	new file mode 100755
	index 000000000..50168fe3c
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/KokkosExp_SharedAlloc.cpp
	@@ -0,0 +1,275 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_Core.hpp>
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+bool
	+SharedAllocationRecord< void , void >::
	+is_sane( SharedAllocationRecord< void , void > * arg_record )
	+{
	+ constexpr static SharedAllocationRecord * zero = 0 ;
	+
	+ SharedAllocationRecord * const root = arg_record ? arg_record->m_root : 0 ;
	+
	+ bool ok = root != 0 && root->m_count == 0 ;
	+
	+ if ( ok ) {
	+ SharedAllocationRecord * root_next = 0 ;
	+
	+ // Lock the list:
	+ while ( ( root_next = Kokkos::atomic_exchange( & root->m_next , zero ) ) == 0 );
	+
	+ for ( SharedAllocationRecord * rec = root_next ; ok && rec != root ; rec = rec->m_next ) {
	+ const bool ok_non_null = rec && rec->m_prev && ( rec == root \|\| rec->m_next );
	+ const bool ok_root = ok_non_null && rec->m_root == root ;
	+ const bool ok_prev_next = ok_non_null && ( rec->m_prev != root ? rec->m_prev->m_next == rec : root_next == rec );
	+ const bool ok_next_prev = ok_non_null && rec->m_next->m_prev == rec ;
	+ const bool ok_count = ok_non_null && 0 <= rec->m_count ;
	+
	+ ok = ok_root && ok_prev_next && ok_next_prev && ok_count ;
	+
	+if ( ! ok ) {
	+ fprintf(stderr,"Kokkos::Experimental::Impl::SharedAllocationRecord failed is_sane: rec(0x%.12lx){ m_count(%d) m_root(0x%.12lx) m_next(0x%.12lx) m_prev(0x%.12lx) m_next->m_prev(0x%.12lx) m_prev->m_next(0x%.12lx) }\n"
	+ , reinterpret_cast< unsigned long >( rec )
	+ , rec->m_count
	+ , reinterpret_cast< unsigned long >( rec->m_root )
	+ , reinterpret_cast< unsigned long >( rec->m_next )
	+ , reinterpret_cast< unsigned long >( rec->m_prev )
	+ , reinterpret_cast< unsigned long >( rec->m_next->m_prev )
	+ , reinterpret_cast< unsigned long >( rec->m_prev != rec->m_root ? rec->m_prev->m_next : root_next )
	+ );
	+}
	+
	+ }
	+
	+ if ( zero != Kokkos::atomic_exchange( & root->m_next , root_next ) ) {
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Experimental::Impl::SharedAllocationRecord failed is_sane unlocking");
	+ }
	+ }
	+
	+ return ok ;
	+}
	+
	+SharedAllocationRecord<void,void> *
	+SharedAllocationRecord<void,void>::find( SharedAllocationRecord<void,void> * const arg_root , void * const arg_data_ptr )
	+{
	+ constexpr static SharedAllocationRecord * zero = 0 ;
	+
	+ SharedAllocationRecord * root_next = 0 ;
	+
	+ // Lock the list:
	+ while ( ( root_next = Kokkos::atomic_exchange( & arg_root->m_next , 0 ) ) == 0 );
	+
	+ // Iterate searching for the record with this data pointer
	+
	+ SharedAllocationRecord * r = root_next ;
	+
	+ while ( ( r != arg_root ) && ( r->data() != arg_data_ptr ) ) { r = r->m_next ; }
	+
	+ if ( r == arg_root ) { r = 0 ; }
	+
	+ if ( zero != Kokkos::atomic_exchange( & arg_root->m_next , root_next ) ) {
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Experimental::Impl::SharedAllocationRecord failed locking/unlocking");
	+ }
	+
	+ return r ;
	+}
	+
	+
	+/**\brief Construct and insert into 'arg_root' tracking set.
	+ * use_count is zero.
	+ */
	+SharedAllocationRecord< void , void >::
	+SharedAllocationRecord( SharedAllocationRecord<void,void> * arg_root
	+ , SharedAllocationHeader * arg_alloc_ptr
	+ , size_t arg_alloc_size
	+ , SharedAllocationRecord< void , void >::function_type arg_dealloc
	+ )
	+ : m_alloc_ptr( arg_alloc_ptr )
	+ , m_alloc_size( arg_alloc_size )
	+ , m_dealloc( arg_dealloc )
	+ , m_root( arg_root )
	+ , m_prev( 0 )
	+ , m_next( 0 )
	+ , m_count( 0 )
	+{
	+ constexpr static SharedAllocationRecord * zero = 0 ;
	+
	+ // Insert into the root double-linked list for tracking
	+ //
	+ // before: arg_root->m_next == next ; next->m_prev == arg_root
	+ // after: arg_root->m_next == this ; this->m_prev == arg_root ;
	+ // this->m_next == next ; next->m_prev == this
	+
	+ m_prev = m_root ;
	+
	+ // Read root->m_next and lock by setting to zero
	+ while ( ( m_next = Kokkos::atomic_exchange( & m_root->m_next , zero ) ) == 0 );
	+
	+ m_next->m_prev = this ;
	+
	+ if ( zero != Kokkos::atomic_exchange( & m_root->m_next , this ) ) {
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Experimental::Impl::SharedAllocationRecord failed locking/unlocking");
	+ }
	+}
	+
	+void
	+SharedAllocationRecord< void , void >::
	+increment( SharedAllocationRecord< void , void > * arg_record )
	+{
	+ const int old_count = Kokkos::atomic_fetch_add( & arg_record->m_count , 1 );
	+
	+ if ( old_count < 0 ) { // Error
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Experimental::Impl::SharedAllocationRecord failed increment");
	+ }
	+}
	+
	+SharedAllocationRecord< void , void > *
	+SharedAllocationRecord< void , void >::
	+decrement( SharedAllocationRecord< void , void > * arg_record )
	+{
	+ constexpr static SharedAllocationRecord * zero = 0 ;
	+
	+ const int old_count = Kokkos::atomic_fetch_add( & arg_record->m_count , -1 );
	+
	+ if ( old_count == 1 ) {
	+
	+ // before: arg_record->m_prev->m_next == arg_record &&
	+ // arg_record->m_next->m_prev == arg_record
	+ //
	+ // after: arg_record->m_prev->m_next == arg_record->m_next &&
	+ // arg_record->m_next->m_prev == arg_record->m_prev
	+
	+ SharedAllocationRecord * root_next = 0 ;
	+
	+ // Lock the list:
	+ while ( ( root_next = Kokkos::atomic_exchange( & arg_record->m_root->m_next , 0 ) ) == 0 );
	+
	+ arg_record->m_next->m_prev = arg_record->m_prev ;
	+
	+ if ( root_next != arg_record ) {
	+ arg_record->m_prev->m_next = arg_record->m_next ;
	+ }
	+ else {
	+ // before: arg_record->m_root == arg_record->m_prev
	+ // after: arg_record->m_root == arg_record->m_next
	+ root_next = arg_record->m_next ;
	+ }
	+
	+ // Unlock the list:
	+ if ( zero != Kokkos::atomic_exchange( & arg_record->m_root->m_next , root_next ) ) {
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Experimental::Impl::SharedAllocationRecord failed decrement unlocking");
	+ }
	+
	+ arg_record->m_next = 0 ;
	+ arg_record->m_prev = 0 ;
	+
	+ function_type d = arg_record->m_dealloc ;
	+ (*d)( arg_record );
	+ arg_record = 0 ;
	+ }
	+ else if ( old_count < 1 ) { // Error
	+ Kokkos::Impl::throw_runtime_exception("Kokkos::Experimental::Impl::SharedAllocationRecord failed decrement count");
	+ }
	+
	+ return arg_record ;
	+}
	+
	+void
	+SharedAllocationRecord< void , void >::
	+print_host_accessible_records( std::ostream & s
	+ , const char * const space_name
	+ , const SharedAllocationRecord * const root
	+ , const bool detail )
	+{
	+ const SharedAllocationRecord< void , void > * r = root ;
	+
	+ char buffer[256] ;
	+
	+ if ( detail ) {
	+ do {
	+
	+ snprintf( buffer , 256 , "%s addr( 0x%.12lx ) list( 0x%.12lx 0x%.12lx ) extent[ 0x%.12lx + %.8ld ] count(%d) dealloc(0x%.12lx) %s\n"
	+ , space_name
	+ , reinterpret_cast<unsigned long>( r )
	+ , reinterpret_cast<unsigned long>( r->m_prev )
	+ , reinterpret_cast<unsigned long>( r->m_next )
	+ , reinterpret_cast<unsigned long>( r->m_alloc_ptr )
	+ , r->m_alloc_size
	+ , r->m_count
	+ , reinterpret_cast<unsigned long>( r->m_dealloc )
	+ , r->m_alloc_ptr->m_label
	+ );
	+ std::cout << buffer ;
	+ r = r->m_next ;
	+ } while ( r != root );
	+ }
	+ else {
	+ do {
	+ if ( r->m_alloc_ptr ) {
	+
	+ snprintf( buffer , 256 , "%s [ 0x%.12lx + %ld ] %s\n"
	+ , space_name
	+ , reinterpret_cast< unsigned long >( r->data() )
	+ , r->size()
	+ , r->m_alloc_ptr->m_label
	+ );
	+ }
	+ else {
	+ snprintf( buffer , 256 , "%s [ 0 + 0 ]\n" , space_name );
	+ }
	+ std::cout << buffer ;
	+ r = r->m_next ;
	+ } while ( r != root );
	+ }
	+}
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+
	diff --git a/lib/kokkos/core/src/impl/KokkosExp_SharedAlloc.hpp b/lib/kokkos/core/src/impl/KokkosExp_SharedAlloc.hpp
	new file mode 100755
	index 000000000..d9491b553
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/KokkosExp_SharedAlloc.hpp
	@@ -0,0 +1,287 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template< class MemorySpace = void , class DestroyFunctor = void >
	+class SharedAllocationRecord ;
	+
	+class SharedAllocationHeader {
	+private:
	+
	+ typedef SharedAllocationRecord<void,void> Record ;
	+
	+ static constexpr unsigned maximum_label_length = ( 1u << 7 /* 128 / ) - sizeof(Record);
	+
	+ template< class , class > friend class SharedAllocationRecord ;
	+
	+ Record * m_record ;
	+ char m_label[ maximum_label_length ];
	+
	+public:
	+
	+ /* Given user memory get pointer to the header */
	+ KOKKOS_INLINE_FUNCTION static
	+ const SharedAllocationHeader * get_header( void * alloc_ptr )
	+ { return reinterpret_cast<SharedAllocationHeader>( reinterpret_cast<char>(alloc_ptr) - sizeof(SharedAllocationHeader) ); }
	+};
	+
	+template<>
	+class SharedAllocationRecord< void , void > {
	+protected:
	+
	+ static_assert( sizeof(SharedAllocationHeader) == ( 1u << 7 /* 128 */ ) , "sizeof(SharedAllocationHeader) != 128" );
	+
	+ template< class , class > friend class SharedAllocationRecord ;
	+
	+ typedef void (* function_type )( SharedAllocationRecord<void,void> * );
	+
	+ SharedAllocationHeader * const m_alloc_ptr ;
	+ size_t const m_alloc_size ;
	+ function_type const m_dealloc ;
	+ SharedAllocationRecord * const m_root ;
	+ SharedAllocationRecord * m_prev ;
	+ SharedAllocationRecord * m_next ;
	+ int m_count ;
	+
	+ SharedAllocationRecord( const SharedAllocationRecord & ) = delete ;
	+ SharedAllocationRecord & operator = ( const SharedAllocationRecord & ) = delete ;
	+
	+ /**\brief Construct and insert into 'arg_root' tracking set.
	+ * use_count is zero.
	+ */
	+ SharedAllocationRecord( SharedAllocationRecord * arg_root
	+ , SharedAllocationHeader * arg_alloc_ptr
	+ , size_t arg_alloc_size
	+ , function_type arg_dealloc
	+ );
	+
	+public:
	+
	+ ~SharedAllocationRecord() = default ;
	+
	+ constexpr SharedAllocationRecord()
	+ : m_alloc_ptr( 0 )
	+ , m_alloc_size( 0 )
	+ , m_dealloc( 0 )
	+ , m_root( this )
	+ , m_prev( this )
	+ , m_next( this )
	+ , m_count( 0 )
	+ {}
	+
	+ static constexpr unsigned maximum_label_length = SharedAllocationHeader::maximum_label_length ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ const SharedAllocationHeader * head() const { return m_alloc_ptr ; }
	+
	+ /* User's memory begins at the end of the header */
	+ KOKKOS_INLINE_FUNCTION
	+ void * data() const { return reinterpret_cast<void*>( m_alloc_ptr + 1 ); }
	+
	+ /* User's memory begins at the end of the header */
	+ constexpr size_t size() const { return m_alloc_size - sizeof(SharedAllocationHeader) ; }
	+
	+ /* Cannot be 'constexpr' because 'm_count' is volatile */
	+ int use_count() const { return m_count ; }
	+
	+ /* Increment use count */
	+ static void increment( SharedAllocationRecord * );
	+
	+ /* Decrement use count. If 1->0 then remove from the tracking list and invoke m_dealloc */
	+ static SharedAllocationRecord * decrement( SharedAllocationRecord * );
	+
	+ /* Given a root record and data pointer find the record */
	+ static SharedAllocationRecord * find( SharedAllocationRecord * const , void * const );
	+
	+ /* Sanity check for the whole set of records to which the input record belongs.
	+ * Locks the set's insert/erase operations until the sanity check is complete.
	+ */
	+ static bool is_sane( SharedAllocationRecord * );
	+
	+ /* Print host-accessible records */
	+ static void print_host_accessible_records( std::ostream &
	+ , const char * const space_name
	+ , const SharedAllocationRecord * const root
	+ , const bool detail );
	+};
	+
	+/*
	+ * Memory space specialization of SharedAllocationRecord< Space , void > requires :
	+ *
	+ * SharedAllocationRecord< Space , void > : public SharedAllocationRecord< void , void >
	+ * {
	+ * // delete allocated user memory via static_cast to this type.
	+ * static void deallocate( const SharedAllocationRecord<void,void> * );
	+ * Space m_space ;
	+ * }
	+ */
	+
	+template< class MemorySpace , class DestroyFunctor >
	+class SharedAllocationRecord : public SharedAllocationRecord< MemorySpace , void >
	+{
	+private:
	+
	+ static void deallocate( SharedAllocationRecord<void,void> * record_ptr )
	+ { delete static_cast<SharedAllocationRecord<MemorySpace,DestroyFunctor>*>(record_ptr); }
	+
	+ SharedAllocationRecord( const MemorySpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc
	+ )
	+ /* Allocate user memory as [ SharedAllocationHeader , user_memory ] */
	+ : SharedAllocationRecord< MemorySpace , void >( arg_space , arg_label , arg_alloc , & deallocate )
	+ , m_destroy()
	+ {}
	+
	+ ~SharedAllocationRecord() { m_destroy.destroy_shared_allocation(); }
	+
	+public:
	+
	+ DestroyFunctor m_destroy ;
	+
	+ // Allocate with a zero use count. Incrementing the use count from zero to one
	+ // inserts the record into the tracking list. Decrementing the count from one to zero
	+ // removes from the trakcing list and deallocates.
	+ KOKKOS_INLINE_FUNCTION static
	+ SharedAllocationRecord * allocate( const MemorySpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc
	+ )
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ return new SharedAllocationRecord( arg_space , arg_label , arg_alloc );
	+#else
	+ return (SharedAllocationRecord *) 0 ;
	+#endif
	+ }
	+};
	+
	+union SharedAllocationTracker {
	+private:
	+
	+ typedef SharedAllocationRecord<void,void> Record ;
	+
	+ enum : unsigned long {
	+ DO_NOT_DEREF_FLAG = 0x01ul
	+ };
	+
	+ // The allocation record resides in Host memory space
	+ Record * m_record ;
	+ unsigned long m_record_bits;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static Record * disable( Record * rec )
	+ { return reinterpret_cast<Record*>( reinterpret_cast<unsigned long>( rec ) & DO_NOT_DEREF_FLAG ); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void increment() const
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ if ( ! ( m_record_bits & DO_NOT_DEREF_FLAG ) ) Record::increment( m_record );
	+#endif
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void decrement() const
	+ {
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ if ( ! ( m_record_bits & DO_NOT_DEREF_FLAG ) ) Record::decrement( m_record );
	+#endif
	+ }
	+
	+public:
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr SharedAllocationTracker() : m_record_bits( DO_NOT_DEREF_FLAG ) {}
	+
	+ template< class MemorySpace >
	+ constexpr
	+ SharedAllocationRecord< MemorySpace , void > & get_record() const
	+ { return * static_cast< SharedAllocationRecord< MemorySpace , void > * >( m_record ); }
	+
	+ template< class MemorySpace >
	+ std::string get_label() const
	+ { return static_cast< SharedAllocationRecord< MemorySpace , void > * >( m_record )->get_label(); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SharedAllocationTracker( Record * arg_record )
	+ : m_record( arg_record ) { increment(); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ ~SharedAllocationTracker() { decrement(); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SharedAllocationTracker( const SharedAllocationTracker & rhs )
	+ : m_record( rhs.m_record ) { increment(); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SharedAllocationTracker( SharedAllocationTracker && rhs )
	+ : m_record( rhs.m_record ) { rhs.m_record_bits = DO_NOT_DEREF_FLAG ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SharedAllocationTracker & operator = ( const SharedAllocationTracker & rhs )
	+ {
	+ decrement();
	+ m_record = rhs.m_record ;
	+ increment();
	+ return *this ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SharedAllocationTracker & operator = ( SharedAllocationTracker && rhs )
	+ {
	+ m_record = rhs.m_record ;
	+ rhs.m_record_bits = DO_NOT_DEREF_FLAG ;
	+ return *this ;
	+ }
	+};
	+
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+
	diff --git a/lib/kokkos/core/src/impl/KokkosExp_ViewAllocProp.hpp b/lib/kokkos/core/src/impl/KokkosExp_ViewAllocProp.hpp
	new file mode 100755
	index 000000000..348ccaf5e
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/KokkosExp_ViewAllocProp.hpp
	@@ -0,0 +1,416 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_EXPERIMENTAL_IMPL_VIEW_ALLOC_PROP_HPP
	+#define KOKKOS_EXPERIMENTAL_IMPL_VIEW_ALLOC_PROP_HPP
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+struct WithoutInitializing_t {};
	+struct AllowPadding_t {};
	+
	+template< class ... Parameters >
	+struct ViewAllocProp ;
	+
	+template<>
	+struct ViewAllocProp<> {
	+
	+ struct NullSpace {};
	+
	+ typedef std::false_type allow_padding_t ;
	+ typedef std::true_type initialize_t ;
	+ typedef NullSpace memory_space ;
	+ typedef NullSpace execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp()
	+ : label()
	+ , memory()
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+
	+ ViewAllocProp( const std::string & arg_label )
	+ : label( arg_label )
	+ , memory()
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ... Parameters >
	+struct ViewAllocProp< const char * , Parameters ... >
	+{
	+ typedef ViewAllocProp< Parameters ... > base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef typename base_prop_type::memory_space memory_space ;
	+ typedef typename base_prop_type::execution_space execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const char * const arg_label , Parameters ... arg_param )
	+ : label( arg_label )
	+ , memory( base_prop_type( arg_param ... ).memory )
	+ , execution( base_prop_type( arg_param ... ).execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ... Parameters >
	+struct ViewAllocProp< std::string , Parameters ... >
	+{
	+ typedef ViewAllocProp< Parameters ... > base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef typename base_prop_type::memory_space memory_space ;
	+ typedef typename base_prop_type::execution_space execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const std::string & arg_label , Parameters ... arg_param )
	+ : label( arg_label )
	+ , memory( base_prop_type( arg_param ... ).memory )
	+ , execution( base_prop_type( arg_param ... ).execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ... Parameters >
	+struct ViewAllocProp< WithoutInitializing_t , Parameters ... >
	+{
	+ typedef ViewAllocProp< Parameters ... > base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef std::false_type initialize_t ;
	+ typedef typename base_prop_type::memory_space memory_space ;
	+ typedef typename base_prop_type::execution_space execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const WithoutInitializing_t & , Parameters ... arg_param )
	+ : label( base_prop_type( arg_param ... ).label )
	+ , memory( base_prop_type( arg_param ... ).memory )
	+ , execution( base_prop_type( arg_param ... ).execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ... Parameters >
	+struct ViewAllocProp< AllowPadding_t , Parameters ... >
	+{
	+ typedef ViewAllocProp< Parameters ... > base_prop_type ;
	+
	+ typedef std::true_type allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef typename base_prop_type::memory_space memory_space ;
	+ typedef typename base_prop_type::execution_space execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const AllowPadding_t & , Parameters ... arg_param )
	+ : label( base_prop_type( arg_param ... ).label )
	+ , memory( base_prop_type( arg_param ... ).memory )
	+ , execution( base_prop_type( arg_param ... ).execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class Space , class ... Parameters >
	+struct ViewAllocProp< Space , Parameters ... >
	+{
	+ enum { is_exec = Kokkos::Impl::is_execution_space< Space >::value };
	+ enum { is_mem = Kokkos::Impl::is_memory_space< Space >::value };
	+
	+ static_assert( is_exec \|\| is_mem , "View allocation given unknown parameter" );
	+
	+ typedef ViewAllocProp< Parameters ... > base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef typename std::conditional< is_mem , Space , typename base_prop_type::memory_space >::type memory_space ;
	+ typedef typename std::conditional< is_exec , Space , typename base_prop_type::execution_space >::type execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ // Templated so that 'base_prop_type( args ... ).execution'
	+ // is not used unless arg_space == memory_space.
	+ template< class ... Args >
	+ ViewAllocProp( const memory_space & arg_space , Args ... args )
	+ : label( base_prop_type( args ... ).label )
	+ , memory( arg_space )
	+ , execution( base_prop_type( args ... ).execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+
	+ // Templated so that 'base_prop_type( args ... ).memory'
	+ // is not used unless arg_space == execution_space.
	+ template< class ... Args >
	+ ViewAllocProp( const execution_space & arg_space , Args ... args )
	+ : label( base_prop_type( args ... ).label )
	+ , memory( base_prop_type( args ... ).memory )
	+ , execution( arg_space )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ExecSpace , class MemSpace >
	+struct ViewAllocProp< Kokkos::Device< ExecSpace , MemSpace > , std::string >
	+{
	+ typedef ViewAllocProp<> base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef MemSpace memory_space ;
	+ typedef ExecSpace execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const std::string & arg_label )
	+ : label( arg_label )
	+ , memory()
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ExecSpace , class MemSpace , unsigned N >
	+struct ViewAllocProp< Kokkos::Device< ExecSpace , MemSpace > , char[N] >
	+{
	+ typedef ViewAllocProp<> base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef MemSpace memory_space ;
	+ typedef ExecSpace execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const char * const arg_label )
	+ : label( arg_label )
	+ , memory()
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+
	+// Deprecate in favor of view_alloc( Kokkos::WithoutInitializing )
	+template< class ExecSpace , class MemSpace >
	+struct ViewAllocProp< Kokkos::Device< ExecSpace , MemSpace >
	+ , Kokkos::ViewAllocateWithoutInitializing
	+ >
	+{
	+ typedef ViewAllocProp<> base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef std::false_type initialize_t ;
	+ typedef MemSpace memory_space ;
	+ typedef ExecSpace execution_space ;
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ ViewAllocProp( const Kokkos::ViewAllocateWithoutInitializing & arg )
	+ : label( arg.label )
	+ , memory()
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+template< class ExecSpace , class MemSpace , class ... Parameters >
	+struct ViewAllocProp< Kokkos::Device< ExecSpace , MemSpace >
	+ , ViewAllocProp< Parameters ... >
	+ >
	+{
	+ typedef ViewAllocProp< Parameters ... > base_prop_type ;
	+
	+ typedef typename base_prop_type::allow_padding_t allow_padding_t ;
	+ typedef typename base_prop_type::initialize_t initialize_t ;
	+ typedef MemSpace memory_space ;
	+
	+ typedef
	+ typename std::conditional
	+ < Kokkos::Impl::is_execution_space< typename base_prop_type::execution_space >::value
	+ , typename base_prop_type::execution_space
	+ , ExecSpace
	+ >::type execution_space ;
	+
	+ static_assert( std::is_same< typename base_prop_type::memory_space , ViewAllocProp<>::NullSpace >::value \|\|
	+ std::is_same< typename base_prop_type::memory_space , memory_space >::value
	+ , "View allocation given incompatible memory space" );
	+
	+ static_assert( Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< typename execution_space::memory_space
	+ , memory_space >::value
	+ , "View allocation given incompatible execution space" );
	+
	+ const std::string label ;
	+ const memory_space memory ;
	+ const execution_space execution ;
	+ const allow_padding_t allow_padding ;
	+ const initialize_t initialize ;
	+
	+ // If the input properties have a memory or execution space then copy construct those spaces
	+ // otherwise default construct those spaces.
	+
	+ template< class P >
	+ ViewAllocProp( const P & arg_prop
	+ , typename std::enable_if
	+ < std::is_same< P , base_prop_type >::value &&
	+ Kokkos::Impl::is_memory_space< typename P::memory_space >::value &&
	+ Kokkos::Impl::is_execution_space< typename P::memory_space >::value
	+ >::type * = 0 )
	+ : label( arg_prop.label )
	+ , memory( arg_prop.memory )
	+ , execution( arg_prop.execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+
	+ template< class P >
	+ ViewAllocProp( const P & arg_prop
	+ , typename std::enable_if
	+ < std::is_same< P , base_prop_type >::value &&
	+ Kokkos::Impl::is_memory_space< typename P::memory_space >::value &&
	+ ! Kokkos::Impl::is_execution_space< typename P::execution_space >::value
	+ >::type * = 0 )
	+ : label( arg_prop.label )
	+ , memory( arg_prop.memory )
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+
	+ template< class P >
	+ ViewAllocProp( const P & arg_prop
	+ , typename std::enable_if
	+ < std::is_same< P , base_prop_type >::value &&
	+ ! Kokkos::Impl::is_memory_space< typename P::memory_space >::value &&
	+ Kokkos::Impl::is_execution_space< typename P::execution_space >::value
	+ >::type * = 0 )
	+ : label( arg_prop.label )
	+ , memory()
	+ , execution( arg_prop.execution )
	+ , allow_padding()
	+ , initialize()
	+ {}
	+
	+ template< class P >
	+ ViewAllocProp( const P & arg_prop
	+ , typename std::enable_if
	+ < std::is_same< P , base_prop_type >::value &&
	+ ! Kokkos::Impl::is_memory_space< typename P::memory_space >::value &&
	+ ! Kokkos::Impl::is_execution_space< typename P::execution_space >::value
	+ >::type * = 0 )
	+ : label( arg_prop.label )
	+ , memory()
	+ , execution()
	+ , allow_padding()
	+ , initialize()
	+ {}
	+};
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+#endif
	+
	diff --git a/lib/kokkos/core/src/impl/KokkosExp_ViewMapping.hpp b/lib/kokkos/core/src/impl/KokkosExp_ViewMapping.hpp
	new file mode 100755
	index 000000000..bd2b4c675
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/KokkosExp_ViewMapping.hpp
	@@ -0,0 +1,2683 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_EXPERIMENTAL_VIEW_MAPPING_HPP
	+#define KOKKOS_EXPERIMENTAL_VIEW_MAPPING_HPP
	+
	+#include <type_traits>
	+#include <initializer_list>
	+
	+#include <Kokkos_Pair.hpp>
	+#include <Kokkos_Layout.hpp>
	+#include <impl/Kokkos_Traits.hpp>
	+#include <impl/Kokkos_Atomic_View.hpp>
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Impl {
	+
	+template< class FunctorType , class ExecPolicy > class ParallelFor ;
	+
	+}} /* namespace Kokkos::Impl */
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template< long sN0 = -1
	+ , long sN1 = -1
	+ , long sN2 = -1
	+ , long sN3 = -1
	+ , long sN4 = -1
	+ , long sN5 = -1
	+ , long sN6 = -1
	+ , long sN7 = -1
	+ >
	+struct ViewDimension {
	+
	+ enum { rank = ( sN0 < 0 ? 0 :
	+ ( sN1 < 0 ? 1 :
	+ ( sN2 < 0 ? 2 :
	+ ( sN3 < 0 ? 3 :
	+ ( sN4 < 0 ? 4 :
	+ ( sN5 < 0 ? 5 :
	+ ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 )))))))) };
	+ enum { rank_dynamic = 0 };
	+
	+ enum { N0 = 0 < sN0 ? sN0 : 1 };
	+ enum { N1 = 0 < sN1 ? sN1 : 1 };
	+ enum { N2 = 0 < sN2 ? sN2 : 1 };
	+ enum { N3 = 0 < sN3 ? sN3 : 1 };
	+ enum { N4 = 0 < sN4 ? sN4 : 1 };
	+ enum { N5 = 0 < sN5 ? sN5 : 1 };
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t , unsigned , unsigned , unsigned
	+ , unsigned , unsigned , unsigned , unsigned ) {}
	+};
	+
	+template< long sN1
	+ , long sN2
	+ , long sN3
	+ , long sN4
	+ , long sN5
	+ , long sN6
	+ , long sN7
	+ >
	+struct ViewDimension< 0, sN1, sN2, sN3, sN4, sN5, sN6, sN7 > {
	+
	+ enum { rank = ( sN1 < 0 ? 1 :
	+ ( sN2 < 0 ? 2 :
	+ ( sN3 < 0 ? 3 :
	+ ( sN4 < 0 ? 4 :
	+ ( sN5 < 0 ? 5 :
	+ ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 ))))))) };
	+ enum { rank_dynamic = 1 };
	+
	+ size_t N0 ; /* When 1 == rank_dynamic allow N0 >= 2^32 */
	+ enum { N1 = 0 < sN1 ? sN1 : 1 };
	+ enum { N2 = 0 < sN2 ? sN2 : 1 };
	+ enum { N3 = 0 < sN3 ? sN3 : 1 };
	+ enum { N4 = 0 < sN4 ? sN4 : 1 };
	+ enum { N5 = 0 < sN5 ? sN5 : 1 };
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned , unsigned , unsigned
	+ , unsigned , unsigned , unsigned , unsigned )
	+ : N0( aN0 ) {}
	+};
	+
	+template< long sN2
	+ , long sN3
	+ , long sN4
	+ , long sN5
	+ , long sN6
	+ , long sN7
	+ >
	+struct ViewDimension< 0, 0, sN2, sN3, sN4, sN5, sN6, sN7 > {
	+
	+ enum { rank = ( sN2 < 0 ? 2 :
	+ ( sN3 < 0 ? 3 :
	+ ( sN4 < 0 ? 4 :
	+ ( sN5 < 0 ? 5 :
	+ ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 )))))) };
	+ enum { rank_dynamic = 2 };
	+
	+ size_t N0 ; /* When 2 == rank_dynamic allow N0 >= 2^32 */
	+ size_t N1 ; /* When 2 == rank_dynamic allow N1 >= 2^32 */
	+ enum { N2 = 0 < sN2 ? sN2 : 1 };
	+ enum { N3 = 0 < sN3 ? sN3 : 1 };
	+ enum { N4 = 0 < sN4 ? sN4 : 1 };
	+ enum { N5 = 0 < sN5 ? sN5 : 1 };
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned , unsigned
	+ , unsigned , unsigned , unsigned , unsigned )
	+ : N0( aN0 ) , N1( aN1 ) {}
	+};
	+
	+template< long sN3
	+ , long sN4
	+ , long sN5
	+ , long sN6
	+ , long sN7
	+ >
	+struct ViewDimension< 0, 0, 0, sN3, sN4, sN5, sN6, sN7 > {
	+
	+ enum { rank = ( sN3 < 0 ? 3 :
	+ ( sN4 < 0 ? 4 :
	+ ( sN5 < 0 ? 5 :
	+ ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 ))))) };
	+ enum { rank_dynamic = 3 };
	+
	+ unsigned N0 ;
	+ unsigned N1 ;
	+ unsigned N2 ;
	+ enum { N3 = 0 < sN3 ? sN3 : 1 };
	+ enum { N4 = 0 < sN4 ? sN4 : 1 };
	+ enum { N5 = 0 < sN5 ? sN5 : 1 };
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned
	+ , unsigned , unsigned , unsigned , unsigned )
	+ : N0( aN0 ) , N1( aN1 ) , N2( aN2 ) {}
	+};
	+
	+template< long sN4
	+ , long sN5
	+ , long sN6
	+ , long sN7
	+ >
	+struct ViewDimension< 0, 0, 0, 0, sN4, sN5, sN6, sN7 > {
	+
	+ enum { rank = ( sN4 < 0 ? 4 :
	+ ( sN5 < 0 ? 5 :
	+ ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 )))) };
	+ enum { rank_dynamic = 4 };
	+
	+ unsigned N0 ;
	+ unsigned N1 ;
	+ unsigned N2 ;
	+ unsigned N3 ;
	+ enum { N4 = 0 < sN4 ? sN4 : 1 };
	+ enum { N5 = 0 < sN5 ? sN5 : 1 };
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned , unsigned , unsigned , unsigned )
	+ : N0( aN0 ) , N1( aN1 ) , N2( aN2 ) , N3( aN3 ) {}
	+};
	+
	+template< long sN5
	+ , long sN6
	+ , long sN7
	+ >
	+struct ViewDimension< 0, 0, 0, 0, 0, sN5, sN6, sN7 > {
	+
	+ enum { rank = ( sN5 < 0 ? 5 :
	+ ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 ))) };
	+ enum { rank_dynamic = 5 };
	+
	+ unsigned N0 ;
	+ unsigned N1 ;
	+ unsigned N2 ;
	+ unsigned N3 ;
	+ unsigned N4 ;
	+ enum { N5 = 0 < sN5 ? sN5 : 1 };
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned , unsigned , unsigned )
	+ : N0( aN0 ) , N1( aN1 ) , N2( aN2 ) , N3( aN3 ) , N4( aN4 ) {}
	+};
	+
	+template< long sN6
	+ , long sN7
	+ >
	+struct ViewDimension< 0, 0, 0, 0, 0, 0, sN6, sN7 > {
	+
	+ enum { rank = ( sN6 < 0 ? 6 :
	+ ( sN7 < 0 ? 7 : 8 )) };
	+ enum { rank_dynamic = 6 };
	+
	+ unsigned N0 ;
	+ unsigned N1 ;
	+ unsigned N2 ;
	+ unsigned N3 ;
	+ unsigned N4 ;
	+ unsigned N5 ;
	+ enum { N6 = 0 < sN6 ? sN6 : 1 };
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned , unsigned )
	+ : N0( aN0 ) , N1( aN1 ) , N2( aN2 ) , N3( aN3 ) , N4( aN4 ) , N5( aN5 ) {}
	+};
	+
	+template< long sN7 >
	+struct ViewDimension< 0, 0, 0, 0, 0, 0, 0, sN7 > {
	+
	+ enum { rank = ( sN7 < 0 ? 7 : 8 ) };
	+ enum { rank_dynamic = 7 };
	+
	+ unsigned N0 ;
	+ unsigned N1 ;
	+ unsigned N2 ;
	+ unsigned N3 ;
	+ unsigned N4 ;
	+ unsigned N5 ;
	+ unsigned N6 ;
	+ enum { N7 = 0 < sN7 ? sN7 : 1 };
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned aN6 , unsigned )
	+ : N0( aN0 ) , N1( aN1 ) , N2( aN2 ) , N3( aN3 ) , N4( aN4 ) , N5( aN5 ) , N6( aN6 ) {}
	+};
	+
	+template<>
	+struct ViewDimension< 0, 0, 0, 0, 0, 0, 0, 0 > {
	+
	+ enum { rank = 8 };
	+ enum { rank_dynamic = 8 };
	+
	+ unsigned N0 ;
	+ unsigned N1 ;
	+ unsigned N2 ;
	+ unsigned N3 ;
	+ unsigned N4 ;
	+ unsigned N5 ;
	+ unsigned N6 ;
	+ unsigned N7 ;
	+
	+ ViewDimension() = default ;
	+ ViewDimension( const ViewDimension & ) = default ;
	+ ViewDimension & operator = ( const ViewDimension & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewDimension( size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned aN6 , unsigned aN7 )
	+ : N0( aN0 ) , N1( aN1 ) , N2( aN2 ) , N3( aN3 ) , N4( aN4 ) , N5( aN5 ) , N6( aN6 ) , N7( aN7 ) {}
	+};
	+
	+//----------------------------------------------------------------------------
	+
	+template< class DstDim , class SrcDim >
	+struct ViewDimensionAssignable ;
	+
	+template< long dN0 , long dN1 , long dN2 , long dN3 , long dN4 , long dN5 , long dN6 , long dN7
	+ , long sN0 , long sN1 , long sN2 , long sN3 , long sN4 , long sN5 , long sN6 , long sN7 >
	+struct ViewDimensionAssignable< ViewDimension<dN0,dN1,dN2,dN3,dN4,dN5,dN6,dN7>
	+ , ViewDimension<sN0,sN1,sN2,sN3,sN4,sN5,sN6,sN7> >
	+{
	+ typedef ViewDimension<dN0,dN1,dN2,dN3,dN4,dN5,dN6,dN7> dst ;
	+ typedef ViewDimension<sN0,sN1,sN2,sN3,sN4,sN5,sN6,sN7> src ;
	+
	+ enum { value = dst::rank == src::rank &&
	+ dst::rank_dynamic >= src::rank_dynamic &&
	+ ( 0 < dst::rank_dynamic \|\| dN0 == sN0 ) &&
	+ ( 1 < dst::rank_dynamic \|\| dN1 == sN1 ) &&
	+ ( 2 < dst::rank_dynamic \|\| dN2 == sN2 ) &&
	+ ( 3 < dst::rank_dynamic \|\| dN3 == sN3 ) &&
	+ ( 4 < dst::rank_dynamic \|\| dN4 == sN4 ) &&
	+ ( 5 < dst::rank_dynamic \|\| dN5 == sN5 ) &&
	+ ( 6 < dst::rank_dynamic \|\| dN6 == sN6 ) &&
	+ ( 7 < dst::rank_dynamic \|\| dN7 == sN7 ) };
	+};
	+
	+//----------------------------------------------------------------------------
	+
	+template< class Dim , unsigned N , unsigned R = Dim::rank_dynamic >
	+struct ViewDimensionInsert ;
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 0 >
	+{
	+ typedef ViewDimension< N
	+ , 0 < Dim::rank ? Dim::N0 : -1
	+ , 1 < Dim::rank ? Dim::N1 : -1
	+ , 2 < Dim::rank ? Dim::N2 : -1
	+ , 3 < Dim::rank ? Dim::N3 : -1
	+ , 4 < Dim::rank ? Dim::N4 : -1
	+ , 5 < Dim::rank ? Dim::N5 : -1
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 1 >
	+{
	+ typedef ViewDimension< 0 , N
	+ , 1 < Dim::rank ? Dim::N1 : -1
	+ , 2 < Dim::rank ? Dim::N2 : -1
	+ , 3 < Dim::rank ? Dim::N3 : -1
	+ , 4 < Dim::rank ? Dim::N4 : -1
	+ , 5 < Dim::rank ? Dim::N5 : -1
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 2 >
	+{
	+ typedef ViewDimension< 0 , 0 , N
	+ , 2 < Dim::rank ? Dim::N2 : -1
	+ , 3 < Dim::rank ? Dim::N3 : -1
	+ , 4 < Dim::rank ? Dim::N4 : -1
	+ , 5 < Dim::rank ? Dim::N5 : -1
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 3 >
	+{
	+ typedef ViewDimension< 0 , 0 , 0 , N
	+ , 3 < Dim::rank ? Dim::N3 : -1
	+ , 4 < Dim::rank ? Dim::N4 : -1
	+ , 5 < Dim::rank ? Dim::N5 : -1
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 4 >
	+{
	+ typedef ViewDimension< 0 , 0 , 0 , 0 , N
	+ , 4 < Dim::rank ? Dim::N4 : -1
	+ , 5 < Dim::rank ? Dim::N5 : -1
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 5 >
	+{
	+ typedef ViewDimension< 0 , 0 , 0 , 0 , 0 , N
	+ , 5 < Dim::rank ? Dim::N5 : -1
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 6 >
	+{
	+ typedef ViewDimension< 0 , 0 , 0 , 0 , 0 , 0 , N
	+ , 6 < Dim::rank ? Dim::N6 : -1
	+ > type ;
	+};
	+
	+template< class Dim , unsigned N >
	+struct ViewDimensionInsert< Dim , N , 7 >
	+{
	+ typedef ViewDimension< 0 , 0 , 0 , 0 , 0 , 0 , 0 , N > type ;
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+/** \brief Analyze the array dimensions defined by a Kokkos::View data type.
	+ *
	+ * It is presumed that the data type can be mapped down to a multidimensional
	+ * array of an intrinsic scalar numerical type (double, float, int, ... ).
	+ * The 'value_type' of an array may be an embedded aggregate type such
	+ * as a fixed length array 'Array<T,N>'.
	+ * In this case the 'array_intrinsic_type' represents the
	+ * underlying array of intrinsic scalar numerical type.
	+ *
	+ * The embedded aggregate type must have an AnalyzeShape specialization
	+ * to map it down to a shape and intrinsic scalar numerical type.
	+ */
	+template< class T >
	+struct ViewDataAnalysis
	+{
	+ typedef void specialize ; // No specialization
	+
	+ typedef ViewDimension<> dimension ;
	+
	+ typedef T type ;
	+ typedef T value_type ;
	+ typedef T array_scalar_type ;
	+
	+ typedef typename std::add_const< T >::type const_type ;
	+ typedef typename std::add_const< T >::type const_value_type ;
	+ typedef typename std::add_const< T >::type const_array_scalar_type ;
	+
	+ typedef typename std::remove_const< T >::type non_const_type ;
	+ typedef typename std::remove_const< T >::type non_const_value_type ;
	+ typedef typename std::remove_const< T >::type non_const_array_scalar_type ;
	+};
	+
	+template< class T >
	+struct ViewDataAnalysis< T * >
	+{
	+private:
	+
	+ typedef ViewDataAnalysis< T > nested ;
	+
	+public:
	+
	+ typedef typename nested::specialize specialize ;
	+
	+ typedef typename ViewDimensionInsert< typename nested::dimension , 0 >::type dimension ;
	+
	+ typedef typename nested::type * type ;
	+ typedef typename nested::value_type value_type ;
	+ typedef typename nested::array_scalar_type * array_scalar_type ;
	+
	+ typedef typename nested::const_type * const_type ;
	+ typedef typename nested::const_value_type const_value_type ;
	+ typedef typename nested::const_array_scalar_type * const_array_scalar_type ;
	+
	+ typedef typename nested::non_const_type * non_const_type ;
	+ typedef typename nested::non_const_value_type non_const_value_type ;
	+ typedef typename nested::non_const_array_scalar_type * non_const_array_scalar_type ;
	+};
	+
	+template< class T >
	+struct ViewDataAnalysis< T [] >
	+{
	+private:
	+
	+ typedef ViewDataAnalysis< T > nested ;
	+
	+public:
	+
	+ typedef typename nested::specialize specialize ;
	+
	+ typedef typename ViewDimensionInsert< typename nested::dimension , 0 >::type dimension ;
	+
	+ typedef typename nested::type type [] ;
	+ typedef typename nested::value_type value_type ;
	+ typedef typename nested::array_scalar_type array_scalar_type [] ;
	+
	+ typedef typename nested::const_type const_type [] ;
	+ typedef typename nested::const_value_type const_value_type ;
	+ typedef typename nested::const_array_scalar_type const_array_scalar_type [] ;
	+
	+ typedef typename nested::non_const_type non_const_type [] ;
	+ typedef typename nested::non_const_value_type non_const_value_type ;
	+ typedef typename nested::non_const_array_scalar_type non_const_array_scalar_type [] ;
	+};
	+
	+template< class T , unsigned N >
	+struct ViewDataAnalysis< T[N] >
	+{
	+private:
	+
	+ typedef ViewDataAnalysis< T > nested ;
	+
	+public:
	+
	+ typedef typename nested::specialize specialize ;
	+
	+ typedef typename ViewDimensionInsert< typename nested::dimension , N >::type dimension ;
	+
	+ typedef typename nested::type type [N] ;
	+ typedef typename nested::value_type value_type ;
	+ typedef typename nested::array_scalar_type array_scalar_type [N] ;
	+
	+ typedef typename nested::const_type const_type [N] ;
	+ typedef typename nested::const_value_type const_value_type ;
	+ typedef typename nested::const_array_scalar_type const_array_scalar_type [N] ;
	+
	+ typedef typename nested::non_const_type non_const_type [N] ;
	+ typedef typename nested::non_const_value_type non_const_value_type ;
	+ typedef typename nested::non_const_array_scalar_type non_const_array_scalar_type [N] ;
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template < class Dimension , class Layout , typename Enable = void >
	+struct ViewOffset ;
	+
	+//----------------------------------------------------------------------------
	+// LayoutLeft AND ( 1 >= rank OR 0 == rank_dynamic ) : no padding / striding
	+template < class Dimension >
	+struct ViewOffset< Dimension , Kokkos::LayoutLeft
	+ , typename std::enable_if<( 1 >= Dimension::rank
	+ \|\|
	+ 0 == Dimension::rank_dynamic
	+ )>::type >
	+{
	+ typedef size_t size_type ;
	+ typedef Dimension dimension_type ;
	+ typedef Kokkos::LayoutLeft array_layout ;
	+
	+ dimension_type m_dim ;
	+
	+ //----------------------------------------
	+
	+ // rank 1
	+ template< typename I0 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 ) const { return i0 ; }
	+
	+ // rank 2
	+ template < typename I0 , typename I1 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 , I1 const & i1 ) const
	+ { return i0 + m_dim.N0 * i1 ; }
	+
	+ //rank 3
	+ template < typename I0, typename I1, typename I2 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2 ) const
	+ {
	+ return i0 + m_dim.N0 * ( i1 + m_dim.N1 * i2 );
	+ }
	+
	+ //rank 4
	+ template < typename I0, typename I1, typename I2, typename I3 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3 ) const
	+ {
	+ return i0 + m_dim.N0 * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * i3 ));
	+ }
	+
	+ //rank 5
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4 ) const
	+ {
	+ return i0 + m_dim.N0 * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * i4 )));
	+ }
	+
	+ //rank 6
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5 ) const
	+ {
	+ return i0 + m_dim.N0 * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * (
	+ i4 + m_dim.N4 * i5 ))));
	+ }
	+
	+ //rank 7
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6 ) const
	+ {
	+ return i0 + m_dim.N0 * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * (
	+ i4 + m_dim.N4 * (
	+ i5 + m_dim.N5 * i6 )))));
	+ }
	+
	+ //rank 8
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6, typename I7 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6, I7 const & i7 ) const
	+ {
	+ return i0 + m_dim.N0 * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * (
	+ i4 + m_dim.N4 * (
	+ i5 + m_dim.N5 * (
	+ i6 + m_dim.N6 * i7 ))))));
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_0() const { return m_dim.N0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_1() const { return m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_2() const { return m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_3() const { return m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_4() const { return m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_5() const { return m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_6() const { return m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_7() const { return m_dim.N7 ; }
	+
	+ /* Cardinality of the domain index space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type size() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ /* Span of the range space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type span() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const { return true ; }
	+
	+ /* Strides of dimensions */
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_0() const { return 1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_1() const { return m_dim.N0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_2() const { return m_dim.N0 * m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_3() const { return m_dim.N0 * m_dim.N1 * m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_4() const { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_5() const { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_6() const { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_7() const { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 ; }
	+
	+ // Stride with [ rank ] value is the total length
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ void stride( iType * const s ) const
	+ {
	+ s[0] = 1 ;
	+ if ( 0 < dimension_type::rank ) { s[1] = m_dim.N0 ; }
	+ if ( 1 < dimension_type::rank ) { s[2] = s[1] * m_dim.N1 ; }
	+ if ( 2 < dimension_type::rank ) { s[3] = s[2] * m_dim.N2 ; }
	+ if ( 3 < dimension_type::rank ) { s[4] = s[3] * m_dim.N3 ; }
	+ if ( 4 < dimension_type::rank ) { s[5] = s[4] * m_dim.N4 ; }
	+ if ( 5 < dimension_type::rank ) { s[6] = s[5] * m_dim.N5 ; }
	+ if ( 6 < dimension_type::rank ) { s[7] = s[6] * m_dim.N6 ; }
	+ if ( 7 < dimension_type::rank ) { s[8] = s[7] * m_dim.N7 ; }
	+ }
	+
	+ //----------------------------------------
	+
	+ ViewOffset() = default ;
	+ ViewOffset( const ViewOffset & ) = default ;
	+ ViewOffset & operator = ( const ViewOffset & ) = default ;
	+
	+ template< unsigned TrivialScalarSize >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( std::integral_constant<unsigned,TrivialScalarSize> const &
	+ , size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned aN6 , unsigned aN7 )
	+ : m_dim( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ {}
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutLeft , void > & rhs )
	+ : m_dim( rhs.m_dim.N0 , rhs.m_dim.N1 , rhs.m_dim.N2 , rhs.m_dim.N3
	+ , rhs.m_dim.N4 , rhs.m_dim.N5 , rhs.m_dim.N6 , rhs.m_dim.N7 )
	+ {
	+ static_assert( int(DimRHS::rank) == int(dimension_type::rank) , "ViewOffset assignment requires equal rank" );
	+ // Also requires equal static dimensions ...
	+ }
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutRight , void > & rhs )
	+ : m_dim( rhs.m_dim.N0, 0, 0, 0, 0, 0, 0, 0 )
	+ {
	+ static_assert( DimRHS::rank == 1 && dimension_type::rank == 1 && dimension_type::rank_dynamic == 1
	+ , "ViewOffset LayoutLeft and LayoutRight are only compatible when rank == 1" );
	+ }
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutStride , void > & rhs )
	+ : m_dim( rhs.m_dim.N0, 0, 0, 0, 0, 0, 0, 0 )
	+ {
	+ static_assert( DimRHS::rank == 1 && dimension_type::rank == 1 && dimension_type::rank_dynamic == 1
	+ , "ViewOffset LayoutLeft and LayoutStride are only compatible when rank == 1" );
	+ if ( rhs.m_stride.S0 != 1 ) {
	+ Kokkos::abort("Kokkos::Experimental::ViewOffset assignment of LayoutLeft from LayoutStride requires stride == 1" );
	+ }
	+ }
	+
	+ //----------------------------------------
	+ // Subview construction
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutLeft , void > & rhs
	+ , const size_t n0
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ )
	+ : m_dim( n0, 0, 0, 0, 0, 0, 0, 0 )
	+ {
	+ static_assert( ( 0 == dimension_type::rank ) \|\|
	+ ( 1 == dimension_type::rank && 1 == dimension_type::rank_dynamic && 1 <= DimRHS::rank )
	+ , "ViewOffset subview construction requires compatible rank" );
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+// LayoutLeft AND ( 1 < rank AND 0 < rank_dynamic ) : has padding / striding
	+template < class Dimension >
	+struct ViewOffset< Dimension , Kokkos::LayoutLeft
	+ , typename std::enable_if<( 1 < Dimension::rank
	+ &&
	+ 0 < Dimension::rank_dynamic
	+ )>::type >
	+{
	+ typedef size_t size_type ;
	+ typedef Dimension dimension_type ;
	+ typedef Kokkos::LayoutLeft array_layout ;
	+
	+ dimension_type m_dim ;
	+ size_type m_stride ;
	+
	+ //----------------------------------------
	+
	+ // rank 1
	+ template< typename I0 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 ) const { return i0 ; }
	+
	+ // rank 2
	+ template < typename I0 , typename I1 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 , I1 const & i1 ) const
	+ { return i0 + m_stride * i1 ; }
	+
	+ //rank 3
	+ template < typename I0, typename I1, typename I2 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2 ) const
	+ {
	+ return i0 + m_stride * ( i1 + m_dim.N1 * i2 );
	+ }
	+
	+ //rank 4
	+ template < typename I0, typename I1, typename I2, typename I3 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3 ) const
	+ {
	+ return i0 + m_stride * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * i3 ));
	+ }
	+
	+ //rank 5
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4 ) const
	+ {
	+ return i0 + m_stride * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * i4 )));
	+ }
	+
	+ //rank 6
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5 ) const
	+ {
	+ return i0 + m_stride * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * (
	+ i4 + m_dim.N4 * i5 ))));
	+ }
	+
	+ //rank 7
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6 ) const
	+ {
	+ return i0 + m_stride * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * (
	+ i4 + m_dim.N4 * (
	+ i5 + m_dim.N5 * i6 )))));
	+ }
	+
	+ //rank 8
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6, typename I7 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6, I7 const & i7 ) const
	+ {
	+ return i0 + m_stride * (
	+ i1 + m_dim.N1 * (
	+ i2 + m_dim.N2 * (
	+ i3 + m_dim.N3 * (
	+ i4 + m_dim.N4 * (
	+ i5 + m_dim.N5 * (
	+ i6 + m_dim.N6 * i7 ))))));
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_0() const { return m_dim.N0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_1() const { return m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_2() const { return m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_3() const { return m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_4() const { return m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_5() const { return m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_6() const { return m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_7() const { return m_dim.N7 ; }
	+
	+ /* Cardinality of the domain index space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type size() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ /* Span of the range space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type span() const
	+ { return m_stride * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const { return m_stride == m_dim.N0 ; }
	+
	+ /* Strides of dimensions */
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_0() const { return 1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_1() const { return m_stride ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_2() const { return m_stride * m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_3() const { return m_stride * m_dim.N1 * m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_4() const { return m_stride * m_dim.N1 * m_dim.N2 * m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_5() const { return m_stride * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_6() const { return m_stride * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_7() const { return m_stride * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 ; }
	+
	+ // Stride with [ rank ] value is the total length
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ void stride( iType * const s ) const
	+ {
	+ s[0] = 1 ;
	+ if ( 0 < dimension_type::rank ) { s[1] = m_stride ; }
	+ if ( 1 < dimension_type::rank ) { s[2] = s[1] * m_dim.N1 ; }
	+ if ( 2 < dimension_type::rank ) { s[3] = s[2] * m_dim.N2 ; }
	+ if ( 3 < dimension_type::rank ) { s[4] = s[3] * m_dim.N3 ; }
	+ if ( 4 < dimension_type::rank ) { s[5] = s[4] * m_dim.N4 ; }
	+ if ( 5 < dimension_type::rank ) { s[6] = s[5] * m_dim.N5 ; }
	+ if ( 6 < dimension_type::rank ) { s[7] = s[6] * m_dim.N6 ; }
	+ if ( 7 < dimension_type::rank ) { s[8] = s[7] * m_dim.N7 ; }
	+ }
	+
	+ //----------------------------------------
	+
	+private:
	+
	+ template< unsigned TrivialScalarSize >
	+ struct Padding {
	+ enum { div = TrivialScalarSize == 0 ? 0 : Kokkos::Impl::MEMORY_ALIGNMENT / ( TrivialScalarSize ? TrivialScalarSize : 1 ) };
	+ enum { mod = TrivialScalarSize == 0 ? 0 : Kokkos::Impl::MEMORY_ALIGNMENT % ( TrivialScalarSize ? TrivialScalarSize : 1 ) };
	+
	+ // If memory alignment is a multiple of the trivial scalar size then attempt to align.
	+ enum { align = 0 != TrivialScalarSize && 0 == mod ? div : 0 };
	+ enum { div_ok = div ? div : 1 }; // To valid modulo zero in constexpr
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_t stride( size_t const N )
	+ {
	+ return ( align && ( Kokkos::Impl::MEMORY_ALIGNMENT_THRESHOLD * align < N ) && ( N % div_ok ) )
	+ ? N + align - ( N % div_ok ) : N ;
	+ }
	+ };
	+
	+public:
	+
	+ ViewOffset() = default ;
	+ ViewOffset( const ViewOffset & ) = default ;
	+ ViewOffset & operator = ( const ViewOffset & ) = default ;
	+
	+ /* Enable padding for trivial scalar types with non-zero trivial scalar size */
	+ template< unsigned TrivialScalarSize >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( std::integral_constant<unsigned,TrivialScalarSize> const & padding_type_size
	+ , size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned aN6 , unsigned aN7 )
	+ : m_dim( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , m_stride( Padding<TrivialScalarSize>::stride( aN0 ) )
	+ {}
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutLeft , void > & rhs )
	+ : m_dim( rhs.m_dim.N0 , rhs.m_dim.N1 , rhs.m_dim.N2 , rhs.m_dim.N3
	+ , rhs.m_dim.N4 , rhs.m_dim.N5 , rhs.m_dim.N6 , rhs.m_dim.N7 )
	+ , m_stride( rhs.stride_1() )
	+ {
	+ static_assert( int(DimRHS::rank) == int(dimension_type::rank) , "ViewOffset assignment requires equal rank" );
	+ // Also requires equal static dimensions ...
	+ }
	+
	+ //----------------------------------------
	+ // Subview construction
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutLeft , void > & rhs
	+ , const size_t aN0
	+ , const size_t aN1
	+ , const size_t aN2
	+ , const size_t aN3
	+ , const size_t aN4
	+ , const size_t aN5
	+ , const size_t aN6
	+ , const size_t aN7
	+ )
	+ : m_dim( aN0
	+ , ( 1 < DimRHS::rank && aN1 ? aN1 :
	+ ( 2 < DimRHS::rank && aN2 ? aN2 :
	+ ( 3 < DimRHS::rank && aN3 ? aN3 :
	+ ( 4 < DimRHS::rank && aN4 ? aN4 :
	+ ( 5 < DimRHS::rank && aN5 ? aN5 :
	+ ( 6 < DimRHS::rank && aN6 ? aN6 :
	+ ( 7 < DimRHS::rank && aN7 ? aN7 : 0 )))))))
	+ , 0, 0, 0, 0, 0, 0 )
	+ , m_stride( ( 1 < DimRHS::rank && aN1 ? rhs.stride_1() :
	+ ( 2 < DimRHS::rank && aN2 ? rhs.stride_2() :
	+ ( 3 < DimRHS::rank && aN3 ? rhs.stride_3() :
	+ ( 4 < DimRHS::rank && aN4 ? rhs.stride_4() :
	+ ( 5 < DimRHS::rank && aN5 ? rhs.stride_5() :
	+ ( 6 < DimRHS::rank && aN6 ? rhs.stride_6() :
	+ ( 7 < DimRHS::rank && aN7 ? rhs.stride_7() : 0 ))))))) )
	+ {
	+ // This subview must be 2 == rank and 2 == rank_dynamic
	+ // due to only having stride #0.
	+ // The source dimension #0 must be non-zero for stride-one leading dimension.
	+ // At most subsequent dimension can be non-zero.
	+
	+ static_assert( ( 2 == dimension_type::rank ) &&
	+ ( 2 == dimension_type::rank_dynamic ) &&
	+ ( 2 <= DimRHS::rank )
	+ , "ViewOffset subview construction requires compatible rank" );
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+// LayoutRight AND ( 1 >= rank OR 0 == rank_dynamic ) : no padding / striding
	+template < class Dimension >
	+struct ViewOffset< Dimension , Kokkos::LayoutRight
	+ , typename std::enable_if<( 1 >= Dimension::rank
	+ \|\|
	+ 0 == Dimension::rank_dynamic
	+ )>::type >
	+{
	+ typedef size_t size_type ;
	+ typedef Dimension dimension_type ;
	+ typedef Kokkos::LayoutRight array_layout ;
	+
	+ dimension_type m_dim ;
	+
	+ //----------------------------------------
	+
	+ // rank 1
	+ template< typename I0 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 ) const { return i0 ; }
	+
	+ // rank 2
	+ template < typename I0 , typename I1 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 , I1 const & i1 ) const
	+ { return i1 + m_dim.N1 * i0 ; }
	+
	+ //rank 3
	+ template < typename I0, typename I1, typename I2 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2 ) const
	+ {
	+ return i2 + m_dim.N2 * ( i1 + m_dim.N1 * ( i0 ));
	+ }
	+
	+ //rank 4
	+ template < typename I0, typename I1, typename I2, typename I3 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3 ) const
	+ {
	+ return i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * (
	+ i1 + m_dim.N1 * ( i0 )));
	+ }
	+
	+ //rank 5
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4 ) const
	+ {
	+ return i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * (
	+ i1 + m_dim.N1 * ( i0 ))));
	+ }
	+
	+ //rank 6
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5 ) const
	+ {
	+ return i5 + m_dim.N5 * (
	+ i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * (
	+ i1 + m_dim.N1 * ( i0 )))));
	+ }
	+
	+ //rank 7
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6 ) const
	+ {
	+ return i6 + m_dim.N6 * (
	+ i5 + m_dim.N5 * (
	+ i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * (
	+ i1 + m_dim.N1 * ( i0 ))))));
	+ }
	+
	+ //rank 8
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6, typename I7 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6, I7 const & i7 ) const
	+ {
	+ return i7 + m_dim.N7 * (
	+ i6 + m_dim.N6 * (
	+ i5 + m_dim.N5 * (
	+ i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * (
	+ i1 + m_dim.N1 * ( i0 )))))));
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_0() const { return m_dim.N0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_1() const { return m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_2() const { return m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_3() const { return m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_4() const { return m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_5() const { return m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_6() const { return m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_7() const { return m_dim.N7 ; }
	+
	+ /* Cardinality of the domain index space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type size() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ /* Span of the range space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type span() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const { return true ; }
	+
	+ /* Strides of dimensions */
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_7() const { return 1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_6() const { return m_dim.N7 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_5() const { return m_dim.N7 * m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_4() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_3() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_2() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 * m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_1() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 * m_dim.N3 * m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_0() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 * m_dim.N3 * m_dim.N2 * m_dim.N1 ; }
	+
	+ // Stride with [ rank ] value is the total length
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ void stride( iType * const s ) const
	+ {
	+ size_type n = 1 ;
	+ if ( 7 < dimension_type::rank ) { s[7] = n ; n *= m_dim.N7 ; }
	+ if ( 6 < dimension_type::rank ) { s[6] = n ; n *= m_dim.N6 ; }
	+ if ( 5 < dimension_type::rank ) { s[5] = n ; n *= m_dim.N5 ; }
	+ if ( 4 < dimension_type::rank ) { s[4] = n ; n *= m_dim.N4 ; }
	+ if ( 3 < dimension_type::rank ) { s[3] = n ; n *= m_dim.N3 ; }
	+ if ( 2 < dimension_type::rank ) { s[2] = n ; n *= m_dim.N2 ; }
	+ if ( 1 < dimension_type::rank ) { s[1] = n ; n *= m_dim.N1 ; }
	+ if ( 0 < dimension_type::rank ) { s[0] = n ; }
	+ s[dimension_type::rank] = n * m_dim.N0 ;
	+ }
	+
	+ //----------------------------------------
	+
	+ ViewOffset() = default ;
	+ ViewOffset( const ViewOffset & ) = default ;
	+ ViewOffset & operator = ( const ViewOffset & ) = default ;
	+
	+ template< unsigned TrivialScalarSize >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( std::integral_constant<unsigned,TrivialScalarSize> const &
	+ , size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned aN6 , unsigned aN7 )
	+ : m_dim( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ {}
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutRight , void > & rhs )
	+ : m_dim( rhs.m_dim.N0 , rhs.m_dim.N1 , rhs.m_dim.N2 , rhs.m_dim.N3
	+ , rhs.m_dim.N4 , rhs.m_dim.N5 , rhs.m_dim.N6 , rhs.m_dim.N7 )
	+ {
	+ static_assert( int(DimRHS::rank) == int(dimension_type::rank) , "ViewOffset assignment requires equal rank" );
	+ // Also requires equal static dimensions ...
	+ }
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutLeft , void > & rhs )
	+ : m_dim( rhs.m_dim.N0, 0, 0, 0, 0, 0, 0, 0 )
	+ {
	+ static_assert( DimRHS::rank == 1 && dimension_type::rank == 1 && dimension_type::rank_dynamic == 1
	+ , "ViewOffset LayoutRight and LayoutLeft are only compatible when rank == 1" );
	+ }
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutStride , void > & rhs )
	+ : m_dim( rhs.m_dim.N0, 0, 0, 0, 0, 0, 0, 0 )
	+ {
	+ static_assert( DimRHS::rank == 1 && dimension_type::rank == 1 && dimension_type::rank_dynamic == 1
	+ , "ViewOffset LayoutLeft and LayoutStride are only compatible when rank == 1" );
	+ if ( rhs.m_stride.S0 != 1 ) {
	+ Kokkos::abort("Kokkos::Experimental::ViewOffset assignment of LayoutRight from LayoutStride requires stride == 1" );
	+ }
	+ }
	+
	+ //----------------------------------------
	+ // Subview construction
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutRight , void > & rhs
	+ , const size_t n0
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ , const size_t
	+ )
	+ : m_dim( n0, 0, 0, 0, 0, 0, 0, 0 )
	+ {
	+ static_assert( ( 0 == dimension_type::rank ) \|\|
	+ ( 1 == dimension_type::rank && 1 == dimension_type::rank_dynamic && 1 <= DimRHS::rank )
	+ , "ViewOffset subview construction requires compatible rank" );
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+// LayoutRight AND ( 1 < rank AND 0 < rank_dynamic ) : has padding / striding
	+template < class Dimension >
	+struct ViewOffset< Dimension , Kokkos::LayoutRight
	+ , typename std::enable_if<( 1 < Dimension::rank
	+ &&
	+ 0 < Dimension::rank_dynamic
	+ )>::type >
	+{
	+ typedef size_t size_type ;
	+ typedef Dimension dimension_type ;
	+ typedef Kokkos::LayoutRight array_layout ;
	+
	+ dimension_type m_dim ;
	+ size_type m_stride ;
	+
	+ //----------------------------------------
	+
	+ // rank 1
	+ template< typename I0 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 ) const { return i0 ; }
	+
	+ // rank 2
	+ template < typename I0 , typename I1 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 , I1 const & i1 ) const
	+ { return i1 + i0 * m_stride ; }
	+
	+ //rank 3
	+ template < typename I0, typename I1, typename I2 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2 ) const
	+ { return i2 + m_dim.N2 * ( i1 ) + i0 * m_stride ; }
	+
	+ //rank 4
	+ template < typename I0, typename I1, typename I2, typename I3 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3 ) const
	+ {
	+ return i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * ( i1 )) +
	+ i0 * m_stride ;
	+ }
	+
	+ //rank 5
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4 ) const
	+ {
	+ return i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * ( i1 ))) +
	+ i0 * m_stride ;
	+ }
	+
	+ //rank 6
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5 ) const
	+ {
	+ return i5 + m_dim.N5 * (
	+ i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * ( i1 )))) +
	+ i0 * m_stride ;
	+ }
	+
	+ //rank 7
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6 ) const
	+ {
	+ return i6 + m_dim.N6 * (
	+ i5 + m_dim.N5 * (
	+ i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * ( i1 ))))) +
	+ i0 * m_stride ;
	+ }
	+
	+ //rank 8
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6, typename I7 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6, I7 const & i7 ) const
	+ {
	+ return i7 + m_dim.N7 * (
	+ i6 + m_dim.N6 * (
	+ i5 + m_dim.N5 * (
	+ i4 + m_dim.N4 * (
	+ i3 + m_dim.N3 * (
	+ i2 + m_dim.N2 * ( i1 )))))) +
	+ i0 * m_stride ;
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_0() const { return m_dim.N0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_1() const { return m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_2() const { return m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_3() const { return m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_4() const { return m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_5() const { return m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_6() const { return m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_7() const { return m_dim.N7 ; }
	+
	+ /* Cardinality of the domain index space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type size() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+ /* Span of the range space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type span() const
	+ { return m_dim.N0 * m_stride ; }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const
	+ { return m_stride == m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 * m_dim.N3 * m_dim.N2 * m_dim.N1 ; }
	+
	+ /* Strides of dimensions */
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_7() const { return 1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_6() const { return m_dim.N7 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_5() const { return m_dim.N7 * m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_4() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_3() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_2() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 * m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_1() const { return m_dim.N7 * m_dim.N6 * m_dim.N5 * m_dim.N4 * m_dim.N3 * m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_0() const { return m_stride ; }
	+
	+ // Stride with [ rank ] value is the total length
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ void stride( iType * const s ) const
	+ {
	+ size_type n = 1 ;
	+ if ( 7 < dimension_type::rank ) { s[7] = n ; n *= m_dim.N7 ; }
	+ if ( 6 < dimension_type::rank ) { s[6] = n ; n *= m_dim.N6 ; }
	+ if ( 5 < dimension_type::rank ) { s[5] = n ; n *= m_dim.N5 ; }
	+ if ( 4 < dimension_type::rank ) { s[4] = n ; n *= m_dim.N4 ; }
	+ if ( 3 < dimension_type::rank ) { s[3] = n ; n *= m_dim.N3 ; }
	+ if ( 2 < dimension_type::rank ) { s[2] = n ; n *= m_dim.N2 ; }
	+ if ( 1 < dimension_type::rank ) { s[1] = n ; }
	+ if ( 0 < dimension_type::rank ) { s[0] = m_stride ; }
	+ s[dimension_type::rank] = m_stride * m_dim.N0 ;
	+ }
	+
	+ //----------------------------------------
	+
	+private:
	+
	+ template< unsigned TrivialScalarSize >
	+ struct Padding {
	+ enum { div = TrivialScalarSize == 0 ? 0 : Kokkos::Impl::MEMORY_ALIGNMENT / ( TrivialScalarSize ? TrivialScalarSize : 1 ) };
	+ enum { mod = TrivialScalarSize == 0 ? 0 : Kokkos::Impl::MEMORY_ALIGNMENT % ( TrivialScalarSize ? TrivialScalarSize : 1 ) };
	+
	+ // If memory alignment is a multiple of the trivial scalar size then attempt to align.
	+ enum { align = 0 != TrivialScalarSize && 0 == mod ? div : 0 };
	+ enum { div_ok = div ? div : 1 }; // To valid modulo zero in constexpr
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_t stride( size_t const N )
	+ {
	+ return ( align && ( Kokkos::Impl::MEMORY_ALIGNMENT_THRESHOLD * align < N ) && ( N % div_ok ) )
	+ ? N + align - ( N % div_ok ) : N ;
	+ }
	+ };
	+
	+public:
	+
	+ ViewOffset() = default ;
	+ ViewOffset( const ViewOffset & ) = default ;
	+ ViewOffset & operator = ( const ViewOffset & ) = default ;
	+
	+ /* Enable padding for trivial scalar types with non-zero trivial scalar size. */
	+ template< unsigned TrivialScalarSize >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( std::integral_constant<unsigned,TrivialScalarSize> const & padding_type_size
	+ , size_t aN0 , unsigned aN1 , unsigned aN2 , unsigned aN3
	+ , unsigned aN4 , unsigned aN5 , unsigned aN6 , unsigned aN7 )
	+ : m_dim( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , m_stride( Padding<TrivialScalarSize>::
	+ stride( /* 2 <= rank */
	+ m_dim.N1 * ( dimension_type::rank == 2 ? 1 :
	+ m_dim.N2 * ( dimension_type::rank == 3 ? 1 :
	+ m_dim.N3 * ( dimension_type::rank == 4 ? 1 :
	+ m_dim.N4 * ( dimension_type::rank == 5 ? 1 :
	+ m_dim.N5 * ( dimension_type::rank == 6 ? 1 :
	+ m_dim.N6 * ( dimension_type::rank == 7 ? 1 : m_dim.N7 )))))) ))
	+ {}
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutLeft , void > & rhs )
	+ : m_dim( rhs.m_dim.N0 , rhs.m_dim.N1 , rhs.m_dim.N2 , rhs.m_dim.N3
	+ , rhs.m_dim.N4 , rhs.m_dim.N5 , rhs.m_dim.N6 , rhs.m_dim.N7 )
	+ , m_stride( rhs.stride_0() )
	+ {
	+ static_assert( int(DimRHS::rank) == int(dimension_type::rank) , "ViewOffset assignment requires equal rank" );
	+ // Also requires equal static dimensions ...
	+ }
	+
	+ //----------------------------------------
	+ // Subview construction
	+ // Last dimension must be non-zero
	+
	+ template< class DimRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , Kokkos::LayoutRight , void > & rhs
	+ , const size_t aN0
	+ , const size_t aN1
	+ , const size_t aN2
	+ , const size_t aN3
	+ , const size_t aN4
	+ , const size_t aN5
	+ , const size_t aN6
	+ , const size_t aN7
	+ )
	+ : m_dim( // N0 == First non-zero dimension before the last dimension.
	+ ( 1 < DimRHS::rank && aN0 ? aN0 :
	+ ( 2 < DimRHS::rank && aN1 ? aN1 :
	+ ( 3 < DimRHS::rank && aN2 ? aN2 :
	+ ( 4 < DimRHS::rank && aN3 ? aN3 :
	+ ( 5 < DimRHS::rank && aN4 ? aN4 :
	+ ( 6 < DimRHS::rank && aN5 ? aN5 :
	+ ( 7 < DimRHS::rank && aN6 ? aN6 : 0 )))))))
	+ , // N1 == Last dimension.
	+ ( 2 == DimRHS::rank ? aN1 :
	+ ( 3 == DimRHS::rank ? aN2 :
	+ ( 4 == DimRHS::rank ? aN3 :
	+ ( 5 == DimRHS::rank ? aN4 :
	+ ( 6 == DimRHS::rank ? aN5 :
	+ ( 7 == DimRHS::rank ? aN6 : aN7 ))))))
	+ , 0, 0, 0, 0, 0, 0 )
	+ , m_stride( ( 1 < DimRHS::rank && aN0 ? rhs.stride_0() :
	+ ( 2 < DimRHS::rank && aN1 ? rhs.stride_1() :
	+ ( 3 < DimRHS::rank && aN2 ? rhs.stride_2() :
	+ ( 4 < DimRHS::rank && aN3 ? rhs.stride_3() :
	+ ( 5 < DimRHS::rank && aN4 ? rhs.stride_4() :
	+ ( 6 < DimRHS::rank && aN5 ? rhs.stride_5() :
	+ ( 7 < DimRHS::rank && aN6 ? rhs.stride_6() : 0 ))))))) )
	+ {
	+ // This subview must be 2 == rank and 2 == rank_dynamic
	+ // due to only having stride #0.
	+ // The source dimension #0 must be non-zero for stride-one leading dimension.
	+ // At most subsequent dimension can be non-zero.
	+
	+ static_assert( ( 2 == dimension_type::rank ) &&
	+ ( 2 == dimension_type::rank_dynamic ) &&
	+ ( 2 <= DimRHS::rank )
	+ , "ViewOffset subview construction requires compatible rank" );
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+/* Strided array layout only makes sense for 0 < rank */
	+
	+template< unsigned Rank >
	+struct ViewStride ;
	+
	+template<>
	+struct ViewStride<1> {
	+ size_t S0 ;
	+ enum { S1 = 0 , S2 = 0 , S3 = 0 , S4 = 0 , S5 = 0 , S6 = 0 , S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t , size_t , size_t
	+ , size_t , size_t , size_t , size_t )
	+ : S0( aS0 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<2> {
	+ size_t S0 , S1 ;
	+ enum { S2 = 0 , S3 = 0 , S4 = 0 , S5 = 0 , S6 = 0 , S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t , size_t
	+ , size_t , size_t , size_t , size_t )
	+ : S0( aS0 ) , S1( aS1 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<3> {
	+ size_t S0 , S1 , S2 ;
	+ enum { S3 = 0 , S4 = 0 , S5 = 0 , S6 = 0 , S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t aS2 , size_t
	+ , size_t , size_t , size_t , size_t )
	+ : S0( aS0 ) , S1( aS1 ) , S2( aS2 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<4> {
	+ size_t S0 , S1 , S2 , S3 ;
	+ enum { S4 = 0 , S5 = 0 , S6 = 0 , S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t aS2 , size_t aS3
	+ , size_t , size_t , size_t , size_t )
	+ : S0( aS0 ) , S1( aS1 ) , S2( aS2 ) , S3( aS3 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<5> {
	+ size_t S0 , S1 , S2 , S3 , S4 ;
	+ enum { S5 = 0 , S6 = 0 , S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t aS2 , size_t aS3
	+ , size_t aS4 , size_t , size_t , size_t )
	+ : S0( aS0 ) , S1( aS1 ) , S2( aS2 ) , S3( aS3 )
	+ , S4( aS4 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<6> {
	+ size_t S0 , S1 , S2 , S3 , S4 , S5 ;
	+ enum { S6 = 0 , S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t aS2 , size_t aS3
	+ , size_t aS4 , size_t aS5 , size_t , size_t )
	+ : S0( aS0 ) , S1( aS1 ) , S2( aS2 ) , S3( aS3 )
	+ , S4( aS4 ) , S5( aS5 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<7> {
	+ size_t S0 , S1 , S2 , S3 , S4 , S5 , S6 ;
	+ enum { S7 = 0 };
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t aS2 , size_t aS3
	+ , size_t aS4 , size_t aS5 , size_t aS6 , size_t )
	+ : S0( aS0 ) , S1( aS1 ) , S2( aS2 ) , S3( aS3 )
	+ , S4( aS4 ) , S5( aS5 ) , S6( aS6 )
	+ {}
	+};
	+
	+template<>
	+struct ViewStride<8> {
	+ size_t S0 , S1 , S2 , S3 , S4 , S5 , S6 , S7 ;
	+
	+ ViewStride() = default ;
	+ ViewStride( const ViewStride & ) = default ;
	+ ViewStride & operator = ( const ViewStride & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewStride( size_t aS0 , size_t aS1 , size_t aS2 , size_t aS3
	+ , size_t aS4 , size_t aS5 , size_t aS6 , size_t aS7 )
	+ : S0( aS0 ) , S1( aS1 ) , S2( aS2 ) , S3( aS3 )
	+ , S4( aS4 ) , S5( aS5 ) , S6( aS6 ) , S7( aS7 )
	+ {}
	+};
	+
	+template < class Dimension >
	+struct ViewOffset< Dimension , Kokkos::LayoutStride
	+ , typename std::enable_if<( 0 < Dimension::rank )>::type >
	+{
	+private:
	+ typedef ViewStride< Dimension::rank > stride_type ;
	+public:
	+
	+ typedef size_t size_type ;
	+ typedef Dimension dimension_type ;
	+ typedef Kokkos::LayoutStride array_layout ;
	+
	+ dimension_type m_dim ;
	+ stride_type m_stride ;
	+
	+ //----------------------------------------
	+
	+ // rank 1
	+ template< typename I0 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 ) const
	+ {
	+ return i0 * m_stride.S0 ;
	+ }
	+
	+ // rank 2
	+ template < typename I0 , typename I1 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0 , I1 const & i1 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 ;
	+ }
	+
	+ //rank 3
	+ template < typename I0, typename I1, typename I2 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 +
	+ i2 * m_stride.S2 ;
	+ }
	+
	+ //rank 4
	+ template < typename I0, typename I1, typename I2, typename I3 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 +
	+ i2 * m_stride.S2 +
	+ i3 * m_stride.S3 ;
	+ }
	+
	+ //rank 5
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 +
	+ i2 * m_stride.S2 +
	+ i3 * m_stride.S3 +
	+ i4 * m_stride.S4 ;
	+ }
	+
	+ //rank 6
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 +
	+ i2 * m_stride.S2 +
	+ i3 * m_stride.S3 +
	+ i4 * m_stride.S4 +
	+ i5 * m_stride.S5 ;
	+ }
	+
	+ //rank 7
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 +
	+ i2 * m_stride.S2 +
	+ i3 * m_stride.S3 +
	+ i4 * m_stride.S4 +
	+ i5 * m_stride.S5 +
	+ i6 * m_stride.S6 ;
	+ }
	+
	+ //rank 8
	+ template < typename I0, typename I1, typename I2, typename I3
	+ , typename I4, typename I5, typename I6, typename I7 >
	+ KOKKOS_INLINE_FUNCTION constexpr
	+ size_type operator()( I0 const & i0, I1 const & i1, I2 const & i2, I3 const & i3
	+ , I4 const & i4, I5 const & i5, I6 const & i6, I7 const & i7 ) const
	+ {
	+ return i0 * m_stride.S0 +
	+ i1 * m_stride.S1 +
	+ i2 * m_stride.S2 +
	+ i3 * m_stride.S3 +
	+ i4 * m_stride.S4 +
	+ i5 * m_stride.S5 +
	+ i6 * m_stride.S6 +
	+ i7 * m_stride.S7 ;
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_0() const { return m_dim.N0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_1() const { return m_dim.N1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_2() const { return m_dim.N2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_3() const { return m_dim.N3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_4() const { return m_dim.N4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_5() const { return m_dim.N5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_6() const { return m_dim.N6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type dimension_7() const { return m_dim.N7 ; }
	+
	+ /* Cardinality of the domain index space */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type size() const
	+ { return m_dim.N0 * m_dim.N1 * m_dim.N2 * m_dim.N3 * m_dim.N4 * m_dim.N5 * m_dim.N6 * m_dim.N7 ; }
	+
	+private:
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_type Max( size_type lhs , size_type rhs )
	+ { return lhs < rhs ? rhs : lhs ; }
	+
	+public:
	+
	+ /* Span of the range space, largest stride * dimension */
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr size_type span() const
	+ {
	+ return Max( m_dim.N0 * m_stride.S0 ,
	+ Max( m_dim.N1 * m_stride.S1 ,
	+ Max( m_dim.N2 * m_stride.S2 ,
	+ Max( m_dim.N3 * m_stride.S3 ,
	+ Max( m_dim.N4 * m_stride.S4 ,
	+ Max( m_dim.N5 * m_stride.S5 ,
	+ Max( m_dim.N6 * m_stride.S6 ,
	+ m_dim.N7 * m_stride.S7 )))))));
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const { return span() == size(); }
	+
	+ /* Strides of dimensions */
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_0() const { return m_stride.S0 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_1() const { return m_stride.S1 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_2() const { return m_stride.S2 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_3() const { return m_stride.S3 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_4() const { return m_stride.S4 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_5() const { return m_stride.S5 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_6() const { return m_stride.S6 ; }
	+ KOKKOS_INLINE_FUNCTION constexpr size_type stride_7() const { return m_stride.S7 ; }
	+
	+ // Stride with [ rank ] value is the total length
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ void stride( iType * const s ) const
	+ {
	+ if ( 0 < dimension_type::rank ) { s[0] = m_stride.S0 ; }
	+ if ( 1 < dimension_type::rank ) { s[1] = m_stride.S1 ; }
	+ if ( 2 < dimension_type::rank ) { s[2] = m_stride.S2 ; }
	+ if ( 3 < dimension_type::rank ) { s[3] = m_stride.S3 ; }
	+ if ( 4 < dimension_type::rank ) { s[4] = m_stride.S4 ; }
	+ if ( 5 < dimension_type::rank ) { s[5] = m_stride.S5 ; }
	+ if ( 6 < dimension_type::rank ) { s[6] = m_stride.S6 ; }
	+ if ( 7 < dimension_type::rank ) { s[7] = m_stride.S7 ; }
	+ s[dimension_type::rank] = span();
	+ }
	+
	+ //----------------------------------------
	+
	+ ViewOffset() = default ;
	+ ViewOffset( const ViewOffset & ) = default ;
	+ ViewOffset & operator = ( const ViewOffset & ) = default ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ ViewOffset( const Kokkos::LayoutStride & rhs )
	+ : m_dim( rhs.dimension[0] , rhs.dimension[1] , rhs.dimension[2] , rhs.dimension[3]
	+ , rhs.dimension[4] , rhs.dimension[5] , rhs.dimension[6] , rhs.dimension[7] )
	+ , m_stride( rhs.stride[0] , rhs.stride[1] , rhs.stride[2] , rhs.stride[3]
	+ , rhs.stride[4] , rhs.stride[5] , rhs.stride[6] , rhs.stride[7] )
	+ {}
	+
	+ template< class DimRHS , class LayoutRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , LayoutRHS , void > & rhs )
	+ : m_dim( rhs.m_dim.N0 , rhs.m_dim.N1 , rhs.m_dim.N2 , rhs.m_dim.N3
	+ , rhs.m_dim.N4 , rhs.m_dim.N5 , rhs.m_dim.N6 , rhs.m_dim.N7 )
	+ , m_stride( rhs.stride_0() , rhs.stride_1() , rhs.stride_2() , rhs.stride_3()
	+ , rhs.stride_4() , rhs.stride_5() , rhs.stride_6() , rhs.stride_7() )
	+ {
	+ static_assert( int(DimRHS::rank) == int(dimension_type::rank) , "ViewOffset assignment requires equal rank" );
	+ // Also requires equal static dimensions ...
	+ }
	+
	+ //----------------------------------------
	+ // Subview construction
	+
	+private:
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr unsigned
	+ count_non_zero( const size_t aN0 = 0
	+ , const size_t aN1 = 0
	+ , const size_t aN2 = 0
	+ , const size_t aN3 = 0
	+ , const size_t aN4 = 0
	+ , const size_t aN5 = 0
	+ , const size_t aN6 = 0
	+ , const size_t aN7 = 0
	+ )
	+ {
	+ return ( aN0 ? 1 : 0 ) +
	+ ( aN1 ? 1 : 0 ) +
	+ ( aN2 ? 1 : 0 ) +
	+ ( aN3 ? 1 : 0 ) +
	+ ( aN4 ? 1 : 0 ) +
	+ ( aN5 ? 1 : 0 ) +
	+ ( aN6 ? 1 : 0 ) +
	+ ( aN7 ? 1 : 0 );
	+ }
	+
	+ template< unsigned Rank , unsigned I >
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_t
	+ get_non_zero( const size_t aN0
	+ , const size_t aN1
	+ , const size_t aN2
	+ , const size_t aN3
	+ , const size_t aN4
	+ , const size_t aN5
	+ , const size_t aN6
	+ , const size_t aN7
	+ )
	+ {
	+ return ( 0 < Rank && I < 1 && aN0 ? aN0 :
	+ ( 1 < Rank && I < 2 && I == count_non_zero(aN0) && aN1 ? aN1 :
	+ ( 2 < Rank && I < 3 && I == count_non_zero(aN0,aN1) && aN2 ? aN2 :
	+ ( 3 < Rank && I < 4 && I == count_non_zero(aN0,aN1,aN2) && aN3 ? aN3 :
	+ ( 4 < Rank && I < 5 && I == count_non_zero(aN0,aN1,aN2,aN3) && aN4 ? aN4 :
	+ ( 5 < Rank && I < 6 && I == count_non_zero(aN0,aN1,aN2,aN3,aN4) && aN5 ? aN5 :
	+ ( 6 < Rank && I < 7 && I == count_non_zero(aN0,aN1,aN2,aN3,aN4,aN5) && aN6 ? aN6 :
	+ ( 7 < Rank && I < 8 && I == count_non_zero(aN0,aN1,aN2,aN3,aN4,aN5,aN6) && aN7 ? aN7 : 0 ))))))));
	+ }
	+
	+ template< unsigned Rank , unsigned I , class DimRHS , class LayoutRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_t
	+ get_non_zero( const size_t aN0 , const size_t aN1 , const size_t aN2 , const size_t aN3
	+ , const size_t aN4 , const size_t aN5 , const size_t aN6 , const size_t aN7
	+ , const ViewOffset< DimRHS , LayoutRHS , void > & rhs )
	+ {
	+ return ( 0 < Rank && I < 1 && aN0 ? rhs.stride_0() :
	+ ( 1 < Rank && I < 2 && I == count_non_zero(aN0) && aN1 ? rhs.stride_1() :
	+ ( 2 < Rank && I < 3 && I == count_non_zero(aN0,aN1) && aN2 ? rhs.stride_2() :
	+ ( 3 < Rank && I < 4 && I == count_non_zero(aN0,aN1,aN2) && aN3 ? rhs.stride_3() :
	+ ( 4 < Rank && I < 5 && I == count_non_zero(aN0,aN1,aN2,aN3) && aN4 ? rhs.stride_4() :
	+ ( 5 < Rank && I < 6 && I == count_non_zero(aN0,aN1,aN2,aN3,aN4) && aN5 ? rhs.stride_5() :
	+ ( 6 < Rank && I < 7 && I == count_non_zero(aN0,aN1,aN2,aN3,aN4,aN5) && aN6 ? rhs.stride_6() :
	+ ( 7 < Rank && I < 8 && I == count_non_zero(aN0,aN1,aN2,aN3,aN4,aN5,aN6) && aN7 ? rhs.stride_7() : 0 ))))))));
	+ }
	+
	+
	+public:
	+
	+ template< class DimRHS , class LayoutRHS >
	+ KOKKOS_INLINE_FUNCTION
	+ constexpr ViewOffset( const ViewOffset< DimRHS , LayoutRHS , void > & rhs
	+ , const size_t aN0
	+ , const size_t aN1
	+ , const size_t aN2
	+ , const size_t aN3
	+ , const size_t aN4
	+ , const size_t aN5
	+ , const size_t aN6
	+ , const size_t aN7
	+ )
	+ // Contract the non-zero dimensions
	+ : m_dim( ViewOffset::template get_non_zero<DimRHS::rank,0>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,1>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,2>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,3>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,4>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,5>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,6>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,7>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7 )
	+ )
	+ , m_stride( ViewOffset::template get_non_zero<DimRHS::rank,0>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,1>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,2>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,3>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,4>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,5>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,6>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ , ViewOffset::template get_non_zero<DimRHS::rank,7>( aN0, aN1, aN2, aN3, aN4, aN5, aN6, aN7, rhs )
	+ )
	+ {
	+ }
	+
	+ //----------------------------------------
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+struct ALL_t {};
	+
	+template< class T >
	+struct ViewOffsetRange {
	+
	+ static_assert( std::is_integral<T>::value , "Non-range must be an integral type" );
	+
	+ enum { is_range = false };
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t dimension( size_t const , T const & ) { return 0 ; }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t begin( T const & i ) { return size_t(i) ; }
	+};
	+
	+template<>
	+struct ViewOffsetRange<void> {
	+ enum { is_range = false };
	+};
	+
	+template<>
	+struct ViewOffsetRange< Kokkos::Experimental::Impl::ALL_t > {
	+ enum { is_range = true };
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t dimension( size_t const n , Experimental::Impl::ALL_t const & ) { return n ; }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t begin( Experimental::Impl::ALL_t const & ) { return 0 ; }
	+};
	+
	+template< typename iType >
	+struct ViewOffsetRange< std::pair<iType,iType> > {
	+
	+ static_assert( std::is_integral<iType>::value , "Range bounds must be an integral type" );
	+
	+ enum { is_range = true };
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t dimension( size_t const n , std::pair<iType,iType> const & r )
	+ { return ( size_t(r.first) < size_t(r.second) && size_t(r.second) <= n ) ? size_t(r.second) - size_t(r.first) : 0 ; }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t begin( std::pair<iType,iType> const & r ) { return size_t(r.first) ; }
	+};
	+
	+template< typename iType >
	+struct ViewOffsetRange< Kokkos::pair<iType,iType> > {
	+
	+ static_assert( std::is_integral<iType>::value , "Range bounds must be an integral type" );
	+
	+ enum { is_range = true };
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t dimension( size_t const n , Kokkos::pair<iType,iType> const & r )
	+ { return ( size_t(r.first) < size_t(r.second) && size_t(r.second) <= n ) ? size_t(r.second) - size_t(r.first) : 0 ; }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t begin( Kokkos::pair<iType,iType> const & r ) { return size_t(r.first) ; }
	+};
	+
	+template< typename iType >
	+struct ViewOffsetRange< std::initializer_list< iType > > {
	+
	+ static_assert( std::is_integral<iType>::value , "Range bounds must be an integral type" );
	+
	+ enum { is_range = true };
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t dimension( size_t const n , std::initializer_list< iType > const & r )
	+ {
	+ return ( size_t(r.begin()[0]) < size_t(r.begin()[1]) && size_t(r.begin()[1]) <= n )
	+ ? size_t(r.begin()[1]) - size_t(r.begin()[0]) : 0 ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION static
	+ size_t begin( std::initializer_list< iType > const & r ) { return size_t(r.begin()[0]) ; }
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+/** \brief ViewDataHandle provides the type of the 'data handle' which the view
	+ * uses to access data with the [] operator. It also provides
	+ * an allocate function and a function to extract a raw ptr from the
	+ * data handle. ViewDataHandle also defines an enum ReferenceAble which
	+ * specifies whether references/pointers to elements can be taken and a
	+ * 'return_type' which is what the view operators will give back.
	+ * Specialisation of this object allows three things depending
	+ * on ViewTraits and compiler options:
	+ * (i) Use special allocator (e.g. huge pages/small pages and pinned memory)
	+ * (ii) Use special data handle type (e.g. add Cuda Texture Object)
	+ * (iii) Use special access intrinsics (e.g. texture fetch and non-caching loads)
	+ */
	+template< class Traits , class Enable = void >
	+struct ViewDataHandle {
	+
	+ typedef typename Traits::value_type value_type ;
	+ typedef typename Traits::value_type * handle_type ;
	+ typedef typename Traits::value_type & return_type ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationTracker track_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type assign( value_type * arg_data_ptr
	+ , track_type const & /arg_tracker/ )
	+ {
	+ return handle_type( arg_data_ptr );
	+ }
	+};
	+
	+template< class Traits >
	+struct ViewDataHandle< Traits ,
	+ typename std::enable_if<( std::is_same< typename Traits::non_const_value_type
	+ , typename Traits::value_type >::value
	+ &&
	+ Traits::memory_traits::Atomic
	+ )>::type >
	+{
	+ typedef typename Traits::value_type value_type ;
	+ typedef typename Kokkos::Impl::AtomicViewDataHandle< Traits > handle_type ;
	+ typedef typename Kokkos::Impl::AtomicDataElement< Traits > return_type ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationTracker track_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type assign( value_type * arg_data_ptr
	+ , track_type const & /arg_tracker/ )
	+ {
	+ return handle_type( arg_data_ptr );
	+ }
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template< class Traits
	+ , bool R0 = false
	+ , bool R1 = false
	+ , bool R2 = false
	+ , bool R3 = false
	+ , bool R4 = false
	+ , bool R5 = false
	+ , bool R6 = false
	+ , bool R7 = false
	+ , typename Enable = void >
	+struct SubviewMapping ;
	+
	+/** \brief View mapping for non-specialized data type and standard layout */
	+template< class Traits >
	+class ViewMapping< Traits , void ,
	+ typename std::enable_if<(
	+ std::is_same< typename Traits::specialize , void >::value
	+ &&
	+ (
	+ std::is_same< typename Traits::array_layout , Kokkos::LayoutLeft >::value \|\|
	+ std::is_same< typename Traits::array_layout , Kokkos::LayoutRight >::value \|\|
	+ std::is_same< typename Traits::array_layout , Kokkos::LayoutStride >::value
	+ )
	+ )>::type >
	+{
	+private:
	+
	+ template< class , class , typename > friend class ViewMapping ;
	+ template< class , bool , bool , bool , bool , bool , bool , bool , bool , class > friend struct SubviewMapping ;
	+ template< class , class , class , class > friend class Kokkos::Experimental::View ;
	+
	+ typedef ViewOffset< typename Traits::dimension
	+ , typename Traits::array_layout
	+ , void
	+ > offset_type ;
	+
	+ typedef typename ViewDataHandle< Traits >::handle_type handle_type ;
	+
	+ handle_type m_handle ;
	+ offset_type m_offset ;
	+
	+public:
	+
	+ //----------------------------------------
	+ // Domain dimensions
	+
	+ enum { Rank = Traits::dimension::rank };
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_0() const { return m_offset.dimension_0(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_1() const { return m_offset.dimension_1(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_2() const { return m_offset.dimension_2(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_3() const { return m_offset.dimension_3(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_4() const { return m_offset.dimension_4(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_5() const { return m_offset.dimension_5(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_6() const { return m_offset.dimension_6(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t dimension_7() const { return m_offset.dimension_7(); }
	+
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_0() const { return m_offset.stride_0(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_1() const { return m_offset.stride_1(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_2() const { return m_offset.stride_2(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_3() const { return m_offset.stride_3(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_4() const { return m_offset.stride_4(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_5() const { return m_offset.stride_5(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_6() const { return m_offset.stride_6(); }
	+ KOKKOS_INLINE_FUNCTION constexpr size_t stride_7() const { return m_offset.stride_7(); }
	+
	+ /*
	+ KOKKOS_INLINE_FUNCTION
	+ Kokkos::Array<size_t,Rank> dimension() const
	+ { return Kokkos::Experimental::Impl::dimension( m_offset.m_dim ); }
	+ */
	+
	+ //----------------------------------------
	+ // Range span
	+
	+ /** \brief Span of the mapped range */
	+ KOKKOS_INLINE_FUNCTION constexpr size_t span() const { return m_offset.span(); }
	+
	+ /** \brief Is the mapped range span contiguous */
	+ KOKKOS_INLINE_FUNCTION constexpr bool span_is_contiguous() const { return m_offset.span_is_contiguous(); }
	+
	+ typedef typename ViewDataHandle< Traits >::return_type reference_type ;
	+
	+ /** \brief If data references are lvalue_reference than can query pointer to memory */
	+ KOKKOS_INLINE_FUNCTION constexpr typename Traits::value_type * data() const
	+ {
	+ typedef typename Traits::value_type * ptr_type ;
	+
	+ return std::is_lvalue_reference< reference_type >::value
	+ ? (ptr_type) m_handle
	+ : (ptr_type) 0 ;
	+ }
	+
	+ //----------------------------------------
	+ // The View class performs all rank and bounds checking before
	+ // calling these element reference methods.
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference() const { return m_handle[0]; }
	+
	+ template< typename I0 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 ) const { return m_handle[i0]; }
	+
	+ template< typename I0 , typename I1 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 ) const
	+ { return m_handle[ m_offset(i0,i1) ]; }
	+
	+ template< typename I0 , typename I1 , typename I2 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 , const I2 & i2 ) const
	+ { return m_handle[ m_offset(i0,i1,i2) ]; }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3 ) const
	+ { return m_handle[ m_offset(i0,i1,i2,i3) ]; }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 ) const
	+ { return m_handle[ m_offset(i0,i1,i2,i3,i4) ]; }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 ) const
	+ { return m_handle[ m_offset(i0,i1,i2,i3,i4,i5) ]; }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 , typename I6 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 , const I6 & i6 ) const
	+ { return m_handle[ m_offset(i0,i1,i2,i3,i4,i5,i6) ]; }
	+
	+ template< typename I0 , typename I1 , typename I2 , typename I3
	+ , typename I4 , typename I5 , typename I6 , typename I7 >
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ reference_type reference( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3
	+ , const I4 & i4 , const I5 & i5 , const I6 & i6 , const I7 & i7 ) const
	+ { return m_handle[ m_offset(i0,i1,i2,i3,i4,i5,i6,i7) ]; }
	+
	+ //----------------------------------------
	+
	+private:
	+
	+ enum { MemorySpanMask = 8 - 1 /* Force alignment on 8 byte boundary */ };
	+ enum { MemorySpanSize = sizeof(typename Traits::value_type) };
	+
	+public:
	+
	+ /** \brief Span, in bytes, of the referenced memory */
	+ KOKKOS_INLINE_FUNCTION constexpr size_t memory_span() const
	+ {
	+ return ( m_offset.span() * sizeof(typename Traits::value_type) + MemorySpanMask ) & ~size_t(MemorySpanMask);
	+ }
	+
	+ /** \brief Span, in bytes, of the required memory */
	+ template< bool AllowPadding >
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_t memory_span( const std::integral_constant<bool,AllowPadding> &
	+ , const size_t N0 , const size_t N1 , const size_t N2 , const size_t N3
	+ , const size_t N4 , const size_t N5 , const size_t N6 , const size_t N7 )
	+ {
	+ typedef std::integral_constant< unsigned , AllowPadding ? MemorySpanSize : 0 > padding ;
	+ return ( offset_type( padding(), N0, N1, N2, N3, N4, N5, N6, N7 ).span() * MemorySpanSize + MemorySpanMask ) & ~size_t(MemorySpanMask);
	+ }
	+
	+ /** \brief Span, in bytes, of the required memory */
	+ template< bool AllowPadding >
	+ KOKKOS_INLINE_FUNCTION
	+ static constexpr size_t memory_span( const std::integral_constant<bool,AllowPadding> &
	+ , const typename Traits::array_layout & layout )
	+ {
	+ return ( offset_type( layout ).span() * MemorySpanSize + MemorySpanMask ) & ~size_t(MemorySpanMask);
	+ }
	+
	+ //----------------------------------------
	+
	+ KOKKOS_INLINE_FUNCTION ~ViewMapping() {}
	+ KOKKOS_INLINE_FUNCTION ViewMapping() : m_handle(), m_offset() {}
	+ KOKKOS_INLINE_FUNCTION ViewMapping( const ViewMapping & rhs )
	+ : m_handle( rhs.m_handle ), m_offset( rhs.m_offset ) {}
	+ KOKKOS_INLINE_FUNCTION ViewMapping & operator = ( const ViewMapping & rhs )
	+ { m_handle = rhs.m_handle ; m_offset = rhs.m_offset ; return *this ; }
	+
	+ KOKKOS_INLINE_FUNCTION ViewMapping( ViewMapping && rhs )
	+ : m_handle( rhs.m_handle ), m_offset( rhs.m_offset ) {}
	+ KOKKOS_INLINE_FUNCTION ViewMapping & operator = ( ViewMapping && rhs )
	+ { m_handle = rhs.m_handle ; m_offset = rhs.m_offset ; return *this ; }
	+
	+ template< bool AllowPadding >
	+ KOKKOS_INLINE_FUNCTION
	+ ViewMapping( void * ptr
	+ , const std::integral_constant<bool,AllowPadding> &
	+ , const size_t N0 , const size_t N1 , const size_t N2 , const size_t N3
	+ , const size_t N4 , const size_t N5 , const size_t N6 , const size_t N7 )
	+ : m_handle( reinterpret_cast< handle_type >( ptr ) )
	+ , m_offset( std::integral_constant< unsigned , AllowPadding ? sizeof(typename Traits::value_type) : 0 >()
	+ , N0, N1, N2, N3, N4, N5, N6, N7 )
	+ {}
	+
	+ template< bool AllowPadding >
	+ KOKKOS_INLINE_FUNCTION
	+ ViewMapping( void * ptr
	+ , const std::integral_constant<bool,AllowPadding> &
	+ , const typename Traits::array_layout & layout )
	+ : m_handle( reinterpret_cast< handle_type >( ptr ) )
	+ , m_offset( layout )
	+ {}
	+
	+ //----------------------------------------
	+ // If the View is to construct or destroy the elements.
	+
	+ struct FunctorTagConstructScalar {};
	+ struct FunctorTagConstructNonScalar {};
	+ struct FunctorTagDestructNonScalar {};
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ void operator()( const FunctorTagConstructScalar & , const size_t i ) const
	+ { m_handle[i] = 0 ; }
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ void operator()( const FunctorTagConstructNonScalar & , const size_t i ) const
	+ {
	+ typedef typename Traits::value_type value_type ;
	+ new( & m_handle[i] ) value_type();
	+ }
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ void operator()( const FunctorTagDestructNonScalar & , const size_t i ) const
	+ {
	+ typedef typename Traits::value_type value_type ;
	+ ( & (m_handle[i]) )->~value_type();
	+ }
	+
	+ template< class ExecSpace >
	+ typename std::enable_if< Kokkos::Impl::is_execution_space<ExecSpace>::value &&
	+ std::is_scalar< typename Traits::value_type >::value >::type
	+ construct( const ExecSpace & space ) const
	+ {
	+ typedef Kokkos::RangePolicy< ExecSpace , FunctorTagConstructScalar , size_t > Policy ;
	+
	+ (void) Kokkos::Impl::ParallelFor< ViewMapping , Policy >( *this , Policy( 0 , m_offset.span() ) );
	+ }
	+
	+ template< class ExecSpace >
	+ typename std::enable_if< Kokkos::Impl::is_execution_space<ExecSpace>::value &&
	+ ! std::is_scalar< typename Traits::value_type >::value >::type
	+ construct( const ExecSpace & space ) const
	+ {
	+ typedef Kokkos::RangePolicy< ExecSpace , FunctorTagConstructNonScalar , size_t > Policy ;
	+
	+ (void) Kokkos::Impl::ParallelFor< ViewMapping , Policy >( *this , Policy( 0 , m_offset.span() ) );
	+ }
	+
	+ template< class ExecSpace >
	+ typename std::enable_if< Kokkos::Impl::is_execution_space<ExecSpace>::value &&
	+ std::is_scalar< typename Traits::value_type >::value >::type
	+ destroy( const ExecSpace & ) const {}
	+
	+ template< class ExecSpace >
	+ typename std::enable_if< Kokkos::Impl::is_execution_space<ExecSpace>::value &&
	+ ! std::is_scalar< typename Traits::value_type >::value >::type
	+ destroy( const ExecSpace & space ) const
	+ {
	+ typedef Kokkos::RangePolicy< ExecSpace , FunctorTagDestructNonScalar , size_t > Policy ;
	+
	+ (void) Kokkos::Impl::ParallelFor< ViewMapping , Policy >( *this , Policy( 0 , m_offset.span() ) );
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+/** \brief Assign compatible default mappings */
	+
	+template< class DstTraits , class SrcTraits >
	+class ViewMapping< DstTraits , SrcTraits ,
	+ typename std::enable_if<(
	+ std::is_same< typename DstTraits::memory_space , typename SrcTraits::memory_space >::value
	+ &&
	+ std::is_same< typename DstTraits::specialize , void >::value
	+ &&
	+ (
	+ std::is_same< typename DstTraits::array_layout , Kokkos::LayoutLeft >::value \|\|
	+ std::is_same< typename DstTraits::array_layout , Kokkos::LayoutRight >::value \|\|
	+ std::is_same< typename DstTraits::array_layout , Kokkos::LayoutStride >::value
	+ )
	+ &&
	+ std::is_same< typename SrcTraits::specialize , void >::value
	+ &&
	+ (
	+ std::is_same< typename SrcTraits::array_layout , Kokkos::LayoutLeft >::value \|\|
	+ std::is_same< typename SrcTraits::array_layout , Kokkos::LayoutRight >::value \|\|
	+ std::is_same< typename SrcTraits::array_layout , Kokkos::LayoutStride >::value
	+ )
	+ )>::type >
	+{
	+public:
	+
	+ enum { is_assignable = true };
	+
	+ typedef Kokkos::Experimental::Impl::SharedAllocationTracker TrackType ;
	+ typedef ViewMapping< DstTraits , void , void > DstType ;
	+ typedef ViewMapping< SrcTraits , void , void > SrcType ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void assign( DstType & dst , const SrcType & src , const TrackType & src_track )
	+ {
	+ static_assert( std::is_same< typename DstTraits::value_type , typename SrcTraits::value_type >::value \|\|
	+ std::is_same< typename DstTraits::value_type , typename SrcTraits::const_value_type >::value
	+ , "View assignment must have same value type or const = non-const" );
	+
	+ static_assert( ViewDimensionAssignable< typename DstTraits::dimension , typename SrcTraits::dimension >::value
	+ , "View assignment must have compatible dimensions" );
	+
	+ static_assert( std::is_same< typename DstTraits::array_layout , typename SrcTraits::array_layout >::value \|\|
	+ std::is_same< typename DstTraits::array_layout , Kokkos::LayoutStride >::value \|\|
	+ ( DstTraits::dimension::rank == 0 ) \|\|
	+ ( DstTraits::dimension::rank == 1 && DstTraits::dimension::rank_dynamic == 1 )
	+ , "View assignment must have compatible layout or have rank <= 1" );
	+
	+ typedef typename DstType::offset_type dst_offset_type ;
	+
	+ dst.m_offset = dst_offset_type( src.m_offset );
	+ dst.m_handle = Kokkos::Experimental::Impl::ViewDataHandle< DstTraits >::assign( src.m_handle , src_track );
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+/** \brief View mapping for non-specialized data type and standard layout */
	+template< class Traits , bool R0 , bool R1 , bool R2 , bool R3 , bool R4 , bool R5 , bool R6 , bool R7 >
	+struct SubviewMapping< Traits, R0, R1, R2, R3, R4, R5, R6, R7 ,
	+ typename std::enable_if<(
	+ std::is_same< typename Traits::specialize , void >::value
	+ &&
	+ (
	+ std::is_same< typename Traits::array_layout , Kokkos::LayoutLeft >::value \|\|
	+ std::is_same< typename Traits::array_layout , Kokkos::LayoutRight >::value \|\|
	+ std::is_same< typename Traits::array_layout , Kokkos::LayoutStride >::value
	+ )
	+ )>::type >
	+{
	+private:
	+
	+ // Subview's rank
	+ enum { rank = unsigned(R0) + unsigned(R1) + unsigned(R2) + unsigned(R3)
	+ + unsigned(R4) + unsigned(R5) + unsigned(R6) + unsigned(R7) };
	+
	+ // Whether right-most rank is a range.
	+ enum { R0_rev = 0 == Traits::rank ? false : (
	+ 1 == Traits::rank ? R0 : (
	+ 2 == Traits::rank ? R1 : (
	+ 3 == Traits::rank ? R2 : (
	+ 4 == Traits::rank ? R3 : (
	+ 5 == Traits::rank ? R4 : (
	+ 6 == Traits::rank ? R5 : (
	+ 7 == Traits::rank ? R6 : R7 ))))))) };
	+
	+ // Subview's layout
	+ typedef typename std::conditional<
	+ ( /* Same array layout IF */
	+ ( rank == 0 ) /* output rank zero */
	+ \|\|
	+ // OutputRank 1 or 2, InputLayout Left, Interval 0
	+ // because single stride one or second index has a stride.
	+ ( rank <= 2 && R0 && std::is_same< typename Traits::array_layout , Kokkos::LayoutLeft >::value )
	+ \|\|
	+ // OutputRank 1 or 2, InputLayout Right, Interval [InputRank-1]
	+ // because single stride one or second index has a stride.
	+ ( rank <= 2 && R0_rev && std::is_same< typename Traits::array_layout , Kokkos::LayoutRight >::value )
	+ ), typename Traits::array_layout , Kokkos::LayoutStride
	+ >::type array_layout ;
	+
	+ typedef typename Traits::value_type value_type ;
	+
	+ typedef typename std::conditional< rank == 0 , value_type ,
	+ typename std::conditional< rank == 1 , value_type * ,
	+ typename std::conditional< rank == 2 , value_type ** ,
	+ typename std::conditional< rank == 3 , value_type *** ,
	+ typename std::conditional< rank == 4 , value_type **** ,
	+ typename std::conditional< rank == 5 , value_type ***** ,
	+ typename std::conditional< rank == 6 , value_type ****** ,
	+ typename std::conditional< rank == 7 , value_type ******* ,
	+ value_type ********
	+ >::type >::type >::type >::type >::type >::type >::type >::type
	+ data_type ;
	+
	+public:
	+
	+ typedef
	+ Kokkos::Experimental::ViewTraits< data_type , array_layout
	+ , typename Traits::device_type
	+ , typename Traits::memory_traits > traits_type ;
	+
	+ typedef Kokkos::Experimental::View< data_type
	+ , array_layout
	+ , typename Traits::device_type
	+ , typename Traits::memory_traits > type ;
	+
	+ template< class T0 , class T1 , class T2 , class T3
	+ , class T4 , class T5 , class T6 , class T7 >
	+ KOKKOS_INLINE_FUNCTION
	+ static void assign( ViewMapping< traits_type , void , void > & dst
	+ , ViewMapping< Traits , void , void > const & src
	+ , T0 const & arg0
	+ , T1 const & arg1
	+ , T2 const & arg2
	+ , T3 const & arg3
	+ , T4 const & arg4
	+ , T5 const & arg5
	+ , T6 const & arg6
	+ , T7 const & arg7
	+ )
	+ {
	+ typedef ViewMapping< traits_type , void , void > DstType ;
	+
	+ typedef typename DstType::offset_type dst_offset_type ;
	+ typedef typename DstType::handle_type dst_handle_type ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T0> V0 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T1> V1 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T2> V2 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T3> V3 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T4> V4 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T5> V5 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T6> V6 ;
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange<T7> V7 ;
	+
	+ dst.m_offset = dst_offset_type
	+ ( src.m_offset
	+ , V0::dimension( src.m_offset.dimension_0() , arg0 )
	+ , V1::dimension( src.m_offset.dimension_1() , arg1 )
	+ , V2::dimension( src.m_offset.dimension_2() , arg2 )
	+ , V3::dimension( src.m_offset.dimension_3() , arg3 )
	+ , V4::dimension( src.m_offset.dimension_4() , arg4 )
	+ , V5::dimension( src.m_offset.dimension_5() , arg5 )
	+ , V6::dimension( src.m_offset.dimension_6() , arg6 )
	+ , V7::dimension( src.m_offset.dimension_7() , arg7 )
	+ );
	+
	+ dst.m_handle = dst_handle_type( src.m_handle +
	+ src.m_offset( V0::begin( arg0 )
	+ , V1::begin( arg1 )
	+ , V2::begin( arg2 )
	+ , V3::begin( arg3 )
	+ , V4::begin( arg4 )
	+ , V5::begin( arg5 )
	+ , V6::begin( arg6 )
	+ , V7::begin( arg7 )
	+ ) );
	+ }
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+template< class V
	+ , bool R0 = false , bool R1 = false , bool R2 = false , bool R3 = false
	+ , bool R4 = false , bool R5 = false , bool R6 = false , bool R7 = false >
	+struct SubviewType ;
	+
	+template< class D , class A1, class A2, class A3
	+ , bool R0 , bool R1 , bool R2 , bool R3
	+ , bool R4 , bool R5 , bool R6 , bool R7 >
	+struct SubviewType< Kokkos::Experimental::View< D , A1, A2, A3 > , R0 , R1 , R2 , R3 , R4 , R5 , R6 , R7 >
	+{
	+private:
	+ typedef Kokkos::Experimental::ViewTraits< D , A1 , A2 , A3 > traits ;
	+ typedef Kokkos::Experimental::Impl::SubviewMapping< traits , R0 , R1 , R2 , R3 , R4 , R5 , R6 , R7 > mapping ;
	+public:
	+ typedef typename mapping::type type ;
	+};
	+
	+}}} // namespace Kokkos::Experimental::Impl
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+class Error_view_scalar_reference_to_non_scalar_view ;
	+
	+} /* namespace Impl */
	+} /* namespace Experimental */
	+} /* namespace Kokkos */
	+
	+#if defined( KOKKOS_EXPRESSION_CHECK )
	+
	+#define KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( SPACE , MAP , RANK , I0 , I1 , I2 , I3 , I4 , I5 , I6 , I7 ) \
	+ Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< \
	+ Kokkos::Impl::ActiveExecutionMemorySpace , SPACE >::verify( MAP.data() ); \
	+ /* array bounds checking */
	+
	+#else
	+
	+#define KOKKOS_ASSERT_VIEW_MAPPING_ACCESS( SPACE , MAP , RANK , I0 , I1 , I2 , I3 , I4 , I5 , I6 , I7 ) \
	+ Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< \
	+ Kokkos::Impl::ActiveExecutionMemorySpace , SPACE >::verify( MAP.data() )
	+
	+#endif
	+
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------
	+
	+#endif /* #ifndef KOKKOS_EXPERIMENTAL_VIEW_MAPPING_HPP */
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_AllocationTracker.cpp b/lib/kokkos/core/src/impl/Kokkos_AllocationTracker.cpp
	new file mode 100755
	index 000000000..7fb33853d
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_AllocationTracker.cpp
	@@ -0,0 +1,844 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_Core_fwd.hpp>
	+
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+
	+#include <Kokkos_Atomic.hpp>
	+
	+#include <impl/Kokkos_Singleton.hpp>
	+#include <impl/Kokkos_AllocationTracker.hpp>
	+#include <impl/Kokkos_Error.hpp>
	+
	+
	+#include <string>
	+#include <vector>
	+#include <sstream>
	+#include <algorithm>
	+#include <utility>
	+#include <cstdlib>
	+#include <cstring>
	+#include <iostream>
	+#include <iomanip>
	+
	+/* Enable clean up of memory leaks */
	+#define CLEAN_UP_MEMORY_LEAKS 0
	+
	+namespace Kokkos { namespace Impl {
	+
	+namespace {
	+
	+
	+//-----------------------------------------------------------------------------
	+// AllocationRecord
	+//-----------------------------------------------------------------------------
	+//
	+// Used to track details about an allocation and provide a ref count
	+// sizeof(AllocationRecord) == 128
	+struct AllocationRecord
	+{
	+ enum {
	+ OFFSET = sizeof(AllocatorBase*) // allocator
	+ + sizeof(void*) // alloc_ptr
	+ + sizeof(uint64_t) // alloc_size
	+ + sizeof(AllocatorAttributeBase*) // attribute
	+ + sizeof(uint32_t) // node_index
	+ + sizeof(uint32_t) // ref_count
	+ , LABEL_LENGTH = 128 - OFFSET
	+ };
	+
	+ AllocatorBase * const allocator;
	+ void * const alloc_ptr;
	+ const uint64_t alloc_size;
	+ AllocatorAttributeBase * const attribute;
	+ const int32_t node_index;
	+ volatile uint32_t ref_count;
	+ const char label[LABEL_LENGTH];
	+
	+
	+ AllocationRecord( AllocatorBase * const arg_allocator
	+ , void * arg_alloc_ptr
	+ , uint64_t arg_alloc_size
	+ , int32_t arg_node_index
	+ , const std::string & arg_label
	+ )
	+ : allocator(arg_allocator)
	+ , alloc_ptr(arg_alloc_ptr)
	+ , alloc_size(arg_alloc_size)
	+ , attribute(NULL)
	+ , node_index(arg_node_index)
	+ , ref_count(1)
	+ , label() // zero fill
	+ {
	+ const size_t length = static_cast<size_t>(LABEL_LENGTH-1u) < arg_label.size() ? static_cast<size_t>(LABEL_LENGTH-1u) : arg_label.size();
	+ strncpy( const_cast<char *>(label), arg_label.c_str(), length );
	+ }
	+
	+ ~AllocationRecord()
	+ {
	+ if (attribute) {
	+ delete attribute;
	+ }
	+ }
	+
	+ uint32_t increment_ref_count()
	+ {
	+ uint32_t old_value = atomic_fetch_add( &ref_count, static_cast<uint32_t>(1) );
	+ return old_value + 1u;
	+ }
	+
	+ uint32_t decrement_ref_count()
	+ {
	+ uint32_t old_value = atomic_fetch_sub( &ref_count, static_cast<uint32_t>(1) );
	+ return old_value - 1u;
	+ }
	+
	+ void print( std::ostream & oss ) const
	+ {
	+ oss << "{ " << allocator->name()
	+ << " } : \"" << label
	+ << "\" ref_count(" << ref_count
	+ << ") memory[ " << alloc_ptr
	+ << " + " << alloc_size
	+ << " ]" ;
	+ }
	+
	+ bool set_attribute( AllocatorAttributeBase * attr )
	+ {
	+ bool result = false;
	+ if (attribute == NULL) {
	+ result = NULL == atomic_compare_exchange( const_cast<AllocatorAttributeBase **>(&attribute)
	+ , reinterpret_cast<AllocatorAttributeBase *>(NULL)
	+ , attr );
	+ }
	+
	+ return result;
	+ }
	+
	+ // disallow copy and assignment
	+ AllocationRecord( const AllocationRecord & );
	+ AllocationRecord & operator=(const AllocationRecord &);
	+};
	+
	+template <int NumBlocks>
	+struct Bitset
	+{
	+ enum { blocks = NumBlocks };
	+ enum { size = blocks * 64 };
	+ enum { block_mask = 63u };
	+ enum { block_shift = 6 };
	+
	+ // used to find free bits in a bitset
	+ static int count_trailing_zeros(uint64_t x)
	+ {
	+ #if defined( KOKKOS_COMPILER_GNU ) \|\| defined( KOKKOS_COMPILER_CLANG ) \|\| defined( KOKKOS_COMPILER_APPLECC )
	+ return x ? __builtin_ctzll(x) : 64;
	+ #elif defined( KOKKOS_COMPILER_INTEL )
	+ enum { shift = 32 };
	+ enum { mask = (static_cast<uint64_t>(1) << shift) - 1u };
	+ return (x & mask) ? _bit_scan_forward(static_cast<int>(x & mask)) :
	+ (x >> shift) ? shift + _bit_scan_forward(static_cast<int>(x >> shift)) :
	+ 64 ;
	+ #elif defined( KOKKOS_COMPILER_IBM )
	+ return x ? __cnttz8(x) : 64;
	+ #else
	+ int i = 0;
	+ for (; ((x & (static_cast<uint64_t>(1) << i)) == 0u) && i < 64; ++i ) {}
	+ return i;
	+ #endif
	+ }
	+
	+ Bitset()
	+ : m_bits()
	+ {
	+ for (int i=0; i < blocks; ++i) {
	+ m_bits[i] = 0u;
	+ }
	+ }
	+
	+ bool set( int i )
	+ {
	+ const uint64_t bit = static_cast<uint64_t>(1) << ( i & block_mask );
	+ return !( atomic_fetch_or( m_bits + (i >> block_shift), bit ) & bit );
	+ }
	+
	+ bool reset( int i )
	+ {
	+ const uint64_t bit = static_cast<uint64_t>(1) << ( i & block_mask );
	+ return atomic_fetch_and( m_bits + (i >> block_shift), ~bit ) & bit;
	+ }
	+
	+ bool test( int i )
	+ {
	+ const uint64_t block = m_bits[ i >> block_shift ];
	+ const uint64_t bit = static_cast<uint64_t>(1) << ( i & block_mask );
	+ return block & bit;
	+ }
	+
	+ int find_first_unset() const
	+ {
	+ for (int i=0; i < blocks; ++i) {
	+ const uint64_t block = m_bits[i];
	+ int b = count_trailing_zeros( ~block );
	+
	+ if ( b < 64 ) {
	+ return (i << block_shift) + b;
	+ }
	+ }
	+ return size;
	+ }
	+
	+ volatile uint64_t m_bits[blocks];
	+};
	+
	+//-----------------------------------------------------------------------------
	+// AllocationRecordPool -- singleton class
	+//
	+// global_alloc_rec_pool is the ONLY instance of this class
	+//
	+//-----------------------------------------------------------------------------
	+// Record AllocationRecords in a lock-free circular list.
	+// Each node in the list has a buffer with space for 959 ((15*64)-1) records
	+// managed by a bitset. Atomics are used to set and reset bits in the bit set.
	+// The head of the list is atomically updated to the last node found with
	+// unused space.
	+//
	+// Cost time to create an allocation record: amortized O(1), worst case O(num nodes)
	+// Cost to destroy an allocation recored: O(1)
	+//
	+// Singleton allocations are pushed onto a lock-free stack that is destroyed
	+// after the circular list of allocation records.
	+struct AllocationRecordPool
	+{
	+ enum { BITSET_BLOCKS = 15 };
	+
	+ typedef Bitset<BITSET_BLOCKS> bitset_type;
	+
	+ enum { BUFFER_SIZE = (bitset_type::size - 1) * sizeof(AllocationRecord) };
	+
	+ struct AllocationNode
	+ {
	+ AllocationNode()
	+ : next()
	+ , bitset()
	+ , buffer()
	+ {
	+ // set the first bit to used
	+ bitset.set(0);
	+ }
	+
	+ void * get_buffer( int32_t node_index )
	+ {
	+ return buffer + (node_index-1) * sizeof(AllocationRecord);
	+ }
	+
	+ // return 0 if no space is available in the node
	+ int32_t get_node_index()
	+ {
	+ int32_t node_index = 0;
	+ do {
	+ node_index = bitset.find_first_unset();
	+
	+ // successfully claimed a bit
	+ if ( node_index != bitset.size && bitset.set(node_index) )
	+ {
	+ return node_index;
	+ }
	+ } while ( node_index != bitset.size );
	+ return 0;
	+ }
	+
	+ void clear_node_index( int32_t node_index )
	+ {
	+ bitset.reset(node_index);
	+ }
	+
	+ AllocationNode * next;
	+ bitset_type bitset;
	+ char buffer[BUFFER_SIZE];
	+ };
	+
	+ struct SingletonNode
	+ {
	+ void * buffer;
	+ SingletonNode * next;
	+ Impl::singleton_destroy_function_type destroy;
	+
	+ SingletonNode( size_t size, Impl::singleton_create_function_type create_func, Impl::singleton_destroy_function_type destroy_func )
	+ : buffer(NULL)
	+ , next(NULL)
	+ , destroy(destroy_func)
	+ {
	+ if (size) {
	+ buffer = malloc(size);
	+ create_func(buffer);
	+ }
	+ }
	+
	+ ~SingletonNode()
	+ {
	+ if (buffer) {
	+ try {
	+ destroy(buffer);
	+ } catch(...) {}
	+ free(buffer);
	+ }
	+ }
	+ };
	+
	+ AllocationRecordPool()
	+ : head( new AllocationNode() )
	+ , singleton_head(NULL)
	+ {
	+ // setup ring
	+ head->next = head;
	+ }
	+
	+ ~AllocationRecordPool()
	+ {
	+ // delete allocation records
	+ {
	+ AllocationNode * start = head;
	+
	+ AllocationNode * curr = start;
	+
	+ std::vector< std::string > string_vec;
	+
	+ do {
	+ AllocationNode * next = curr->next;
	+
	+ #if defined( KOKKOS_DEBUG_PRINT_ALLOCATION_BITSET )
	+ // print node bitset
	+ for (int i=0; i < bitset_type::blocks; ++i ) {
	+ std::cout << std::hex << std::showbase << curr->bitset.m_bits[i] << " ";
	+ }
	+ std::cout << std::endl;
	+ #endif
	+
	+ // bit zero does not map to an AllocationRecord
	+ for ( int32_t i=1; i < bitset_type::size; ++i )
	+ {
	+ if (curr->bitset.test(i)) {
	+ AllocationRecord * alloc_rec = reinterpret_cast<AllocationRecord *>( curr->get_buffer(i) );
	+
	+ std::ostringstream oss;
	+ alloc_rec->print( oss );
	+ string_vec.push_back( oss.str() );
	+
	+#if CLEAN_UP_MEMORY_LEAKS
	+/* Cleaning up memory leaks prevents memory error detection tools
	+ * from reporting the original source of allocation, which can
	+ * impede debugging with such tools.
	+ */
	+ try {
	+ destroy(alloc_rec);
	+ }
	+ catch(...) {}
	+#endif
	+ }
	+ }
	+
	+ curr->next = NULL;
	+
	+ delete curr;
	+
	+ curr = next;
	+ } while ( curr != start );
	+
	+ if ( !string_vec.empty() ) {
	+ std::sort( string_vec.begin(), string_vec.end() );
	+
	+ std::ostringstream oss;
	+ oss << "Error: Allocation pool destroyed with the following memory leak(s):\n";
	+ for (size_t i=0; i< string_vec.size(); ++i)
	+ {
	+ oss << " " << string_vec[i] << std::endl;
	+ }
	+
	+ std::cerr << oss.str() << std::endl;
	+ }
	+ }
	+
	+ // delete singletons
	+ {
	+ SingletonNode * curr = singleton_head;
	+
	+ while (curr) {
	+ SingletonNode * next = curr->next;
	+ delete curr;
	+ curr = next;
	+ }
	+ }
	+ }
	+
	+ AllocationRecord * create( AllocatorBase * arg_allocator
	+ , void * arg_alloc_ptr
	+ , size_t arg_alloc_size
	+ , const std::string & arg_label
	+ )
	+ {
	+ AllocationNode * start = volatile_load(&head);
	+
	+ AllocationNode * curr = start;
	+
	+
	+ int32_t node_index = curr->get_node_index();
	+
	+ if (node_index == 0) {
	+ curr = volatile_load(&curr->next);
	+ }
	+
	+ while (node_index == 0 && curr != start)
	+ {
	+ node_index = curr->get_node_index();
	+ if (node_index == 0) {
	+ curr = volatile_load(&curr->next);
	+ }
	+ }
	+
	+ // Need to allocate and insert a new node
	+ if (node_index == 0 && curr == start)
	+ {
	+ AllocationNode * new_node = new AllocationNode();
	+
	+ node_index = new_node->get_node_index();
	+
	+ AllocationNode * next = NULL;
	+ do {
	+ next = volatile_load(&curr->next);
	+ new_node->next = next;
	+ memory_fence();
	+ } while ( next != atomic_compare_exchange( &(curr->next), next, new_node ) );
	+
	+ curr = new_node;
	+ }
	+
	+ void * buffer = curr->get_buffer(node_index);
	+
	+ // try to set head to curr
	+ if ( start != curr )
	+ {
	+ atomic_compare_exchange( & head, start, curr );
	+ }
	+
	+ return new (buffer) AllocationRecord( arg_allocator
	+ , arg_alloc_ptr
	+ , arg_alloc_size
	+ , node_index
	+ , arg_label
	+ );
	+ }
	+
	+ void destroy( AllocationRecord * alloc_rec )
	+ {
	+ if (alloc_rec) {
	+ const int32_t node_index = alloc_rec->node_index;
	+ AllocationNode * node = get_node( alloc_rec );
	+
	+ // deallocate memory
	+ alloc_rec->allocator->deallocate( alloc_rec->alloc_ptr, alloc_rec->alloc_size );
	+
	+ // call destructor
	+ alloc_rec->~AllocationRecord();
	+
	+ // wait for writes to complete
	+ memory_fence();
	+
	+ // clear node index
	+ node->clear_node_index( node_index );
	+ }
	+ }
	+
	+ void * create_singleton( size_t size, Impl::singleton_create_function_type create_func, Impl::singleton_destroy_function_type destroy_func )
	+ {
	+ SingletonNode * node = new SingletonNode( size, create_func, destroy_func );
	+ SingletonNode * next;
	+
	+ // insert new node at the head of the list
	+ do {
	+ next = volatile_load(&singleton_head);
	+ node->next = next;
	+ } while ( next != atomic_compare_exchange( &singleton_head, next, node ) );
	+
	+ return node->buffer;
	+ }
	+
	+ void print_memory( std::ostream & out ) const
	+ {
	+ AllocationNode * start = head;
	+
	+ AllocationNode * curr = start;
	+
	+ std::vector< std::string > string_vec;
	+
	+ do {
	+ AllocationNode * next = curr->next;
	+
	+ // bit zero does not map to an AllocationRecord
	+ for ( int32_t i=1; i < bitset_type::size; ++i )
	+ {
	+ if (curr->bitset.test(i)) {
	+ AllocationRecord * alloc_rec = reinterpret_cast<AllocationRecord *>( curr->get_buffer(i) );
	+
	+ std::ostringstream oss;
	+ alloc_rec->print( oss );
	+ string_vec.push_back( oss.str() );
	+ }
	+ }
	+ curr = next;
	+ } while ( curr != start );
	+
	+ if ( !string_vec.empty() ) {
	+ std::sort( string_vec.begin(), string_vec.end() );
	+
	+ std::ostringstream oss;
	+ oss << "Tracked Memory:" << std::endl;
	+ for (size_t i=0; i< string_vec.size(); ++i)
	+ {
	+ oss << " " << string_vec[i] << std::endl;
	+ }
	+ out << oss.str() << std::endl;
	+ }
	+ else {
	+ out << "No Tracked Memory" << std::endl;
	+ }
	+ }
	+
	+ // find an AllocationRecord such that
	+ // alloc_ptr <= ptr < alloc_ptr + alloc_size
	+ // otherwise return NULL
	+ AllocationRecord * find( void const * ptr, AllocatorBase const * allocator ) const
	+ {
	+ AllocationNode * start = head;
	+
	+ AllocationNode * curr = start;
	+
	+ char const * const char_ptr = reinterpret_cast<const char *>(ptr);
	+
	+ do {
	+ AllocationNode * next = curr->next;
	+
	+ // bit zero does not map to an AllocationRecord
	+ for ( int32_t i=1; i < bitset_type::size; ++i )
	+ {
	+ if (curr->bitset.test(i)) {
	+ AllocationRecord * alloc_rec = reinterpret_cast<AllocationRecord *>( curr->get_buffer(i) );
	+
	+ char const * const alloc_ptr = reinterpret_cast<char const *>(alloc_rec->alloc_ptr);
	+
	+ if ( (allocator == alloc_rec->allocator)
	+ && (alloc_ptr <= char_ptr)
	+ && (char_ptr < (alloc_ptr + alloc_rec->alloc_size)) )
	+ {
	+ return alloc_rec;
	+ }
	+ }
	+ }
	+ curr = next;
	+ } while ( curr != start );
	+
	+ return NULL;
	+ }
	+
	+private:
	+
	+ AllocationNode * get_node( AllocationRecord * alloc_rec )
	+ {
	+ return reinterpret_cast<AllocationNode *>( alloc_rec - alloc_rec->node_index);
	+ }
	+
	+ AllocationNode * head;
	+ SingletonNode * singleton_head;
	+};
	+
	+// create the global pool for allocation records
	+AllocationRecordPool global_alloc_rec_pool;
	+
	+
	+
	+// convert a uintptr_t to an AllocationRecord pointer
	+inline
	+AllocationRecord * to_alloc_rec( uintptr_t alloc_rec )
	+{
	+ return reinterpret_cast<AllocationRecord *>( alloc_rec & ~static_cast<uintptr_t>(1) );
	+}
	+
	+} // unnamed namespace
	+
	+//-----------------------------------------------------------------------------
	+// Allocation Tracker methods
	+//-----------------------------------------------------------------------------
	+
	+// Create a reference counted AllocationTracker
	+void AllocationTracker::initalize( AllocatorBase * arg_allocator
	+ , void * arg_alloc_ptr
	+ , size_t arg_alloc_size
	+ , const std::string & arg_label
	+ )
	+{
	+ if ( arg_allocator && arg_alloc_ptr && arg_alloc_size) {
	+ // create record
	+ AllocationRecord * alloc_rec = global_alloc_rec_pool.create( arg_allocator
	+ , arg_alloc_ptr
	+ , arg_alloc_size
	+ , arg_label
	+ );
	+
	+ m_alloc_rec = reinterpret_cast<uintptr_t>(alloc_rec) \| REF_COUNT_BIT;
	+ }
	+}
	+
	+void AllocationTracker::reallocate( size_t size ) const
	+{
	+ AllocationRecord * rec = to_alloc_rec( m_alloc_rec );
	+
	+ void * the_alloc_ptr = rec->allocator->reallocate( rec->alloc_ptr, rec->alloc_size, size );
	+
	+ if ( NULL != the_alloc_ptr )
	+ {
	+ const_cast<void *>(&rec->alloc_ptr) = the_alloc_ptr;
	+ const_cast<uint64_t >(&rec->alloc_size) = size;
	+ }
	+ else {
	+ Impl::throw_runtime_exception( "Error: unable to reallocate allocation tracker");
	+ }
	+}
	+
	+
	+void AllocationTracker::increment_ref_count() const
	+{
	+ to_alloc_rec( m_alloc_rec )->increment_ref_count();
	+}
	+
	+
	+void AllocationTracker::decrement_ref_count() const
	+{
	+ AllocationRecord * alloc_rec = to_alloc_rec( m_alloc_rec );
	+ uint32_t the_ref_count = alloc_rec->decrement_ref_count();
	+ if (the_ref_count == 0u) {
	+ try {
	+ global_alloc_rec_pool.destroy( alloc_rec );
	+ }
	+ catch(...) {}
	+ }
	+}
	+
	+namespace {
	+
	+struct NullAllocator { static const char * name() { return "Null Allocator"; } };
	+
	+}
	+
	+AllocatorBase * AllocationTracker::allocator() const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ return to_alloc_rec(m_alloc_rec)->allocator;
	+ }
	+ return Allocator<NullAllocator>::singleton();
	+}
	+
	+void * AllocationTracker::alloc_ptr() const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ return to_alloc_rec(m_alloc_rec)->alloc_ptr;
	+ }
	+ return NULL;
	+}
	+
	+size_t AllocationTracker::alloc_size() const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ return to_alloc_rec(m_alloc_rec)->alloc_size;
	+ }
	+ return 0u;
	+}
	+
	+size_t AllocationTracker::ref_count() const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ return to_alloc_rec(m_alloc_rec)->ref_count;
	+ }
	+ return 0u;
	+}
	+
	+char const * AllocationTracker::label() const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ return to_alloc_rec(m_alloc_rec)->label;
	+ }
	+ return "[Empty Allocation Tracker]";
	+}
	+
	+void AllocationTracker::print( std::ostream & oss) const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ to_alloc_rec(m_alloc_rec)->print(oss);
	+ }
	+ else {
	+ oss << label();
	+ }
	+}
	+
	+bool AllocationTracker::set_attribute( AllocatorAttributeBase * attr ) const
	+{
	+ bool result = false;
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ result = to_alloc_rec(m_alloc_rec)->set_attribute(attr);
	+ }
	+ return result;
	+}
	+
	+AllocatorAttributeBase * AllocationTracker::attribute() const
	+{
	+ if (m_alloc_rec & REF_COUNT_MASK) {
	+ return to_alloc_rec(m_alloc_rec)->attribute;
	+ }
	+ return NULL;
	+}
	+
	+void AllocationTracker::print_tracked_memory( std::ostream & out )
	+{
	+ global_alloc_rec_pool.print_memory( out );
	+}
	+
	+
	+AllocationTracker AllocationTracker::find( void const * ptr, AllocatorBase const * arg_allocator )
	+{
	+ AllocationRecord * alloc_rec = global_alloc_rec_pool.find(ptr, arg_allocator);
	+
	+ AllocationTracker tracker;
	+
	+ if ( alloc_rec != NULL )
	+ {
	+ if ( tracking_enabled() ) {
	+ alloc_rec->increment_ref_count();
	+ tracker.m_alloc_rec = reinterpret_cast<uintptr_t>(alloc_rec) \| REF_COUNT_BIT;
	+ }
	+ else {
	+ tracker.m_alloc_rec = reinterpret_cast<uintptr_t>(alloc_rec);
	+ }
	+ }
	+
	+ return tracker ;
	+}
	+
	+
	+
	+//-----------------------------------------------------------------------------
	+// static AllocationTracker
	+//-----------------------------------------------------------------------------
	+#if defined( KOKKOS_USE_DECENTRALIZED_HOST )
	+namespace {
	+
	+ // TODO : Detect compiler support for thread local variables
	+ #if defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP )
	+ bool g_thread_local_tracking_enabled = true;
	+ #pragma omp threadprivate(g_thread_local_tracking_enabled)
	+ #elif defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS )
	+ __thread bool g_thread_local_tracking_enabled = true;
	+ #elif defined( KOKKOS_HAVE_OPENMP )
	+ bool g_thread_local_tracking_enabled = true;
	+ #pragma omp threadprivate(g_thread_local_tracking_enabled)
	+ #elif defined( KOKKOS_HAVE_PTHREAD )
	+ __thread bool g_thread_local_tracking_enabled = true;
	+ #elif defined( KOKKOS_HAVE_SERIAL )
	+ bool g_thread_local_tracking_enabled = true;
	+ #endif
	+} // unnamed namespace
	+
	+void AllocationTracker::disable_tracking()
	+{
	+ g_thread_local_tracking_enabled = false;
	+}
	+
	+void AllocationTracker::enable_tracking()
	+{
	+ g_thread_local_tracking_enabled = true;
	+}
	+
	+bool AllocationTracker::tracking_enabled()
	+{
	+ return g_thread_local_tracking_enabled;
	+}
	+#else
	+namespace {
	+enum TrackingEnum { TRACKING_ENABLED, TRACKING_DISABLED };
	+volatile TrackingEnum g_tracking_enabled = TRACKING_ENABLED;
	+}
	+
	+void AllocationTracker::disable_tracking()
	+{
	+ if ( TRACKING_ENABLED != atomic_compare_exchange( &g_tracking_enabled, TRACKING_ENABLED, TRACKING_DISABLED ) ) {
	+ Impl::throw_runtime_exception("Error: Tracking already disabled");
	+ }
	+}
	+
	+void AllocationTracker::enable_tracking()
	+{
	+ if ( TRACKING_DISABLED != atomic_compare_exchange( &g_tracking_enabled, TRACKING_DISABLED, TRACKING_ENABLED ) ) {
	+ Impl::throw_runtime_exception("Error: Tracking already enabled");
	+ }
	+}
	+
	+bool AllocationTracker::tracking_enabled()
	+{
	+ return g_tracking_enabled == TRACKING_ENABLED;
	+}
	+#endif
	+
	+
	+//-----------------------------------------------------------------------------
	+// create singleton free function
	+//-----------------------------------------------------------------------------
	+void * create_singleton( size_t size
	+ , Impl::singleton_create_function_type create_func
	+ , Impl::singleton_destroy_function_type destroy_func )
	+{
	+ return global_alloc_rec_pool.create_singleton( size, create_func, destroy_func );
	+}
	+
	+}} // namespace Kokkos::Impl
	+
	+#endif /* #if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST ) */
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_AllocationTracker.hpp b/lib/kokkos/core/src/impl/Kokkos_AllocationTracker.hpp
	new file mode 100755
	index 000000000..331c4e8fa
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_AllocationTracker.hpp
	@@ -0,0 +1,586 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef KOKKOS_ALLOCATION_TRACKER_HPP
	+#define KOKKOS_ALLOCATION_TRACKER_HPP
	+
	+#include <Kokkos_Macros.hpp>
	+
	+#include <impl/Kokkos_Traits.hpp>
	+#include <impl/Kokkos_Error.hpp>
	+
	+#include <stdint.h>
	+#include <cstdlib>
	+#include <string>
	+#include <iosfwd>
	+
	+namespace Kokkos { namespace Impl {
	+
	+//-----------------------------------------------------------------------------
	+// Create Singleton objects
	+//-----------------------------------------------------------------------------
	+
	+typedef void * (singleton_create_function_type)(void buffer);
	+typedef void (singleton_destroy_function_type)(void );
	+
	+void * create_singleton( size_t size
	+ , singleton_create_function_type create_func
	+ , singleton_destroy_function_type destroy_func
	+ );
	+
	+
	+
	+/// class Singleton
	+///
	+/// Default construct a singleton type. This method is used to circumvent
	+/// order of construction issues. Singleton objects are destroyed after all
	+/// other allocations in the reverse order of their creation.
	+template <typename Type>
	+class Singleton
	+{
	+public:
	+ /// Get a pointer to the Singleton. Default construct the singleton if it does not already exist
	+ static Type * get()
	+ {
	+ static Type * singleton = NULL;
	+ if (singleton == NULL) {
	+ Impl::singleton_create_function_type create_func = &create;
	+ Impl::singleton_destroy_function_type destroy_func = &destroy;
	+ singleton = reinterpret_cast<Type*>( Impl::create_singleton( sizeof(Type), create_func, destroy_func ) );
	+ }
	+ return singleton;
	+ }
	+
	+private:
	+
	+ /// Call the Type constructor
	+ static void destroy(void * ptr)
	+ {
	+ reinterpret_cast<Type*>(ptr)->~Type();
	+ }
	+
	+ /// placement new the Type in buffer
	+ static void * create(void * buffer)
	+ {
	+ return new (buffer) Type();
	+ }
	+};
	+
	+
	+//-----------------------------------------------------------------------------
	+// AllocatorBase
	+//-----------------------------------------------------------------------------
	+
	+/// class AllocatorBase
	+///
	+/// Abstract base class for all Allocators.
	+/// Allocators should be singleton objects, use Singleton<Allocator>::get to create
	+/// to avoid order of destruction issues
	+class AllocatorBase
	+{
	+public:
	+ /// name of the allocator
	+ /// used to report memory leaks
	+ virtual const char * name() const = 0;
	+
	+ /// Allocate a buffer of size number of bytes
	+ virtual void* allocate(size_t size) const = 0;
	+
	+ /// Deallocate a buffer with size number of bytes
	+ /// The pointer must have been allocated with a call to corresponding allocate
	+ virtual void deallocate(void * ptr, size_t size) const = 0;
	+
	+ /// Changes the size of the memory block pointed to by ptr.
	+ /// Ptr must have been allocated with the corresponding allocate call
	+ /// The function may move the memory block to a new location
	+ /// (whose address is returned by the function).
	+ ///
	+ /// The content of the memory block is preserved up to the lesser of the new and
	+ /// old sizes, even if the block is moved to a new location. If the new size is larger,
	+ /// the value of the newly allocated portion is indeterminate.
	+ ///
	+ /// In case that ptr is a null pointer, the function behaves like allocate, assigning a
	+ /// new block of size bytes and returning a pointer to its beginning.
	+ virtual void * reallocate(void * old_ptr, size_t old_size, size_t new_size) const = 0;
	+
	+ /// can a texture object be bound to the allocated memory
	+ virtual bool support_texture_binding() const = 0;
	+
	+ /// virtual destructor
	+ virtual ~AllocatorBase() {}
	+};
	+
	+/// class AllocatorAttributeBase
	+class AllocatorAttributeBase
	+{
	+public:
	+ virtual ~AllocatorAttributeBase() {}
	+};
	+
	+//-----------------------------------------------------------------------------
	+// Allocator< StaticAllocator > : public AllocatorBase
	+//-----------------------------------------------------------------------------
	+
	+// HasStaticName
	+template<typename T>
	+class HasStaticName
	+{
	+ typedef const char * (*static_method)();
	+ template<typename U, static_method> struct SFINAE {};
	+ template<typename U> static char Test(SFINAE<U, &U::name>*);
	+ template<typename U> static int Test(...);
	+public:
	+ enum { value = sizeof(Test<T>(0)) == sizeof(char) };
	+};
	+
	+
	+template <typename T>
	+inline
	+typename enable_if<HasStaticName<T>::value, const char *>::type
	+allocator_name()
	+{
	+ return T::name();
	+}
	+
	+template <typename T>
	+inline
	+typename enable_if<!HasStaticName<T>::value, const char *>::type
	+allocator_name()
	+{
	+ return "Unnamed Allocator";
	+}
	+
	+
	+// HasStaticAllocate
	+template<typename T>
	+class HasStaticAllocate
	+{
	+ typedef void * (*static_method)(size_t);
	+ template<typename U, static_method> struct SFINAE {};
	+ template<typename U> static char Test(SFINAE<U, &U::allocate>*);
	+ template<typename U> static int Test(...);
	+public:
	+ enum { value = sizeof(Test<T>(0)) == sizeof(char) };
	+};
	+
	+template <typename T>
	+inline
	+typename enable_if<HasStaticAllocate<T>::value, void *>::type
	+allocator_allocate(size_t size)
	+{
	+ return T::allocate(size);
	+}
	+
	+template <typename T>
	+inline
	+typename enable_if<!HasStaticAllocate<T>::value, void *>::type
	+allocator_allocate(size_t)
	+{
	+ throw_runtime_exception( std::string("Error: ")
	+ + std::string(allocator_name<T>())
	+ + std::string(" cannot allocate memory!") );
	+ return NULL;
	+}
	+
	+// HasStaticDeallocate
	+template<typename T>
	+class HasStaticDeallocate
	+{
	+ typedef void (static_method)(void , size_t);
	+ template<typename U, static_method> struct SFINAE {};
	+ template<typename U> static char Test(SFINAE<U, &U::deallocate>*);
	+ template<typename U> static int Test(...);
	+public:
	+ enum { value = sizeof(Test<T>(0)) == sizeof(char) };
	+};
	+
	+template <typename T>
	+inline
	+typename enable_if<HasStaticDeallocate<T>::value, void>::type
	+allocator_deallocate(void * ptr, size_t size)
	+{
	+ T::deallocate(ptr,size);
	+}
	+
	+template <typename T>
	+inline
	+typename enable_if<!HasStaticDeallocate<T>::value, void>::type
	+allocator_deallocate(void *, size_t)
	+{
	+ throw_runtime_exception( std::string("Error: ")
	+ + std::string(allocator_name<T>())
	+ + std::string(" cannot deallocate memory!") );
	+}
	+
	+// HasStaticReallocate
	+template<typename T>
	+class HasStaticReallocate
	+{
	+ typedef void * (static_method)(void , size_t, size_t);
	+ template<typename U, static_method> struct SFINAE {};
	+ template<typename U> static char Test(SFINAE<U, &U::reallocate>*);
	+ template<typename U> static int Test(...);
	+public:
	+ enum { value = sizeof(Test<T>(0)) == sizeof(char) };
	+};
	+
	+template <typename T>
	+inline
	+typename enable_if<HasStaticReallocate<T>::value, void *>::type
	+allocator_reallocate(void * old_ptr, size_t old_size, size_t new_size)
	+{
	+ return T::reallocate(old_ptr, old_size, new_size);
	+}
	+
	+template <typename T>
	+inline
	+typename enable_if<!HasStaticReallocate<T>::value, void *>::type
	+allocator_reallocate(void *, size_t, size_t)
	+{
	+ throw_runtime_exception( std::string("Error: ")
	+ + std::string(allocator_name<T>())
	+ + std::string(" cannot reallocate memory!") );
	+ return NULL;
	+}
	+
	+// HasStaticReallocate
	+template<typename T>
	+class HasStaticSupportTextureBinding
	+{
	+ typedef bool (*static_method)();
	+ template<typename U, static_method> struct SFINAE {};
	+ template<typename U> static char Test(SFINAE<U, &U::support_texture_binding>*);
	+ template<typename U> static int Test(...);
	+public:
	+ enum { value = sizeof(Test<T>(0)) == sizeof(char) };
	+};
	+
	+template <typename T>
	+inline
	+typename enable_if<HasStaticSupportTextureBinding<T>::value, bool>::type
	+allocator_support_texture_binding()
	+{
	+ return T::support_texture_binding();
	+}
	+
	+template <typename T>
	+inline
	+typename enable_if<!HasStaticSupportTextureBinding<T>::value, bool>::type
	+allocator_support_texture_binding()
	+{
	+ return false;
	+}
	+
	+template <typename T>
	+class Allocator : public AllocatorBase
	+{
	+public:
	+ virtual const char * name() const
	+ {
	+ return allocator_name<T>();
	+ }
	+
	+ virtual void* allocate(size_t size) const
	+ {
	+ return allocator_allocate<T>(size);
	+ }
	+
	+ virtual void deallocate(void * ptr, size_t size) const
	+ {
	+ allocator_deallocate<T>(ptr,size);
	+ }
	+
	+ virtual void * reallocate(void * old_ptr, size_t old_size, size_t new_size) const
	+ {
	+ return allocator_reallocate<T>(old_ptr, old_size, new_size);
	+ }
	+
	+ virtual bool support_texture_binding() const
	+ {
	+ return allocator_support_texture_binding<T>();
	+ }
	+
	+ static AllocatorBase * singleton()
	+ {
	+ return Singleton< Allocator<T> >::get();
	+ }
	+};
	+
	+//-----------------------------------------------------------------------------
	+// AllocationTracker
	+//-----------------------------------------------------------------------------
	+
	+// forward declaration for friend classes
	+struct CopyWithoutTracking;
	+struct MallocHelper;
	+
	+/// class AllocationTracker
	+/// Will call deallocate from the AllocatorBase when the reference count reaches 0.
	+/// Reference counting is disabled when the host is in parallel.
	+class AllocationTracker
	+{
	+ // use the least significant bit of the AllocationRecord pointer to indicate if the
	+ // AllocationTracker should reference count
	+ enum {
	+ REF_COUNT_BIT = static_cast<uintptr_t>(1)
	+ , REF_COUNT_MASK = ~static_cast<uintptr_t>(1)
	+ };
	+
	+public:
	+
	+ /// Find an AllocationTracker such that
	+ /// alloc_ptr <= ptr < alloc_ptr + alloc_size
	+ /// O(n) where n is the number of tracked allocations.
	+ template <typename StaticAllocator>
	+ static AllocationTracker find( void const * ptr )
	+ {
	+ return find( ptr, Allocator<StaticAllocator>::singleton() );
	+ }
	+
	+
	+ /// Pretty print all the currently tracked memory
	+ static void print_tracked_memory( std::ostream & out );
	+
	+ /// Default constructor
	+ KOKKOS_INLINE_FUNCTION
	+ AllocationTracker()
	+ : m_alloc_rec(0)
	+ {}
	+
	+ /// Create a AllocationTracker
	+ ///
	+ /// Start reference counting the alloc_ptr.
	+ /// When the reference count reachs 0 the allocator deallocate method
	+ /// will be call with the given size. The alloc_ptr should have been
	+ /// allocated with the allocator's allocate method.
	+ ///
	+ /// If arg_allocator == NULL OR arg_alloc_ptr == NULL OR size == 0
	+ /// do nothing
	+ template <typename StaticAllocator>
	+ AllocationTracker( StaticAllocator const &
	+ , void * arg_alloc_ptr
	+ , size_t arg_alloc_size
	+ , const std::string & arg_label = std::string("") )
	+ : m_alloc_rec(0)
	+ {
	+ AllocatorBase * arg_allocator = Allocator<StaticAllocator>::singleton();
	+ initalize( arg_allocator, arg_alloc_ptr, arg_alloc_size, arg_label);
	+ }
	+
	+ /// Create a AllocationTracker
	+ ///
	+ /// Start reference counting the alloc_ptr.
	+ /// When the reference count reachs 0 the allocator deallocate method
	+ /// will be call with the given size. The alloc_ptr should have been
	+ /// allocated with the allocator's allocate method.
	+ ///
	+ /// If arg_allocator == NULL OR arg_alloc_ptr == NULL OR size == 0
	+ /// do nothing
	+ template <typename StaticAllocator>
	+ AllocationTracker( StaticAllocator const &
	+ , size_t arg_alloc_size
	+ , const std::string & arg_label = std::string("")
	+ )
	+ : m_alloc_rec(0)
	+ {
	+ AllocatorBase * arg_allocator = Allocator<StaticAllocator>::singleton();
	+ void * arg_alloc_ptr = arg_allocator->allocate( arg_alloc_size );
	+
	+ initalize( arg_allocator, arg_alloc_ptr, arg_alloc_size, arg_label);
	+ }
	+
	+ /// Copy an AllocatorTracker
	+ KOKKOS_INLINE_FUNCTION
	+ AllocationTracker( const AllocationTracker & rhs )
	+ : m_alloc_rec( rhs.m_alloc_rec)
	+ {
	+#if !defined( __CUDA_ARCH__ )
	+ if ( rhs.ref_counting() && tracking_enabled() ) {
	+ increment_ref_count();
	+ }
	+ else {
	+ m_alloc_rec = m_alloc_rec & REF_COUNT_MASK;
	+ }
	+#else
	+ m_alloc_rec = m_alloc_rec & REF_COUNT_MASK;
	+#endif
	+ }
	+
	+ /// Copy an AllocatorTracker
	+ /// Decrement the reference count of the current tracker if necessary
	+ KOKKOS_INLINE_FUNCTION
	+ AllocationTracker & operator=( const AllocationTracker & rhs )
	+ {
	+ if (this != &rhs) {
	+#if !defined( __CUDA_ARCH__ )
	+ if ( ref_counting() ) {
	+ decrement_ref_count();
	+ }
	+
	+ m_alloc_rec = rhs.m_alloc_rec;
	+
	+ if ( rhs.ref_counting() && tracking_enabled() ) {
	+ increment_ref_count();
	+ }
	+ else {
	+ m_alloc_rec = m_alloc_rec & REF_COUNT_MASK;
	+ }
	+#else
	+ m_alloc_rec = rhs.m_alloc_rec & REF_COUNT_MASK;
	+#endif
	+ }
	+
	+ return * this;
	+ }
	+
	+ /// Destructor
	+ /// Decrement the reference count if necessary
	+ KOKKOS_INLINE_FUNCTION
	+ ~AllocationTracker()
	+ {
	+#if !defined( __CUDA_ARCH__ )
	+ if ( ref_counting() ) {
	+ decrement_ref_count();
	+ }
	+#endif
	+ }
	+
	+ /// Is the tracker valid?
	+ KOKKOS_INLINE_FUNCTION
	+ bool is_valid() const
	+ {
	+ return (m_alloc_rec & REF_COUNT_MASK);
	+ }
	+
	+
	+
	+ /// clear the tracker
	+ KOKKOS_INLINE_FUNCTION
	+ void clear()
	+ {
	+#if !defined( __CUDA_ARCH__ )
	+ if ( ref_counting() ) {
	+ decrement_ref_count();
	+ }
	+#endif
	+ m_alloc_rec = 0;
	+ }
	+
	+ /// is this tracker currently counting allocations?
	+ KOKKOS_INLINE_FUNCTION
	+ bool ref_counting() const
	+ {
	+ return (m_alloc_rec & REF_COUNT_BIT);
	+ }
	+
	+ AllocatorBase * allocator() const;
	+
	+ /// pointer to the allocated memory
	+ void * alloc_ptr() const;
	+
	+ /// size in bytes of the allocated memory
	+ size_t alloc_size() const;
	+
	+ /// the current reference count
	+ size_t ref_count() const;
	+
	+ /// the label given to the allocation
	+ char const * label() const;
	+
	+ /// pretty print all the tracker's information to the std::ostream
	+ void print( std::ostream & oss) const;
	+
	+
	+ /// set an attribute ptr on the allocation record
	+ /// the arg_attribute pointer will be deleted when the record is destroyed
	+ /// the attribute ptr can only be set once
	+ bool set_attribute( AllocatorAttributeBase * arg_attribute) const;
	+
	+ /// get the attribute ptr from the allocation record
	+ AllocatorAttributeBase * attribute() const;
	+
	+
	+ /// reallocate the memory tracked by this allocation
	+ /// NOT thread-safe
	+ void reallocate( size_t size ) const;
	+
	+private:
	+
	+ static AllocationTracker find( void const * ptr, AllocatorBase const * arg_allocator );
	+
	+ void initalize( AllocatorBase * arg_allocator
	+ , void * arg_alloc_ptr
	+ , size_t arg_alloc_size
	+ , std::string const & label );
	+
	+ void increment_ref_count() const;
	+ void decrement_ref_count() const;
	+
	+ static void disable_tracking();
	+ static void enable_tracking();
	+ static bool tracking_enabled();
	+
	+ friend struct Impl::CopyWithoutTracking;
	+ friend struct Impl::MallocHelper;
	+
	+ uintptr_t m_alloc_rec;
	+};
	+
	+
	+
	+/// Make a copy of the functor with reference counting disabled
	+struct CopyWithoutTracking
	+{
	+ template <typename Functor>
	+ static Functor apply( const Functor & f )
	+ {
	+ AllocationTracker::disable_tracking();
	+ Functor func(f);
	+ AllocationTracker::enable_tracking();
	+ return func;
	+ }
	+};
	+
	+}} // namespace Kokkos::Impl
	+
	+#endif //KOKKOS_ALLOCATION_TRACKER_HPP
	diff --git a/lib/kokkos/core/src/impl/Kokkos_AnalyzeShape.hpp b/lib/kokkos/core/src/impl/Kokkos_AnalyzeShape.hpp
	index b2330248c..2de9df008 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_AnalyzeShape.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_AnalyzeShape.hpp
	@@ -1,260 +1,260 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_ANALYZESHAPE_HPP
	#define KOKKOS_ANALYZESHAPE_HPP

	#include <impl/Kokkos_Shape.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------

	/** \brief Analyze the array shape defined by a Kokkos::View data type.
	*
	* It is presumed that the data type can be mapped down to a multidimensional
	* array of an intrinsic scalar numerical type (double, float, int, ... ).
	* The 'value_type' of an array may be an embedded aggregate type such
	* as a fixed length array 'Array<T,N>'.
	* In this case the 'array_intrinsic_type' represents the
	* underlying array of intrinsic scalar numerical type.
	*
	* The embedded aggregate type must have an AnalyzeShape specialization
	* to map it down to a shape and intrinsic scalar numerical type.
	*/
	template< class T >
	struct AnalyzeShape : public Shape< sizeof(T) , 0 >
	{
	typedef void specialize ;

	typedef Shape< sizeof(T), 0 > shape ;

	typedef T array_intrinsic_type ;
	typedef T value_type ;
	typedef T type ;

	typedef const T const_array_intrinsic_type ;
	typedef const T const_value_type ;
	typedef const T const_type ;

	typedef T non_const_array_intrinsic_type ;
	typedef T non_const_value_type ;
	typedef T non_const_type ;
	};

	template<>
	struct AnalyzeShape<void> : public Shape< 0 , 0 >
	{
	typedef void specialize ;

	typedef Shape< 0 , 0 > shape ;

	typedef void array_intrinsic_type ;
	typedef void value_type ;
	typedef void type ;
	typedef const void const_array_intrinsic_type ;
	typedef const void const_value_type ;
	typedef const void const_type ;
	typedef void non_const_array_intrinsic_type ;
	typedef void non_const_value_type ;
	typedef void non_const_type ;
	};

	template< class T >
	struct AnalyzeShape< const T > : public AnalyzeShape<T>::shape
	{
	private:
	typedef AnalyzeShape<T> nested ;
	public:

	typedef typename nested::specialize specialize ;

	typedef typename nested::shape shape ;

	typedef typename nested::const_array_intrinsic_type array_intrinsic_type ;
	typedef typename nested::const_value_type value_type ;
	typedef typename nested::const_type type ;

	typedef typename nested::const_array_intrinsic_type const_array_intrinsic_type ;
	typedef typename nested::const_value_type const_value_type ;
	typedef typename nested::const_type const_type ;

	typedef typename nested::non_const_array_intrinsic_type non_const_array_intrinsic_type ;
	typedef typename nested::non_const_value_type non_const_value_type ;
	typedef typename nested::non_const_type non_const_type ;
	};

	template< class T >
	struct AnalyzeShape< T * >
	: public ShapeInsert< typename AnalyzeShape<T>::shape , 0 >::type
	{
	private:
	typedef AnalyzeShape<T> nested ;
	public:

	typedef typename nested::specialize specialize ;

	typedef typename ShapeInsert< typename nested::shape , 0 >::type shape ;

	typedef typename nested::array_intrinsic_type * array_intrinsic_type ;
	typedef typename nested::value_type value_type ;
	typedef typename nested::type * type ;

	typedef typename nested::const_array_intrinsic_type * const_array_intrinsic_type ;
	typedef typename nested::const_value_type const_value_type ;
	typedef typename nested::const_type * const_type ;

	typedef typename nested::non_const_array_intrinsic_type * non_const_array_intrinsic_type ;
	typedef typename nested::non_const_value_type non_const_value_type ;
	typedef typename nested::non_const_type * non_const_type ;
	};

	template< class T >
	struct AnalyzeShape< T[] >
	: public ShapeInsert< typename AnalyzeShape<T>::shape , 0 >::type
	{
	private:
	typedef AnalyzeShape<T> nested ;
	public:

	typedef typename nested::specialize specialize ;

	typedef typename ShapeInsert< typename nested::shape , 0 >::type shape ;

	typedef typename nested::array_intrinsic_type array_intrinsic_type [] ;
	typedef typename nested::value_type value_type ;
	typedef typename nested::type type [] ;

	typedef typename nested::const_array_intrinsic_type const_array_intrinsic_type [] ;
	typedef typename nested::const_value_type const_value_type ;
	typedef typename nested::const_type const_type [] ;

	typedef typename nested::non_const_array_intrinsic_type non_const_array_intrinsic_type [] ;
	typedef typename nested::non_const_value_type non_const_value_type ;
	typedef typename nested::non_const_type non_const_type [] ;
	};

	template< class T >
	struct AnalyzeShape< const T[] >
	: public ShapeInsert< typename AnalyzeShape< const T >::shape , 0 >::type
	{
	private:
	typedef AnalyzeShape< const T > nested ;
	public:

	typedef typename nested::specialize specialize ;

	typedef typename ShapeInsert< typename nested::shape , 0 >::type shape ;

	typedef typename nested::array_intrinsic_type array_intrinsic_type [] ;
	typedef typename nested::value_type value_type ;
	typedef typename nested::type type [] ;

	typedef typename nested::const_array_intrinsic_type const_array_intrinsic_type [] ;
	typedef typename nested::const_value_type const_value_type ;
	typedef typename nested::const_type const_type [] ;

	typedef typename nested::non_const_array_intrinsic_type non_const_array_intrinsic_type [] ;
	typedef typename nested::non_const_value_type non_const_value_type ;
	typedef typename nested::non_const_type non_const_type [] ;
	};

	template< class T , unsigned N >
	struct AnalyzeShape< T[N] >
	: public ShapeInsert< typename AnalyzeShape<T>::shape , N >::type
	{
	private:
	typedef AnalyzeShape<T> nested ;
	public:

	typedef typename nested::specialize specialize ;

	typedef typename ShapeInsert< typename nested::shape , N >::type shape ;

	typedef typename nested::array_intrinsic_type array_intrinsic_type [N] ;
	typedef typename nested::value_type value_type ;
	typedef typename nested::type type [N] ;

	typedef typename nested::const_array_intrinsic_type const_array_intrinsic_type [N] ;
	typedef typename nested::const_value_type const_value_type ;
	typedef typename nested::const_type const_type [N] ;

	typedef typename nested::non_const_array_intrinsic_type non_const_array_intrinsic_type [N] ;
	typedef typename nested::non_const_value_type non_const_value_type ;
	typedef typename nested::non_const_type non_const_type [N] ;
	};

	template< class T , unsigned N >
	struct AnalyzeShape< const T[N] >
	: public ShapeInsert< typename AnalyzeShape< const T >::shape , N >::type
	{
	private:
	typedef AnalyzeShape< const T > nested ;
	public:

	typedef typename nested::specialize specialize ;

	typedef typename ShapeInsert< typename nested::shape , N >::type shape ;

	typedef typename nested::array_intrinsic_type array_intrinsic_type [N] ;
	typedef typename nested::value_type value_type ;
	typedef typename nested::type type [N] ;

	typedef typename nested::const_array_intrinsic_type const_array_intrinsic_type [N] ;
	typedef typename nested::const_value_type const_value_type ;
	typedef typename nested::const_type const_type [N] ;

	typedef typename nested::non_const_array_intrinsic_type non_const_array_intrinsic_type [N] ;
	typedef typename nested::non_const_value_type non_const_value_type ;
	typedef typename nested::non_const_type non_const_type [N] ;
	};

	} // namespace Impl
	} // namespace Kokkos

	#endif /* #ifndef KOKKOS_ANALYZESHAPE_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Assembly_X86.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Assembly_X86.hpp
	index b1ce1bd44..e9c7a16d5 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Assembly_X86.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Assembly_X86.hpp
	@@ -1,176 +1,214 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/
	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_ASSEMBLY_X86_HPP )
	#define KOKKOS_ATOMIC_ASSEMBLY_X86_HPP
	namespace Kokkos {

	+#ifdef KOKKOS_ENABLE_ASM
	#ifndef __CUDA_ARCH__
	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_increment<char>(volatile char* a) {
	__asm__ __volatile__(
	"lock incb %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_increment<short>(volatile short* a) {
	__asm__ __volatile__(
	"lock incw %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_increment<int>(volatile int* a) {
	__asm__ __volatile__(
	"lock incl %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_increment<long long int>(volatile long long int* a) {
	__asm__ __volatile__(
	"lock incq %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_decrement<char>(volatile char* a) {
	__asm__ __volatile__(
	"lock decb %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_decrement<short>(volatile short* a) {
	__asm__ __volatile__(
	"lock decw %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_decrement<int>(volatile int* a) {
	__asm__ __volatile__(
	"lock decl %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}

	template<>
	KOKKOS_INLINE_FUNCTION
	void atomic_decrement<long long int>(volatile long long int* a) {
	__asm__ __volatile__(
	"lock decq %0"
	: /* no output registers */
	: "m" (a[0])
	: "memory"
	);
	}
	#endif
	+#endif

	namespace Impl {
	struct cas128_t
	{
	uint64_t lower;
	uint64_t upper;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ cas128_t () {
	+ lower = 0;
	+ upper = 0;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ cas128_t (const cas128_t& a) {
	+ lower = a.lower;
	+ upper = a.upper;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ cas128_t (volatile cas128_t* a) {
	+ lower = a->lower;
	+ upper = a->upper;
	+ }
	+
	KOKKOS_INLINE_FUNCTION
	bool operator != (const cas128_t& a) const {
	return (lower != a.lower) \|\| upper!=a.upper;
	}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator = (const cas128_t& a) {
	+ lower = a.lower;
	+ upper = a.upper;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator = (const cas128_t& a) volatile {
	+ lower = a.lower;
	+ upper = a.upper;
	+ }
	}
	__attribute__ (( __aligned__( 16 ) ));




	inline cas128_t cas128( volatile cas128_t * ptr, cas128_t cmp, cas128_t swap )
	{
	- bool swapped;
	+ #ifdef KOKKOS_ENABLE_ASM
	+ bool swapped = false;
	__asm__ __volatile__
	(
	"lock cmpxchg16b %1\n\t"
	"setz %0"
	: "=q" ( swapped )
	, "+m" ( *ptr )
	, "+d" ( cmp.upper )
	, "+a" ( cmp.lower )
	: "c" ( swap.upper )
	, "b" ( swap.lower )
	- : "cc"
	+ , "q" ( swapped )
	);
	- (void) swapped;
	return cmp;
	+ #else
	+ cas128_t tmp(ptr);
	+ if(tmp != cmp) {
	+ return tmp;
	+ } else {
	+ *ptr = swap;
	+ return swap;
	+ }
	+ #endif
	}

	}
	}

	#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Compare_Exchange_Strong.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Compare_Exchange_Strong.hpp
	index a1c35d9f9..524cd7327 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Compare_Exchange_Strong.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Compare_Exchange_Strong.hpp
	@@ -1,231 +1,259 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_COMPARE_EXCHANGE_STRONG_HPP )
	#define KOKKOS_ATOMIC_COMPARE_EXCHANGE_STRONG_HPP

	namespace Kokkos {

	//----------------------------------------------------------------------------
	// Cuda native CAS supports int, unsigned int, and unsigned long long int (non-standard type).
	// Must cast-away 'volatile' for the CAS call.

	#if defined( KOKKOS_ATOMICS_USE_CUDA )

	__inline__ __device__
	int atomic_compare_exchange( volatile int * const dest, const int compare, const int val)
	{ return atomicCAS((int*)dest,compare,val); }

	__inline__ __device__
	unsigned int atomic_compare_exchange( volatile unsigned int * const dest, const unsigned int compare, const unsigned int val)
	{ return atomicCAS((unsigned int*)dest,compare,val); }

	__inline__ __device__
	unsigned long long int atomic_compare_exchange( volatile unsigned long long int * const dest ,
	const unsigned long long int compare ,
	const unsigned long long int val )
	{ return atomicCAS((unsigned long long int*)dest,compare,val); }

	template < typename T >
	__inline__ __device__
	T atomic_compare_exchange( volatile T * const dest , const T & compare ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T & >::type val )
	{
	const int tmp = atomicCAS( (int) dest , ((int)&compare) , ((int*)&val) );
	return ((T)&tmp);
	}

	template < typename T >
	__inline__ __device__
	T atomic_compare_exchange( volatile T * const dest , const T & compare ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T & >::type val )
	{
	typedef unsigned long long int type ;
	const type tmp = atomicCAS( (type) dest , ((type)&compare) , ((type*)&val) );
	return ((T)&tmp);
	}

	template < typename T >
	__inline__ __device__
	T atomic_compare_exchange( volatile T * const dest , const T & compare ,
	- typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	- sizeof(T) != sizeof(unsigned long long int) &&
	- sizeof(T) == sizeof(Impl::cas128_t) , const T & >::type val )
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ , const T >::type& val )
	{
	- Kokkos::abort("Error: calling atomic_compare_exchange with 128bit type is not supported on CUDA execution space.");
	- return T();
	+ T return_val;
	+ // This is a way to (hopefully) avoid dead lock in a warp
	+ bool done = false;
	+ while (! done ) {
	+ if( Impl::lock_address_cuda_space( (void*) dest ) ) {
	+ return_val = *dest;
	+ if( return_val == compare )
	+ *dest = val;
	+ Impl::unlock_address_cuda_space( (void*) dest );
	+ }
	+ }
	+ return return_val;
	}

	//----------------------------------------------------------------------------
	// GCC native CAS supports int, long, unsigned int, unsigned long.
	// Intel native CAS support int and long with the same interface as GCC.

	#elif defined(KOKKOS_ATOMICS_USE_GCC) \|\| defined(KOKKOS_ATOMICS_USE_INTEL)

	KOKKOS_INLINE_FUNCTION
	int atomic_compare_exchange( volatile int * const dest, const int compare, const int val)
	{ return __sync_val_compare_and_swap(dest,compare,val); }

	KOKKOS_INLINE_FUNCTION
	long atomic_compare_exchange( volatile long * const dest, const long compare, const long val )
	{ return __sync_val_compare_and_swap(dest,compare,val); }

	#if defined( KOKKOS_ATOMICS_USE_GCC )

	// GCC supports unsigned

	KOKKOS_INLINE_FUNCTION
	unsigned int atomic_compare_exchange( volatile unsigned int * const dest, const unsigned int compare, const unsigned int val )
	{ return __sync_val_compare_and_swap(dest,compare,val); }

	KOKKOS_INLINE_FUNCTION
	unsigned long atomic_compare_exchange( volatile unsigned long * const dest ,
	const unsigned long compare ,
	const unsigned long val )
	{ return __sync_val_compare_and_swap(dest,compare,val); }

	#endif

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_compare_exchange( volatile T * const dest, const T & compare,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T & >::type val )
	{
	#ifdef KOKKOS_HAVE_CXX11
	union U {
	int i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} tmp ;
	#else
	union U {
	int i ;
	T t ;
	} tmp ;
	#endif

	tmp.i = __sync_val_compare_and_swap( (int) dest , ((int)&compare) , ((int*)&val) );
	return tmp.t ;
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_compare_exchange( volatile T * const dest, const T & compare,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(long) , const T & >::type val )
	{
	#ifdef KOKKOS_HAVE_CXX11
	union U {
	long i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} tmp ;
	#else
	union U {
	long i ;
	T t ;
	} tmp ;
	#endif

	tmp.i = __sync_val_compare_and_swap( (long) dest , ((long)&compare) , ((long*)&val) );
	return tmp.t ;
	}

	+#ifdef KOKKOS_ENABLE_ASM
	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_compare_exchange( volatile T * const dest, const T & compare,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) != sizeof(long) &&
	sizeof(T) == sizeof(Impl::cas128_t), const T & >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	union U {
	Impl::cas128_t i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} tmp ;
	-#else
	- union U {
	- Impl::cas128_t i ;
	- T t ;
	- } tmp ;
	-#endif

	tmp.i = Impl::cas128( (Impl::cas128_t) dest , ((Impl::cas128_t)&compare) , ((Impl::cas128_t*)&val) );
	return tmp.t ;
	}
	+#endif
	+
	+template < typename T >
	+inline
	+T atomic_compare_exchange( volatile T * const dest , const T compare ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ #if defined(KOKKOS_ENABLE_ASM)
	+ && ( sizeof(T) != 16 )
	+ #endif
	+ , const T >::type& val )
	+{
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ T return_val = *dest;
	+ if( return_val == compare ) {
	+ const T tmp = *dest = val;
	+ #ifndef KOKKOS_COMPILER_CLANG
	+ (void) tmp;
	+ #endif
	+ }
	+ Impl::unlock_address_host_space( (void*) dest );
	+ return return_val;
	+}
	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )

	template< typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_compare_exchange( volatile T * const dest, const T compare, const T val )
	{
	T retval;
	#pragma omp critical
	{
	retval = dest[0];
	if ( retval == compare )
	dest[0] = val;
	}
	return retval;
	}

	#endif

	-
	template <typename T>
	KOKKOS_INLINE_FUNCTION
	bool atomic_compare_exchange_strong(volatile T* const dest, const T compare, const T val)
	{
	return compare == atomic_compare_exchange(dest, compare, val);
	}

	//----------------------------------------------------------------------------

	} // namespace Kokkos

	#endif

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Exchange.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Exchange.hpp
	index b39adf2a9..1bdbdbc7f 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Exchange.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Exchange.hpp
	@@ -1,305 +1,340 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_EXCHANGE_HPP )
	#define KOKKOS_ATOMIC_EXCHANGE_HPP

	namespace Kokkos {

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ATOMICS_USE_CUDA )

	__inline__ __device__
	int atomic_exchange( volatile int * const dest , const int val )
	{
	// return __iAtomicExch( (int*) dest , val );
	return atomicExch( (int*) dest , val );
	}

	__inline__ __device__
	unsigned int atomic_exchange( volatile unsigned int * const dest , const unsigned int val )
	{
	// return __uAtomicExch( (unsigned int*) dest , val );
	return atomicExch( (unsigned int*) dest , val );
	}

	__inline__ __device__
	unsigned long long int atomic_exchange( volatile unsigned long long int * const dest , const unsigned long long int val )
	{
	// return __ullAtomicExch( (unsigned long long*) dest , val );
	return atomicExch( (unsigned long long*) dest , val );
	}

	/** \brief Atomic exchange for any type with compatible size */
	template< typename T >
	__inline__ __device__
	T atomic_exchange(
	volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T & >::type val )
	{
	// int tmp = __ullAtomicExch( (int) dest , ((int*)&val) );
	int tmp = atomicExch( ((int)dest) , ((int*)&val) );
	return ((T)&tmp);
	}

	template< typename T >
	__inline__ __device__
	T atomic_exchange(
	volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T & >::type val )
	{
	typedef unsigned long long int type ;
	// type tmp = __ullAtomicExch( (type) dest , ((type*)&val) );
	type tmp = atomicExch( ((type)dest) , ((type*)&val) );
	return ((T)&tmp);
	}

	-template< typename T >
	+template < typename T >
	__inline__ __device__
	-T atomic_exchange(
	- volatile T * const dest ,
	- typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	- sizeof(T) != sizeof(unsigned long long int) &&
	- sizeof(T) == sizeof(Impl::cas128_t) , const T & >::type val )
	+T atomic_exchange( volatile T * const dest ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ , const T >::type& val )
	{
	- Kokkos::abort("Error: calling atomic_exchange with 128bit type is not supported on CUDA execution space.");
	- return T();
	+ T return_val;
	+ // This is a way to (hopefully) avoid dead lock in a warp
	+ bool done = false;
	+ while (! done ) {
	+ if( Impl::lock_address_cuda_space( (void*) dest ) ) {
	+ return_val = *dest;
	+ *dest = val;
	+ Impl::unlock_address_cuda_space( (void*) dest );
	+ }
	+ }
	+ return return_val;
	}
	-
	/** \brief Atomic exchange for any type with compatible size */
	template< typename T >
	__inline__ __device__
	void atomic_assign(
	volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T & >::type val )
	{
	// (void) __ullAtomicExch( (int) dest , ((int*)&val) );
	(void) atomicExch( ((int)dest) , ((int*)&val) );
	}

	template< typename T >
	__inline__ __device__
	void atomic_assign(
	volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T & >::type val )
	{
	typedef unsigned long long int type ;
	// (void) __ullAtomicExch( (type) dest , ((type*)&val) );
	(void) atomicExch( ((type)dest) , ((type*)&val) );
	}

	template< typename T >
	__inline__ __device__
	void atomic_assign(
	volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	- sizeof(T) != sizeof(unsigned long long int) &&
	- sizeof(T) == sizeof(Impl::cas128_t) , const T & >::type val )
	+ sizeof(T) != sizeof(unsigned long long int)
	+ , const T & >::type val )
	{
	- Kokkos::abort("Error: calling atomic_assign with 128bit type is not supported on CUDA execution space.");
	+ (void) atomic_exchange(dest,val);
	}

	//----------------------------------------------------------------------------

	#elif defined(KOKKOS_ATOMICS_USE_GCC) \|\| defined(KOKKOS_ATOMICS_USE_INTEL)

	template< typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_exchange( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) \|\| sizeof(T) == sizeof(long)
	, const T & >::type val )
	{
	typedef typename Kokkos::Impl::if_c< sizeof(T) == sizeof(int) , int , long >::type type ;

	const type v = ((type)&val); // Extract to be sure the value doesn't change

	type assumed ;

	#ifdef KOKKOS_HAVE_CXX11
	union U {
	T val_T ;
	type val_type ;
	KOKKOS_INLINE_FUNCTION U() {};
	} old ;
	#else
	union { T val_T ; type val_type ; } old ;
	#endif

	old.val_T = *dest ;

	do {
	assumed = old.val_type ;
	old.val_type = __sync_val_compare_and_swap( (volatile type *) dest , assumed , v );
	} while ( assumed != old.val_type );

	return old.val_T ;
	}

	+#if defined(KOKKOS_ENABLE_ASM)
	template< typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_exchange( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(Impl::cas128_t)
	, const T & >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	union U {
	Impl::cas128_t i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	-#else
	- union U {
	- Impl::cas128_t i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif

	oldval.t = *dest ;
	newval.t = val;

	do {
	assume.i = oldval.i ;
	oldval.i = Impl::cas128( (volatile Impl::cas128_t*) dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+template < typename T >
	+inline
	+T atomic_exchange( volatile T * const dest ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ #if defined(KOKKOS_ENABLE_ASM)
	+ && ( sizeof(T) != 16 )
	+ #endif
	+ , const T >::type& val )
	+{
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ T return_val = *dest;
	+ const T tmp = *dest = val;
	+ #ifndef KOKKOS_COMPILER_CLANG
	+ (void) tmp;
	+ #endif
	+ Impl::unlock_address_host_space( (void*) dest );
	+ return return_val;
	+}

	template< typename T >
	KOKKOS_INLINE_FUNCTION
	void atomic_assign( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) \|\| sizeof(T) == sizeof(long)
	, const T & >::type val )
	{
	typedef typename Kokkos::Impl::if_c< sizeof(T) == sizeof(int) , int , long >::type type ;

	const type v = ((type)&val); // Extract to be sure the value doesn't change

	type assumed ;

	#ifdef KOKKOS_HAVE_CXX11
	union U {
	T val_T ;
	type val_type ;
	KOKKOS_INLINE_FUNCTION U() {};
	} old ;
	#else
	union { T val_T ; type val_type ; } old ;
	#endif

	old.val_T = *dest ;

	do {
	assumed = old.val_type ;
	old.val_type = __sync_val_compare_and_swap( (volatile type *) dest , assumed , v );
	} while ( assumed != old.val_type );
	}

	+#ifdef KOKKOS_ENABLE_ASM
	template< typename T >
	KOKKOS_INLINE_FUNCTION
	void atomic_assign( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(Impl::cas128_t)
	, const T & >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	union U {
	Impl::cas128_t i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	-#else
	- union U {
	- Impl::cas128_t i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif

	oldval.t = *dest ;
	newval.t = val;
	do {
	assume.i = oldval.i ;
	oldval.i = Impl::cas128( (volatile Impl::cas128_t*) dest , assume.i , newval.i);
	} while ( assume.i != oldval.i );
	}
	+#endif
	+
	+template < typename T >
	+inline
	+void atomic_assign( volatile T * const dest ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ #if defined(KOKKOS_ENABLE_ASM)
	+ && ( sizeof(T) != 16 )
	+ #endif
	+ , const T >::type& val )
	+{
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ *dest = val;
	+ Impl::unlock_address_host_space( (void*) dest );
	+}
	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_exchange( volatile T * const dest , const T val )
	{
	T retval;
	//#pragma omp atomic capture
	#pragma omp critical
	{
	retval = dest[0];
	dest[0] = val;
	}
	return retval;
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	void atomic_assign( volatile T * const dest , const T val )
	{
	//#pragma omp atomic
	#pragma omp critical
	{
	dest[0] = val;
	}
	}

	#endif

	-//----------------------------------------------------------------------------
	-
	} // namespace Kokkos

	#endif

	//----------------------------------------------------------------------------

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp
	index ce8b9a093..b06a5b424 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp
	@@ -1,297 +1,326 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_FETCH_ADD_HPP )
	#define KOKKOS_ATOMIC_FETCH_ADD_HPP

	namespace Kokkos {

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ATOMICS_USE_CUDA )

	// Support for int, unsigned int, unsigned long long int, and float

	__inline__ __device__
	int atomic_fetch_add( volatile int * const dest , const int val )
	{ return atomicAdd((int*)dest,val); }

	__inline__ __device__
	unsigned int atomic_fetch_add( volatile unsigned int * const dest , const unsigned int val )
	{ return atomicAdd((unsigned int*)dest,val); }

	__inline__ __device__
	unsigned long long int atomic_fetch_add( volatile unsigned long long int * const dest ,
	const unsigned long long int val )
	{ return atomicAdd((unsigned long long int*)dest,val); }

	__inline__ __device__
	float atomic_fetch_add( volatile float * const dest , const float val )
	{ return atomicAdd((float*)dest,val); }

	template < typename T >
	__inline__ __device__
	T atomic_fetch_add( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T >::type val )
	{
	#ifdef KOKKOS_HAVE_CXX11
	union U {
	int i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	#else
	union U {
	int i ;
	T t ;
	} assume , oldval , newval ;
	#endif

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = assume.t + val ;
	oldval.i = atomicCAS( (int*)dest , assume.i , newval.i );
	} while ( assumed.i != oldval.i );

	return oldval.t ;
	}

	template < typename T >
	__inline__ __device__
	T atomic_fetch_add( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T >::type val )
	{
	#ifdef KOKKOS_HAVE_CXX11
	union U {
	unsigned long long int i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	#else
	union U {
	unsigned long long int i ;
	T t ;
	} assume , oldval , newval ;
	#endif

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = assume.t + val ;
	oldval.i = atomicCAS( (unsigned long long int*)dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	+//----------------------------------------------------------------------------
	+
	template < typename T >
	__inline__ __device__
	T atomic_fetch_add( volatile T * const dest ,
	- typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	- sizeof(T) != sizeof(unsigned long long int) &&
	- sizeof(T) == sizeof(Impl::cas128_t), const T >::type val )
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ , const T >::type& val )
	{
	- Kokkos::abort("Error: calling atomic_fetch_add with 128bit type is not supported on CUDA execution space.");
	- return T();
	+ T return_val;
	+ // This is a way to (hopefully) avoid dead lock in a warp
	+ bool done = false;
	+ while (! done ) {
	+ if( Impl::lock_address_cuda_space( (void*) dest ) ) {
	+ return_val = *dest;
	+ *dest = return_val + val;
	+ Impl::unlock_address_cuda_space( (void*) dest );
	+ }
	+ }
	+ return return_val;
	}
	-
	//----------------------------------------------------------------------------

	#elif defined(KOKKOS_ATOMICS_USE_GCC) \|\| defined(KOKKOS_ATOMICS_USE_INTEL)

	KOKKOS_INLINE_FUNCTION
	int atomic_fetch_add( volatile int * const dest , const int val )
	{ return __sync_fetch_and_add(dest,val); }

	KOKKOS_INLINE_FUNCTION
	long int atomic_fetch_add( volatile long int * const dest , const long int val )
	{ return __sync_fetch_and_add(dest,val); }

	#if defined( KOKKOS_ATOMICS_USE_GCC )

	KOKKOS_INLINE_FUNCTION
	unsigned int atomic_fetch_add( volatile unsigned int * const dest , const unsigned int val )
	{ return __sync_fetch_and_add(dest,val); }

	KOKKOS_INLINE_FUNCTION
	unsigned long int atomic_fetch_add( volatile unsigned long int * const dest , const unsigned long int val )
	{ return __sync_fetch_and_add(dest,val); }

	#endif

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_add( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T >::type val )
	{
	#ifdef KOKKOS_HAVE_CXX11
	union U {
	int i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	#else
	union U {
	int i ;
	T t ;
	} assume , oldval , newval ;
	#endif

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = assume.t + val ;
	oldval.i = __sync_val_compare_and_swap( (int*) dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_add( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(long) , const T >::type val )
	{
	#ifdef KOKKOS_HAVE_CXX11
	union U {
	long i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	#else
	union U {
	long i ;
	T t ;
	} assume , oldval , newval ;
	#endif

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = assume.t + val ;
	oldval.i = __sync_val_compare_and_swap( (long*) dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	+#ifdef KOKKOS_ENABLE_ASM
	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_add( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) != sizeof(long) &&
	sizeof(T) == sizeof(Impl::cas128_t) , const T >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	union U {
	Impl::cas128_t i ;
	T t ;
	KOKKOS_INLINE_FUNCTION U() {};
	} assume , oldval , newval ;
	-#else
	- union U {
	- Impl::cas128_t i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = assume.t + val ;
	oldval.i = Impl::cas128( (volatile Impl::cas128_t*) dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+template < typename T >
	+inline
	+T atomic_fetch_add( volatile T * const dest ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ #if defined(KOKKOS_ENABLE_ASM)
	+ && ( sizeof(T) != 16 )
	+ #endif
	+ , const T >::type& val )
	+{
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ T return_val = *dest;
	+ const T tmp = *dest = return_val + val;
	+ #ifndef KOKKOS_COMPILER_CLANG
	+ (void) tmp;
	+ #endif
	+ Impl::unlock_address_host_space( (void*) dest );
	+ return return_val;
	+}
	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )

	template< typename T >
	T atomic_fetch_add( volatile T * const dest , const T val )
	{
	T retval;
	#pragma omp atomic capture
	{
	retval = dest[0];
	dest[0] += val;
	}
	return retval;
	}

	#endif

	//----------------------------------------------------------------------------

	// Simpler version of atomic_fetch_add without the fetch
	template <typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_add(volatile T * const dest, const T src) {
	atomic_fetch_add(dest,src);
	}

	// Atomic increment
	template<typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_increment(volatile T* a) {
	Kokkos::atomic_fetch_add(a,1);
	}

	template<typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_decrement(volatile T* a) {
	Kokkos::atomic_fetch_add(a,-1);
	}

	}
	#endif

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_And.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_And.hpp
	index 9e62fd65d..9b7ebae4a 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_And.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_And.hpp
	@@ -1,125 +1,125 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_FETCH_AND_HPP )
	#define KOKKOS_ATOMIC_FETCH_AND_HPP

	namespace Kokkos {

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ATOMICS_USE_CUDA )

	// Support for int, unsigned int, unsigned long long int, and float

	__inline__ __device__
	int atomic_fetch_and( volatile int * const dest , const int val )
	{ return atomicAnd((int*)dest,val); }

	__inline__ __device__
	unsigned int atomic_fetch_and( volatile unsigned int * const dest , const unsigned int val )
	{ return atomicAnd((unsigned int*)dest,val); }

	#if defined( __CUDA_ARCH__ ) && ( 350 <= __CUDA_ARCH__ )
	__inline__ __device__
	unsigned long long int atomic_fetch_and( volatile unsigned long long int * const dest ,
	const unsigned long long int val )
	{ return atomicAnd((unsigned long long int*)dest,val); }
	#endif

	//----------------------------------------------------------------------------

	#elif defined(KOKKOS_ATOMICS_USE_GCC) \|\| defined(KOKKOS_ATOMICS_USE_INTEL)

	KOKKOS_INLINE_FUNCTION
	int atomic_fetch_and( volatile int * const dest , const int val )
	{ return __sync_fetch_and_and(dest,val); }

	KOKKOS_INLINE_FUNCTION
	long int atomic_fetch_and( volatile long int * const dest , const long int val )
	{ return __sync_fetch_and_and(dest,val); }

	#if defined( KOKKOS_ATOMICS_USE_GCC )

	KOKKOS_INLINE_FUNCTION
	unsigned int atomic_fetch_and( volatile unsigned int * const dest , const unsigned int val )
	{ return __sync_fetch_and_and(dest,val); }

	KOKKOS_INLINE_FUNCTION
	unsigned long int atomic_fetch_and( volatile unsigned long int * const dest , const unsigned long int val )
	{ return __sync_fetch_and_and(dest,val); }

	#endif

	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )

	template< typename T >
	T atomic_fetch_and( volatile T * const dest , const T val )
	{
	T retval;
	#pragma omp atomic capture
	{
	retval = dest[0];
	dest[0] &= val;
	}
	return retval;
	}

	#endif

	//----------------------------------------------------------------------------

	// Simpler version of atomic_fetch_and without the fetch
	template <typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_and(volatile T * const dest, const T src) {
	(void)atomic_fetch_and(dest,src);
	}

	}

	#endif


	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Or.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Or.hpp
	index 22a4a7866..f15e61a3a 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Or.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Or.hpp
	@@ -1,125 +1,125 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_FETCH_OR_HPP )
	#define KOKKOS_ATOMIC_FETCH_OR_HPP

	namespace Kokkos {

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ATOMICS_USE_CUDA )

	// Support for int, unsigned int, unsigned long long int, and float

	__inline__ __device__
	int atomic_fetch_or( volatile int * const dest , const int val )
	{ return atomicOr((int*)dest,val); }

	__inline__ __device__
	unsigned int atomic_fetch_or( volatile unsigned int * const dest , const unsigned int val )
	{ return atomicOr((unsigned int*)dest,val); }

	#if defined( __CUDA_ARCH__ ) && ( 350 <= __CUDA_ARCH__ )
	__inline__ __device__
	unsigned long long int atomic_fetch_or( volatile unsigned long long int * const dest ,
	const unsigned long long int val )
	{ return atomicOr((unsigned long long int*)dest,val); }
	#endif

	//----------------------------------------------------------------------------

	#elif defined(KOKKOS_ATOMICS_USE_GCC) \|\| defined(KOKKOS_ATOMICS_USE_INTEL)

	KOKKOS_INLINE_FUNCTION
	int atomic_fetch_or( volatile int * const dest , const int val )
	{ return __sync_fetch_and_or(dest,val); }

	KOKKOS_INLINE_FUNCTION
	long int atomic_fetch_or( volatile long int * const dest , const long int val )
	{ return __sync_fetch_and_or(dest,val); }

	#if defined( KOKKOS_ATOMICS_USE_GCC )

	KOKKOS_INLINE_FUNCTION
	unsigned int atomic_fetch_or( volatile unsigned int * const dest , const unsigned int val )
	{ return __sync_fetch_and_or(dest,val); }

	KOKKOS_INLINE_FUNCTION
	unsigned long int atomic_fetch_or( volatile unsigned long int * const dest , const unsigned long int val )
	{ return __sync_fetch_and_or(dest,val); }

	#endif

	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )

	template< typename T >
	T atomic_fetch_or( volatile T * const dest , const T val )
	{
	T retval;
	#pragma omp atomic capture
	{
	retval = dest[0];
	dest[0] \|= val;
	}
	return retval;
	}

	#endif

	//----------------------------------------------------------------------------

	// Simpler version of atomic_fetch_or without the fetch
	template <typename T>
	KOKKOS_INLINE_FUNCTION
	void atomic_or(volatile T * const dest, const T src) {
	(void)atomic_fetch_or(dest,src);
	}

	}

	#endif


	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Sub.hpp
	similarity index 52%
	copy from lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp
	copy to lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Sub.hpp
	index ce8b9a093..259cba794 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Add.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Fetch_Sub.hpp
	@@ -1,297 +1,233 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_FETCH_ADD_HPP )
	-#define KOKKOS_ATOMIC_FETCH_ADD_HPP
	+#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_FETCH_SUB_HPP )
	+#define KOKKOS_ATOMIC_FETCH_SUB_HPP

	namespace Kokkos {

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ATOMICS_USE_CUDA )

	// Support for int, unsigned int, unsigned long long int, and float

	__inline__ __device__
	-int atomic_fetch_add( volatile int * const dest , const int val )
	-{ return atomicAdd((int*)dest,val); }
	-
	-__inline__ __device__
	-unsigned int atomic_fetch_add( volatile unsigned int * const dest , const unsigned int val )
	-{ return atomicAdd((unsigned int*)dest,val); }
	-
	-__inline__ __device__
	-unsigned long long int atomic_fetch_add( volatile unsigned long long int * const dest ,
	- const unsigned long long int val )
	-{ return atomicAdd((unsigned long long int*)dest,val); }
	+int atomic_fetch_sub( volatile int * const dest , const int val )
	+{ return atomicSub((int*)dest,val); }

	__inline__ __device__
	-float atomic_fetch_add( volatile float * const dest , const float val )
	-{ return atomicAdd((float*)dest,val); }
	+unsigned int atomic_fetch_sub( volatile unsigned int * const dest , const unsigned int val )
	+{ return atomicSub((unsigned int*)dest,val); }

	template < typename T >
	__inline__ __device__
	-T atomic_fetch_add( volatile T * const dest ,
	+T atomic_fetch_sub( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	- union U {
	- int i ;
	- T t ;
	- KOKKOS_INLINE_FUNCTION U() {};
	- } assume , oldval , newval ;
	-#else
	- union U {
	- int i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif
	+ union { int i ; T t ; } oldval , assume , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	- newval.t = assume.t + val ;
	+ newval.t = assume.t - val ;
	oldval.i = atomicCAS( (int*)dest , assume.i , newval.i );
	} while ( assumed.i != oldval.i );

	return oldval.t ;
	}

	template < typename T >
	__inline__ __device__
	-T atomic_fetch_add( volatile T * const dest ,
	+T atomic_fetch_sub( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	- union U {
	- unsigned long long int i ;
	- T t ;
	- KOKKOS_INLINE_FUNCTION U() {};
	- } assume , oldval , newval ;
	-#else
	- union U {
	- unsigned long long int i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif
	+ union { unsigned long long int i ; T t ; } oldval , assume , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	- newval.t = assume.t + val ;
	+ newval.t = assume.t - val ;
	oldval.i = atomicCAS( (unsigned long long int*)dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	+
	+//----------------------------------------------------------------------------
	+
	template < typename T >
	__inline__ __device__
	-T atomic_fetch_add( volatile T * const dest ,
	- typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	- sizeof(T) != sizeof(unsigned long long int) &&
	- sizeof(T) == sizeof(Impl::cas128_t), const T >::type val )
	+T atomic_fetch_sub( volatile T * const dest ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ , const T >::type& val )
	{
	- Kokkos::abort("Error: calling atomic_fetch_add with 128bit type is not supported on CUDA execution space.");
	- return T();
	+ T return_val;
	+ // This is a way to (hopefully) avoid dead lock in a warp
	+ bool done = false;
	+ while (! done ) {
	+ if( Impl::lock_address_cuda_space( (void*) dest ) ) {
	+ return_val = *dest;
	+ *dest = return_val - val;
	+ Impl::unlock_address_cuda_space( (void*) dest );
	+ }
	+ }
	+ return return_val;
	}

	//----------------------------------------------------------------------------

	#elif defined(KOKKOS_ATOMICS_USE_GCC) \|\| defined(KOKKOS_ATOMICS_USE_INTEL)

	KOKKOS_INLINE_FUNCTION
	-int atomic_fetch_add( volatile int * const dest , const int val )
	-{ return __sync_fetch_and_add(dest,val); }
	+int atomic_fetch_sub( volatile int * const dest , const int val )
	+{ return __sync_fetch_and_sub(dest,val); }

	KOKKOS_INLINE_FUNCTION
	-long int atomic_fetch_add( volatile long int * const dest , const long int val )
	-{ return __sync_fetch_and_add(dest,val); }
	+long int atomic_fetch_sub( volatile long int * const dest , const long int val )
	+{ return __sync_fetch_and_sub(dest,val); }

	#if defined( KOKKOS_ATOMICS_USE_GCC )

	KOKKOS_INLINE_FUNCTION
	-unsigned int atomic_fetch_add( volatile unsigned int * const dest , const unsigned int val )
	-{ return __sync_fetch_and_add(dest,val); }
	+unsigned int atomic_fetch_sub( volatile unsigned int * const dest , const unsigned int val )
	+{ return __sync_fetch_and_sub(dest,val); }

	KOKKOS_INLINE_FUNCTION
	-unsigned long int atomic_fetch_add( volatile unsigned long int * const dest , const unsigned long int val )
	-{ return __sync_fetch_and_add(dest,val); }
	+unsigned long int atomic_fetch_sub( volatile unsigned long int * const dest , const unsigned long int val )
	+{ return __sync_fetch_and_sub(dest,val); }

	#endif

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	-T atomic_fetch_add( volatile T * const dest ,
	+T atomic_fetch_sub( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	- union U {
	- int i ;
	- T t ;
	- KOKKOS_INLINE_FUNCTION U() {};
	- } assume , oldval , newval ;
	-#else
	- union U {
	- int i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif
	+ union { int i ; T t ; } assume , oldval , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	- newval.t = assume.t + val ;
	+ newval.t = assume.t - val ;
	oldval.i = __sync_val_compare_and_swap( (int*) dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	-T atomic_fetch_add( volatile T * const dest ,
	+T atomic_fetch_sub( volatile T * const dest ,
	typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(long) , const T >::type val )
	{
	-#ifdef KOKKOS_HAVE_CXX11
	- union U {
	- long i ;
	- T t ;
	- KOKKOS_INLINE_FUNCTION U() {};
	- } assume , oldval , newval ;
	-#else
	- union U {
	- long i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif
	+ union { long i ; T t ; } assume , oldval , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	- newval.t = assume.t + val ;
	+ newval.t = assume.t - val ;
	oldval.i = __sync_val_compare_and_swap( (long*) dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	-template < typename T >
	-KOKKOS_INLINE_FUNCTION
	-T atomic_fetch_add( volatile T * const dest ,
	- typename Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	- sizeof(T) != sizeof(long) &&
	- sizeof(T) == sizeof(Impl::cas128_t) , const T >::type val )
	-{
	-#ifdef KOKKOS_HAVE_CXX11
	- union U {
	- Impl::cas128_t i ;
	- T t ;
	- KOKKOS_INLINE_FUNCTION U() {};
	- } assume , oldval , newval ;
	-#else
	- union U {
	- Impl::cas128_t i ;
	- T t ;
	- } assume , oldval , newval ;
	-#endif
	-
	- oldval.t = *dest ;

	- do {
	- assume.i = oldval.i ;
	- newval.t = assume.t + val ;
	- oldval.i = Impl::cas128( (volatile Impl::cas128_t*) dest , assume.i , newval.i );
	- } while ( assume.i != oldval.i );
	+//----------------------------------------------------------------------------

	- return oldval.t ;
	+template < typename T >
	+inline
	+T atomic_fetch_sub( volatile T * const dest ,
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ , const T >::type& val )
	+{
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ T return_val = *dest;
	+ *dest = return_val - val;
	+ Impl::unlock_address_host_space( (void*) dest );
	+ return return_val;
	}
	+
	//----------------------------------------------------------------------------

	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )

	template< typename T >
	-T atomic_fetch_add( volatile T * const dest , const T val )
	+T atomic_fetch_sub( volatile T * const dest , const T val )
	{
	T retval;
	#pragma omp atomic capture
	{
	retval = dest[0];
	- dest[0] += val;
	+ dest[0] -= val;
	}
	return retval;
	}

	#endif

	-//----------------------------------------------------------------------------
	-
	-// Simpler version of atomic_fetch_add without the fetch
	+// Simpler version of atomic_fetch_sub without the fetch
	template <typename T>
	KOKKOS_INLINE_FUNCTION
	-void atomic_add(volatile T * const dest, const T src) {
	- atomic_fetch_add(dest,src);
	-}
	-
	-// Atomic increment
	-template<typename T>
	-KOKKOS_INLINE_FUNCTION
	-void atomic_increment(volatile T* a) {
	- Kokkos::atomic_fetch_add(a,1);
	+void atomic_sub(volatile T * const dest, const T src) {
	+ atomic_fetch_sub(dest,src);
	}

	-template<typename T>
	-KOKKOS_INLINE_FUNCTION
	-void atomic_decrement(volatile T* a) {
	- Kokkos::atomic_fetch_add(a,-1);
	}

	-}
	+#include<impl/Kokkos_Atomic_Assembly_X86.hpp>
	#endif

	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Generic.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Generic.hpp
	index 125142825..bd968633b 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_Generic.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Generic.hpp
	@@ -1,383 +1,375 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/
	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_ATOMIC_GENERIC_HPP )
	#define KOKKOS_ATOMIC_GENERIC_HPP
	#include <Kokkos_Macros.hpp>

	// Combination operands to be used in an Compare and Exchange based atomic operation
	namespace Kokkos {
	namespace Impl {

	template<class Scalar1, class Scalar2>
	struct AddOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1+val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct SubOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1-val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct MulOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1*val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct DivOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1/val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct ModOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1%val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct AndOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1&val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct OrOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1\|val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct XorOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1^val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct LShiftOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1<<val2;
	}
	};

	template<class Scalar1, class Scalar2>
	struct RShiftOper {
	KOKKOS_FORCEINLINE_FUNCTION
	static Scalar1 apply(const Scalar1& val1, const Scalar2& val2) {
	return val1>>val2;
	}
	};

	template < class Oper, typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_oper( const Oper& op, volatile T * const dest ,
	typename ::Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T >::type val )
	{
	union { unsigned long long int i ; T t ; } oldval , assume , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = Oper::apply(assume.t, val) ;
	oldval.i = ::Kokkos::atomic_compare_exchange( (unsigned long long int*)dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	template < class Oper, typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_oper_fetch( const Oper& op, volatile T * const dest ,
	typename ::Kokkos::Impl::enable_if< sizeof(T) != sizeof(int) &&
	sizeof(T) == sizeof(unsigned long long int) , const T >::type val )
	{
	union { unsigned long long int i ; T t ; } oldval , assume , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = Oper::apply(assume.t, val) ;
	oldval.i = ::Kokkos::atomic_compare_exchange( (unsigned long long int*)dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return newval.t ;
	}

	template < class Oper, typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_oper( const Oper& op, volatile T * const dest ,
	typename ::Kokkos::Impl::enable_if< sizeof(T) == sizeof(int) , const T >::type val )
	{
	union { int i ; T t ; } oldval , assume , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = Oper::apply(assume.t, val) ;
	oldval.i = ::Kokkos::atomic_compare_exchange( (int*)dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return oldval.t ;
	}

	template < class Oper, typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_oper_fetch( const Oper& op, volatile T * const dest ,
	typename ::Kokkos::Impl::enable_if< sizeof(T) == sizeof(int), const T >::type val )
	{
	union { int i ; T t ; } oldval , assume , newval ;

	oldval.t = *dest ;

	do {
	assume.i = oldval.i ;
	newval.t = Oper::apply(assume.t, val) ;
	oldval.i = ::Kokkos::atomic_compare_exchange( (int*)dest , assume.i , newval.i );
	} while ( assume.i != oldval.i );

	return newval.t ;
	}

	-/*template < class Oper, typename T >
	-KOKKOS_INLINE_FUNCTION
	-T atomic_fetch_oper( const Oper& op, volatile T * const dest ,
	- typename ::Kokkos::Impl::enable_if< sizeof(T) == sizeof(short) , const T >::type val )
	-{
	- union { short i ; T t ; } oldval , assume , newval ;
	-
	- oldval.t = *dest ;
	-
	- do {
	- assume.i = oldval.i ;
	- newval.t = Oper::apply(assume.t, val) ;
	- oldval.i = ::Kokkos::atomic_compare_exchange( (short*)dest , assume.i , newval.i );
	- } while ( assume.i != oldval.i );
	-
	- return oldval.t ;
	-}
	-
	-template < class Oper, typename T >
	-KOKKOS_INLINE_FUNCTION
	-T atomic_oper_fetch( const Oper& op, volatile T * const dest ,
	- typename ::Kokkos::Impl::enable_if< sizeof(T) == sizeof(short), const T >::type val )
	-{
	- union { short i ; T t ; } oldval , assume , newval ;
	-
	- oldval.t = *dest ;
	-
	- do {
	- assume.i = oldval.i ;
	- newval.t = Oper::apply(assume.t, val) ;
	- oldval.i = ::Kokkos::atomic_compare_exchange( (short*)dest , assume.i , newval.i );
	- } while ( assume.i != oldval.i );
	-
	- return newval.t ;
	-}
	-
	template < class Oper, typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_oper( const Oper& op, volatile T * const dest ,
	- typename ::Kokkos::Impl::enable_if< sizeof(T) == sizeof(char) , const T >::type val )
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ #if defined(KOKKOS_ENABLE_ASM) && defined(KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST)
	+ && ( sizeof(T) != 16 )
	+ #endif
	+ , const T >::type val )
	{
	- union { char i ; T t ; } oldval , assume , newval ;
	-
	- oldval.t = *dest ;

	- do {
	- assume.i = oldval.i ;
	- newval.t = Oper::apply(assume.t, val) ;
	- oldval.i = ::Kokkos::atomic_compare_exchange( (char*)dest , assume.i , newval.i );
	- } while ( assume.i != oldval.i );
	-
	- return oldval.t ;
	+#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ T return_val = *dest;
	+ *dest = Oper::apply(return_val, val);
	+ Impl::unlock_address_host_space( (void*) dest );
	+ return return_val;
	+#else
	+ // This is a way to (hopefully) avoid dead lock in a warp
	+ bool done = false;
	+ while (! done ) {
	+ if( Impl::lock_address_cuda_space( (void*) dest ) ) {
	+ T return_val = *dest;
	+ *dest = Oper::apply(return_val, val);;
	+ Impl::unlock_address_cuda_space( (void*) dest );
	+ }
	+ }
	+ return return_val;
	+#endif
	}

	template < class Oper, typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_oper_fetch( const Oper& op, volatile T * const dest ,
	- typename ::Kokkos::Impl::enable_if< sizeof(T) == sizeof(char), const T >::type val )
	+ typename ::Kokkos::Impl::enable_if<
	+ ( sizeof(T) != 4 )
	+ && ( sizeof(T) != 8 )
	+ #if defined(KOKKOS_ENABLE_ASM) && defined(KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST)
	+ && ( sizeof(T) != 16 )
	+ #endif
	+ , const T >::type& val )
	{
	- union { char i ; T t ; } oldval , assume , newval ;
	-
	- oldval.t = *dest ;

	- do {
	- assume.i = oldval.i ;
	- newval.t = Oper::apply(assume.t, val) ;
	- oldval.i = ::Kokkos::atomic_compare_exchange( (char*)dest , assume.i , newval.i );
	- } while ( assume.i != oldval.i );
	-
	- return newval.t ;
	-}*/
	+#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST
	+ while( !Impl::lock_address_host_space( (void*) dest ) );
	+ T return_val = Oper::apply(*dest, val);
	+ *dest = return_val;
	+ Impl::unlock_address_host_space( (void*) dest );
	+ return return_val;
	+#else
	+ // This is a way to (hopefully) avoid dead lock in a warp
	+ bool done = false;
	+ while (! done ) {
	+ if( Impl::lock_address_cuda_space( (void*) dest ) ) {
	+ T return_val = Oper::apply(*dest, val);
	+ *dest = return_val;
	+ Impl::unlock_address_cuda_space( (void*) dest );
	+ }
	+ }
	+ return return_val;
	+#endif
	+}

	}
	}

	namespace Kokkos {

	// Fetch_Oper atomics: return value before operation
	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_mul(volatile T * const dest, const T val) {
	return Impl::atomic_fetch_oper(Impl::MulOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_div(volatile T * const dest, const T val) {
	return Impl::atomic_fetch_oper(Impl::DivOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_mod(volatile T * const dest, const T val) {
	return Impl::atomic_fetch_oper(Impl::ModOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_and(volatile T * const dest, const T val) {
	return Impl::atomic_fetch_oper(Impl::AndOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_or(volatile T * const dest, const T val) {
	return Impl::atomic_fetch_oper(Impl::OrOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_xor(volatile T * const dest, const T val) {
	return Impl::atomic_fetch_oper(Impl::XorOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_lshift(volatile T * const dest, const unsigned int val) {
	return Impl::atomic_fetch_oper(Impl::LShiftOper<T,const unsigned int>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_fetch_rshift(volatile T * const dest, const unsigned int val) {
	return Impl::atomic_fetch_oper(Impl::RShiftOper<T,const unsigned int>(),dest,val);
	}


	// Oper Fetch atomics: return value after operation
	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_mul_fetch(volatile T * const dest, const T val) {
	return Impl::atomic_oper_fetch(Impl::MulOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_div_fetch(volatile T * const dest, const T val) {
	return Impl::atomic_oper_fetch(Impl::DivOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_mod_fetch(volatile T * const dest, const T val) {
	return Impl::atomic_oper_fetch(Impl::ModOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_and_fetch(volatile T * const dest, const T val) {
	return Impl::atomic_oper_fetch(Impl::AndOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_or_fetch(volatile T * const dest, const T val) {
	return Impl::atomic_oper_fetch(Impl::OrOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_xor_fetch(volatile T * const dest, const T val) {
	return Impl::atomic_oper_fetch(Impl::XorOper<T,const T>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_lshift_fetch(volatile T * const dest, const unsigned int val) {
	return Impl::atomic_oper_fetch(Impl::LShiftOper<T,const unsigned int>(),dest,val);
	}

	template < typename T >
	KOKKOS_INLINE_FUNCTION
	T atomic_rshift_fetch(volatile T * const dest, const unsigned int val) {
	return Impl::atomic_oper_fetch(Impl::RShiftOper<T,const unsigned int>(),dest,val);
	}


	}
	#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_View.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_View.hpp
	index 6bb33f6bf..f95ed67da 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Atomic_View.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_View.hpp
	@@ -1,448 +1,462 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/
	#ifndef KOKKOS_ATOMIC_VIEW_HPP
	#define KOKKOS_ATOMIC_VIEW_HPP

	#include <Kokkos_Macros.hpp>
	#include <Kokkos_Atomic.hpp>
	-namespace Kokkos {
	-namespace Impl {
	+
	+namespace Kokkos { namespace Impl {
	+
	+class AllocationTracker;

	//The following tag is used to prevent an implicit call of the constructor when trying
	//to assign a literal 0 int ( = 0 );
	struct AtomicViewConstTag {};

	template<class ViewTraits>
	class AtomicDataElement {
	public:
	typedef typename ViewTraits::value_type value_type;
	typedef typename ViewTraits::const_value_type const_value_type;
	typedef typename ViewTraits::non_const_value_type non_const_value_type;
	volatile value_type* const ptr;

	KOKKOS_INLINE_FUNCTION
	AtomicDataElement(value_type* ptr_, AtomicViewConstTag ):ptr(ptr_){}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator = (const_value_type& val) const {
	*ptr = val;
	return val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator = (volatile const_value_type& val) const {
	*ptr = val;
	return val;
	}

	KOKKOS_INLINE_FUNCTION
	void inc() const {
	Kokkos::atomic_increment(ptr);
	}

	KOKKOS_INLINE_FUNCTION
	void dec() const {
	Kokkos::atomic_decrement(ptr);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator ++ () const {
	const_value_type tmp = Kokkos::atomic_fetch_add(ptr,1);
	return tmp+1;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator -- () const {
	const_value_type tmp = Kokkos::atomic_fetch_add(ptr,-1);
	return tmp-1;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator ++ (int) const {
	return Kokkos::atomic_fetch_add(ptr,1);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator -- (int) const {
	return Kokkos::atomic_fetch_add(ptr,-1);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator += (const_value_type& val) const {
	const_value_type tmp = Kokkos::atomic_fetch_add(ptr,val);
	return tmp+val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator += (volatile const_value_type& val) const {
	const_value_type tmp = Kokkos::atomic_fetch_add(ptr,val);
	return tmp+val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator -= (const_value_type& val) const {
	const_value_type tmp = Kokkos::atomic_fetch_add(ptr,-val);
	return tmp-val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator -= (volatile const_value_type& val) const {
	const_value_type tmp = Kokkos::atomic_fetch_add(ptr,-val);
	return tmp-val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator *= (const_value_type& val) const {
	return Kokkos::atomic_mul_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator *= (volatile const_value_type& val) const {
	return Kokkos::atomic_mul_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator /= (const_value_type& val) const {
	return Kokkos::atomic_div_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator /= (volatile const_value_type& val) const {
	return Kokkos::atomic_div_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator %= (const_value_type& val) const {
	return Kokkos::atomic_mod_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator %= (volatile const_value_type& val) const {
	return Kokkos::atomic_mod_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator &= (const_value_type& val) const {
	return Kokkos::atomic_and_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator &= (volatile const_value_type& val) const {
	return Kokkos::atomic_and_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator ^= (const_value_type& val) const {
	return Kokkos::atomic_xor_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator ^= (volatile const_value_type& val) const {
	return Kokkos::atomic_xor_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator \|= (const_value_type& val) const {
	return Kokkos::atomic_or_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator \|= (volatile const_value_type& val) const {
	return Kokkos::atomic_or_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator <<= (const_value_type& val) const {
	return Kokkos::atomic_lshift_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator <<= (volatile const_value_type& val) const {
	return Kokkos::atomic_lshift_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator >>= (const_value_type& val) const {
	return Kokkos::atomic_rshift_fetch(ptr,val);
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator >>= (volatile const_value_type& val) const {
	return Kokkos::atomic_rshift_fetch(ptr,val);
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator + (const_value_type& val) const {
	return *ptr+val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator + (volatile const_value_type& val) const {
	return *ptr+val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator - (const_value_type& val) const {
	return *ptr-val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator - (volatile const_value_type& val) const {
	return *ptr-val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator * (const_value_type& val) const {
	return ptrval;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator * (volatile const_value_type& val) const {
	return ptrval;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator / (const_value_type& val) const {
	return *ptr/val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator / (volatile const_value_type& val) const {
	return *ptr/val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator % (const_value_type& val) const {
	return *ptr^val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator % (volatile const_value_type& val) const {
	return *ptr^val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator ! () const {
	return !*ptr;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator && (const_value_type& val) const {
	return *ptr&&val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator && (volatile const_value_type& val) const {
	return *ptr&&val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator \|\| (const_value_type& val) const {
	return *ptr\|val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator \|\| (volatile const_value_type& val) const {
	return *ptr\|val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator & (const_value_type& val) const {
	return *ptr&val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator & (volatile const_value_type& val) const {
	return *ptr&val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator \| (const_value_type& val) const {
	return *ptr\|val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator \| (volatile const_value_type& val) const {
	return *ptr\|val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator ^ (const_value_type& val) const {
	return *ptr^val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator ^ (volatile const_value_type& val) const {
	return *ptr^val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator ~ () const {
	return ~*ptr;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator << (const unsigned int& val) const {
	return *ptr<<val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator << (volatile const unsigned int& val) const {
	return *ptr<<val;
	}

	KOKKOS_INLINE_FUNCTION
	const_value_type operator >> (const unsigned int& val) const {
	return *ptr>>val;
	}
	KOKKOS_INLINE_FUNCTION
	const_value_type operator >> (volatile const unsigned int& val) const {
	return *ptr>>val;
	}

	KOKKOS_INLINE_FUNCTION
	bool operator == (const_value_type& val) const {
	return *ptr == val;
	}
	KOKKOS_INLINE_FUNCTION
	bool operator == (volatile const_value_type& val) const {
	return *ptr == val;
	}

	KOKKOS_INLINE_FUNCTION
	bool operator != (const_value_type& val) const {
	return *ptr != val;
	}
	KOKKOS_INLINE_FUNCTION
	bool operator != (volatile const_value_type& val) const {
	return *ptr != val;
	}

	KOKKOS_INLINE_FUNCTION
	bool operator >= (const_value_type& val) const {
	return *ptr >= val;
	}
	KOKKOS_INLINE_FUNCTION
	bool operator >= (volatile const_value_type& val) const {
	return *ptr >= val;
	}

	KOKKOS_INLINE_FUNCTION
	bool operator <= (const_value_type& val) const {
	return *ptr <= val;
	}
	KOKKOS_INLINE_FUNCTION
	bool operator <= (volatile const_value_type& val) const {
	return *ptr <= val;
	}

	KOKKOS_INLINE_FUNCTION
	bool operator < (const_value_type& val) const {
	return *ptr < val;
	}
	KOKKOS_INLINE_FUNCTION
	bool operator < (volatile const_value_type& val) const {
	return *ptr < val;
	}

	KOKKOS_INLINE_FUNCTION
	bool operator > (const_value_type& val) const {
	return *ptr > val;
	}
	KOKKOS_INLINE_FUNCTION
	bool operator > (volatile const_value_type& val) const {
	return *ptr > val;
	}

	KOKKOS_INLINE_FUNCTION
	operator const_value_type () const {
	//return Kokkos::atomic_load(ptr);
	return *ptr;
	}

	KOKKOS_INLINE_FUNCTION
	operator volatile non_const_value_type () volatile const {
	//return Kokkos::atomic_load(ptr);
	return *ptr;
	}
	};

	template<class ViewTraits>
	class AtomicViewDataHandle {
	public:
	typename ViewTraits::value_type* ptr;

	KOKKOS_INLINE_FUNCTION
	- AtomicViewDataHandle(typename ViewTraits::value_type* ptr_):ptr(ptr_){}
	+ AtomicViewDataHandle()
	+ : ptr(NULL)
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ AtomicViewDataHandle(typename ViewTraits::value_type* ptr_)
	+ :ptr(ptr_)
	+ {}

	template<class iType>
	KOKKOS_INLINE_FUNCTION
	AtomicDataElement<ViewTraits> operator[] (const iType& i) const {
	return AtomicDataElement<ViewTraits>(ptr+i,AtomicViewConstTag());
	}


	KOKKOS_INLINE_FUNCTION
	operator typename ViewTraits::value_type * () const { return ptr ; }

	};

	template<unsigned Size>
	struct Kokkos_Atomic_is_only_allowed_with_32bit_and_64bit_scalars;

	template<>
	struct Kokkos_Atomic_is_only_allowed_with_32bit_and_64bit_scalars<4> {
	typedef int type;
	};

	template<>
	struct Kokkos_Atomic_is_only_allowed_with_32bit_and_64bit_scalars<8> {
	typedef int64_t type;
	};

	// Must be non-const, atomic access trait, and 32 or 64 bit type for true atomics.
	template<class ViewTraits>
	class ViewDataHandle<
	ViewTraits ,
	typename enable_if<
	( ! is_same<typename ViewTraits::const_value_type,typename ViewTraits::value_type>::value) &&
	( ViewTraits::memory_traits::Atomic )
	>::type >
	{
	private:
	-// typedef typename if_c<(sizeof(typename ViewTraits::const_value_type)==4) \|\|
	-// (sizeof(typename ViewTraits::const_value_type)==8),
	-// int, Kokkos_Atomic_is_only_allowed_with_32bit_and_64bit_scalars >::type
	-// atomic_view_possible;
	+// typedef typename if_c<(sizeof(typename ViewTraits::const_value_type)==4) \|\|
	+// (sizeof(typename ViewTraits::const_value_type)==8),
	+// int, Kokkos_Atomic_is_only_allowed_with_32bit_and_64bit_scalars >::type
	+// atomic_view_possible;
	typedef typename Kokkos_Atomic_is_only_allowed_with_32bit_and_64bit_scalars<sizeof(typename ViewTraits::const_value_type)>::type enable_atomic_type;
	typedef ViewDataHandle self_type;

	public:
	enum { ReturnTypeIsReference = false };

	typedef Impl::AtomicViewDataHandle<ViewTraits> handle_type;
	typedef Impl::AtomicDataElement<ViewTraits> return_type;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type create_handle( typename ViewTraits::value_type * arg_data_ptr, AllocationTracker const & /arg_tracker/ )
	+ {
	+ return handle_type(arg_data_ptr);
	+ }
	};

	-}
	-}
	+}} // namespace Kokkos::Impl

	#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Atomic_Windows.hpp b/lib/kokkos/core/src/impl/Kokkos_Atomic_Windows.hpp
	new file mode 100755
	index 000000000..62581569f
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_Atomic_Windows.hpp
	@@ -0,0 +1,211 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+#ifndef KOKKOS_ATOMIC_WINDOWS_HPP
	+#define KOKKOS_ATOMIC_WINDOWS_HPP
	+#ifdef _WIN32
	+
	+#define NOMINMAX
	+#include <Windows.h>
	+
	+namespace Kokkos {
	+ namespace Impl {
	+ _declspec(align(16))
	+ struct cas128_t
	+ {
	+ LONGLONG lower;
	+ LONGLONG upper;
	+ KOKKOS_INLINE_FUNCTION
	+ bool operator != (const cas128_t& a) const {
	+ return (lower != a.lower) \|\| upper != a.upper;
	+ }
	+ };
	+ }
	+
	+#ifdef KOKKOS_HAVE_CXX11
	+ template < typename T >
	+ KOKKOS_INLINE_FUNCTION
	+ T atomic_compare_exchange(volatile T * const dest, const T & compare,
	+ typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(LONG), const T & >::type val)
	+ {
	+ union U {
	+ LONG i;
	+ T t;
	+ KOKKOS_INLINE_FUNCTION U() {};
	+ } tmp;
	+
	+ tmp.i = _InterlockedCompareExchange((LONG)dest, ((LONG)&val), ((LONG*)&compare));
	+ return tmp.t;
	+ }
	+
	+ template < typename T >
	+ KOKKOS_INLINE_FUNCTION
	+ T atomic_compare_exchange(volatile T * const dest, const T & compare,
	+ typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(LONGLONG), const T & >::type val)
	+ {
	+ union U {
	+ LONGLONG i;
	+ T t;
	+ KOKKOS_INLINE_FUNCTION U() {};
	+ } tmp;
	+
	+ tmp.i = _InterlockedCompareExchange64((LONGLONG)dest, ((LONGLONG)&val), ((LONGLONG*)&compare));
	+ return tmp.t;
	+ }
	+
	+ template < typename T >
	+ KOKKOS_INLINE_FUNCTION
	+ T atomic_compare_exchange(volatile T * const dest, const T & compare,
	+ typename Kokkos::Impl::enable_if< sizeof(T) == sizeof(Impl::cas128_t), const T & >::type val)
	+ {
	+ union U {
	+ Impl::cas128_t i;
	+ T t;
	+ KOKKOS_INLINE_FUNCTION U() {};
	+ } tmp, newval;
	+ newval.t = val;
	+ tmp.i = _InterlockedCompareExchange128((LONGLONG)dest, newval.i.upper, newval.i.lower, ((LONGLONG*)&compare));
	+ return tmp.t;
	+ }
	+
	+ template< typename T >
	+ T atomic_fetch_or(volatile T * const dest, const T val) {
	+ T oldval = *dest;
	+ T assume;
	+ do {
	+ assume = oldval;
	+ T newval = val \| oldval;
	+ oldval = atomic_compare_exchange(dest, assume, newval);
	+ } while (assume != oldval);
	+
	+ return oldval;
	+ }
	+
	+ template< typename T >
	+ T atomic_fetch_and(volatile T * const dest, const T val) {
	+ T oldval = *dest;
	+ T assume;
	+ do {
	+ assume = oldval;
	+ T newval = val & oldval;
	+ oldval = atomic_compare_exchange(dest, assume, newval);
	+ } while (assume != oldval);
	+
	+ return oldval;
	+ }
	+
	+ template< typename T >
	+ T atomic_fetch_add(volatile T * const dest, const T val) {
	+ T oldval = *dest;
	+ T assume;
	+ do {
	+ assume = oldval;
	+ T newval = val + oldval;
	+ oldval = atomic_compare_exchange(dest, assume, newval);
	+ } while (assume != oldval);
	+
	+ return oldval;
	+ }
	+
	+ template< typename T >
	+ T atomic_fetch_exchange(volatile T * const dest, const T val) {
	+ T oldval = *dest;
	+ T assume;
	+ do {
	+ assume = oldval;
	+ oldval = atomic_compare_exchange(dest, assume, val);
	+ } while (assume != oldval);
	+
	+ return oldval;
	+ }
	+
	+ template< typename T >
	+ void atomic_or(volatile T * const dest, const T val) {
	+ atomic_fetch_or(dest, val);
	+ }
	+
	+ template< typename T >
	+ void atomic_and(volatile T * const dest, const T val) {
	+ atomic_fetch_and(dest, val);
	+ }
	+
	+ template< typename T >
	+ void atomic_add(volatile T * const dest, const T val) {
	+ atomic_fetch_add(dest, val);
	+ }
	+
	+ template< typename T >
	+ void atomic_exchange(volatile T * const dest, const T val) {
	+ atomic_fetch_exchange(dest, val);
	+ }
	+
	+ template< typename T >
	+ void atomic_assign(volatile T * const dest, const T val) {
	+ atomic_fetch_exchange(dest, val);
	+ }
	+
	+ template< typename T >
	+ T atomic_increment(volatile T * const dest) {
	+ T oldval = *dest;
	+ T assume;
	+ do {
	+ assume = oldval;
	+ T newval = assume++;
	+ oldval = atomic_compare_exchange(dest, assume, newval);
	+ } while (assume != oldval);
	+ }
	+
	+ template< typename T >
	+ T atomic_decrement(volatile T * const dest) {
	+ T oldval = *dest;
	+ T assume;
	+ do {
	+ assume = oldval;
	+ T newval = assume--;
	+ oldval = atomic_compare_exchange(dest, assume, newval);
	+ } while (assume != oldval);
	+ }
	+
	+}
	+#endif
	+#endif
	+#endif
	\ No newline at end of file
	diff --git a/lib/kokkos/core/src/impl/Kokkos_BasicAllocators.cpp b/lib/kokkos/core/src/impl/Kokkos_BasicAllocators.cpp
	new file mode 100755
	index 000000000..8da619fdb
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_BasicAllocators.cpp
	@@ -0,0 +1,281 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_HostSpace.hpp>
	+
	+#include <impl/Kokkos_BasicAllocators.hpp>
	+#include <impl/Kokkos_Error.hpp>
	+
	+
	+#include <stdint.h> // uintptr_t
	+#include <cstdlib> // for malloc, realloc, and free
	+#include <cstring> // for memcpy
	+#include <sys/mman.h> // for mmap, munmap, MAP_ANON, etc
	+#include <unistd.h> // for sysconf, _SC_PAGE_SIZE, _SC_PHYS_PAGES
	+
	+#include <sstream>
	+
	+namespace Kokkos { namespace Impl {
	+
	+/--------------------------------------------------------------------------/
	+
	+void* MallocAllocator::allocate( size_t size )
	+{
	+ void * ptr = NULL;
	+ if (size) {
	+ ptr = malloc(size);
	+
	+ if (!ptr)
	+ {
	+ std::ostringstream msg ;
	+ msg << name() << ": allocate(" << size << ") FAILED";
	+ throw_runtime_exception( msg.str() );
	+ }
	+ }
	+ return ptr;
	+}
	+
	+void MallocAllocator::deallocate( void * ptr, size_t /size/ )
	+{
	+ if (ptr) {
	+ free(ptr);
	+ }
	+}
	+
	+void * MallocAllocator::reallocate(void * old_ptr, size_t /old_size/, size_t new_size)
	+{
	+ void * ptr = realloc(old_ptr, new_size);
	+
	+ if (new_size > 0u && ptr == NULL) {
	+ throw_runtime_exception("Error: Malloc Allocator could not reallocate memory");
	+ }
	+ return ptr;
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace {
	+
	+void * raw_aligned_allocate( size_t size, size_t alignment )
	+{
	+ void * ptr = NULL;
	+ if ( size ) {
	+#if defined( __INTEL_COMPILER ) && !defined ( KOKKOS_HAVE_CUDA )
	+ ptr = _mm_malloc( size , alignment );
	+
	+#elif ( defined( _POSIX_C_SOURCE ) && _POSIX_C_SOURCE >= 200112L ) \|\| \
	+ ( defined( _XOPEN_SOURCE ) && _XOPEN_SOURCE >= 600 )
	+
	+ posix_memalign( & ptr, alignment , size );
	+
	+#else
	+ // Over-allocate to and round up to guarantee proper alignment.
	+ size_t size_padded = size + alignment + sizeof(void *);
	+ void * alloc_ptr = malloc( size_padded );
	+
	+ if (alloc_ptr) {
	+ uintptr_t address = reinterpret_cast<uintptr_t>(alloc_ptr);
	+ // offset enough to record the alloc_ptr
	+ address += sizeof(void *);
	+ uintptr_t rem = address % alignment;
	+ uintptr_t offset = rem ? (alignment - rem) : 0u;
	+ address += offset;
	+ ptr = reinterpret_cast<void *>(address);
	+ // record the alloc'd pointer
	+ address -= sizeof(void *);
	+ reinterpret_cast<void *>(address) = alloc_ptr;
	+ }
	+#endif
	+ }
	+ return ptr;
	+}
	+
	+void raw_aligned_deallocate( void * ptr, size_t /size/ )
	+{
	+ if ( ptr ) {
	+#if defined( __INTEL_COMPILER ) && !defined ( KOKKOS_HAVE_CUDA )
	+ _mm_free( ptr );
	+
	+#elif ( defined( _POSIX_C_SOURCE ) && _POSIX_C_SOURCE >= 200112L ) \|\| \
	+ ( defined( _XOPEN_SOURCE ) && _XOPEN_SOURCE >= 600 )
	+ free( ptr );
	+#else
	+ // get the alloc'd pointer
	+ void * alloc_ptr = (reinterpret_cast<void *>(ptr) -1);
	+ free( alloc_ptr );
	+#endif
	+ }
	+
	+}
	+
	+}
	+
	+void* AlignedAllocator::allocate( size_t size )
	+{
	+ void * ptr = 0 ;
	+
	+ if ( size ) {
	+ ptr = raw_aligned_allocate(size, MEMORY_ALIGNMENT);
	+
	+ if (!ptr)
	+ {
	+ std::ostringstream msg ;
	+ msg << name() << ": allocate(" << size << ") FAILED";
	+ throw_runtime_exception( msg.str() );
	+ }
	+ }
	+ return ptr;
	+}
	+
	+void AlignedAllocator::deallocate( void * ptr, size_t size )
	+{
	+ raw_aligned_deallocate( ptr, size);
	+}
	+
	+void * AlignedAllocator::reallocate(void * old_ptr, size_t old_size, size_t new_size)
	+{
	+ void * ptr = old_ptr;;
	+
	+ if (old_size < new_size) {
	+ ptr = allocate( new_size );
	+
	+ memcpy(ptr, old_ptr, old_size );
	+
	+ deallocate( old_ptr, old_size );
	+ }
	+
	+ return ptr;
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+// mmap flags for private anonymous memory allocation
	+#if defined( MAP_ANONYMOUS ) && defined( MAP_PRIVATE )
	+ #define MMAP_FLAGS (MAP_PRIVATE \| MAP_ANONYMOUS)
	+#elif defined( MAP_ANON) && defined( MAP_PRIVATE )
	+ #define MMAP_FLAGS (MAP_PRIVATE \| MAP_ANON)
	+#else
	+ #define NO_MMAP
	+#endif
	+
	+// huge page tables
	+#if !defined( NO_MMAP )
	+ #if defined( MAP_HUGETLB )
	+ #define MMAP_FLAGS_HUGE (MMAP_FLAGS \| MAP_HUGETLB )
	+ #elif defined( MMAP_FLAGS )
	+ #define MMAP_FLAGS_HUGE MMAP_FLAGS
	+ #endif
	+ // threshold to use huge pages
	+ #define MMAP_USE_HUGE_PAGES (1u << 27)
	+#endif
	+
	+// read write access to private memory
	+#if !defined( NO_MMAP )
	+ #define MMAP_PROTECTION (PROT_READ \| PROT_WRITE)
	+#endif
	+
	+
	+void* PageAlignedAllocator::allocate( size_t size )
	+{
	+ void *ptr = NULL;
	+ if (size) {
	+#if !defined NO_MMAP
	+ if ( size < MMAP_USE_HUGE_PAGES ) {
	+ ptr = mmap( NULL, size, MMAP_PROTECTION, MMAP_FLAGS, -1 /file descriptor/, 0 /offset/);
	+ } else {
	+ ptr = mmap( NULL, size, MMAP_PROTECTION, MMAP_FLAGS_HUGE, -1 /file descriptor/, 0 /offset/);
	+ }
	+ if (ptr == MAP_FAILED) {
	+ ptr = NULL;
	+ }
	+#else
	+ static const size_t page_size = 4096; // TODO: read in from sysconf( _SC_PAGE_SIZE )
	+
	+ ptr = raw_aligned_allocate( size, page_size);
	+#endif
	+ if (!ptr)
	+ {
	+ std::ostringstream msg ;
	+ msg << name() << ": allocate(" << size << ") FAILED";
	+ throw_runtime_exception( msg.str() );
	+ }
	+ }
	+ return ptr;
	+}
	+
	+void PageAlignedAllocator::deallocate( void * ptr, size_t size )
	+{
	+#if !defined( NO_MMAP )
	+ munmap(ptr, size);
	+#else
	+ raw_aligned_deallocate(ptr, size);
	+#endif
	+}
	+
	+void * PageAlignedAllocator::reallocate(void * old_ptr, size_t old_size, size_t new_size)
	+{
	+ void * ptr = NULL;
	+#if defined( NO_MMAP ) \|\| defined( __APPLE__ )
	+
	+ if (old_size != new_size) {
	+ ptr = allocate( new_size );
	+
	+ memcpy(ptr, old_ptr, (old_size < new_size ? old_size : new_size) );
	+
	+ deallocate( old_ptr, old_size );
	+ }
	+ else {
	+ ptr = old_ptr;
	+ }
	+#else
	+ ptr = mremap( old_ptr, old_size, new_size, MREMAP_MAYMOVE );
	+
	+ if (ptr == MAP_FAILED) {
	+ throw_runtime_exception("Error: Page Aligned Allocator could not reallocate memory");
	+ }
	+#endif
	+
	+ return ptr;
	+}
	+
	+}} // namespace Kokkos::Impl
	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/core/src/impl/Kokkos_BasicAllocators.hpp
	similarity index 54%
	copy from lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	copy to lib/kokkos/core/src/impl/Kokkos_BasicAllocators.hpp
	index 0dcb3977a..76377c5f1 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_BasicAllocators.hpp
	@@ -1,84 +1,118 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	-#define KOKKOS_PHYSICAL_LAYOUT_HPP
	-
	-
	-#include <Kokkos_View.hpp>
	-namespace Kokkos {
	-namespace Impl {
	-
	-
	-
	-struct PhysicalLayout {
	- enum LayoutType {Left,Right,Scalar,Error};
	- LayoutType layout_type;
	- int rank;
	- long long int stride[8]; //distance between two neighboring elements in a given dimension
	-
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #ifdef KOKKOS_HAVE_CUDA
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #endif
	+#ifndef KOKKOS_BASIC_ALLOCATORS_HPP
	+#define KOKKOS_BASIC_ALLOCATORS_HPP
	+
	+
	+namespace Kokkos { namespace Impl {
	+
	+/// class UnmanagedAllocator
	+/// does nothing when deallocate(ptr,size) is called
	+class UnmanagedAllocator
	+{
	+public:
	+ static const char * name() { return "Unmanaged Allocator"; }
	+
	+ static void deallocate(void * /ptr/, size_t /size/) {}
	+};
	+
	+
	+/// class MallocAllocator
	+class MallocAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Malloc Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t size);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+};
	+
	+
	+/// class AlignedAllocator
	+/// memory aligned to Kokkos::Impl::MEMORY_ALIGNMENT
	+class AlignedAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Aligned Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t size);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	};

	-}
	-}
	-#endif
	+
	+/// class PageAlignedAllocator
	+/// memory aligned to PAGE_SIZE
	+class PageAlignedAllocator
	+{
	+public:
	+ static const char * name()
	+ {
	+ return "Page Aligned Allocator";
	+ }
	+
	+ static void* allocate(size_t size);
	+
	+ static void deallocate(void * ptr, size_t size);
	+
	+ static void * reallocate(void * old_ptr, size_t old_size, size_t new_size);
	+};
	+
	+
	+}} // namespace Kokkos::Impl
	+
	+#endif //KOKKOS_BASIC_ALLOCATORS_HPP
	+
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Core.cpp b/lib/kokkos/core/src/impl/Kokkos_Core.cpp
	index 25542fa3d..1c3c83cfe 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Core.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Core.cpp
	@@ -1,441 +1,447 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <Kokkos_Core.hpp>
	#include <impl/Kokkos_Error.hpp>
	#include <cctype>
	#include <cstring>
	#include <iostream>
	#include <cstdlib>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {
	namespace {

	bool is_unsigned_int(const char* str)
	{
	const size_t len = strlen (str);
	for (size_t i = 0; i < len; ++i) {
	if (! isdigit (str[i])) {
	return false;
	}
	}
	return true;
	}

	void initialize_internal(const InitArguments& args)
	{
	// Protect declarations, to prevent "unused variable" warnings.
	#if defined( KOKKOS_HAVE_OPENMP ) \|\| defined( KOKKOS_HAVE_PTHREAD )
	const int num_threads = args.num_threads;
	const int use_numa = args.num_numa;
	#endif // defined( KOKKOS_HAVE_OPENMP ) \|\| defined( KOKKOS_HAVE_PTHREAD )
	#if defined( KOKKOS_HAVE_CUDA )
	const int use_gpu = args.device_id;
	#endif // defined( KOKKOS_HAVE_CUDA )

	#if defined( KOKKOS_HAVE_OPENMP )
	if( Impl::is_same< Kokkos::OpenMP , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::OpenMP , Kokkos::HostSpace::execution_space >::value ) {
	if(num_threads>0) {
	if(use_numa>0) {
	Kokkos::OpenMP::initialize(num_threads,use_numa);
	}
	else {
	Kokkos::OpenMP::initialize(num_threads);
	}
	} else {
	Kokkos::OpenMP::initialize();
	}
	//std::cout << "Kokkos::initialize() fyi: OpenMP enabled and initialized" << std::endl ;
	}
	else {
	//std::cout << "Kokkos::initialize() fyi: OpenMP enabled but not initialized" << std::endl ;
	}
	#endif

	#if defined( KOKKOS_HAVE_PTHREAD )
	if( Impl::is_same< Kokkos::Threads , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::Threads , Kokkos::HostSpace::execution_space >::value ) {
	if(num_threads>0) {
	if(use_numa>0) {
	Kokkos::Threads::initialize(num_threads,use_numa);
	}
	else {
	Kokkos::Threads::initialize(num_threads);
	}
	} else {
	Kokkos::Threads::initialize();
	}
	//std::cout << "Kokkos::initialize() fyi: Pthread enabled and initialized" << std::endl ;
	}
	else {
	//std::cout << "Kokkos::initialize() fyi: Pthread enabled but not initialized" << std::endl ;
	}
	#endif

	#if defined( KOKKOS_HAVE_SERIAL )
	// Prevent "unused variable" warning for 'args' input struct. If
	// Serial::initialize() ever needs to take arguments from the input
	// struct, you may remove this line of code.
	(void) args;

	if( Impl::is_same< Kokkos::Serial , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::Serial , Kokkos::HostSpace::execution_space >::value ) {
	Kokkos::Serial::initialize();
	}
	#endif

	#if defined( KOKKOS_HAVE_CUDA )
	if( Impl::is_same< Kokkos::Cuda , Kokkos::DefaultExecutionSpace >::value \|\| 0 < use_gpu ) {
	if (use_gpu > -1) {
	Kokkos::Cuda::initialize( Kokkos::Cuda::SelectDevice( use_gpu ) );
	}
	else {
	Kokkos::Cuda::initialize();
	}
	//std::cout << "Kokkos::initialize() fyi: Cuda enabled and initialized" << std::endl ;
	}
	#endif
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ Kokkos::Experimental::initialize();
	+#endif
	}

	void finalize_internal( const bool all_spaces = false )
	{

	#if defined( KOKKOS_HAVE_CUDA )
	if( Impl::is_same< Kokkos::Cuda , Kokkos::DefaultExecutionSpace >::value \|\| all_spaces ) {
	if(Kokkos::Cuda::is_initialized())
	Kokkos::Cuda::finalize();
	}
	#endif

	#if defined( KOKKOS_HAVE_OPENMP )
	if( Impl::is_same< Kokkos::OpenMP , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::OpenMP , Kokkos::HostSpace::execution_space >::value \|\|
	all_spaces ) {
	if(Kokkos::OpenMP::is_initialized())
	Kokkos::OpenMP::finalize();
	}
	#endif

	#if defined( KOKKOS_HAVE_PTHREAD )
	if( Impl::is_same< Kokkos::Threads , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::Threads , Kokkos::HostSpace::execution_space >::value \|\|
	all_spaces ) {
	if(Kokkos::Threads::is_initialized())
	Kokkos::Threads::finalize();
	}
	#endif

	#if defined( KOKKOS_HAVE_SERIAL )
	if( Impl::is_same< Kokkos::Serial , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::Serial , Kokkos::HostSpace::execution_space >::value \|\|
	all_spaces ) {
	if(Kokkos::Serial::is_initialized())
	Kokkos::Serial::finalize();
	}
	#endif

	+#ifdef KOKKOSP_ENABLE_PROFILING
	+ Kokkos::Experimental::finalize();
	+#endif
	+
	}

	void fence_internal()
	{

	#if defined( KOKKOS_HAVE_CUDA )
	if( Impl::is_same< Kokkos::Cuda , Kokkos::DefaultExecutionSpace >::value ) {
	Kokkos::Cuda::fence();
	}
	#endif

	#if defined( KOKKOS_HAVE_OPENMP )
	if( Impl::is_same< Kokkos::OpenMP , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::OpenMP , Kokkos::HostSpace::execution_space >::value ) {
	Kokkos::OpenMP::fence();
	}
	#endif

	#if defined( KOKKOS_HAVE_PTHREAD )
	if( Impl::is_same< Kokkos::Threads , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::Threads , Kokkos::HostSpace::execution_space >::value ) {
	Kokkos::Threads::fence();
	}
	#endif

	#if defined( KOKKOS_HAVE_SERIAL )
	if( Impl::is_same< Kokkos::Serial , Kokkos::DefaultExecutionSpace >::value \|\|
	Impl::is_same< Kokkos::Serial , Kokkos::HostSpace::execution_space >::value ) {
	Kokkos::Serial::fence();
	}
	#endif

	}

	} // namespace
	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------

	namespace Kokkos {

	void initialize(int& narg, char* arg[])
	{
	int num_threads = -1;
	int numa = -1;
	int device = -1;

	int kokkos_threads_found = 0;
	int kokkos_numa_found = 0;
	int kokkos_device_found = 0;
	int kokkos_ndevices_found = 0;

	int iarg = 0;

	while (iarg < narg) {
	if ((strncmp(arg[iarg],"--kokkos-threads",16) == 0) \|\| (strncmp(arg[iarg],"--threads",9) == 0)) {
	//Find the number of threads (expecting --threads=XX)
	if (!((strncmp(arg[iarg],"--kokkos-threads=",17) == 0) \|\| (strncmp(arg[iarg],"--threads=",10) == 0)))
	Impl::throw_runtime_exception("Error: expecting an '=INT' after command line argument '--threads/--kokkos-threads'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	char* number = strchr(arg[iarg],'=')+1;

	if(!Impl::is_unsigned_int(number) \|\| (strlen(number)==0))
	Impl::throw_runtime_exception("Error: expecting an '=INT' after command line argument '--threads/--kokkos-threads'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	if((strncmp(arg[iarg],"--kokkos-threads",16) == 0) \|\| !kokkos_threads_found)
	num_threads = atoi(number);

	//Remove the --kokkos-threads argument from the list but leave --threads
	if(strncmp(arg[iarg],"--kokkos-threads",16) == 0) {
	for(int k=iarg;k<narg-1;k++) {
	arg[k] = arg[k+1];
	}
	kokkos_threads_found=1;
	narg--;
	} else {
	iarg++;
	}
	} else if ((strncmp(arg[iarg],"--kokkos-numa",13) == 0) \|\| (strncmp(arg[iarg],"--numa",6) == 0)) {
	//Find the number of numa (expecting --numa=XX)
	if (!((strncmp(arg[iarg],"--kokkos-numa=",14) == 0) \|\| (strncmp(arg[iarg],"--numa=",7) == 0)))
	Impl::throw_runtime_exception("Error: expecting an '=INT' after command line argument '--numa/--kokkos-numa'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	char* number = strchr(arg[iarg],'=')+1;

	if(!Impl::is_unsigned_int(number) \|\| (strlen(number)==0))
	Impl::throw_runtime_exception("Error: expecting an '=INT' after command line argument '--numa/--kokkos-numa'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	if((strncmp(arg[iarg],"--kokkos-numa",13) == 0) \|\| !kokkos_numa_found)
	numa = atoi(number);

	//Remove the --kokkos-numa argument from the list but leave --numa
	if(strncmp(arg[iarg],"--kokkos-numa",13) == 0) {
	for(int k=iarg;k<narg-1;k++) {
	arg[k] = arg[k+1];
	}
	kokkos_numa_found=1;
	narg--;
	} else {
	iarg++;
	}
	} else if ((strncmp(arg[iarg],"--kokkos-device",15) == 0) \|\| (strncmp(arg[iarg],"--device",8) == 0)) {
	//Find the number of device (expecting --device=XX)
	if (!((strncmp(arg[iarg],"--kokkos-device=",16) == 0) \|\| (strncmp(arg[iarg],"--device=",9) == 0)))
	Impl::throw_runtime_exception("Error: expecting an '=INT' after command line argument '--device/--kokkos-device'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	char* number = strchr(arg[iarg],'=')+1;

	if(!Impl::is_unsigned_int(number) \|\| (strlen(number)==0))
	Impl::throw_runtime_exception("Error: expecting an '=INT' after command line argument '--device/--kokkos-device'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	if((strncmp(arg[iarg],"--kokkos-device",15) == 0) \|\| !kokkos_device_found)
	device = atoi(number);

	//Remove the --kokkos-device argument from the list but leave --device
	if(strncmp(arg[iarg],"--kokkos-device",15) == 0) {
	for(int k=iarg;k<narg-1;k++) {
	arg[k] = arg[k+1];
	}
	kokkos_device_found=1;
	narg--;
	} else {
	iarg++;
	}
	} else if ((strncmp(arg[iarg],"--kokkos-ndevices",17) == 0) \|\| (strncmp(arg[iarg],"--ndevices",10) == 0)) {

	//Find the number of device (expecting --device=XX)
	if (!((strncmp(arg[iarg],"--kokkos-ndevices=",18) == 0) \|\| (strncmp(arg[iarg],"--ndevices=",11) == 0)))
	Impl::throw_runtime_exception("Error: expecting an '=INT[,INT]' after command line argument '--ndevices/--kokkos-ndevices'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	int ndevices=-1;
	int skip_device = 9999;

	char* num1 = strchr(arg[iarg],'=')+1;
	char* num2 = strpbrk(num1,",");
	int num1_len = num2==NULL?strlen(num1):num2-num1;
	char* num1_only = new char[num1_len+1];
	strncpy(num1_only,num1,num1_len);
	num1_only[num1_len]=0;

	if(!Impl::is_unsigned_int(num1_only) \|\| (strlen(num1_only)==0)) {
	Impl::throw_runtime_exception("Error: expecting an integer number after command line argument '--kokkos-ndevices'. Raised by Kokkos::initialize(int narg, char* argc[]).");
	}
	if((strncmp(arg[iarg],"--kokkos-ndevices",17) == 0) \|\| !kokkos_ndevices_found)
	ndevices = atoi(num1_only);

	if( num2 != NULL ) {
	if(( !Impl::is_unsigned_int(num2+1) ) \|\| (strlen(num2)==1) )
	Impl::throw_runtime_exception("Error: expecting an integer number after command line argument '--kokkos-ndevices=XX,'. Raised by Kokkos::initialize(int narg, char* argc[]).");

	if((strncmp(arg[iarg],"--kokkos-ndevices",17) == 0) \|\| !kokkos_ndevices_found)
	skip_device = atoi(num2+1);
	}

	if((strncmp(arg[iarg],"--kokkos-ndevices",17) == 0) \|\| !kokkos_ndevices_found) {
	char *str;
	if ((str = getenv("SLURM_LOCALID"))) {
	int local_rank = atoi(str);
	device = local_rank % ndevices;
	if (device >= skip_device) device++;
	}
	if ((str = getenv("MV2_COMM_WORLD_LOCAL_RANK"))) {
	int local_rank = atoi(str);
	device = local_rank % ndevices;
	if (device >= skip_device) device++;
	}
	if ((str = getenv("OMPI_COMM_WORLD_LOCAL_RANK"))) {
	int local_rank = atoi(str);
	device = local_rank % ndevices;
	if (device >= skip_device) device++;
	}
	if(device==-1) {
	device = 0;
	if (device >= skip_device) device++;
	}
	}

	//Remove the --kokkos-ndevices argument from the list but leave --ndevices
	if(strncmp(arg[iarg],"--kokkos-ndevices",17) == 0) {
	for(int k=iarg;k<narg-1;k++) {
	arg[k] = arg[k+1];
	}
	kokkos_ndevices_found=1;
	narg--;
	} else {
	iarg++;
	}
	} else if ((strcmp(arg[iarg],"--kokkos-help") == 0) \|\| (strcmp(arg[iarg],"--help") == 0)) {
	std::cout << std::endl;
	std::cout << "--------------------------------------------------------------------------------" << std::endl;
	std::cout << "-------------Kokkos command line arguments--------------------------------------" << std::endl;
	std::cout << "--------------------------------------------------------------------------------" << std::endl;
	std::cout << "The following arguments exist also without prefix 'kokkos' (e.g. --help)." << std::endl;
	std::cout << "The prefixed arguments will be removed from the list by Kokkos::initialize()," << std::endl;
	std::cout << "the non-prefixed ones are not removed. Prefixed versions take precedence over " << std::endl;
	std::cout << "non prefixed ones, and the last occurence of an argument overwrites prior" << std::endl;
	std::cout << "settings." << std::endl;
	std::cout << std::endl;
	std::cout << "--kokkos-help : print this message" << std::endl;
	std::cout << "--kokkos-threads=INT : specify total number of threads or" << std::endl;
	std::cout << " number of threads per NUMA region if " << std::endl;
	std::cout << " used in conjunction with '--numa' option. " << std::endl;
	std::cout << "--kokkos-numa=INT : specify number of NUMA regions used by process." << std::endl;
	std::cout << "--kokkos-device=INT : specify device id to be used by Kokkos. " << std::endl;
	std::cout << "--kokkos-ndevices=INT[,INT] : used when running MPI jobs. Specify number of" << std::endl;
	std::cout << " devices per node to be used. Process to device" << std::endl;
	std::cout << " mapping happens by obtaining the local MPI rank" << std::endl;
	std::cout << " and assigning devices round-robin. The optional" << std::endl;
	std::cout << " second argument allows for an existing device" << std::endl;
	std::cout << " to be ignored. This is most useful on workstations" << std::endl;
	std::cout << " with multiple GPUs of which one is used to drive" << std::endl;
	std::cout << " screen output." << std::endl;
	std::cout << std::endl;
	std::cout << "--------------------------------------------------------------------------------" << std::endl;
	std::cout << std::endl;

	//Remove the --kokkos-help argument from the list but leave --ndevices
	if(strcmp(arg[iarg],"--kokkos-help") == 0) {
	for(int k=iarg;k<narg-1;k++) {
	arg[k] = arg[k+1];
	}
	narg--;
	} else {
	iarg++;
	}
	} else
	iarg++;
	}

	InitArguments arguments;
	arguments.num_threads = num_threads;
	arguments.num_numa = numa;
	arguments.device_id = device;
	Impl::initialize_internal(arguments);
	}

	void initialize(const InitArguments& arguments) {
	Impl::initialize_internal(arguments);
	}

	void finalize()
	{
	Impl::finalize_internal();
	}

	void finalize_all()
	{
	enum { all_spaces = true };
	Impl::finalize_internal( all_spaces );
	}

	void fence()
	{
	Impl::fence_internal();
	}

	} // namespace Kokkos

	diff --git a/lib/kokkos/core/src/impl/Kokkos_CrsArray_factory.hpp b/lib/kokkos/core/src/impl/Kokkos_CrsArray_factory.hpp
	deleted file mode 100755
	index 1ed745a70..000000000
	--- a/lib/kokkos/core/src/impl/Kokkos_CrsArray_factory.hpp
	+++ /dev/null
	@@ -1,223 +0,0 @@
	-/*
	-//@HEADER
	-// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	-// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	-// the U.S. Government retains certain rights in this software.
	-//
	-// Redistribution and use in source and binary forms, with or without
	-// modification, are permitted provided that the following conditions are
	-// met:
	-//
	-// 1. Redistributions of source code must retain the above copyright
	-// notice, this list of conditions and the following disclaimer.
	-//
	-// 2. Redistributions in binary form must reproduce the above copyright
	-// notice, this list of conditions and the following disclaimer in the
	-// documentation and/or other materials provided with the distribution.
	-//
	-// 3. Neither the name of the Corporation nor the names of the
	-// contributors may be used to endorse or promote products derived from
	-// this software without specific prior written permission.
	-//
	-// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	-// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	-// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	-// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	-// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	-// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	-// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	-// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	-// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	-// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	-// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	-//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	-// ************************************************************************
	-//@HEADER
	-*/
	-
	-#ifndef KOKKOS_IMPL_CRSARRAY_FACTORY_HPP
	-#define KOKKOS_IMPL_CRSARRAY_FACTORY_HPP
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-namespace Kokkos {
	-
	-template< class DataType , class Arg1Type , class Arg2Type , typename SizeType >
	-inline
	-typename CrsArray< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	-create_mirror( const CrsArray<DataType,Arg1Type,Arg2Type,SizeType > & view )
	-{
	- // Force copy:
	- //typedef Impl::ViewAssignment< Impl::ViewDefault > alloc ; // unused
	- typedef CrsArray< DataType , Arg1Type , Arg2Type , SizeType > crsarray_type ;
	-
	- typename crsarray_type::HostMirror tmp ;
	- typename crsarray_type::row_map_type::HostMirror tmp_row_map = create_mirror( view.row_map );
	-
	- tmp.row_map = tmp_row_map ; // Assignment of 'const' from 'non-const'
	- tmp.entries = create_mirror( view.entries );
	-
	- // Deep copy:
	- deep_copy( tmp_row_map , view.row_map );
	- deep_copy( tmp.entries , view.entries );
	-
	- return tmp ;
	-}
	-
	-template< class DataType , class Arg1Type , class Arg2Type , typename SizeType >
	-inline
	-typename CrsArray< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	-create_mirror_view( const CrsArray<DataType,Arg1Type,Arg2Type,SizeType > & view ,
	- typename Impl::enable_if< ViewTraits<DataType,Arg1Type,Arg2Type,void>::is_hostspace >::type * = 0 )
	-{
	- return view ;
	-}
	-
	-template< class DataType , class Arg1Type , class Arg2Type , typename SizeType >
	-inline
	-typename CrsArray< DataType , Arg1Type , Arg2Type , SizeType >::HostMirror
	-create_mirror_view( const CrsArray<DataType,Arg1Type,Arg2Type,SizeType > & view ,
	- typename Impl::enable_if< ! ViewTraits<DataType,Arg1Type,Arg2Type,void>::is_hostspace >::type * = 0 )
	-{
	- return create_mirror( view );
	-}
	-
	-
	-} // namespace Kokkos
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-namespace Kokkos {
	-
	-template< class CrsArrayType , class InputSizeType >
	-inline
	-typename CrsArrayType::crsarray_type
	-create_crsarray( const std::string & label ,
	- const std::vector< InputSizeType > & input )
	-{
	- typedef CrsArrayType output_type ;
	- //typedef std::vector< InputSizeType > input_type ; // unused
	-
	- typedef typename output_type::entries_type entries_type ;
	-
	- typedef View< typename output_type::size_type [] ,
	- typename output_type::array_layout ,
	- typename output_type::execution_space > work_type ;
	-
	- output_type output ;
	-
	- // Create the row map:
	-
	- const size_t length = input.size();
	-
	- {
	- work_type row_work( "tmp" , length + 1 );
	-
	- typename work_type::HostMirror row_work_host =
	- create_mirror_view( row_work );
	-
	- size_t sum = 0 ;
	- row_work_host[0] = 0 ;
	- for ( size_t i = 0 ; i < length ; ++i ) {
	- row_work_host[i+1] = sum += input[i];
	- }
	-
	- deep_copy( row_work , row_work_host );
	-
	- output.entries = entries_type( label , sum );
	- output.row_map = row_work ;
	- }
	-
	- return output ;
	-}
	-
	-//----------------------------------------------------------------------------
	-
	-template< class CrsArrayType , class InputSizeType >
	-inline
	-typename CrsArrayType::crsarray_type
	-create_crsarray( const std::string & label ,
	- const std::vector< std::vector< InputSizeType > > & input )
	-{
	- typedef CrsArrayType output_type ;
	- //typedef std::vector< std::vector< InputSizeType > > input_type ; // unused
	- typedef typename output_type::entries_type entries_type ;
	- //typedef typename output_type::size_type size_type ; // unused
	-
	- // mfh 14 Feb 2014: This function doesn't actually create instances
	- // of ok_rank, but it needs to declare the typedef in order to do
	- // the static "assert" (a compile-time check that the given shape
	- // has rank 1). In order to avoid a "declared but unused typedef"
	- // warning, we declare an empty instance of this type, with the
	- // usual "(void)" marker to avoid a compiler warning for the unused
	- // variable.
	-
	- typedef typename
	- Impl::assert_shape_is_rank_one< typename entries_type::shape_type >::type
	- ok_rank ;
	- {
	- ok_rank thing;
	- (void) thing;
	- }
	-
	- typedef View< typename output_type::size_type [] ,
	- typename output_type::array_layout ,
	- typename output_type::execution_space > work_type ;
	-
	- output_type output ;
	-
	- // Create the row map:
	-
	- const size_t length = input.size();
	-
	- {
	- work_type row_work( "tmp" , length + 1 );
	-
	- typename work_type::HostMirror row_work_host =
	- create_mirror_view( row_work );
	-
	- size_t sum = 0 ;
	- row_work_host[0] = 0 ;
	- for ( size_t i = 0 ; i < length ; ++i ) {
	- row_work_host[i+1] = sum += input[i].size();
	- }
	-
	- deep_copy( row_work , row_work_host );
	-
	- output.entries = entries_type( label , sum );
	- output.row_map = row_work ;
	- }
	-
	- // Fill in the entries:
	- {
	- typename entries_type::HostMirror host_entries =
	- create_mirror_view( output.entries );
	-
	- size_t sum = 0 ;
	- for ( size_t i = 0 ; i < length ; ++i ) {
	- for ( size_t j = 0 ; j < input[i].size() ; ++j , ++sum ) {
	- host_entries( sum ) = input[i][j] ;
	- }
	- }
	-
	- deep_copy( output.entries , host_entries );
	- }
	-
	- return output ;
	-}
	-
	-} // namespace Kokkos
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-#endif /* #ifndef KOKKOS_IMPL_CRSARRAY_FACTORY_HPP */
	-
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Error.cpp b/lib/kokkos/core/src/impl/Kokkos_Error.cpp
	index 00fe43884..97cfbfae7 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Error.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Error.cpp
	@@ -1,195 +1,193 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <stdio.h>
	#include <string.h>
	#include <stdlib.h>

	#include <ostream>
	#include <sstream>
	#include <iomanip>
	#include <stdexcept>
	#include <impl/Kokkos_Error.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	void host_abort( const char * const message )
	{
	fwrite(message,1,strlen(message),stderr);
	fflush(stderr);
	abort();
	}

	void throw_runtime_exception( const std::string & msg )
	{
	std::ostringstream o ;
	o << msg ;
	traceback_callstack( o );
	throw std::runtime_error( o.str() );
	}


	std::string human_memory_size(size_t arg_bytes)
	{
	double bytes = arg_bytes;
	const double K = 1024;
	const double M = K*1024;
	const double G = M*1024;

	std::ostringstream out;
	if (bytes < K) {
	out << std::setprecision(4) << bytes << " B";
	} else if (bytes < M) {
	bytes /= K;
	out << std::setprecision(4) << bytes << " K";
	} else if (bytes < G) {
	bytes /= M;
	out << std::setprecision(4) << bytes << " M";
	} else {
	bytes /= G;
	out << std::setprecision(4) << bytes << " G";
	}
	return out.str();
	}

	}
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#if defined( __GNUC__ ) && defined( ENABLE_TRACEBACK )

	/* This is only known to work with GNU C++
	* Must be compiled with '-rdynamic'
	* Must be linked with '-ldl'
	*/

	/* Print call stack into an error stream,
	* so one knows in which function the error occured.
	*
	* Code copied from:
	* http://stupefydeveloper.blogspot.com/2008/10/cc-call-stack.html
	*
	* License on this site:
	* This blog is licensed under a
	* Creative Commons Attribution-Share Alike 3.0 Unported License.
	*
	* http://creativecommons.org/licenses/by-sa/3.0/
	*
	* Modified to output to std::ostream.
	*/
	#include <signal.h>
	#include <execinfo.h>
	#include <cxxabi.h>
	#include <dlfcn.h>
	#include <stdlib.h>

	namespace Kokkos {
	namespace Impl {

	void traceback_callstack( std::ostream & msg )
	{
	using namespace abi;

	enum { MAX_DEPTH = 32 };

	void *trace[MAX_DEPTH];
	Dl_info dlinfo;

	int status;

	int trace_size = backtrace(trace, MAX_DEPTH);

	msg << std::endl << "Call stack {" << std::endl ;

	for (int i=1; i<trace_size; ++i)
	{
	if(!dladdr(trace[i], &dlinfo))
	continue;

	const char * symname = dlinfo.dli_sname;

	char * demangled = __cxa_demangle(symname, NULL, 0, &status);

	if ( status == 0 && demangled ) {
	symname = demangled;
	}

	if ( symname && *symname != 0 ) {
	msg << " object: " << dlinfo.dli_fname
	<< " function: " << symname
	<< std::endl ;
	}

	if ( demangled ) {
	free(demangled);
	}
	}
	msg << "}" ;
	}

	}
	}

	#else

	namespace Kokkos {
	namespace Impl {

	void traceback_callstack( std::ostream & msg )
	{
	msg << std::endl << "Traceback functionality not available" << std::endl ;
	}

	}
	}

	#endif

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Error.hpp b/lib/kokkos/core/src/impl/Kokkos_Error.hpp
	index f8d0c15d6..33e203c94 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Error.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Error.hpp
	@@ -1,80 +1,78 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_IMPL_ERROR_HPP
	#define KOKKOS_IMPL_ERROR_HPP

	#include <string>
	#include <iosfwd>

	namespace Kokkos {
	namespace Impl {

	void host_abort( const char * const );

	void throw_runtime_exception( const std::string & );

	void traceback_callstack( std::ostream & );

	std::string human_memory_size(size_t arg_bytes);

	}
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	namespace Kokkos {
	inline
	void abort( const char * const message ) { Kokkos::Impl::host_abort(message); }
	}
	#endif /* defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA ) */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_IMPL_ERROR_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_FunctorAdapter.hpp b/lib/kokkos/core/src/impl/Kokkos_FunctorAdapter.hpp
	index fb5add6a7..ff6230b57 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_FunctorAdapter.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_FunctorAdapter.hpp
	@@ -1,960 +1,1070 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_FUNCTORADAPTER_HPP
	#define KOKKOS_FUNCTORADAPTER_HPP

	#include <cstddef>
	#include <Kokkos_Core_fwd.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Tags.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class ArgTag , class Enable = void >
	struct FunctorDeclaresValueType : public Impl::false_type {};

	template< class FunctorType , class ArgTag >
	struct FunctorDeclaresValueType< FunctorType , ArgTag
	, typename Impl::enable_if_type< typename FunctorType::value_type >::type >
	: public Impl::true_type {};


	/** \brief Query Functor and execution policy argument tag for value type.
	*
	* If C++11 enabled and 'value_type' is not explicitly declared then attempt
	* to deduce the type from FunctorType::operator().
	*/
	template< class FunctorType , class ArgTag , bool Dec = FunctorDeclaresValueType<FunctorType,ArgTag>::value >
	struct FunctorValueTraits
	{
	typedef void value_type ;
	typedef void pointer_type ;
	typedef void reference_type ;

	enum { StaticValueSize = 0 };

	KOKKOS_FORCEINLINE_FUNCTION static
	unsigned value_count( const FunctorType & ) { return 0 ; }

	KOKKOS_FORCEINLINE_FUNCTION static
	unsigned value_size( const FunctorType & ) { return 0 ; }
	};

	+template<class ArgTag>
	+struct FunctorValueTraits<void, ArgTag,false>
	+{
	+ typedef void reference_type;
	+};
	+
	/** \brief FunctorType::value_type is explicitly declared so use it.
	*
	* Two options for declaration
	*
	* 1) A plain-old-data (POD) type
	* typedef {pod_type} value_type ;
	*
	* 2) An array of POD of a runtime specified count.
	* typedef {pod_type} value_type[] ;
	* const unsigned value_count ;
	*/
	template< class FunctorType , class ArgTag >
	-struct FunctorValueTraits< FunctorType , ArgTag , true /* exists FunctorType::value_type */ >
	+struct FunctorValueTraits< FunctorType , ArgTag , true /* == exists FunctorType::value_type */ >
	{
	typedef typename Impl::remove_extent< typename FunctorType::value_type >::type value_type ;

	// If not an array then what is the sizeof(value_type)
	enum { StaticValueSize = Impl::is_array< typename FunctorType::value_type >::value ? 0 : sizeof(value_type) };

	typedef value_type * pointer_type ;

	// The reference_type for an array is 'value_type *'
	// The reference_type for a single value is 'value_type &'

	typedef typename Impl::if_c< ! StaticValueSize , value_type *
	, value_type & >::type reference_type ;

	// Number of values if single value
	template< class F >
	KOKKOS_FORCEINLINE_FUNCTION static
	typename Impl::enable_if< Impl::is_same<F,FunctorType>::value && StaticValueSize , unsigned >::type
	value_count( const F & ) { return 1 ; }

	// Number of values if an array, protect via templating because 'f.value_count'
	// will only exist when the functor declares the value_type to be an array.
	template< class F >
	KOKKOS_FORCEINLINE_FUNCTION static
	typename Impl::enable_if< Impl::is_same<F,FunctorType>::value && ! StaticValueSize , unsigned >::type
	value_count( const F & f ) { return f.value_count ; }

	// Total size of the value
	KOKKOS_INLINE_FUNCTION static
	unsigned value_size( const FunctorType & f ) { return value_count( f ) * sizeof(value_type) ; }
	};


	#if defined( KOKKOS_HAVE_CXX11 )

	-// If have C++11 and functor does not explicitly specify a value type
	-// then try to deduce the value type from FunctorType::operator().
	-// Can only deduce single value type since array length cannot be deduced.
	-template< class FunctorType >
	+template< class FunctorType , class ArgTag >
	struct FunctorValueTraits< FunctorType
	- , void /* == ArgTag */
	- , false /* == exists FunctorType::value_type */
	- >
	+ , ArgTag
	+ , false /* == exists FunctorType::value_type */
	+ >
	{
	private:

	- struct VOID {};
	+ struct VOIDTAG {}; // Allow declaration of non-matching operator() with void argument tag.
	+ struct REJECTTAG {}; // Reject tagged operator() when using non-tagged execution policy.
	+
	+ typedef typename
	+ Impl::if_c< Impl::is_same< ArgTag , void >::value , VOIDTAG , ArgTag >::type tag_type ;

	+ //----------------------------------------
	// parallel_for operator without a tag:
	+
	template< class ArgMember >
	KOKKOS_INLINE_FUNCTION
	- static VOID deduce( void (FunctorType::*)( ArgMember ) const ) {}
	+ static VOIDTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( ArgMember ) const ) {}

	- // parallel_reduce operator without a tag:
	- template< class ArgMember , class T >
	+ template< class ArgMember >
	KOKKOS_INLINE_FUNCTION
	- static T deduce( void (FunctorType::*)( ArgMember , T & ) const ) {}
	+ static VOIDTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const ArgMember & ) const ) {}

	- // parallel_scan operator without a tag:
	- template< class ArgMember , class T >
	+ template< class TagType , class ArgMember >
	KOKKOS_INLINE_FUNCTION
	- static T deduce( void (FunctorType::*)( ArgMember , T & , bool ) const ) {}
	-
	- typedef decltype( deduce( & FunctorType::operator() ) ) ValueType ;
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( TagType , ArgMember ) const ) {}

	- enum { IS_VOID = Impl::is_same<VOID,ValueType>::value };
	+ template< class TagType , class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( TagType , const ArgMember & ) const ) {}

	-public:
	+ template< class TagType , class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const TagType & , ArgMember ) const ) {}

	- typedef typename Impl::if_c< IS_VOID , void , ValueType >::type value_type ;
	- typedef typename Impl::if_c< IS_VOID , void , ValueType * >::type pointer_type ;
	- typedef typename Impl::if_c< IS_VOID , void , ValueType & >::type reference_type ;
	+ template< class TagType , class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const TagType & , const ArgMember & ) const ) {}

	- enum { StaticValueSize = IS_VOID ? 0 : sizeof(ValueType) };
	+ //----------------------------------------
	+ // parallel_for operator with a tag:

	- KOKKOS_FORCEINLINE_FUNCTION static
	- unsigned value_size( const FunctorType & ) { return StaticValueSize ; }
	+ template< class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static VOIDTAG deduce_reduce_type( tag_type , void (FunctorType::*)( tag_type , ArgMember ) const ) {}

	- KOKKOS_FORCEINLINE_FUNCTION static
	- unsigned value_count( const FunctorType & ) { return IS_VOID ? 0 : 1 ; }
	-};
	+ template< class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static VOIDTAG deduce_reduce_type( tag_type , void (FunctorType::*)( const tag_type & , ArgMember ) const ) {}

	+ template< class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static VOIDTAG deduce_reduce_type( tag_type , void (FunctorType::*)( tag_type , const ArgMember & ) const ) {}

	-template< class FunctorType , class ArgTag >
	-struct FunctorValueTraits< FunctorType
	- , ArgTag /* != void */
	- , false /* == exists FunctorType::value_type */
	- >
	-{
	-private:
	+ template< class ArgMember >
	+ KOKKOS_INLINE_FUNCTION
	+ static VOIDTAG deduce_reduce_type( tag_type , void (FunctorType::*)( const tag_type & , const ArgMember & ) const ) {}

	//----------------------------------------
	- // parallel_for operator with a tag:
	+ // parallel_reduce operator without a tag:

	- struct VOID {}; // to allow valid sizeof(ValueType)
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( VOIDTAG , void (FunctorType::*)( ArgMember , T & ) const ) {}

	- template< class ArgMember >
	+ template< class ArgMember , class T >
	KOKKOS_INLINE_FUNCTION
	- static VOID deduce( void (FunctorType::*)( ArgTag , ArgMember ) const ) {}
	+ static T deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const ArgMember & , T & ) const ) {}

	- template< class ArgMember >
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( TagType , ArgMember , T & ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( TagType , const ArgMember & , T & ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const TagType & , ArgMember , T & ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	KOKKOS_INLINE_FUNCTION
	- static VOID deduce( void (FunctorType::*)( const ArgTag & , ArgMember ) const ) {}
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const TagType & , const ArgMember & , T & ) const ) {}

	//----------------------------------------
	// parallel_reduce operator with a tag:

	template< class ArgMember , class T >
	KOKKOS_INLINE_FUNCTION
	- static T deduce( void (FunctorType::*)( ArgTag , ArgMember , T & ) const ) {}
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( tag_type , ArgMember , T & ) const ) {}
	+
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( const tag_type & , ArgMember , T & ) const ) {}

	template< class ArgMember , class T >
	KOKKOS_INLINE_FUNCTION
	- static T deduce( void (FunctorType::*)( const ArgTag & , ArgMember , T & ) const ) {}
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( tag_type , const ArgMember & , T & ) const ) {}
	+
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( const tag_type & , const ArgMember & , T & ) const ) {}
	+
	+ //----------------------------------------
	+ // parallel_scan operator without a tag:
	+
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( VOIDTAG , void (FunctorType::*)( ArgMember , T & , bool ) const ) {}
	+
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const ArgMember & , T & , bool ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( TagType , ArgMember , T & , bool ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( TagType , const ArgMember & , T & , bool ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const TagType & , ArgMember , T & , bool ) const ) {}
	+
	+ template< class TagType , class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static REJECTTAG deduce_reduce_type( VOIDTAG , void (FunctorType::*)( const TagType & , const ArgMember & , T & , bool ) const ) {}

	//----------------------------------------
	// parallel_scan operator with a tag:

	template< class ArgMember , class T >
	KOKKOS_INLINE_FUNCTION
	- static T deduce( void (FunctorType::*)( ArgTag , ArgMember , T & , bool ) const ) {}
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( tag_type , ArgMember , T & , bool ) const ) {}
	+
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( const tag_type & , ArgMember , T & , bool ) const ) {}
	+
	+ template< class ArgMember , class T >
	+ KOKKOS_INLINE_FUNCTION
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( tag_type , const ArgMember& , T & , bool ) const ) {}

	template< class ArgMember , class T >
	KOKKOS_INLINE_FUNCTION
	- static T deduce( void (FunctorType::*)( const ArgTag & , ArgMember , T & , bool ) const ) {}
	+ static T deduce_reduce_type( tag_type , void (FunctorType::*)( const tag_type & , const ArgMember& , T & , bool ) const ) {}

	//----------------------------------------

	- typedef decltype( deduce( & FunctorType::operator() ) ) ValueType ;
	+ typedef decltype( deduce_reduce_type( tag_type() , & FunctorType::operator() ) ) ValueType ;

	- enum { IS_VOID = Impl::is_same<VOID,ValueType>::value };
	+ enum { IS_VOID = Impl::is_same<VOIDTAG ,ValueType>::value };
	+ enum { IS_REJECT = Impl::is_same<REJECTTAG,ValueType>::value };

	public:

	- typedef typename Impl::if_c< IS_VOID , void , ValueType >::type value_type ;
	- typedef typename Impl::if_c< IS_VOID , void , ValueType * >::type pointer_type ;
	- typedef typename Impl::if_c< IS_VOID , void , ValueType & >::type reference_type ;
	+ typedef typename Impl::if_c< IS_VOID \|\| IS_REJECT , void , ValueType >::type value_type ;
	+ typedef typename Impl::if_c< IS_VOID \|\| IS_REJECT , void , ValueType * >::type pointer_type ;
	+ typedef typename Impl::if_c< IS_VOID \|\| IS_REJECT , void , ValueType & >::type reference_type ;

	- enum { StaticValueSize = IS_VOID ? 0 : sizeof(ValueType) };
	+ enum { StaticValueSize = IS_VOID \|\| IS_REJECT ? 0 : sizeof(ValueType) };

	KOKKOS_FORCEINLINE_FUNCTION static
	unsigned value_size( const FunctorType & ) { return StaticValueSize ; }

	KOKKOS_FORCEINLINE_FUNCTION static
	- unsigned value_count( const FunctorType & ) { return IS_VOID ? 0 : 1 ; }
	+ unsigned value_count( const FunctorType & ) { return IS_VOID \|\| IS_REJECT ? 0 : 1 ; }
	};

	#endif /* #if defined( KOKKOS_HAVE_CXX11 ) */

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	// Function signatures for FunctorType::init function with a tag and not an array
	template< class FunctorType , class ArgTag , bool IsArray = 0 == FunctorValueTraits<FunctorType,ArgTag>::StaticValueSize >
	struct FunctorValueInitFunction {

	typedef typename FunctorValueTraits<FunctorType,ArgTag>::value_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , value_type & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , value_type & ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , value_type volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , value_type volatile & ) );
	};

	// Function signatures for FunctorType::init function with a tag and is an array
	template< class FunctorType , class ArgTag >
	struct FunctorValueInitFunction< FunctorType , ArgTag , true > {

	typedef typename FunctorValueTraits<FunctorType,ArgTag>::value_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag , value_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag const & , value_type ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag , value_type volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag const & , value_type volatile ) );
	};

	// Function signatures for FunctorType::init function without a tag and not an array
	template< class FunctorType >
	struct FunctorValueInitFunction< FunctorType , void , false > {

	typedef typename FunctorValueTraits<FunctorType,void>::reference_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( value_type & ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( value_type volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( value_type volatile & ) );
	};

	// Function signatures for FunctorType::init function without a tag and is an array
	template< class FunctorType >
	struct FunctorValueInitFunction< FunctorType , void , true > {

	typedef typename FunctorValueTraits<FunctorType,void>::reference_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( value_type ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( value_type volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( value_type volatile ) );
	};

	// Adapter for value initialization function.
	// If a proper FunctorType::init is declared then use it,
	// otherwise use default constructor.
	template< class FunctorType , class ArgTag
	, class T = typename FunctorValueTraits<FunctorType,ArgTag>::reference_type
	, class Enable = void >
	struct FunctorValueInit ;

	/* No 'init' function provided for single value */
	template< class FunctorType , class ArgTag , class T , class Enable >
	struct FunctorValueInit< FunctorType , ArgTag , T & , Enable >
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	T & init( const FunctorType & f , void * p )
	{ return *( new(p) T() ); };
	};

	/* No 'init' function provided for array value */
	template< class FunctorType , class ArgTag , class T , class Enable >
	struct FunctorValueInit< FunctorType , ArgTag , T * , Enable >
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	T * init( const FunctorType & f , void * p )
	{
	const int n = FunctorValueTraits< FunctorType , ArgTag >::value_count(f);
	for ( int i = 0 ; i < n ; ++i ) { new( ((T*)p) + i ) T(); }
	return (T*)p ;
	}
	};

	+/* 'init' function provided for single value */
	+template< class FunctorType , class T >
	+struct FunctorValueInit
	+ < FunctorType
	+ , void
	+ , T &
	+ // First substitution failure when FunctorType::init does not exist.
	+#if defined( KOKKOS_HAVE_CXX11 )
	+ // Second substitution failure when FunctorType::init is not compatible.
	+ , decltype( FunctorValueInitFunction< FunctorType , void >::enable_if( & FunctorType::init ) )
	+#else
	+ , typename Impl::enable_if< 0 < sizeof( & FunctorType::init ) >::type
	+#endif
	+ >
	+{
	+ KOKKOS_FORCEINLINE_FUNCTION static
	+ T & init( const FunctorType & f , void * p )
	+ { f.init( ((T)p) ); return ((T)p) ; }
	+};
	+
	+/* 'init' function provided for array value */
	+template< class FunctorType , class T >
	+struct FunctorValueInit
	+ < FunctorType
	+ , void
	+ , T *
	+ // First substitution failure when FunctorType::init does not exist.
	+#if defined( KOKKOS_HAVE_CXX11 )
	+ // Second substitution failure when FunctorType::init is not compatible
	+ , decltype( FunctorValueInitFunction< FunctorType , void >::enable_if( & FunctorType::init ) )
	+#else
	+ , typename Impl::enable_if< 0 < sizeof( & FunctorType::init ) >::type
	+#endif
	+ >
	+{
	+ KOKKOS_FORCEINLINE_FUNCTION static
	+ T * init( const FunctorType & f , void * p )
	+ { f.init( (T)p ); return (T)p ; }
	+};
	+
	/* 'init' function provided for single value */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorValueInit
	< FunctorType
	, ArgTag
	, T &
	// First substitution failure when FunctorType::init does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when FunctorType::init is not compatible.
	, decltype( FunctorValueInitFunction< FunctorType , ArgTag >::enable_if( & FunctorType::init ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::init ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	T & init( const FunctorType & f , void * p )
	- { f.init( ((T)p) ); return ((T)p) ; }
	+ { f.init( ArgTag() , ((T)p) ); return ((T)p) ; }
	};

	/* 'init' function provided for array value */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorValueInit
	< FunctorType
	, ArgTag
	, T *
	// First substitution failure when FunctorType::init does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when FunctorType::init is not compatible
	, decltype( FunctorValueInitFunction< FunctorType , ArgTag >::enable_if( & FunctorType::init ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::init ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	T * init( const FunctorType & f , void * p )
	- { f.init( (T)p ); return (T)p ; }
	+ { f.init( ArgTag() , (T)p ); return (T)p ; }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	// Signatures for compatible FunctorType::join with tag and not an array
	template< class FunctorType , class ArgTag , bool IsArray = 0 == FunctorValueTraits<FunctorType,ArgTag>::StaticValueSize >
	struct FunctorValueJoinFunction {

	typedef typename FunctorValueTraits<FunctorType,ArgTag>::value_type value_type ;

	typedef volatile value_type & vref_type ;
	typedef const volatile value_type & cvref_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , vref_type , cvref_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , vref_type , cvref_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , vref_type , cvref_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , vref_type , cvref_type ) );
	};

	// Signatures for compatible FunctorType::join with tag and is an array
	template< class FunctorType , class ArgTag >
	struct FunctorValueJoinFunction< FunctorType , ArgTag , true > {

	typedef typename FunctorValueTraits<FunctorType,ArgTag>::value_type value_type ;

	typedef volatile value_type * vptr_type ;
	typedef const volatile value_type * cvptr_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , vptr_type , cvptr_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , vptr_type , cvptr_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , vptr_type , cvptr_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , vptr_type , cvptr_type ) );
	};

	// Signatures for compatible FunctorType::join without tag and not an array
	template< class FunctorType >
	struct FunctorValueJoinFunction< FunctorType , void , false > {

	typedef typename FunctorValueTraits<FunctorType,void>::value_type value_type ;

	typedef volatile value_type & vref_type ;
	typedef const volatile value_type & cvref_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( vref_type , cvref_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( vref_type , cvref_type ) );
	};

	// Signatures for compatible FunctorType::join without tag and is an array
	template< class FunctorType >
	struct FunctorValueJoinFunction< FunctorType , void , true > {

	typedef typename FunctorValueTraits<FunctorType,void>::value_type value_type ;

	typedef volatile value_type * vptr_type ;
	typedef const volatile value_type * cvptr_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( vptr_type , cvptr_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( vptr_type , cvptr_type ) );
	};


	template< class FunctorType , class ArgTag
	, class T = typename FunctorValueTraits<FunctorType,ArgTag>::reference_type
	, class Enable = void >
	struct FunctorValueJoin ;

	/* No 'join' function provided, single value */
	template< class FunctorType , class ArgTag , class T , class Enable >
	struct FunctorValueJoin< FunctorType , ArgTag , T & , Enable >
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void join( const FunctorType & f , volatile void * const lhs , const volatile void * const rhs )
	{
	((volatile T)lhs) += ((const volatile T)rhs);
	}
	};

	/* No 'join' function provided, array of values */
	template< class FunctorType , class ArgTag , class T , class Enable >
	struct FunctorValueJoin< FunctorType , ArgTag , T * , Enable >
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void join( const FunctorType & f , volatile void * const lhs , const volatile void * const rhs )
	{
	const int n = FunctorValueTraits<FunctorType,ArgTag>::value_count(f);

	for ( int i = 0 ; i < n ; ++i ) { ((volatile T)lhs)[i] += ((const volatile T)rhs)[i]; }
	}
	};

	/* 'join' function provided, single value */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorValueJoin
	< FunctorType
	, ArgTag
	, T &
	// First substitution failure when FunctorType::join does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::join ) does not exist
	, decltype( FunctorValueJoinFunction< FunctorType , ArgTag >::enable_if( & FunctorType::join ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::join ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void join( const FunctorType & f , volatile void * const lhs , const volatile void * const rhs )
	{
	f.join( ArgTag() , ((volatile T )lhs) , ((const volatile T )rhs) );
	}
	};

	/* 'join' function provided, no tag, single value */
	template< class FunctorType , class T >
	struct FunctorValueJoin
	< FunctorType
	, void
	, T &
	// First substitution failure when FunctorType::join does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::join ) does not exist
	, decltype( FunctorValueJoinFunction< FunctorType , void >::enable_if( & FunctorType::join ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::join ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void join( const FunctorType & f , volatile void * const lhs , const volatile void * const rhs )
	{
	f.join( ((volatile T )lhs) , ((const volatile T )rhs) );
	}
	};

	/* 'join' function provided for array value */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorValueJoin
	< FunctorType
	, ArgTag
	, T *
	// First substitution failure when FunctorType::join does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::join ) does not exist
	, decltype( FunctorValueJoinFunction< FunctorType , ArgTag >::enable_if( & FunctorType::join ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::join ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void join( const FunctorType & f , volatile void * const lhs , const volatile void * const rhs )
	{
	f.join( ArgTag() , (volatile T )lhs , (const volatile T )rhs );
	}
	};

	/* 'join' function provided, no tag, array value */
	template< class FunctorType , class T >
	struct FunctorValueJoin
	< FunctorType
	, void
	, T *
	// First substitution failure when FunctorType::join does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::join ) does not exist
	, decltype( FunctorValueJoinFunction< FunctorType , void >::enable_if( & FunctorType::join ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::join ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void join( const FunctorType & f , volatile void * const lhs , const volatile void * const rhs )
	{
	f.join( (volatile T )lhs , (const volatile T )rhs );
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	-#ifdef KOKKOS_HAVE_CXX11
	namespace Kokkos {

	namespace Impl {

	+#if defined( KOKKOS_HAVE_CXX11 )
	+
	template<typename ValueType, class JoinOp, class Enable = void>
	struct JoinLambdaAdapter {
	typedef ValueType value_type;
	const JoinOp& lambda;
	KOKKOS_INLINE_FUNCTION
	JoinLambdaAdapter(const JoinOp& lambda_):lambda(lambda_) {}

	KOKKOS_INLINE_FUNCTION
	void join(volatile value_type& dst, const volatile value_type& src) const {
	lambda(dst,src);
	}

	KOKKOS_INLINE_FUNCTION
	void join(value_type& dst, const value_type& src) const {
	lambda(dst,src);
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (volatile value_type& dst, const volatile value_type& src) const {
	lambda(dst,src);
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (value_type& dst, const value_type& src) const {
	lambda(dst,src);
	}
	};

	template<typename ValueType, class JoinOp>
	struct JoinLambdaAdapter<ValueType, JoinOp, decltype( FunctorValueJoinFunction< JoinOp , void >::enable_if( & JoinOp::join ) )> {
	typedef ValueType value_type;
	typedef StaticAssertSame<ValueType,typename JoinOp::value_type> assert_value_types_match;
	const JoinOp& lambda;
	KOKKOS_INLINE_FUNCTION
	JoinLambdaAdapter(const JoinOp& lambda_):lambda(lambda_) {}

	KOKKOS_INLINE_FUNCTION
	void join(volatile value_type& dst, const volatile value_type& src) const {
	lambda.join(dst,src);
	}

	KOKKOS_INLINE_FUNCTION
	void join(value_type& dst, const value_type& src) const {
	lambda.join(dst,src);
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (volatile value_type& dst, const volatile value_type& src) const {
	lambda.join(dst,src);
	}

	KOKKOS_INLINE_FUNCTION
	void operator() (value_type& dst, const value_type& src) const {
	lambda.join(dst,src);
	}
	};

	+#endif
	+
	template<typename ValueType>
	struct JoinAdd {
	typedef ValueType value_type;

	KOKKOS_INLINE_FUNCTION
	JoinAdd() {}

	KOKKOS_INLINE_FUNCTION
	void join(volatile value_type& dst, const volatile value_type& src) const {
	dst+=src;
	}
	KOKKOS_INLINE_FUNCTION
	void operator() (value_type& dst, const value_type& src) const {
	dst+=src;
	}
	KOKKOS_INLINE_FUNCTION
	void operator() (volatile value_type& dst, const volatile value_type& src) const {
	dst+=src;
	}
	};

	}
	}
	-#endif

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class ArgTag
	, class T = typename FunctorValueTraits<FunctorType,ArgTag>::reference_type >
	struct FunctorValueOps ;

	template< class FunctorType , class ArgTag , class T >
	struct FunctorValueOps< FunctorType , ArgTag , T & >
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	T * pointer( T & r ) { return & r ; }

	KOKKOS_FORCEINLINE_FUNCTION static
	T & reference( void * p ) { return ((T)p); }

	KOKKOS_FORCEINLINE_FUNCTION static
	void copy( const FunctorType & , void * const lhs , const void * const rhs )
	{ ((T)lhs) = ((const T)rhs); }
	};

	/* No 'join' function provided, array of values */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorValueOps< FunctorType , ArgTag , T * >
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	T * pointer( T * p ) { return p ; }

	KOKKOS_FORCEINLINE_FUNCTION static
	T * reference( void * p ) { return ((T*)p); }

	KOKKOS_FORCEINLINE_FUNCTION static
	void copy( const FunctorType & f , void * const lhs , const void * const rhs )
	{
	const int n = FunctorValueTraits<FunctorType,ArgTag>::value_count(f);
	for ( int i = 0 ; i < n ; ++i ) { ((T)lhs)[i] = ((const T)rhs)[i]; }
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	// Compatible functions for 'final' function and value_type not an array
	template< class FunctorType , class ArgTag , bool IsArray = 0 == FunctorValueTraits<FunctorType,ArgTag>::StaticValueSize >
	struct FunctorFinalFunction {

	typedef typename FunctorValueTraits<FunctorType,ArgTag>::value_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , value_type & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , value_type & ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , value_type volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , value_type volatile & ) );

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type const & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type const & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type const & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type const & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , value_type const & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , value_type const & ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type const volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type const volatile & ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , value_type const volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , value_type const volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , value_type const volatile & ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , value_type const volatile & ) );
	};

	// Compatible functions for 'final' function and value_type is an array
	template< class FunctorType , class ArgTag >
	struct FunctorFinalFunction< FunctorType , ArgTag , true > {

	typedef typename FunctorValueTraits<FunctorType,ArgTag>::value_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag , value_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag const & , value_type ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag , value_type volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag const & , value_type volatile ) );

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type const ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type const ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type const ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type const ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag , value_type const ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag const & , value_type const ) );

	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type const volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type const volatile ) const );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag , value_type const volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( ArgTag const & , value_type const volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag , value_type const volatile ) );
	// KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( ArgTag const & , value_type const volatile ) );
	};

	template< class FunctorType >
	struct FunctorFinalFunction< FunctorType , void , false > {

	typedef typename FunctorValueTraits<FunctorType,void>::value_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( value_type & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( value_type & ) );

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( const value_type & ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( const value_type & ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( const value_type & ) );
	};

	template< class FunctorType >
	struct FunctorFinalFunction< FunctorType , void , true > {

	typedef typename FunctorValueTraits<FunctorType,void>::value_type value_type ;

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( value_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( value_type ) );

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( const value_type ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::)( const value_type ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( )( const value_type ) );
	};

	/* No 'final' function provided */
	template< class FunctorType , class ArgTag
	, class ResultType = typename FunctorValueTraits<FunctorType,ArgTag>::reference_type
	, class Enable = void >
	struct FunctorFinal
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void final( const FunctorType & , void * ) {}
	};

	/* 'final' function provided */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorFinal
	< FunctorType
	, ArgTag
	, T &
	// First substitution failure when FunctorType::final does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::final ) does not exist
	, decltype( FunctorFinalFunction< FunctorType , ArgTag >::enable_if( & FunctorType::final ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::final ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void final( const FunctorType & f , void * p ) { f.final( ((T)p) ); }

	KOKKOS_FORCEINLINE_FUNCTION static
	void final( FunctorType & f , void * p ) { f.final( ((T)p) ); }
	};

	/* 'final' function provided for array value */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorFinal
	< FunctorType
	, ArgTag
	, T *
	// First substitution failure when FunctorType::final does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::final ) does not exist
	, decltype( FunctorFinalFunction< FunctorType , ArgTag >::enable_if( & FunctorType::final ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::final ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void final( const FunctorType & f , void * p ) { f.final( (T*)p ); }

	KOKKOS_FORCEINLINE_FUNCTION static
	void final( FunctorType & f , void * p ) { f.final( (T*)p ); }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class FunctorType , class ArgTag
	, class ReferenceType = typename FunctorValueTraits<FunctorType,ArgTag>::reference_type >
	struct FunctorApplyFunction {

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , ReferenceType ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , ReferenceType ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag , ReferenceType ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ArgTag const & , ReferenceType ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag , ReferenceType ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ArgTag const & , ReferenceType ) );
	};

	template< class FunctorType , class ReferenceType >
	struct FunctorApplyFunction< FunctorType , void , ReferenceType > {

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ReferenceType ) const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)( ReferenceType ) );
	KOKKOS_INLINE_FUNCTION static void enable_if( void ( *)( ReferenceType ) );
	};

	template< class FunctorType >
	struct FunctorApplyFunction< FunctorType , void , void > {

	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)() const );
	KOKKOS_INLINE_FUNCTION static void enable_if( void (FunctorType::*)() );
	};

	template< class FunctorType , class ArgTag , class ReferenceType
	, class Enable = void >
	struct FunctorApply
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void apply( const FunctorType & , void * ) {}
	};

	/* 'apply' function provided for void value */
	template< class FunctorType , class ArgTag >
	struct FunctorApply
	< FunctorType
	, ArgTag
	, void
	// First substitution failure when FunctorType::apply does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::apply ) does not exist
	, decltype( FunctorApplyFunction< FunctorType , ArgTag , void >::enable_if( & FunctorType::apply ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::apply ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void apply( FunctorType & f ) { f.apply(); }

	KOKKOS_FORCEINLINE_FUNCTION static
	void apply( const FunctorType & f ) { f.apply(); }
	};

	/* 'apply' function provided for single value */
	template< class FunctorType , class ArgTag , class T >
	struct FunctorApply
	< FunctorType
	, ArgTag
	, T &
	// First substitution failure when FunctorType::apply does not exist.
	#if defined( KOKKOS_HAVE_CXX11 )
	// Second substitution failure when enable_if( & Functor::apply ) does not exist
	, decltype( FunctorApplyFunction< FunctorType , ArgTag >::enable_if( & FunctorType::apply ) )
	#else
	, typename Impl::enable_if< 0 < sizeof( & FunctorType::apply ) >::type
	#endif
	>
	{
	KOKKOS_FORCEINLINE_FUNCTION static
	void apply( const FunctorType & f , void * p ) { f.apply( ((T)p) ); }

	KOKKOS_FORCEINLINE_FUNCTION static
	void apply( FunctorType & f , void * p ) { f.apply( ((T)p) ); }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* KOKKOS_FUNCTORADAPTER_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_HostSpace.cpp b/lib/kokkos/core/src/impl/Kokkos_HostSpace.cpp
	index ecb779a4c..5c6a5b03b 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_HostSpace.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_HostSpace.cpp
	@@ -1,271 +1,455 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#include <memory.h>
	-#include <stddef.h>
	-#include <stdlib.h>
	-#include <iostream>
	-#include <sstream>
	-#include <cstring>

	-#include <Kokkos_HostSpace.hpp>
	-#include <impl/Kokkos_MemoryTracking.hpp>
	-#include <impl/Kokkos_Error.hpp>
	+#include <Kokkos_Macros.hpp>

	/--------------------------------------------------------------------------/

	-namespace Kokkos {
	-namespace Impl {
	-namespace {
	+#if defined( __INTEL_COMPILER ) && ! defined ( KOKKOS_HAVE_CUDA )

	-Impl::MemoryTracking<> & host_space_singleton()
	-{
	- static Impl::MemoryTracking<> self("Kokkos::HostSpace");
	- return self ;
	-}
	+// Intel specialized allocator does not interoperate with CUDA memory allocation

	-} // namespace <blank>
	-} // namespace Impl
	-} // namespade Kokkos
	+#define KOKKOS_INTEL_MM_ALLOC_AVAILABLE

	-/--------------------------------------------------------------------------/
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-void * host_allocate_not_thread_safe( const std::string & label , const size_t size )
	-{
	- void * ptr = 0 ;
	+#endif

	- if ( size ) {
	- size_t size_padded = size ;
	- void * ptr_alloc = 0 ;
	+/--------------------------------------------------------------------------/

	-#if defined( __INTEL_COMPILER ) && !defined ( KOKKOS_HAVE_CUDA )
	+#if ( defined( _POSIX_C_SOURCE ) && _POSIX_C_SOURCE >= 200112L ) \|\| \
	+ ( defined( _XOPEN_SOURCE ) && _XOPEN_SOURCE >= 600 )

	- ptr = ptr_alloc = _mm_malloc( size , MEMORY_ALIGNMENT );
	+#define KOKKOS_POSIX_MEMALIGN_AVAILABLE

	-#elif ( defined( _POSIX_C_SOURCE ) && _POSIX_C_SOURCE >= 200112L ) \|\| \
	- ( defined( _XOPEN_SOURCE ) && _XOPEN_SOURCE >= 600 )
	+#include <unistd.h>
	+#include <sys/mman.h>

	- posix_memalign( & ptr_alloc , MEMORY_ALIGNMENT , size );
	- ptr = ptr_alloc ;
	+/* mmap flags for private anonymous memory allocation */

	-#else
	-
	- {
	- // Over-allocate to and round up to guarantee proper alignment.
	+#if defined( MAP_ANONYMOUS ) && defined( MAP_PRIVATE )
	+ #define KOKKOS_POSIX_MMAP_FLAGS (MAP_PRIVATE \| MAP_ANONYMOUS)
	+#elif defined( MAP_ANON ) && defined( MAP_PRIVATE )
	+ #define KOKKOS_POSIX_MMAP_FLAGS (MAP_PRIVATE \| MAP_ANON)
	+#endif

	- size_padded = ( size + MEMORY_ALIGNMENT - 1 );
	+// mmap flags for huge page tables
	+#if defined( KOKKOS_POSIX_MMAP_FLAGS )
	+ #if defined( MAP_HUGETLB )
	+ #define KOKKOS_POSIX_MMAP_FLAGS_HUGE (KOKKOS_POSIX_MMAP_FLAGS \| MAP_HUGETLB )
	+ #else
	+ #define KOKKOS_POSIX_MMAP_FLAGS_HUGE KOKKOS_POSIX_MMAP_FLAGS
	+ #endif
	+#endif

	- ptr_alloc = malloc( size_padded );
	+#endif

	- const size_t rem = reinterpret_cast<ptrdiff_t>(ptr_alloc) % MEMORY_ALIGNMENT ;
	+/--------------------------------------------------------------------------/

	- ptr = static_cast<unsigned char *>(ptr_alloc) + ( rem ? MEMORY_ALIGNMENT - rem : 0 );
	- }
	+#include <stddef.h>
	+#include <stdlib.h>
	+#include <stdint.h>
	+#include <memory.h>

	-#endif
	+#include <iostream>
	+#include <sstream>
	+#include <cstring>

	- if ( ptr_alloc && ptr_alloc <= ptr &&
	- 0 == ( reinterpret_cast<ptrdiff_t>(ptr) % MEMORY_ALIGNMENT ) ) {
	- // Insert allocated pointer and allocation count
	- Impl::host_space_singleton().insert( label , ptr_alloc , size_padded );
	- }
	- else {
	- std::ostringstream msg ;
	- msg << "Kokkos::Impl::host_allocate_not_thread_safe( "
	- << label
	- << " , " << size
	- << " ) FAILED aligned memory allocation" ;
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	- }
	- }
	+#include <Kokkos_HostSpace.hpp>
	+#include <impl/Kokkos_BasicAllocators.hpp>
	+#include <impl/Kokkos_Error.hpp>
	+#include <Kokkos_Atomic.hpp>

	- return ptr ;
	-}
	+/--------------------------------------------------------------------------/

	-void host_decrement_not_thread_safe( const void * ptr )
	-{
	- void * ptr_alloc = Impl::host_space_singleton().decrement( ptr );
	+namespace Kokkos {
	+namespace Impl {

	- if ( ptr_alloc ) {
	-#if defined( __INTEL_COMPILER ) && !defined ( KOKKOS_HAVE_CUDA )
	- _mm_free( ptr_alloc );
	-#else
	- free( ptr_alloc );
	-#endif
	- }
	-}

	DeepCopy<HostSpace,HostSpace>::DeepCopy( void * dst , const void * src , size_t n )
	{
	memcpy( dst , src , n );
	}

	}
	}

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace {

	static const int QUERY_SPACE_IN_PARALLEL_MAX = 16 ;

	typedef int (* QuerySpaceInParallelPtr )();

	QuerySpaceInParallelPtr s_in_parallel_query[ QUERY_SPACE_IN_PARALLEL_MAX ] ;
	int s_in_parallel_query_count = 0 ;

	} // namespace <empty>

	void HostSpace::register_in_parallel( int (*device_in_parallel)() )
	{
	if ( 0 == device_in_parallel ) {
	Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::register_in_parallel ERROR : given NULL" ) );
	}

	int i = -1 ;

	if ( ! (device_in_parallel)() ) {
	for ( i = 0 ; i < s_in_parallel_query_count && ! (*(s_in_parallel_query[i]))() ; ++i );
	}

	if ( i < s_in_parallel_query_count ) {
	Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::register_in_parallel_query ERROR : called in_parallel" ) );

	}

	if ( QUERY_SPACE_IN_PARALLEL_MAX <= i ) {
	Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::register_in_parallel_query ERROR : exceeded maximum" ) );

	}

	for ( i = 0 ; i < s_in_parallel_query_count && s_in_parallel_query[i] != device_in_parallel ; ++i );

	if ( i == s_in_parallel_query_count ) {
	s_in_parallel_query[s_in_parallel_query_count++] = device_in_parallel ;
	}
	}

	int HostSpace::in_parallel()
	{
	const int n = s_in_parallel_query_count ;

	int i = 0 ;

	while ( i < n && ! (*(s_in_parallel_query[i]))() ) { ++i ; }

	return i < n ;
	}

	} // namespace Kokkos

	/--------------------------------------------------------------------------/

	namespace Kokkos {

	-void * HostSpace::allocate( const std::string & label , const size_t size )
	+Impl::AllocationTracker HostSpace::allocate_and_track( const std::string & label, const size_t size )
	{
	- void * ptr = 0 ;
	+ return Impl::AllocationTracker( allocator(), size, label );
	+}

	- if ( ! HostSpace::in_parallel() ) {
	- ptr = Impl::host_allocate_not_thread_safe( label , size );
	- }
	- else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::allocate called within a parallel functor") );
	- }
	+} // namespace Kokkos

	- return ptr ;
	-}
	+/--------------------------------------------------------------------------/

	-void HostSpace::increment( const void * ptr )
	+namespace Kokkos {
	+
	+/* Default allocation mechanism */
	+HostSpace::HostSpace()
	+ : m_alloc_mech(
	+#if defined( KOKKOS_INTEL_MM_ALLOC_AVAILABLE )
	+ HostSpace::INTEL_MM_ALLOC
	+#elif defined( KOKKOS_POSIX_MMAP_FLAGS )
	+ HostSpace::POSIX_MMAP
	+#elif defined( KOKKOS_POSIX_MEMALIGN_AVAILABLE )
	+ HostSpace::POSIX_MEMALIGN
	+#else
	+ HostSpace::STD_MALLOC
	+#endif
	+ )
	+{}
	+
	+/* Default allocation mechanism */
	+HostSpace::HostSpace( const HostSpace::AllocationMechanism & arg_alloc_mech )
	+ : m_alloc_mech( HostSpace::STD_MALLOC )
	{
	- if ( ! HostSpace::in_parallel() ) {
	- Impl::host_space_singleton().increment( ptr );
	+ if ( arg_alloc_mech == STD_MALLOC ) {
	+ m_alloc_mech = HostSpace::STD_MALLOC ;
	+ }
	+#if defined( KOKKOS_INTEL_MM_ALLOC_AVAILABLE )
	+ else if ( arg_alloc_mech == HostSpace::INTEL_MM_ALLOC ) {
	+ m_alloc_mech = HostSpace::INTEL_MM_ALLOC ;
	+ }
	+#elif defined( KOKKOS_POSIX_MEMALIGN_AVAILABLE )
	+ else if ( arg_alloc_mech == HostSpace::POSIX_MEMALIGN ) {
	+ m_alloc_mech = HostSpace::POSIX_MEMALIGN ;
	+ }
	+#elif defined( KOKKOS_POSIX_MMAP_FLAGS )
	+ else if ( arg_alloc_mech == HostSpace::POSIX_MMAP ) {
	+ m_alloc_mech = HostSpace::POSIX_MMAP ;
	}
	+#endif
	else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::increment called within a parallel functor") );
	+ const char * const mech =
	+ ( arg_alloc_mech == HostSpace::INTEL_MM_ALLOC ) ? "INTEL_MM_ALLOC" : (
	+ ( arg_alloc_mech == HostSpace::POSIX_MEMALIGN ) ? "POSIX_MEMALIGN" : (
	+ ( arg_alloc_mech == HostSpace::POSIX_MMAP ) ? "POSIX_MMAP" : "" ));
	+
	+ std::string msg ;
	+ msg.append("Kokkos::HostSpace ");
	+ msg.append(mech);
	+ msg.append(" is not available" );
	+ Kokkos::Impl::throw_runtime_exception( msg );
	}
	}

	-void HostSpace::decrement( const void * ptr )
	+void * HostSpace::allocate( const size_t arg_alloc_size ) const
	{
	- if ( ! HostSpace::in_parallel() ) {
	- Impl::host_decrement_not_thread_safe( ptr );
	+ static_assert( sizeof(void*) == sizeof(uintptr_t)
	+ , "Error sizeof(void*) != sizeof(uintptr_t)" );
	+
	+ static_assert( Kokkos::Impl::power_of_two< Kokkos::Impl::MEMORY_ALIGNMENT >::value
	+ , "Memory alignment must be power of two" );
	+
	+ constexpr size_t alignment = Kokkos::Impl::MEMORY_ALIGNMENT ;
	+ constexpr size_t alignment_mask = alignment - 1 ;
	+
	+ void * ptr = NULL;
	+
	+ if ( arg_alloc_size ) {
	+
	+ if ( m_alloc_mech == STD_MALLOC ) {
	+ // Over-allocate to and round up to guarantee proper alignment.
	+ size_t size_padded = arg_alloc_size + sizeof(void*) + alignment ;
	+
	+ void * alloc_ptr = malloc( size_padded );
	+
	+ if (alloc_ptr) {
	+ uintptr_t address = reinterpret_cast<uintptr_t>(alloc_ptr);
	+
	+ // offset enough to record the alloc_ptr
	+ address += sizeof(void *);
	+ uintptr_t rem = address % alignment;
	+ uintptr_t offset = rem ? (alignment - rem) : 0u;
	+ address += offset;
	+ ptr = reinterpret_cast<void *>(address);
	+ // record the alloc'd pointer
	+ address -= sizeof(void *);
	+ reinterpret_cast<void *>(address) = alloc_ptr;
	+ }
	+ }
	+
	+#if defined( KOKKOS_INTEL_MM_ALLOC_AVAILABLE )
	+ else if ( m_alloc_mech == INTEL_MM_ALLOC ) {
	+ ptr = _mm_malloc( arg_alloc_size , alignment );
	+ }
	+#endif
	+
	+#if defined( KOKKOS_POSIX_MEMALIGN_AVAILABLE )
	+ else if ( m_alloc_mech == POSIX_MEMALIGN ) {
	+ posix_memalign( & ptr, alignment , arg_alloc_size );
	+ }
	+#endif
	+
	+#if defined( KOKKOS_POSIX_MMAP_FLAGS )
	+ else if ( m_alloc_mech == POSIX_MMAP ) {
	+ constexpr size_t use_huge_pages = (1u << 27);
	+ constexpr int prot = PROT_READ \| PROT_WRITE ;
	+ const int flags = arg_alloc_size < use_huge_pages
	+ ? KOKKOS_POSIX_MMAP_FLAGS
	+ : KOKKOS_POSIX_MMAP_FLAGS_HUGE ;
	+
	+ // read write access to private memory
	+
	+ ptr = mmap( NULL /* address hint, if NULL OS kernel chooses address */
	+ , arg_alloc_size /* size in bytes */
	+ , prot /* memory protection */
	+ , flags /* visibility of updates */
	+ , -1 /* file descriptor */
	+ , 0 /* offset */
	+ );
	+
	+/* Associated reallocation:
	+ ptr = mremap( old_ptr , old_size , new_size , MREMAP_MAYMOVE );
	+*/
	+ }
	+#endif
	}
	- else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::decrement called within a parallel functor") );
	+
	+ if ( reinterpret_cast<uintptr_t>(ptr) & alignment_mask ) {
	+ Kokkos::Impl::throw_runtime_exception( "Kokkos::HostSpace aligned allocation failed" );
	}
	+
	+ return ptr;
	}

	-int HostSpace::count( const void * ptr ) {
	- if ( ! HostSpace::in_parallel() ) {
	- Impl::MemoryTracking<>::Entry * const entry =
	- Impl::host_space_singleton().query(ptr);
	- return entry != NULL?entry->count():0;
	- }
	- else {
	- Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::HostSpace::count called within a parallel functor") );
	- return -1;
	+
	+void HostSpace::deallocate( void * const arg_alloc_ptr , const size_t arg_alloc_size ) const
	+{
	+ if ( arg_alloc_ptr ) {
	+
	+ if ( m_alloc_mech == STD_MALLOC ) {
	+ void * alloc_ptr = (reinterpret_cast<void *>(arg_alloc_ptr) -1);
	+ free( alloc_ptr );
	+ }
	+
	+#if defined( KOKKOS_INTEL_MM_ALLOC_AVAILABLE )
	+ else if ( m_alloc_mech == INTEL_MM_ALLOC ) {
	+ _mm_free( arg_alloc_ptr );
	+ }
	+#endif
	+
	+#if defined( KOKKOS_POSIX_MEMALIGN_AVAILABLE )
	+ else if ( m_alloc_mech == POSIX_MEMALIGN ) {
	+ free( arg_alloc_ptr );
	+ }
	+#endif
	+
	+#if defined( KOKKOS_POSIX_MMAP_FLAGS )
	+ else if ( m_alloc_mech == POSIX_MMAP ) {
	+ munmap( arg_alloc_ptr , arg_alloc_size );
	+ }
	+#endif
	+
	}
	}

	-void HostSpace::print_memory_view( std::ostream & o )
	+} // namespace Kokkos
	+
	+namespace Kokkos {
	+namespace Experimental {
	+namespace Impl {
	+
	+SharedAllocationRecord< void , void >
	+SharedAllocationRecord< Kokkos::HostSpace , void >::s_root_record ;
	+
	+void
	+SharedAllocationRecord< Kokkos::HostSpace , void >::
	+deallocate( SharedAllocationRecord< void , void > * arg_rec )
	+{
	+ delete static_cast<SharedAllocationRecord*>(arg_rec);
	+}
	+
	+SharedAllocationRecord< Kokkos::HostSpace , void >::
	+~SharedAllocationRecord()
	{
	- Impl::host_space_singleton().print( o , std::string(" ") );
	+ m_space.deallocate( SharedAllocationRecord< void , void >::m_alloc_ptr
	+ , SharedAllocationRecord< void , void >::m_alloc_size
	+ );
	+}
	+
	+SharedAllocationRecord< Kokkos::HostSpace , void >::
	+SharedAllocationRecord( const Kokkos::HostSpace & arg_space
	+ , const std::string & arg_label
	+ , const size_t arg_alloc_size
	+ , const SharedAllocationRecord< void , void >::function_type arg_dealloc
	+ )
	+ // Pass through allocated [ SharedAllocationHeader , user_memory ]
	+ // Pass through deallocation function
	+ : SharedAllocationRecord< void , void >
	+ ( & SharedAllocationRecord< Kokkos::HostSpace , void >::s_root_record
	+ , reinterpret_cast<SharedAllocationHeader*>( arg_space.allocate( sizeof(SharedAllocationHeader) + arg_alloc_size ) )
	+ , sizeof(SharedAllocationHeader) + arg_alloc_size
	+ , arg_dealloc
	+ )
	+ , m_space( arg_space )
	+{
	+ // Fill in the Header information
	+ RecordBase::m_alloc_ptr->m_record = static_cast< SharedAllocationRecord< void , void > * >( this );
	+
	+ strncpy( RecordBase::m_alloc_ptr->m_label
	+ , arg_label.c_str()
	+ , SharedAllocationHeader::maximum_label_length
	+ );
	}

	-std::string HostSpace::query_label( const void * p )
	+SharedAllocationRecord< Kokkos::HostSpace , void > *
	+SharedAllocationRecord< Kokkos::HostSpace , void >::get_record( void * alloc_ptr )
	{
	- Impl::MemoryTracking<>::Entry * const entry = Impl::host_space_singleton().query(p);
	- return std::string( entry ? entry->label() : "<NOT ALLOCATED>" );
	+ typedef SharedAllocationHeader Header ;
	+ typedef SharedAllocationRecord< Kokkos::HostSpace , void > RecordHost ;
	+
	+ SharedAllocationHeader const * const head = Header::get_header( alloc_ptr );
	+ RecordHost * const record = static_cast< RecordHost * >( head->m_record );
	+
	+ if ( record->m_alloc_ptr != head ) {
	+ Kokkos::Impl::throw_runtime_exception( std::string("Kokkos::Experimental::Impl::SharedAllocationRecord< Kokkos::HostSpace , void >::get_record ERROR" ) );
	+ }
	+
	+ return record ;
	}

	+// Iterate records to print orphaned memory ...
	+void SharedAllocationRecord< Kokkos::HostSpace , void >::
	+print_records( std::ostream & s , const Kokkos::HostSpace & space , bool detail )
	+{
	+ SharedAllocationRecord< void , void >::print_host_accessible_records( s , "HostSpace" , & s_root_record , detail );
	+}
	+
	+} // namespace Impl
	+} // namespace Experimental
	} // namespace Kokkos

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	+namespace Kokkos {
	+namespace {
	+ const unsigned HOST_SPACE_ATOMIC_MASK = 0xFFFF;
	+ const unsigned HOST_SPACE_ATOMIC_XOR_MASK = 0x5A39;
	+ static int HOST_SPACE_ATOMIC_LOCKS[HOST_SPACE_ATOMIC_MASK+1];
	+}
	+
	+namespace Impl {
	+void init_lock_array_host_space() {
	+ static int is_initialized = 0;
	+ if(! is_initialized)
	+ for(int i = 0; i < static_cast<int> (HOST_SPACE_ATOMIC_MASK+1); i++)
	+ HOST_SPACE_ATOMIC_LOCKS[i] = 0;
	+}
	+
	+bool lock_address_host_space(void* ptr) {
	+ return 0 == atomic_compare_exchange( &HOST_SPACE_ATOMIC_LOCKS[
	+ (( size_t(ptr) >> 2 ) & HOST_SPACE_ATOMIC_MASK) ^ HOST_SPACE_ATOMIC_XOR_MASK] ,
	+ 0 , 1);
	+}
	+
	+void unlock_address_host_space(void* ptr) {
	+ atomic_exchange( &HOST_SPACE_ATOMIC_LOCKS[
	+ (( size_t(ptr) >> 2 ) & HOST_SPACE_ATOMIC_MASK) ^ HOST_SPACE_ATOMIC_XOR_MASK] ,
	+ 0);
	+}
	+
	+}
	+}
	diff --git a/lib/kokkos/core/src/impl/Kokkos_MemoryTracking.hpp b/lib/kokkos/core/src/impl/Kokkos_MemoryTracking.hpp
	deleted file mode 100755
	index 3883fc130..000000000
	--- a/lib/kokkos/core/src/impl/Kokkos_MemoryTracking.hpp
	+++ /dev/null
	@@ -1,374 +0,0 @@
	-/*
	-//@HEADER
	-// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	-// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	-// the U.S. Government retains certain rights in this software.
	-//
	-// Redistribution and use in source and binary forms, with or without
	-// modification, are permitted provided that the following conditions are
	-// met:
	-//
	-// 1. Redistributions of source code must retain the above copyright
	-// notice, this list of conditions and the following disclaimer.
	-//
	-// 2. Redistributions in binary form must reproduce the above copyright
	-// notice, this list of conditions and the following disclaimer in the
	-// documentation and/or other materials provided with the distribution.
	-//
	-// 3. Neither the name of the Corporation nor the names of the
	-// contributors may be used to endorse or promote products derived from
	-// this software without specific prior written permission.
	-//
	-// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	-// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	-// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	-// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	-// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	-// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	-// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	-// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	-// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	-// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	-// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	-//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	-// ************************************************************************
	-//@HEADER
	-*/
	-
	-#ifndef KOKKOS_MEMORY_TRACKING_HPP
	-#define KOKKOS_MEMORY_TRACKING_HPP
	-
	-#include <cstddef>
	-#include <cstring>
	-#include <limits>
	-#include <utility>
	-#include <vector>
	-#include <string>
	-#include <sstream>
	-#include <iostream>
	-
	-#include <impl/Kokkos_Error.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-namespace {
	-
	-// Fast search for result[-1] <= val < result[0].
	-// Requires result[max] == upper_bound.
	-// Start with a binary search until the search range is
	-// less than LINEAR_LIMIT, then switch to linear search.
	-
	-int memory_tracking_upper_bound( const ptrdiff_t * const begin
	- , unsigned length
	- , const ptrdiff_t value )
	-{
	- enum { LINEAR_LIMIT = 32 };
	-
	- // precondition: begin[length-1] == std::numeric_limits<ptrdiff_t>::max()
	-
	- const ptrdiff_t * first = begin ;
	-
	- while ( LINEAR_LIMIT < length ) {
	- unsigned half = length >> 1 ;
	- const ptrdiff_t * middle = first + half ;
	-
	- if ( value < *middle ) {
	- length = half ;
	- }
	- else {
	- first = ++middle ;
	- length -= ++half ;
	- }
	- }
	-
	- for ( ; ! ( value < *first ) ; ++first ) {}
	-
	- return first - begin ;
	-}
	-
	-template< class AttributeType = size_t >
	-class MemoryTracking {
	-public:
	-
	- class Entry {
	- private:
	-
	- friend class MemoryTracking ;
	-
	- enum { LABEL_LENGTH = 128 };
	-
	- Entry( const Entry & );
	- Entry & operator = ( const Entry & );
	-
	- ~Entry() {}
	-
	- Entry()
	- : m_count(0)
	- , m_alloc_ptr( reinterpret_cast<void*>( std::numeric_limits<ptrdiff_t>::max() ) )
	- , m_alloc_size(0)
	- , m_attribute()
	- { strcpy( m_label , "sentinel" ); }
	-
	- Entry( const std::string & arg_label
	- , void * const arg_alloc_ptr
	- , size_t const arg_alloc_size )
	- : m_count( 0 )
	- , m_alloc_ptr( arg_alloc_ptr )
	- , m_alloc_size( arg_alloc_size )
	- , m_attribute()
	- {
	- strncpy( m_label , arg_label.c_str() , LABEL_LENGTH );
	- m_label[ LABEL_LENGTH - 1 ] = 0 ;
	- }
	-
	- char m_label[ LABEL_LENGTH ] ;
	- size_t m_count ;
	-
	- public:
	-
	- void * const m_alloc_ptr ;
	- size_t const m_alloc_size ;
	- AttributeType m_attribute ;
	-
	- size_t count() const { return m_count ; }
	- const char * label() const { return m_label ; }
	-
	- void print( std::ostream & oss ) const
	- {
	- oss << "{ \"" << m_label
	- << "\" count(" << m_count
	- << ") memory[ " << m_alloc_ptr
	- << " + " << m_alloc_size
	- << " ]" ;
	- }
	- };
	-
	- //------------------------------------------------------------
	- /** \brief Track a memory range defined by the entry.
	- * Return the input entry pointer for success.
	- * Throw exception for failure.
	- */
	- Entry * insert( const std::string & arg_label
	- , void * const arg_alloc_ptr
	- , size_t const arg_alloc_size
	- )
	- {
	- Entry * result = 0 ;
	-
	- const ptrdiff_t alloc_begin = reinterpret_cast<ptrdiff_t>(arg_alloc_ptr);
	- const ptrdiff_t alloc_end = alloc_begin + arg_alloc_size ;
	-
	- const bool ok_exist = ! m_tracking_end.empty();
	-
	- const bool ok_input =
	- ok_exist &&
	- ( 0 < alloc_begin ) &&
	- ( alloc_begin < alloc_end ) &&
	- ( alloc_end < std::numeric_limits<ptrdiff_t>::max() );
	-
	- const int i = ok_input
	- ? memory_tracking_upper_bound( & m_tracking_end[0] , m_tracking_end.size() , alloc_end )
	- : -1 ;
	-
	- const bool ok_range = ( 0 <= i ) && ( alloc_end <= reinterpret_cast<ptrdiff_t>( m_tracking[i]->m_alloc_ptr ) );
	-
	- // allocate the new entry only if the vector inserts succeed.
	- const bool ok_insert =
	- ok_range &&
	- ( alloc_end == *m_tracking_end.insert(m_tracking_end.begin()+i,alloc_end) ) &&
	- ( 0 == *m_tracking.insert(m_tracking.begin()+i,0) ) &&
	- ( 0 != ( result = new Entry(arg_label,arg_alloc_ptr,arg_alloc_size) ) );
	-
	- if ( ok_insert ) {
	- result->m_count = 1 ;
	- m_tracking[i] = result ;
	- }
	- else {
	- std::ostringstream msg ;
	- msg << m_space
	- << "::insert( " << arg_label
	- << " , " << arg_alloc_ptr
	- << " , " << arg_alloc_size
	- << " ) ERROR : " ;
	- if ( ! ok_exist ) {
	- msg << " called after return from main()" ;
	- }
	- else if ( ! ok_input ) {
	- msg << " bad allocation range" ;
	- }
	- else if ( ! ok_range ) {
	- msg << " overlapping memory range with"
	- << " { " << m_tracking[i]->m_label
	- << " , " << m_tracking[i]->m_alloc_ptr
	- << " , " << m_tracking[i]->m_alloc_size
	- << " }" ;
	- }
	- else {
	- msg << " internal allocation error" ;
	- }
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	- }
	-
	- return result ;
	- }
	-
	- /** \brief Decrement the tracked memory range.
	- * If the count is zero then return the originally inserted pointer.
	- * If the count is non zero then return zero.
	- */
	- void * decrement( void const * const ptr )
	- {
	- void * result = 0 ;
	-
	- if ( ptr ) {
	- const bool ok_exist = ! m_tracking_end.empty();
	-
	- const int i = ok_exist
	- ? memory_tracking_upper_bound( & m_tracking_end[0] , m_tracking_end.size() , reinterpret_cast<ptrdiff_t>(ptr) )
	- : -1 ;
	-
	- const bool ok_found = ( 0 <= i ) && ( reinterpret_cast<ptrdiff_t>( m_tracking[i]->m_alloc_ptr ) <=
	- reinterpret_cast<ptrdiff_t>(ptr) );
	-
	- if ( ok_found ) {
	- if ( 0 == --( m_tracking[i]->m_count ) ) {
	- result = m_tracking[i]->m_alloc_ptr ;
	- delete m_tracking[i] ;
	- m_tracking.erase( m_tracking.begin() + i );
	- m_tracking_end.erase( m_tracking_end.begin() + i );
	- }
	- }
	- else {
	- // Don't throw as this is likely called from within a destructor.
	- std::cerr << m_space
	- << "::decrement( " << ptr << " ) ERROR : "
	- << ( ! ok_exist ? " called after return from main()"
	- : " memory not being tracked" )
	- << std::endl ;
	- std::cerr.flush();
	- }
	- }
	- return result ;
	- }
	-
	- /** \brief Increment the tracking count. */
	- void increment( void const * const ptr )
	- {
	- if ( ptr ) {
	- const bool ok_exist = ! m_tracking_end.empty();
	-
	- const int i = ok_exist
	- ? memory_tracking_upper_bound( & m_tracking_end[0] , m_tracking_end.size() , reinterpret_cast<ptrdiff_t>(ptr) )
	- : -1 ;
	-
	- const bool ok_found = ( 0 <= i ) && ( reinterpret_cast<ptrdiff_t>( m_tracking[i]->m_alloc_ptr ) <=
	- reinterpret_cast<ptrdiff_t>(ptr) );
	-
	- if ( ok_found ) {
	- ++( m_tracking[i]->m_count );
	- }
	- else {
	- std::ostringstream msg ;
	- msg << m_space
	- << "::increment( " << ptr << " ) ERROR : "
	- << ( ! ok_exist ? " called after return from main()"
	- : " memory not being tracked" )
	- << std::endl ;
	- Kokkos::Impl::throw_runtime_exception( msg.str() );
	- }
	- }
	- }
	-
	- /** \brief Query a tracked memory range.
	- * Return zero for not found.
	- */
	- Entry * query( void const * const ptr ) const
	- {
	- const bool ok_exist = ! m_tracking_end.empty();
	-
	- const int i = ( ok_exist && ptr )
	- ? memory_tracking_upper_bound( & m_tracking_end[0] , m_tracking_end.size() , reinterpret_cast<ptrdiff_t>(ptr) )
	- : -1 ;
	-
	- const bool ok_found = ( 0 <= i ) && ( reinterpret_cast<ptrdiff_t>( m_tracking[i]->m_alloc_ptr ) <=
	- reinterpret_cast<ptrdiff_t>(ptr) );
	-
	- return ok_found ? m_tracking[i] : (Entry *) 0 ;
	- }
	-
	- /** \brief Call the 'print' method on all entries. */
	- void print( std::ostream & oss , const std::string & lead ) const
	- {
	- const size_t n = m_tracking.empty() ? 0 : m_tracking.size() - 1 ;
	- for ( size_t i = 0 ; i < n ; ++i ) {
	- oss << lead ;
	- m_tracking[i]->print( oss );
	- oss << std::endl ;
	- }
	- }
	-
	- size_t size() const { return m_tracking.size(); }
	-
	- template< typename iType >
	- MemoryTracking & operator[]( const iType & i ) const
	- { return *m_tracking[i]; }
	-
	- /** \brief Construct with a name for error messages */
	- explicit MemoryTracking( const std::string & space_name )
	- : m_space( space_name )
	- , m_tracking()
	- , m_tracking_end()
	- , m_sentinel()
	- {
	- m_tracking.reserve( 512 );
	- m_tracking_end.reserve( 512 );
	- m_tracking.push_back( & m_sentinel );
	- m_tracking_end.push_back( reinterpret_cast<ptrdiff_t>( m_sentinel.m_alloc_ptr ) );
	- }
	-
	- /** \brief Print memory leak warning for all entries. */
	- ~MemoryTracking()
	- {
	- try {
	- const ptrdiff_t max = std::numeric_limits<ptrdiff_t>::max();
	-
	- if ( 1 < m_tracking.size() ) {
	- std::cerr << m_space << " destroyed with memory leaks:" ;
	- print( std::cerr , std::string(" ") );
	- }
	- else if ( m_tracking.empty() \|\| max != m_tracking_end.back() ) {
	- std::cerr << m_space << " corrupted data structure" << std::endl ;
	- }
	-
	- m_space = std::string();
	- m_tracking = std::vector<Entry*>();
	- m_tracking_end = std::vector<ptrdiff_t>();
	- }
	- catch( ... ) {}
	- }
	-
	- const std::string & label() const { return m_space ; }
	-
	-private:
	- MemoryTracking();
	- MemoryTracking( const MemoryTracking & );
	- MemoryTracking & operator = ( const MemoryTracking & );
	-
	- std::string m_space ;
	- std::vector<Entry*> m_tracking ;
	- std::vector<ptrdiff_t> m_tracking_end ;
	- Entry m_sentinel ;
	-};
	-
	-} /* namespace */
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif
	-
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Memory_Fence.hpp b/lib/kokkos/core/src/impl/Kokkos_Memory_Fence.hpp
	index eebb0c7f0..17eb0c2f4 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Memory_Fence.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Memory_Fence.hpp
	@@ -1,73 +1,73 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_MEMORY_FENCE )
	#define KOKKOS_MEMORY_FENCE
	-
	namespace Kokkos {

	//----------------------------------------------------------------------------

	KOKKOS_FORCEINLINE_FUNCTION
	void memory_fence()
	{
	#if defined( KOKKOS_ATOMICS_USE_CUDA )
	__threadfence();
	#elif defined( KOKKOS_ATOMICS_USE_GCC ) \|\| \
	( defined( KOKKOS_COMPILER_NVCC ) && defined( KOKKOS_ATOMICS_USE_INTEL ) )
	__sync_synchronize();
	#elif defined( KOKKOS_ATOMICS_USE_INTEL )
	_mm_mfence();
	#elif defined( KOKKOS_ATOMICS_USE_OMP31 )
	#pragma omp flush
	-
	+#elif defined( KOKKOS_ATOMICS_USE_WINDOWS )
	+ MemoryBarrier();
	#else
	#error "Error: memory_fence() not defined"
	#endif
	}

	} // namespace kokkos

	#endif


	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	index 0dcb3977a..0e87c63e4 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	@@ -1,84 +1,84 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	#define KOKKOS_PHYSICAL_LAYOUT_HPP


	#include <Kokkos_View.hpp>
	namespace Kokkos {
	namespace Impl {



	struct PhysicalLayout {
	enum LayoutType {Left,Right,Scalar,Error};
	LayoutType layout_type;
	int rank;
	long long int stride[8]; //distance between two neighboring elements in a given dimension

	template< class T , class L , class D , class M >
	PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	: layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	, rank( view.Rank )
	{
	for(int i=0;i<8;i++) stride[i] = 0;
	view.stride( stride );
	}
	#ifdef KOKKOS_HAVE_CUDA
	template< class T , class L , class D , class M >
	PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	: layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	, rank( view.Rank )
	{
	for(int i=0;i<8;i++) stride[i] = 0;
	view.stride( stride );
	}
	#endif
	};

	}
	}
	#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Profiling_DeviceInfo.hpp b/lib/kokkos/core/src/impl/Kokkos_Profiling_DeviceInfo.hpp
	new file mode 100755
	index 000000000..5da60841d
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_Profiling_DeviceInfo.hpp
	@@ -0,0 +1,57 @@
	+/*
	+ //@HEADER
	+ // ************************************************************************
	+ //
	+ // Kokkos v. 2.0
	+ // Copyright (2014) Sandia Corporation
	+ //
	+ // Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+ // the U.S. Government retains certain rights in this software.
	+ //
	+ // Redistribution and use in source and binary forms, with or without
	+ // modification, are permitted provided that the following conditions are
	+ // met:
	+ //
	+ // 1. Redistributions of source code must retain the above copyright
	+ // notice, this list of conditions and the following disclaimer.
	+ //
	+ // 2. Redistributions in binary form must reproduce the above copyright
	+ // notice, this list of conditions and the following disclaimer in the
	+ // documentation and/or other materials provided with the distribution.
	+ //
	+ // 3. Neither the name of the Corporation nor the names of the
	+ // contributors may be used to endorse or promote products derived from
	+ // this software without specific prior written permission.
	+ //
	+ // THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+ // EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+ // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+ // PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+ // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+ // EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+ // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+ // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+ // LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+ // NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+ // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+ //
	+ // Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+ //
	+ // ************************************************************************
	+ //@HEADER
	+*/
	+
	+#ifndef KOKKOSP_DEVICE_INFO_HPP
	+#define KOKKOSP_DEVICE_INFO_HPP
	+
	+namespace Kokkos {
	+namespace Experimental {
	+
	+ struct KokkosPDeviceInfo {
	+ uint32_t deviceID;
	+ };
	+
	+}
	+}
	+
	+#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Profiling_Interface.cpp b/lib/kokkos/core/src/impl/Kokkos_Profiling_Interface.cpp
	new file mode 100755
	index 000000000..85ec1709c
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_Profiling_Interface.cpp
	@@ -0,0 +1,141 @@
	+/*
	+ //@HEADER
	+ // ************************************************************************
	+ //
	+ // Kokkos v. 2.0
	+ // Copyright (2014) Sandia Corporation
	+ //
	+ // Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+ // the U.S. Government retains certain rights in this software.
	+ //
	+ // Redistribution and use in source and binary forms, with or without
	+ // modification, are permitted provided that the following conditions are
	+ // met:
	+ //
	+ // 1. Redistributions of source code must retain the above copyright
	+ // notice, this list of conditions and the following disclaimer.
	+ //
	+ // 2. Redistributions in binary form must reproduce the above copyright
	+ // notice, this list of conditions and the following disclaimer in the
	+ // documentation and/or other materials provided with the distribution.
	+ //
	+ // 3. Neither the name of the Corporation nor the names of the
	+ // contributors may be used to endorse or promote products derived from
	+ // this software without specific prior written permission.
	+ //
	+ // THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+ // EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+ // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+ // PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+ // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+ // EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+ // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+ // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+ // LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+ // NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+ // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+ //
	+ // Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+ //
	+ // ************************************************************************
	+ //@HEADER
	+ */
	+
	+#include <impl/Kokkos_Profiling_Interface.hpp>
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+#include <string.h>
	+
	+namespace Kokkos {
	+ namespace Experimental {
	+ bool profileLibraryLoaded() {
	+ return (NULL != initProfileLibrary);
	+ }
	+
	+ void beginParallelFor(const std::string& kernelPrefix, const uint32_t devID, uint64_t* kernelID) {
	+ if(NULL != beginForCallee) {
	+ Kokkos::fence();
	+ (*beginForCallee)(kernelPrefix.c_str(), devID, kernelID);
	+ }
	+ };
	+
	+ void endParallelFor(const uint64_t kernelID) {
	+ if(NULL != endForCallee) {
	+ Kokkos::fence();
	+ (*endForCallee)(kernelID);
	+ }
	+ };
	+
	+ void beginParallelScan(const std::string& kernelPrefix, const uint32_t devID, uint64_t* kernelID) {
	+ if(NULL != beginScanCallee) {
	+ Kokkos::fence();
	+ (*beginScanCallee)(kernelPrefix.c_str(), devID, kernelID);
	+ }
	+ };
	+
	+ void endParallelScan(const uint64_t kernelID) {
	+ if(NULL != endScanCallee) {
	+ Kokkos::fence();
	+ (*endScanCallee)(kernelID);
	+ }
	+ };
	+
	+ void beginParallelReduce(const std::string& kernelPrefix, const uint32_t devID, uint64_t* kernelID) {
	+ if(NULL != beginReduceCallee) {
	+ Kokkos::fence();
	+ (*beginReduceCallee)(kernelPrefix.c_str(), devID, kernelID);
	+ }
	+ };
	+
	+ void endParallelReduce(const uint64_t kernelID) {
	+ if(NULL != endReduceCallee) {
	+ Kokkos::fence();
	+ (*endReduceCallee)(kernelID);
	+ }
	+ };
	+
	+ void initialize() {
	+ void* firstProfileLibrary;
	+
	+ char* envProfileLibrary = getenv("KOKKOS_PROFILE_LIBRARY");
	+ char* profileLibraryName = strtok(envProfileLibrary, ";");
	+
	+ if( (NULL != profileLibraryName) && (strcmp(profileLibraryName, "") != 0) ) {
	+ firstProfileLibrary = dlopen(profileLibraryName, RTLD_NOW \| RTLD_GLOBAL);
	+
	+ if(NULL == firstProfileLibrary) {
	+ std::cerr << "Error: Unable to load KokkosP library: " <<
	+ profileLibraryName << std::endl;
	+ } else {
	+ std::cout << "KOKKOSP: Library Loaded: " << profileLibraryName << std::endl;
	+
	+ beginForCallee = (beginFunction) dlsym(firstProfileLibrary, "kokkosp_begin_parallel_for");
	+ beginScanCallee = (beginFunction) dlsym(firstProfileLibrary, "kokkosp_begin_parallel_scan");
	+ beginReduceCallee = (beginFunction) dlsym(firstProfileLibrary, "kokkosp_begin_parallel_reduce");
	+
	+ endScanCallee = (endFunction) dlsym(firstProfileLibrary, "kokkosp_end_parallel_scan");
	+ endForCallee = (endFunction) dlsym(firstProfileLibrary, "kokkosp_end_parallel_for");
	+ endReduceCallee = (endFunction) dlsym(firstProfileLibrary, "kokkosp_end_parallel_reduce");
	+
	+ initProfileLibrary = (initFunction) dlsym(firstProfileLibrary, "kokkosp_init_library");
	+ finalizeProfileLibrary = (finalizeFunction) dlsym(firstProfileLibrary, "kokkosp_finalize_library");
	+ }
	+ }
	+
	+ if(NULL != initProfileLibrary) {
	+ (*initProfileLibrary)(0,
	+ (uint64_t) KOKKOSP_INTERFACE_VERSION,
	+ (uint32_t) 0,
	+ NULL);
	+ }
	+ };
	+
	+ void finalize() {
	+ if(NULL != finalizeProfileLibrary) {
	+ (*finalizeProfileLibrary)();
	+ }
	+ };
	+ }
	+}
	+
	+#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Profiling_Interface.hpp b/lib/kokkos/core/src/impl/Kokkos_Profiling_Interface.hpp
	new file mode 100755
	index 000000000..1e2f715f3
	--- /dev/null
	+++ b/lib/kokkos/core/src/impl/Kokkos_Profiling_Interface.hpp
	@@ -0,0 +1,98 @@
	+/*
	+ //@HEADER
	+ // ************************************************************************
	+ //
	+ // Kokkos v. 2.0
	+ // Copyright (2014) Sandia Corporation
	+ //
	+ // Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+ // the U.S. Government retains certain rights in this software.
	+ //
	+ // Redistribution and use in source and binary forms, with or without
	+ // modification, are permitted provided that the following conditions are
	+ // met:
	+ //
	+ // 1. Redistributions of source code must retain the above copyright
	+ // notice, this list of conditions and the following disclaimer.
	+ //
	+ // 2. Redistributions in binary form must reproduce the above copyright
	+ // notice, this list of conditions and the following disclaimer in the
	+ // documentation and/or other materials provided with the distribution.
	+ //
	+ // 3. Neither the name of the Corporation nor the names of the
	+ // contributors may be used to endorse or promote products derived from
	+ // this software without specific prior written permission.
	+ //
	+ // THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+ // EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+ // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+ // PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+ // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+ // EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+ // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+ // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+ // LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+ // NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+ // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+ //
	+ // Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+ //
	+ // ************************************************************************
	+ //@HEADER
	+ */
	+
	+#ifndef KOKKOSP_INTERFACE_HPP
	+#define KOKKOSP_INTERFACE_HPP
	+
	+#include <cstddef>
	+#include <Kokkos_Core_fwd.hpp>
	+#include <Kokkos_Macros.hpp>
	+#include <string>
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+#include <impl/Kokkos_Profiling_DeviceInfo.hpp>
	+#include <dlfcn.h>
	+#include <iostream>
	+#include <stdlib.h>
	+#endif
	+
	+#define KOKKOSP_INTERFACE_VERSION 20150628
	+
	+#ifdef KOKKOSP_ENABLE_PROFILING
	+namespace Kokkos {
	+ namespace Experimental {
	+
	+ typedef void (*initFunction)(const int,
	+ const uint64_t,
	+ const uint32_t,
	+ KokkosPDeviceInfo*);
	+ typedef void (*finalizeFunction)();
	+ typedef void (beginFunction)(const char, const uint32_t, uint64_t*);
	+ typedef void (*endFunction)(uint64_t);
	+
	+ static initFunction initProfileLibrary = NULL;
	+ static finalizeFunction finalizeProfileLibrary = NULL;
	+ static beginFunction beginForCallee = NULL;
	+ static beginFunction beginScanCallee = NULL;
	+ static beginFunction beginReduceCallee = NULL;
	+ static endFunction endForCallee = NULL;
	+ static endFunction endScanCallee = NULL;
	+ static endFunction endReduceCallee = NULL;
	+
	+ bool profileLibraryLoaded();
	+
	+ void beginParallelFor(const std::string& kernelPrefix, const uint32_t devID, uint64_t* kernelID);
	+ void endParallelFor(const uint64_t kernelID);
	+ void beginParallelScan(const std::string& kernelPrefix, const uint32_t devID, uint64_t* kernelID);
	+ void endParallelScan(const uint64_t kernelID);
	+ void beginParallelReduce(const std::string& kernelPrefix, const uint32_t devID, uint64_t* kernelID);
	+ void endParallelReduce(const uint64_t kernelID);
	+
	+ void initialize();
	+ void finalize();
	+
	+ }
	+}
	+
	+#endif
	+#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Serial.cpp b/lib/kokkos/core/src/impl/Kokkos_Serial.cpp
	index db9f7c5b5..562c7afc6 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Serial.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Serial.cpp
	@@ -1,119 +1,119 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <stdlib.h>
	#include <sstream>
	#include <Kokkos_Serial.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Error.hpp>

	#if defined( KOKKOS_HAVE_SERIAL )

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {
	namespace SerialImpl {

	Sentinel::Sentinel() : m_scratch(0), m_reduce_end(0), m_shared_end(0) {}

	Sentinel::~Sentinel()
	{
	if ( m_scratch ) { free( m_scratch ); }
	m_scratch = 0 ;
	m_reduce_end = 0 ;
	m_shared_end = 0 ;
	}

	Sentinel & Sentinel::singleton()
	{
	static Sentinel s ; return s ;
	}

	inline
	unsigned align( unsigned n )
	{
	enum { ALIGN = 0x0100 /* 256 */ , MASK = ALIGN - 1 };
	return ( n + MASK ) & ~MASK ;
	}

	} // namespace

	SerialTeamMember::SerialTeamMember( int arg_league_rank
	, int arg_league_size
	, int arg_shared_size
	)
	: m_space( ((char *) SerialImpl::Sentinel::singleton().m_scratch) + SerialImpl::Sentinel::singleton().m_reduce_end
	, arg_shared_size )
	, m_league_rank( arg_league_rank )
	, m_league_size( arg_league_size )
	{}

	} // namespace Impl

	void * Serial::scratch_memory_resize( unsigned reduce_size , unsigned shared_size )
	{
	static Impl::SerialImpl::Sentinel & s = Impl::SerialImpl::Sentinel::singleton();

	reduce_size = Impl::SerialImpl::align( reduce_size );
	shared_size = Impl::SerialImpl::align( shared_size );

	if ( ( s.m_reduce_end < reduce_size ) \|\|
	( s.m_shared_end < s.m_reduce_end + shared_size ) ) {

	if ( s.m_scratch ) { free( s.m_scratch ); }

	if ( s.m_reduce_end < reduce_size ) s.m_reduce_end = reduce_size ;
	if ( s.m_shared_end < s.m_reduce_end + shared_size ) s.m_shared_end = s.m_reduce_end + shared_size ;

	s.m_scratch = malloc( s.m_shared_end );
	}

	return s.m_scratch ;
	}

	} // namespace Kokkos

	#endif // defined( KOKKOS_HAVE_SERIAL )


	diff --git a/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.cpp b/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.cpp
	index d814a78df..688f97f42 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.cpp
	@@ -1,324 +1,336 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	// Experimental unified task-data parallel manycore LDRD

	#include <impl/Kokkos_Serial_TaskPolicy.hpp>

	#if defined( KOKKOS_HAVE_SERIAL )
	#include <stdlib.h>
	#include <stdexcept>
	#include <iostream>
	#include <sstream>
	#include <string>

	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {
	+
	+TaskPolicy< Kokkos::Serial >::member_type &
	+TaskPolicy< Kokkos::Serial >::member_single()
	+{
	+ static member_type s(0,1,0);
	+ return s ;
	+}
	+
	+} // namespace Experimental
	+} // namespace Kokkos
	+
	+namespace Kokkos {
	+namespace Experimental {
	namespace Impl {

	typedef TaskMember< Kokkos::Serial , void , void > Task ;

	//----------------------------------------------------------------------------

	namespace {

	inline
	unsigned padded_sizeof_derived( unsigned sizeof_derived )
	{
	return sizeof_derived +
	( sizeof_derived % sizeof(Task) ? sizeof(Task) - sizeof_derived % sizeof(Task*) : 0 );
	}

	} // namespace

	void Task::deallocate( void * ptr )
	{
	free( ptr );
	}

	void * Task::allocate( const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity )
	{
	return malloc( padded_sizeof_derived( arg_sizeof_derived ) + arg_dependence_capacity * sizeof(Task*) );
	}

	Task::~TaskMember()
	{

	}

	Task::TaskMember( const Task::function_verify_type arg_verify
	, const Task::function_dealloc_type arg_dealloc
	, const Task::function_apply_type arg_apply
	, const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity
	)
	: m_dealloc( arg_dealloc )
	, m_verify( arg_verify )
	, m_apply( arg_apply )
	, m_dep( (Task *)( ((unsigned char ) this) + padded_sizeof_derived( arg_sizeof_derived ) ) )
	, m_wait( 0 )
	, m_next( 0 )
	, m_dep_capacity( arg_dependence_capacity )
	, m_dep_size( 0 )
	, m_ref_count( 0 )
	, m_state( TASK_STATE_CONSTRUCTING )
	{
	for ( unsigned i = 0 ; i < arg_dependence_capacity ; ++i ) m_dep[i] = 0 ;
	}

	Task::TaskMember( const Task::function_dealloc_type arg_dealloc
	, const Task::function_apply_type arg_apply
	, const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity
	)
	: m_dealloc( arg_dealloc )
	, m_verify( & Task::verify_type<void> )
	, m_apply( arg_apply )
	, m_dep( (Task *)( ((unsigned char ) this) + padded_sizeof_derived( arg_sizeof_derived ) ) )
	, m_wait( 0 )
	, m_next( 0 )
	, m_dep_capacity( arg_dependence_capacity )
	, m_dep_size( 0 )
	, m_ref_count( 0 )
	, m_state( TASK_STATE_CONSTRUCTING )
	{
	for ( unsigned i = 0 ; i < arg_dependence_capacity ; ++i ) m_dep[i] = 0 ;
	}

	//----------------------------------------------------------------------------

	void Task::throw_error_add_dependence() const
	{
	std::cerr << "TaskMember< Serial >::add_dependence ERROR"
	<< " state(" << m_state << ")"
	<< " dep_size(" << m_dep_size << ")"
	<< std::endl ;
	throw std::runtime_error("TaskMember< Serial >::add_dependence ERROR");
	}

	void Task::throw_error_verify_type()
	{
	throw std::runtime_error("TaskMember< Serial >::verify_type ERROR");
	}

	//----------------------------------------------------------------------------

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )

	void Task::assign( Task ** const lhs , Task * rhs , const bool no_throw )
	{
	- static const char msg_error_header[] = "Kokkos::Impl::TaskManager<Kokkos::Serial>::assign ERROR" ;
	+ static const char msg_error_header[] = "Kokkos::Experimental::Impl::TaskManager<Kokkos::Serial>::assign ERROR" ;
	static const char msg_error_count[] = ": negative reference count" ;
	static const char msg_error_complete[] = ": destroy task that is not complete" ;
	static const char msg_error_dependences[] = ": destroy task that has dependences" ;
	static const char msg_error_exception[] = ": caught internal exception" ;

	const char * msg_error = 0 ;

	try {

	if ( *lhs ) {

	const int count = --((**lhs).m_ref_count);

	if ( 0 == count ) {

	// Reference count at zero, delete it

	// Should only be deallocating a completed task
	- if ( (**lhs).m_state == Kokkos::TASK_STATE_COMPLETE ) {
	+ if ( (**lhs).m_state == Kokkos::Experimental::TASK_STATE_COMPLETE ) {

	// A completed task should not have dependences...
	for ( int i = 0 ; i < (**lhs).m_dep_size && 0 == msg_error ; ++i ) {
	if ( (**lhs).m_dep[i] ) msg_error = msg_error_dependences ;
	}
	}
	else {
	msg_error = msg_error_complete ;
	}

	if ( 0 == msg_error ) {
	// Get deletion function and apply it
	const Task::function_dealloc_type d = (**lhs).m_dealloc ;

	(d)( lhs );
	}
	}
	else if ( count <= 0 ) {
	msg_error = msg_error_count ;
	}
	}

	if ( 0 == msg_error && rhs ) { ++( rhs->m_ref_count ); }

	*lhs = rhs ;
	}
	catch( ... ) {
	if ( 0 == msg_error ) msg_error = msg_error_exception ;
	}

	if ( 0 != msg_error ) {
	if ( no_throw ) {
	std::cerr << msg_error_header << msg_error << std::endl ;
	std::cerr.flush();
	}
	else {
	std::string msg(msg_error_header);
	msg.append(msg_error);
	throw std::runtime_error( msg );
	}
	}
	}
	#endif

	namespace {

	Task * s_ready = 0 ;
	Task * s_denied = reinterpret_cast<Task*>( ~((unsigned long)0) );

	}

	void Task::schedule()
	{
	// Execute ready tasks in case the task being scheduled
	// is dependent upon a waiting and ready task.

	Task::execute_ready_tasks();

	// spawning : Constructing -> Waiting
	// respawning : Executing -> Waiting
	// updating : Waiting -> Waiting

	// Must not be in a dependence linked list: 0 == t->m_next

	const bool ok_state = TASK_STATE_COMPLETE != m_state ;
	const bool ok_list = 0 == m_next ;

	if ( ok_state && ok_list ) {

	// Will be waiting for execution upon return from this function

	- m_state = Kokkos::TASK_STATE_WAITING ;
	+ m_state = Kokkos::Experimental::TASK_STATE_WAITING ;

	// Insert this task into another dependence that is not complete

	int i = 0 ;
	for ( ; i < m_dep_size ; ++i ) {
	Task * const y = m_dep[i] ;
	if ( y && s_denied != ( m_next = y->m_wait ) ) {
	y->m_wait = this ; // CAS( & y->m_wait , m_next , this );
	break ;
	}
	}
	if ( i == m_dep_size ) {
	// All dependences are complete, insert into the ready list
	m_next = s_ready ;
	s_ready = this ; // CAS( & s_ready , m_next = s_ready , this );
	}
	}
	else {
	- throw std::runtime_error(std::string("Kokkos::Impl::Task spawn or respawn state error"));
	+ throw std::runtime_error(std::string("Kokkos::Experimental::Impl::Task spawn or respawn state error"));
	}
	}

	void Task::execute_ready_tasks()
	{
	while ( s_ready ) {

	// Remove this task from the ready list

	// Task * task ;
	// while ( ! CAS( & s_ready , task = s_ready , s_ready->m_next ) );

	Task * const task = s_ready ;
	s_ready = task->m_next ;

	task->m_next = 0 ;

	// precondition: task->m_state = TASK_STATE_WAITING
	// precondition: task->m_dep[i]->m_state == TASK_STATE_COMPLETE for all i
	// precondition: does not exist T such that T->m_wait = task
	// precondition: does not exist T such that T->m_next = task

	- task->m_state = Kokkos::TASK_STATE_EXECUTING ;
	+ task->m_state = Kokkos::Experimental::TASK_STATE_EXECUTING ;

	(*task->m_apply)( task );

	- if ( task->m_state == Kokkos::TASK_STATE_EXECUTING ) {
	+ if ( task->m_state == Kokkos::Experimental::TASK_STATE_EXECUTING ) {
	// task did not respawn itself
	- task->m_state = Kokkos::TASK_STATE_COMPLETE ;
	+ task->m_state = Kokkos::Experimental::TASK_STATE_COMPLETE ;

	// release dependences:
	for ( int i = 0 ; i < task->m_dep_size ; ++i ) {
	assign( task->m_dep + i , 0 );
	}

	// Stop other tasks from adding themselves to 'task->m_wait' ;

	Task * x ;
	// CAS( & task->m_wait , x = task->m_wait , s_denied );
	x = task->m_wait ; task->m_wait = s_denied ;

	// update tasks waiting on this task
	while ( x ) {
	Task * const next = x->m_next ;

	x->m_next = 0 ;

	x->schedule(); // could happen concurrently

	x = next ;
	}
	}
	}
	}

	-void Task::wait( const Future< void , Kokkos::Serial > & )
	-{ execute_ready_tasks(); }
	-
	} // namespace Impl
	+} // namespace Experimental
	} // namespace Kokkos

	#endif // defined( KOKKOS_HAVE_SERIAL )
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.hpp b/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.hpp
	index bdd9fd03f..4eec2f66b 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Serial_TaskPolicy.hpp
	@@ -1,763 +1,845 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	// Experimental unified task-data parallel manycore LDRD

	#ifndef KOKKOS_SERIAL_TASKPOLICY_HPP
	#define KOKKOS_SERIAL_TASKPOLICY_HPP

	#include <Kokkos_Macros.hpp>
	#if defined( KOKKOS_HAVE_SERIAL )

	#include <string>
	#include <typeinfo>
	#include <stdexcept>

	#include <Kokkos_Serial.hpp>
	#include <Kokkos_TaskPolicy.hpp>
	#include <Kokkos_View.hpp>

	#include <impl/Kokkos_FunctorAdapter.hpp>

	//----------------------------------------------------------------------------
	/* Inheritance structure to allow static_cast from the task root type
	* and a task's FunctorType.
	*
	* task_root_type == TaskMember< Space , void , void >
	*
	* TaskMember< PolicyType , ResultType , FunctorType >
	* : TaskMember< PolicyType::Space , ResultType , FunctorType >
	* { ... };
	*
	* TaskMember< Space , ResultType , FunctorType >
	* : TaskMember< Space , ResultType , void >
	* , FunctorType
	* { ... };
	*
	* when ResultType != void
	*
	* TaskMember< Space , ResultType , void >
	* : TaskMember< Space , void , void >
	* { ... };
	*
	*/
	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {
	namespace Impl {

	/** \brief Base class for all tasks in the Serial execution space */
	template<>
	class TaskMember< Kokkos::Serial , void , void >
	{
	public:

	typedef void (* function_apply_type) ( TaskMember * );
	typedef void (* function_dealloc_type)( TaskMember * );
	typedef TaskMember * (* function_verify_type) ( TaskMember * );

	private:

	const function_dealloc_type m_dealloc ; ///< Deallocation
	const function_verify_type m_verify ; ///< Result type verification
	const function_apply_type m_apply ; ///< Apply function
	TaskMember ** const m_dep ; ///< Dependences
	TaskMember * m_wait ; ///< Linked list of tasks waiting on this task
	TaskMember * m_next ; ///< Linked list of tasks waiting on a different task
	const int m_dep_capacity ; ///< Capacity of dependences
	int m_dep_size ; ///< Actual count of dependences
	int m_ref_count ; ///< Reference count
	int m_state ; ///< State of the task

	// size = 6 Pointers + 4 ints

	TaskMember() /* = delete */ ;
	TaskMember( const TaskMember & ) /* = delete */ ;
	TaskMember & operator = ( const TaskMember & ) /* = delete */ ;

	static void * allocate( const unsigned arg_sizeof_derived , const unsigned arg_dependence_capacity );
	static void deallocate( void * );

	void throw_error_add_dependence() const ;
	static void throw_error_verify_type();

	template < class DerivedTaskType >
	static
	void deallocate( TaskMember * t )
	{
	DerivedTaskType * ptr = static_cast< DerivedTaskType * >(t);
	ptr->~DerivedTaskType();
	deallocate( (void *) ptr );
	}

	protected :

	~TaskMember();

	// Used by TaskMember< Serial , ResultType , void >
	TaskMember( const function_verify_type arg_verify
	, const function_dealloc_type arg_dealloc
	, const function_apply_type arg_apply
	, const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity
	);

	// Used for TaskMember< Serial , void , void >
	TaskMember( const function_dealloc_type arg_dealloc
	, const function_apply_type arg_apply
	, const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity
	);

	public:

	template< typename ResultType >
	KOKKOS_FUNCTION static
	TaskMember * verify_type( TaskMember * t )
	{
	- enum { check_type = ! Impl::is_same< ResultType , void >::value };
	+ enum { check_type = ! Kokkos::Impl::is_same< ResultType , void >::value };

	if ( check_type && t != 0 ) {

	// Verify that t->m_verify is this function
	const function_verify_type self = & TaskMember::template verify_type< ResultType > ;

	if ( t->m_verify != self ) {
	t = 0 ;
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	throw_error_verify_type();
	#endif
	}
	}
	return t ;
	}

	//----------------------------------------
	/* Inheritence Requirements on task types:
	* typedef FunctorType::value_type value_type ;
	* class DerivedTaskType
	* : public TaskMember< Serial , value_type , FunctorType >
	* { ... };
	* class TaskMember< Serial , value_type , FunctorType >
	* : public TaskMember< Serial , value_type , void >
	* , public Functor
	* { ... };
	* If value_type != void
	* class TaskMember< Serial , value_type , void >
	* : public TaskMember< Serial , void , void >
	*
	* Allocate space for DerivedTaskType followed by TaskMember*[ dependence_capacity ]
	*
	*/

	/** \brief Allocate and construct a single-thread task */
	template< class DerivedTaskType >
	static
	TaskMember * create( const typename DerivedTaskType::functor_type & arg_functor
	- , const unsigned arg_dependence_capacity )
	+ , const unsigned arg_dependence_capacity
	+ )
	{
	typedef typename DerivedTaskType::functor_type functor_type ;
	typedef typename functor_type::value_type value_type ;

	DerivedTaskType * const task =
	new( allocate( sizeof(DerivedTaskType) , arg_dependence_capacity ) )
	DerivedTaskType( & TaskMember::template deallocate< DerivedTaskType >
	, & TaskMember::template apply_single< functor_type , value_type >
	, sizeof(DerivedTaskType)
	, arg_dependence_capacity
	, arg_functor );

	return static_cast< TaskMember * >( task );
	}

	/** \brief Allocate and construct a data parallel task */
	template< class DerivedTaskType >
	static
	TaskMember * create( const typename DerivedTaskType::policy_type & arg_policy
	, const typename DerivedTaskType::functor_type & arg_functor
	- , const unsigned arg_dependence_capacity )
	+ , const unsigned arg_dependence_capacity
	+ )
	{
	DerivedTaskType * const task =
	new( allocate( sizeof(DerivedTaskType) , arg_dependence_capacity ) )
	DerivedTaskType( & TaskMember::template deallocate< DerivedTaskType >
	, sizeof(DerivedTaskType)
	, arg_dependence_capacity
	, arg_policy
	, arg_functor
	);

	return static_cast< TaskMember * >( task );
	}

	+ /** \brief Allocate and construct a thread-team task */
	+ template< class DerivedTaskType >
	+ static
	+ TaskMember * create_team( const typename DerivedTaskType::functor_type & arg_functor
	+ , const unsigned arg_dependence_capacity
	+ )
	+ {
	+ typedef typename DerivedTaskType::functor_type functor_type ;
	+ typedef typename functor_type::value_type value_type ;
	+
	+ DerivedTaskType * const task =
	+ new( allocate( sizeof(DerivedTaskType) , arg_dependence_capacity ) )
	+ DerivedTaskType( & TaskMember::template deallocate< DerivedTaskType >
	+ , & TaskMember::template apply_team< functor_type , value_type >
	+ , sizeof(DerivedTaskType)
	+ , arg_dependence_capacity
	+ , arg_functor );
	+
	+ return static_cast< TaskMember * >( task );
	+ }
	+
	void schedule();
	static void execute_ready_tasks();
	- static void wait( const Future< void , Kokkos::Serial > & );

	//----------------------------------------

	typedef FutureValueTypeIsVoidError get_result_type ;

	KOKKOS_INLINE_FUNCTION
	get_result_type get() const { return get_result_type() ; }

	KOKKOS_INLINE_FUNCTION
	- Kokkos::TaskState get_state() const { return Kokkos::TaskState( m_state ); }
	+ Kokkos::Experimental::TaskState get_state() const { return Kokkos::Experimental::TaskState( m_state ); }

	//----------------------------------------

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	static
	void assign( TaskMember ** const lhs , TaskMember * const rhs , const bool no_throw = false );
	#else
	KOKKOS_INLINE_FUNCTION static
	void assign( TaskMember ** const lhs , TaskMember * const rhs , const bool no_throw = false ) {}
	#endif

	KOKKOS_INLINE_FUNCTION
	TaskMember * get_dependence( int i ) const
	- { return ( Kokkos::TASK_STATE_EXECUTING == m_state && 0 <= i && i < m_dep_size ) ? m_dep[i] : (TaskMember*) 0 ; }
	+ { return ( Kokkos::Experimental::TASK_STATE_EXECUTING == m_state && 0 <= i && i < m_dep_size ) ? m_dep[i] : (TaskMember*) 0 ; }

	KOKKOS_INLINE_FUNCTION
	int get_dependence() const
	{ return m_dep_size ; }

	KOKKOS_INLINE_FUNCTION
	void clear_dependence()
	{
	for ( int i = 0 ; i < m_dep_size ; ++i ) assign( m_dep + i , 0 );
	m_dep_size = 0 ;
	}

	KOKKOS_INLINE_FUNCTION
	void add_dependence( TaskMember * before )
	{
	- if ( ( Kokkos::TASK_STATE_CONSTRUCTING == m_state \|\|
	- Kokkos::TASK_STATE_EXECUTING == m_state ) &&
	+ if ( ( Kokkos::Experimental::TASK_STATE_CONSTRUCTING == m_state \|\|
	+ Kokkos::Experimental::TASK_STATE_EXECUTING == m_state ) &&
	m_dep_size < m_dep_capacity ) {
	assign( m_dep + m_dep_size , before );
	++m_dep_size ;
	}
	else {
	throw_error_add_dependence();
	}
	}

	//----------------------------------------

	template< class FunctorType , class ResultType >
	KOKKOS_INLINE_FUNCTION static
	- void apply_single( typename Impl::enable_if< ! Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	+ void apply_single( typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	{
	typedef TaskMember< Kokkos::Serial , ResultType , FunctorType > derived_type ;

	// TaskMember< Kokkos::Serial , ResultType , FunctorType >
	// : public TaskMember< Kokkos::Serial , ResultType , void >
	// , public FunctorType
	// { ... };

	derived_type & m = * static_cast< derived_type * >( t );

	- Impl::FunctorApply< FunctorType , void , ResultType & >::apply( (FunctorType &) m , & m.m_result );
	+ Kokkos::Impl::FunctorApply< FunctorType , void , ResultType & >::apply( (FunctorType &) m , & m.m_result );
	}

	template< class FunctorType , class ResultType >
	KOKKOS_INLINE_FUNCTION static
	- void apply_single( typename Impl::enable_if< Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	+ void apply_single( typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	+ {
	+ typedef TaskMember< Kokkos::Serial , ResultType , FunctorType > derived_type ;
	+
	+ // TaskMember< Kokkos::Serial , ResultType , FunctorType >
	+ // : public TaskMember< Kokkos::Serial , ResultType , void >
	+ // , public FunctorType
	+ // { ... };
	+
	+ derived_type & m = * static_cast< derived_type * >( t );
	+
	+ Kokkos::Impl::FunctorApply< FunctorType , void , void >::apply( (FunctorType &) m );
	+ }
	+
	+ //----------------------------------------
	+
	+ template< class FunctorType , class ResultType >
	+ static
	+ void apply_team( typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	{
	typedef TaskMember< Kokkos::Serial , ResultType , FunctorType > derived_type ;
	+ typedef Kokkos::Impl::SerialTeamMember member_type ;

	// TaskMember< Kokkos::Serial , ResultType , FunctorType >
	// : public TaskMember< Kokkos::Serial , ResultType , void >
	// , public FunctorType
	// { ... };

	derived_type & m = * static_cast< derived_type * >( t );

	- Impl::FunctorApply< FunctorType , void , void >::apply( (FunctorType &) m );
	+ m.FunctorType::apply( member_type(0,1,0) , m.m_result );
	+ }
	+
	+ template< class FunctorType , class ResultType >
	+ static
	+ void apply_team( typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same< ResultType , void >::value , TaskMember * >::type t )
	+ {
	+ typedef TaskMember< Kokkos::Serial , ResultType , FunctorType > derived_type ;
	+ typedef Kokkos::Impl::SerialTeamMember member_type ;
	+
	+ // TaskMember< Kokkos::Serial , ResultType , FunctorType >
	+ // : public TaskMember< Kokkos::Serial , ResultType , void >
	+ // , public FunctorType
	+ // { ... };
	+
	+ derived_type & m = * static_cast< derived_type * >( t );
	+
	+ m.FunctorType::apply( member_type(0,1,0) );
	}
	};

	//----------------------------------------------------------------------------
	/** \brief Base class for tasks with a result value in the Serial execution space.
	*
	* The FunctorType must be void because this class is accessed by the
	* Future class for the task and result value.
	*
	* Must be derived from TaskMember<S,void,void> 'root class' so the Future class
	* can correctly static_cast from the 'root class' to this class.
	*/
	template < class ResultType >
	class TaskMember< Kokkos::Serial , ResultType , void >
	: public TaskMember< Kokkos::Serial , void , void >
	{
	public:

	ResultType m_result ;

	typedef const ResultType & get_result_type ;

	KOKKOS_INLINE_FUNCTION
	get_result_type get() const { return m_result ; }

	protected:

	typedef TaskMember< Kokkos::Serial , void , void > task_root_type ;
	typedef task_root_type::function_dealloc_type function_dealloc_type ;
	typedef task_root_type::function_apply_type function_apply_type ;

	inline
	TaskMember( const function_dealloc_type arg_dealloc
	, const function_apply_type arg_apply
	, const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity
	)
	: task_root_type( & task_root_type::template verify_type< ResultType >
	, arg_dealloc
	, arg_apply
	, arg_sizeof_derived
	, arg_dependence_capacity )
	, m_result()
	{}
	-
	};

	template< class ResultType , class FunctorType >
	class TaskMember< Kokkos::Serial , ResultType , FunctorType >
	: public TaskMember< Kokkos::Serial , ResultType , void >
	, public FunctorType
	{
	public:

	typedef FunctorType functor_type ;

	typedef TaskMember< Kokkos::Serial , void , void > task_root_type ;
	typedef TaskMember< Kokkos::Serial , ResultType , void > task_base_type ;
	typedef task_root_type::function_dealloc_type function_dealloc_type ;
	typedef task_root_type::function_apply_type function_apply_type ;

	inline
	TaskMember( const function_dealloc_type arg_dealloc
	, const function_apply_type arg_apply
	, const unsigned arg_sizeof_derived
	, const unsigned arg_dependence_capacity
	, const functor_type & arg_functor
	)
	: task_base_type( arg_dealloc , arg_apply , arg_sizeof_derived , arg_dependence_capacity )
	, functor_type( arg_functor )
	{}
	};

	//----------------------------------------------------------------------------
	/** \brief ForEach task in the Serial execution space
	*
	* Derived from TaskMember< Kokkos::Serial , ResultType , FunctorType >
	* so that Functor can be cast to task root type without knowing policy.
	*/
	template< class Arg0 , class Arg1 , class Arg2 , class ResultType , class FunctorType >
	class TaskForEach< Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial >
	, ResultType
	, FunctorType >
	: TaskMember< Kokkos::Serial , ResultType , FunctorType >
	{
	public:

	typedef FunctorType functor_type ;
	typedef RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > policy_type ;

	private:

	- friend class Kokkos::TaskPolicy< Kokkos::Serial > ;
	- friend class Kokkos::Impl::TaskMember< Kokkos::Serial , void , void > ;
	+ friend class Kokkos::Experimental::TaskPolicy< Kokkos::Serial > ;
	+ friend class Kokkos::Experimental::Impl::TaskMember< Kokkos::Serial , void , void > ;

	typedef TaskMember< Kokkos::Serial , void , void > task_root_type ;
	typedef TaskMember< Kokkos::Serial , ResultType , FunctorType > task_base_type ;
	typedef task_root_type::function_dealloc_type function_dealloc_type ;

	policy_type m_policy ;

	template< class Tag >
	inline
	- typename Impl::enable_if< Impl::is_same<Tag,void>::value >::type
	+ typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same<Tag,void>::value >::type
	apply_policy() const
	{
	const typename policy_type::member_type e = m_policy.end();
	for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	functor_type::operator()(i);
	}
	}

	template< class Tag >
	inline
	- typename Impl::enable_if< ! Impl::is_same<Tag,void>::value >::type
	+ typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same<Tag,void>::value >::type
	apply_policy() const
	{
	const Tag tag ;
	const typename policy_type::member_type e = m_policy.end();
	for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	functor_type::operator()(tag,i);
	}
	}

	static
	void apply_parallel( task_root_type * t )
	{
	static_cast<TaskForEach*>(t)->template apply_policy< typename policy_type::work_tag >();

	task_root_type::template apply_single< functor_type , ResultType >( t );
	}

	TaskForEach( const function_dealloc_type arg_dealloc
	, const int arg_sizeof_derived
	, const int arg_dependence_capacity
	, const policy_type & arg_policy
	, const functor_type & arg_functor
	)
	: task_base_type( arg_dealloc
	, & apply_parallel
	, arg_sizeof_derived
	, arg_dependence_capacity
	, arg_functor )
	, m_policy( arg_policy )
	{}

	TaskForEach() /* = delete */ ;
	TaskForEach( const TaskForEach & ) /* = delete */ ;
	TaskForEach & operator = ( const TaskForEach & ) /* = delete */ ;
	};

	//----------------------------------------------------------------------------
	/** \brief Reduce task in the Serial execution space
	*
	* Derived from TaskMember< Kokkos::Serial , ResultType , FunctorType >
	* so that Functor can be cast to task root type without knowing policy.
	*/
	template< class Arg0 , class Arg1 , class Arg2 , class ResultType , class FunctorType >
	class TaskReduce< Kokkos::RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial >
	, ResultType
	, FunctorType >
	: TaskMember< Kokkos::Serial , ResultType , FunctorType >
	{
	public:

	typedef FunctorType functor_type ;
	typedef RangePolicy< Arg0 , Arg1 , Arg2 , Kokkos::Serial > policy_type ;

	private:

	- friend class Kokkos::TaskPolicy< Kokkos::Serial > ;
	- friend class Kokkos::Impl::TaskMember< Kokkos::Serial , void , void > ;
	+ friend class Kokkos::Experimental::TaskPolicy< Kokkos::Serial > ;
	+ friend class Kokkos::Experimental::Impl::TaskMember< Kokkos::Serial , void , void > ;

	typedef TaskMember< Kokkos::Serial , void , void > task_root_type ;
	typedef TaskMember< Kokkos::Serial , ResultType , FunctorType > task_base_type ;
	typedef task_root_type::function_dealloc_type function_dealloc_type ;

	policy_type m_policy ;

	template< class Tag >
	inline
	- void apply_policy( typename Impl::enable_if< Impl::is_same<Tag,void>::value , ResultType & >::type result ) const
	+ void apply_policy( typename Kokkos::Impl::enable_if< Kokkos::Impl::is_same<Tag,void>::value , ResultType & >::type result ) const
	{
	- Impl::FunctorValueInit< functor_type , Tag >::init( *this , & result );
	+ Kokkos::Impl::FunctorValueInit< functor_type , Tag >::init( *this , & result );
	const typename policy_type::member_type e = m_policy.end();
	for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	functor_type::operator()( i, result );
	}
	}

	template< class Tag >
	inline
	- void apply_policy( typename Impl::enable_if< ! Impl::is_same<Tag,void>::value , ResultType & >::type result ) const
	+ void apply_policy( typename Kokkos::Impl::enable_if< ! Kokkos::Impl::is_same<Tag,void>::value , ResultType & >::type result ) const
	{
	- Impl::FunctorValueInit< functor_type , Tag >::init( *this , & result );
	+ Kokkos::Impl::FunctorValueInit< functor_type , Tag >::init( *this , & result );
	const Tag tag ;
	const typename policy_type::member_type e = m_policy.end();
	for ( typename policy_type::member_type i = m_policy.begin() ; i < e ; ++i ) {
	functor_type::operator()( tag, i, result );
	}
	}

	static
	void apply_parallel( task_root_type * t )
	{
	TaskReduce * const task = static_cast<TaskReduce*>(t);

	task->template apply_policy< typename policy_type::work_tag >( task->task_base_type::m_result );

	task_root_type::template apply_single< functor_type , ResultType >( t );
	}

	TaskReduce( const function_dealloc_type arg_dealloc
	, const int arg_sizeof_derived
	, const int arg_dependence_capacity
	, const policy_type & arg_policy
	, const functor_type & arg_functor
	)
	: task_base_type( arg_dealloc
	, & apply_parallel
	, arg_sizeof_derived
	, arg_dependence_capacity
	, arg_functor )
	, m_policy( arg_policy )
	{}

	TaskReduce() /* = delete */ ;
	TaskReduce( const TaskReduce & ) /* = delete */ ;
	TaskReduce & operator = ( const TaskReduce & ) /* = delete */ ;
	};

	} /* namespace Impl */
	+} /* namespace Experimental */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	+namespace Experimental {

	template<>
	class TaskPolicy< Kokkos::Serial >
	{
	public:

	- typedef Kokkos::Serial execution_space ;
	+ typedef Kokkos::Serial execution_space ;
	+ typedef Kokkos::Impl::SerialTeamMember member_type ;

	private:

	typedef Impl::TaskMember< execution_space , void , void > task_root_type ;

	- TaskPolicy & operator = ( const TaskPolicy & ) /* = delete */ ;
	-
	template< class FunctorType >
	static inline
	const task_root_type * get_task_root( const FunctorType * f )
	{
	typedef Impl::TaskMember< execution_space , typename FunctorType::value_type , FunctorType > task_type ;
	return static_cast< const task_root_type * >( static_cast< const task_type * >(f) );
	}

	template< class FunctorType >
	static inline
	task_root_type * get_task_root( FunctorType * f )
	{
	typedef Impl::TaskMember< execution_space , typename FunctorType::value_type , FunctorType > task_type ;
	return static_cast< task_root_type * >( static_cast< task_type * >(f) );
	}

	- const unsigned m_default_dependence_capacity ;
	+ unsigned m_default_dependence_capacity ;

	public:

	KOKKOS_INLINE_FUNCTION
	TaskPolicy() : m_default_dependence_capacity(4) {}

	KOKKOS_INLINE_FUNCTION
	TaskPolicy( const TaskPolicy & rhs ) : m_default_dependence_capacity( rhs.m_default_dependence_capacity ) {}

	KOKKOS_INLINE_FUNCTION
	explicit
	TaskPolicy( const unsigned arg_default_dependence_capacity )
	: m_default_dependence_capacity( arg_default_dependence_capacity ) {}

	KOKKOS_INLINE_FUNCTION
	TaskPolicy( const TaskPolicy &
	, const unsigned arg_default_dependence_capacity )
	: m_default_dependence_capacity( arg_default_dependence_capacity ) {}

	+ TaskPolicy & operator = ( const TaskPolicy &rhs )
	+ {
	+ m_default_dependence_capacity = rhs.m_default_dependence_capacity;
	+ return *this;
	+ }
	+
	//----------------------------------------

	template< class ValueType >
	KOKKOS_INLINE_FUNCTION
	const Future< ValueType , execution_space > &
	spawn( const Future< ValueType , execution_space > & f ) const
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	f.m_task->schedule();
	#endif
	return f ;
	}

	// Create single-thread task

	template< class FunctorType >
	KOKKOS_INLINE_FUNCTION
	Future< typename FunctorType::value_type , execution_space >
	create( const FunctorType & functor
	, const unsigned dependence_capacity = ~0u ) const
	{
	typedef typename FunctorType::value_type value_type ;
	typedef Impl::TaskMember< execution_space , value_type , FunctorType > task_type ;
	return Future< value_type , execution_space >(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	task_root_type::create< task_type >(
	functor , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	#endif
	);
	}

	+ template< class FunctorType >
	+ KOKKOS_INLINE_FUNCTION
	+ Future< typename FunctorType::value_type , execution_space >
	+ create_team( const FunctorType & functor
	+ , const unsigned dependence_capacity = ~0u ) const
	+ {
	+ typedef typename FunctorType::value_type value_type ;
	+ typedef Impl::TaskMember< execution_space , value_type , FunctorType > task_type ;
	+ return Future< value_type , execution_space >(
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ task_root_type::create_team< task_type >(
	+ functor , ( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	+#endif
	+ );
	+ }
	+
	// Create parallel foreach task

	template< class PolicyType , class FunctorType >
	KOKKOS_INLINE_FUNCTION
	Future< typename FunctorType::value_type , execution_space >
	create_foreach( const PolicyType & policy
	, const FunctorType & functor
	, const unsigned dependence_capacity = ~0u ) const
	{
	typedef typename FunctorType::value_type value_type ;
	typedef Impl::TaskForEach< PolicyType , value_type , FunctorType > task_type ;
	return Future< value_type , execution_space >(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	task_root_type::create< task_type >( policy , functor ,
	( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	#endif
	);
	}

	// Create parallel reduce task

	template< class PolicyType , class FunctorType >
	KOKKOS_INLINE_FUNCTION
	Future< typename FunctorType::value_type , execution_space >
	create_reduce( const PolicyType & policy
	, const FunctorType & functor
	, const unsigned dependence_capacity = ~0u ) const
	{
	typedef typename FunctorType::value_type value_type ;
	typedef Impl::TaskReduce< PolicyType , value_type , FunctorType > task_type ;
	return Future< value_type , execution_space >(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	task_root_type::create< task_type >( policy , functor ,
	( ~0u == dependence_capacity ? m_default_dependence_capacity : dependence_capacity ) )
	#endif
	);
	}

	// Add dependence
	template< class A1 , class A2 , class A3 , class A4 >
	KOKKOS_INLINE_FUNCTION
	void add_dependence( const Future<A1,A2> & after
	, const Future<A3,A4> & before
	- , typename Impl::enable_if
	- < Impl::is_same< typename Future<A1,A2>::execution_space , execution_space >::value
	+ , typename Kokkos::Impl::enable_if
	+ < Kokkos::Impl::is_same< typename Future<A1,A2>::execution_space , execution_space >::value
	&&
	- Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	+ Kokkos::Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	>::type * = 0
	) const
	{
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	after.m_task->add_dependence( before.m_task );
	#endif
	}

	//----------------------------------------
	// Functions for an executing task functor to query dependences,
	// set new dependences, and respawn itself.

	template< class FunctorType >
	KOKKOS_INLINE_FUNCTION
	Future< void , execution_space >
	get_dependence( const FunctorType * task_functor , int i ) const
	{
	return Future<void,execution_space>(
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	get_task_root(task_functor)->get_dependence(i)
	#endif
	);
	}

	template< class FunctorType >
	KOKKOS_INLINE_FUNCTION
	int get_dependence( const FunctorType * task_functor ) const
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ return get_task_root(task_functor)->get_dependence(); }
	#else
	{ return 0 ; }
	#endif

	template< class FunctorType >
	KOKKOS_INLINE_FUNCTION
	void clear_dependence( FunctorType * task_functor ) const
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ get_task_root(task_functor)->clear_dependence(); }
	#else
	{}
	#endif

	template< class FunctorType , class A3 , class A4 >
	KOKKOS_INLINE_FUNCTION
	void add_dependence( FunctorType * task_functor
	, const Future<A3,A4> & before
	- , typename Impl::enable_if
	- < Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	+ , typename Kokkos::Impl::enable_if
	+ < Kokkos::Impl::is_same< typename Future<A3,A4>::execution_space , execution_space >::value
	>::type * = 0
	) const
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ get_task_root(task_functor)->add_dependence( before.m_task ); }
	#else
	{}
	#endif

	template< class FunctorType >
	KOKKOS_INLINE_FUNCTION
	void respawn( FunctorType * task_functor ) const
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	{ get_task_root(task_functor)->schedule(); }
	#else
	{}
	#endif
	+
	+ //----------------------------------------
	+
	+ static member_type & member_single();
	};

	inline
	void wait( TaskPolicy< Kokkos::Serial > & )
	{ Impl::TaskMember< Kokkos::Serial , void , void >::execute_ready_tasks(); }

	-inline
	-void wait( const Future< void , Kokkos::Serial > & future )
	-{ Impl::TaskMember< Kokkos::Serial , void , void >::wait( future ); }
	-
	+} /* namespace Experimental */
	} // namespace Kokkos

	//----------------------------------------------------------------------------

	#endif /* defined( KOKKOS_HAVE_SERIAL ) */
	#endif /* #define KOKKOS_SERIAL_TASK_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Shape.cpp b/lib/kokkos/core/src/impl/Kokkos_Shape.cpp
	index 062946b39..da12db1f3 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Shape.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Shape.cpp
	@@ -1,178 +1,178 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/


	#include <sstream>
	#include <impl/Kokkos_Error.hpp>
	#include <impl/Kokkos_Shape.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	void assert_counts_are_equal_throw(
	const size_t x_count ,
	const size_t y_count )
	{
	std::ostringstream msg ;

	msg << "Kokkos::Impl::assert_counts_are_equal_throw( "
	<< x_count << " != " << y_count << " )" ;

	throw_runtime_exception( msg.str() );
	}

	void assert_shapes_are_equal_throw(
	const unsigned x_scalar_size ,
	const unsigned x_rank ,
	const size_t x_N0 , const unsigned x_N1 ,
	const unsigned x_N2 , const unsigned x_N3 ,
	const unsigned x_N4 , const unsigned x_N5 ,
	const unsigned x_N6 , const unsigned x_N7 ,

	const unsigned y_scalar_size ,
	const unsigned y_rank ,
	const size_t y_N0 , const unsigned y_N1 ,
	const unsigned y_N2 , const unsigned y_N3 ,
	const unsigned y_N4 , const unsigned y_N5 ,
	const unsigned y_N6 , const unsigned y_N7 )
	{
	std::ostringstream msg ;

	msg << "Kokkos::Impl::assert_shape_are_equal_throw( {"
	<< " scalar_size(" << x_scalar_size
	<< ") rank(" << x_rank
	<< ") dimension(" ;
	if ( 0 < x_rank ) { msg << " " << x_N0 ; }
	if ( 1 < x_rank ) { msg << " " << x_N1 ; }
	if ( 2 < x_rank ) { msg << " " << x_N2 ; }
	if ( 3 < x_rank ) { msg << " " << x_N3 ; }
	if ( 4 < x_rank ) { msg << " " << x_N4 ; }
	if ( 5 < x_rank ) { msg << " " << x_N5 ; }
	if ( 6 < x_rank ) { msg << " " << x_N6 ; }
	if ( 7 < x_rank ) { msg << " " << x_N7 ; }
	msg << " ) } != { "
	<< " scalar_size(" << y_scalar_size
	<< ") rank(" << y_rank
	<< ") dimension(" ;
	if ( 0 < y_rank ) { msg << " " << y_N0 ; }
	if ( 1 < y_rank ) { msg << " " << y_N1 ; }
	if ( 2 < y_rank ) { msg << " " << y_N2 ; }
	if ( 3 < y_rank ) { msg << " " << y_N3 ; }
	if ( 4 < y_rank ) { msg << " " << y_N4 ; }
	if ( 5 < y_rank ) { msg << " " << y_N5 ; }
	if ( 6 < y_rank ) { msg << " " << y_N6 ; }
	if ( 7 < y_rank ) { msg << " " << y_N7 ; }
	msg << " ) } )" ;

	throw_runtime_exception( msg.str() );
	}

	void AssertShapeBoundsAbort< Kokkos::HostSpace >::apply(
	const size_t rank ,
	const size_t n0 , const size_t n1 ,
	const size_t n2 , const size_t n3 ,
	const size_t n4 , const size_t n5 ,
	const size_t n6 , const size_t n7 ,

	const size_t arg_rank ,
	const size_t i0 , const size_t i1 ,
	const size_t i2 , const size_t i3 ,
	const size_t i4 , const size_t i5 ,
	const size_t i6 , const size_t i7 )
	{
	std::ostringstream msg ;
	msg << "Kokkos::Impl::AssertShapeBoundsAbort( shape = {" ;
	if ( 0 < rank ) { msg << " " << n0 ; }
	if ( 1 < rank ) { msg << " " << n1 ; }
	if ( 2 < rank ) { msg << " " << n2 ; }
	if ( 3 < rank ) { msg << " " << n3 ; }
	if ( 4 < rank ) { msg << " " << n4 ; }
	if ( 5 < rank ) { msg << " " << n5 ; }
	if ( 6 < rank ) { msg << " " << n6 ; }
	if ( 7 < rank ) { msg << " " << n7 ; }
	msg << " } index = {" ;
	if ( 0 < arg_rank ) { msg << " " << i0 ; }
	if ( 1 < arg_rank ) { msg << " " << i1 ; }
	if ( 2 < arg_rank ) { msg << " " << i2 ; }
	if ( 3 < arg_rank ) { msg << " " << i3 ; }
	if ( 4 < arg_rank ) { msg << " " << i4 ; }
	if ( 5 < arg_rank ) { msg << " " << i5 ; }
	if ( 6 < arg_rank ) { msg << " " << i6 ; }
	if ( 7 < arg_rank ) { msg << " " << i7 ; }
	msg << " } )" ;

	throw_runtime_exception( msg.str() );
	}

	void assert_shape_effective_rank1_at_leastN_throw(
	const size_t x_rank , const size_t x_N0 ,
	const size_t x_N1 , const size_t x_N2 ,
	const size_t x_N3 , const size_t x_N4 ,
	const size_t x_N5 , const size_t x_N6 ,
	const size_t x_N7 ,
	const size_t N0 )
	{
	std::ostringstream msg ;

	msg << "Kokkos::Impl::assert_shape_effective_rank1_at_leastN_throw( shape = {" ;
	if ( 0 < x_rank ) { msg << " " << x_N0 ; }
	if ( 1 < x_rank ) { msg << " " << x_N1 ; }
	if ( 2 < x_rank ) { msg << " " << x_N2 ; }
	if ( 3 < x_rank ) { msg << " " << x_N3 ; }
	if ( 4 < x_rank ) { msg << " " << x_N4 ; }
	if ( 5 < x_rank ) { msg << " " << x_N5 ; }
	if ( 6 < x_rank ) { msg << " " << x_N6 ; }
	if ( 7 < x_rank ) { msg << " " << x_N7 ; }
	msg << " } N = " << N0 << " )" ;

	throw_runtime_exception( msg.str() );
	}



	}
	}

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Shape.hpp b/lib/kokkos/core/src/impl/Kokkos_Shape.hpp
	index 73be5717a..dba730127 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Shape.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Shape.hpp
	@@ -1,917 +1,917 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_SHAPE_HPP
	#define KOKKOS_SHAPE_HPP

	#include <typeinfo>
	#include <utility>
	#include <Kokkos_Core_fwd.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_StaticAssert.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------
	/** \brief The shape of a Kokkos with dynamic and static dimensions.
	* Dynamic dimensions are member values and static dimensions are
	* 'static const' values.
	*
	* The upper bound on the array rank is eight.
	*/
	template< unsigned ScalarSize ,
	unsigned Rank ,
	unsigned s0 = 1 ,
	unsigned s1 = 1 ,
	unsigned s2 = 1 ,
	unsigned s3 = 1 ,
	unsigned s4 = 1 ,
	unsigned s5 = 1 ,
	unsigned s6 = 1 ,
	unsigned s7 = 1 >
	struct Shape ;

	//----------------------------------------------------------------------------
	/** \brief Shape equality if the value type, layout, and dimensions
	* are equal.
	*/
	template< unsigned xSize , unsigned xRank ,
	unsigned xN0 , unsigned xN1 , unsigned xN2 , unsigned xN3 ,
	unsigned xN4 , unsigned xN5 , unsigned xN6 , unsigned xN7 ,

	unsigned ySize , unsigned yRank ,
	unsigned yN0 , unsigned yN1 , unsigned yN2 , unsigned yN3 ,
	unsigned yN4 , unsigned yN5 , unsigned yN6 , unsigned yN7 >
	KOKKOS_INLINE_FUNCTION
	bool operator == ( const Shape<xSize,xRank,xN0,xN1,xN2,xN3,xN4,xN5,xN6,xN7> & x ,
	const Shape<ySize,yRank,yN0,yN1,yN2,yN3,yN4,yN5,yN6,yN7> & y )
	{
	enum { same_size = xSize == ySize };
	enum { same_rank = xRank == yRank };

	return same_size && same_rank &&
	size_t( x.N0 ) == size_t( y.N0 ) &&
	unsigned( x.N1 ) == unsigned( y.N1 ) &&
	unsigned( x.N2 ) == unsigned( y.N2 ) &&
	unsigned( x.N3 ) == unsigned( y.N3 ) &&
	unsigned( x.N4 ) == unsigned( y.N4 ) &&
	unsigned( x.N5 ) == unsigned( y.N5 ) &&
	unsigned( x.N6 ) == unsigned( y.N6 ) &&
	unsigned( x.N7 ) == unsigned( y.N7 ) ;
	}

	template< unsigned xSize , unsigned xRank ,
	unsigned xN0 , unsigned xN1 , unsigned xN2 , unsigned xN3 ,
	unsigned xN4 , unsigned xN5 , unsigned xN6 , unsigned xN7 ,

	unsigned ySize ,unsigned yRank ,
	unsigned yN0 , unsigned yN1 , unsigned yN2 , unsigned yN3 ,
	unsigned yN4 , unsigned yN5 , unsigned yN6 , unsigned yN7 >
	KOKKOS_INLINE_FUNCTION
	bool operator != ( const Shape<xSize,xRank,xN0,xN1,xN2,xN3,xN4,xN5,xN6,xN7> & x ,
	const Shape<ySize,yRank,yN0,yN1,yN2,yN3,yN4,yN5,yN6,yN7> & y )
	{ return ! operator == ( x , y ); }

	//----------------------------------------------------------------------------

	void assert_counts_are_equal_throw(
	const size_t x_count ,
	const size_t y_count );

	inline
	void assert_counts_are_equal(
	const size_t x_count ,
	const size_t y_count )
	{
	if ( x_count != y_count ) {
	assert_counts_are_equal_throw( x_count , y_count );
	}
	}

	void assert_shapes_are_equal_throw(
	const unsigned x_scalar_size ,
	const unsigned x_rank ,
	const size_t x_N0 , const unsigned x_N1 ,
	const unsigned x_N2 , const unsigned x_N3 ,
	const unsigned x_N4 , const unsigned x_N5 ,
	const unsigned x_N6 , const unsigned x_N7 ,

	const unsigned y_scalar_size ,
	const unsigned y_rank ,
	const size_t y_N0 , const unsigned y_N1 ,
	const unsigned y_N2 , const unsigned y_N3 ,
	const unsigned y_N4 , const unsigned y_N5 ,
	const unsigned y_N6 , const unsigned y_N7 );

	template< unsigned xSize , unsigned xRank ,
	unsigned xN0 , unsigned xN1 , unsigned xN2 , unsigned xN3 ,
	unsigned xN4 , unsigned xN5 , unsigned xN6 , unsigned xN7 ,

	unsigned ySize , unsigned yRank ,
	unsigned yN0 , unsigned yN1 , unsigned yN2 , unsigned yN3 ,
	unsigned yN4 , unsigned yN5 , unsigned yN6 , unsigned yN7 >
	inline
	void assert_shapes_are_equal(
	const Shape<xSize,xRank,xN0,xN1,xN2,xN3,xN4,xN5,xN6,xN7> & x ,
	const Shape<ySize,yRank,yN0,yN1,yN2,yN3,yN4,yN5,yN6,yN7> & y )
	{
	typedef Shape<xSize,xRank,xN0,xN1,xN2,xN3,xN4,xN5,xN6,xN7> x_type ;
	typedef Shape<ySize,yRank,yN0,yN1,yN2,yN3,yN4,yN5,yN6,yN7> y_type ;

	if ( x != y ) {
	assert_shapes_are_equal_throw(
	x_type::scalar_size, x_type::rank, x.N0, x.N1, x.N2, x.N3, x.N4, x.N5, x.N6, x.N7,
	y_type::scalar_size, y_type::rank, y.N0, y.N1, y.N2, y.N3, y.N4, y.N5, y.N6, y.N7 );
	}
	}

	template< unsigned xSize , unsigned xRank ,
	unsigned xN0 , unsigned xN1 , unsigned xN2 , unsigned xN3 ,
	unsigned xN4 , unsigned xN5 , unsigned xN6 , unsigned xN7 ,

	unsigned ySize , unsigned yRank ,
	unsigned yN0 , unsigned yN1 , unsigned yN2 , unsigned yN3 ,
	unsigned yN4 , unsigned yN5 , unsigned yN6 , unsigned yN7 >
	void assert_shapes_equal_dimension(
	const Shape<xSize,xRank,xN0,xN1,xN2,xN3,xN4,xN5,xN6,xN7> & x ,
	const Shape<ySize,yRank,yN0,yN1,yN2,yN3,yN4,yN5,yN6,yN7> & y )
	{
	typedef Shape<xSize,xRank,xN0,xN1,xN2,xN3,xN4,xN5,xN6,xN7> x_type ;
	typedef Shape<ySize,yRank,yN0,yN1,yN2,yN3,yN4,yN5,yN6,yN7> y_type ;

	// Omit comparison of scalar_size.
	if ( unsigned( x.rank ) != unsigned( y.rank ) \|\|
	size_t( x.N0 ) != size_t( y.N0 ) \|\|
	unsigned( x.N1 ) != unsigned( y.N1 ) \|\|
	unsigned( x.N2 ) != unsigned( y.N2 ) \|\|
	unsigned( x.N3 ) != unsigned( y.N3 ) \|\|
	unsigned( x.N4 ) != unsigned( y.N4 ) \|\|
	unsigned( x.N5 ) != unsigned( y.N5 ) \|\|
	unsigned( x.N6 ) != unsigned( y.N6 ) \|\|
	unsigned( x.N7 ) != unsigned( y.N7 ) ) {
	assert_shapes_are_equal_throw(
	x_type::scalar_size, x_type::rank, x.N0, x.N1, x.N2, x.N3, x.N4, x.N5, x.N6, x.N7,
	y_type::scalar_size, y_type::rank, y.N0, y.N1, y.N2, y.N3, y.N4, y.N5, y.N6, y.N7 );
	}
	}

	//----------------------------------------------------------------------------

	template< class ShapeType > struct assert_shape_is_rank_zero ;
	template< class ShapeType > struct assert_shape_is_rank_one ;

	template< unsigned Size >
	struct assert_shape_is_rank_zero< Shape<Size,0> >
	: public true_type {};

	template< unsigned Size , unsigned s0 >
	struct assert_shape_is_rank_one< Shape<Size,1,s0> >
	: public true_type {};

	//----------------------------------------------------------------------------

	/** \brief Array bounds assertion templated on the execution space
	* to allow device-specific abort code.
	*/
	template< class Space >
	struct AssertShapeBoundsAbort ;

	template<>
	struct AssertShapeBoundsAbort< Kokkos::HostSpace >
	{
	static void apply( const size_t rank ,
	const size_t n0 , const size_t n1 ,
	const size_t n2 , const size_t n3 ,
	const size_t n4 , const size_t n5 ,
	const size_t n6 , const size_t n7 ,
	const size_t arg_rank ,
	const size_t i0 , const size_t i1 ,
	const size_t i2 , const size_t i3 ,
	const size_t i4 , const size_t i5 ,
	const size_t i6 , const size_t i7 );
	};

	template< class ExecutionSpace >
	struct AssertShapeBoundsAbort
	{
	KOKKOS_INLINE_FUNCTION
	static void apply( const size_t rank ,
	const size_t n0 , const size_t n1 ,
	const size_t n2 , const size_t n3 ,
	const size_t n4 , const size_t n5 ,
	const size_t n6 , const size_t n7 ,
	const size_t arg_rank ,
	const size_t i0 , const size_t i1 ,
	const size_t i2 , const size_t i3 ,
	const size_t i4 , const size_t i5 ,
	const size_t i6 , const size_t i7 )
	{
	AssertShapeBoundsAbort< Kokkos::HostSpace >
	::apply( rank , n0 , n1 , n2 , n3 , n4 , n5 , n6 , n7 ,
	arg_rank, i0 , i1 , i2 , i3 , i4 , i5 , i6 , i7 );
	}
	};

	template< class ShapeType >
	KOKKOS_INLINE_FUNCTION
	void assert_shape_bounds( const ShapeType & shape ,
	const size_t arg_rank ,
	const size_t i0 ,
	const size_t i1 = 0 ,
	const size_t i2 = 0 ,
	const size_t i3 = 0 ,
	const size_t i4 = 0 ,
	const size_t i5 = 0 ,
	const size_t i6 = 0 ,
	const size_t i7 = 0 )
	{
	// Must supply at least as many indices as ranks.
	// Every index must be within bounds.
	const bool ok = ShapeType::rank <= arg_rank &&
	i0 < shape.N0 &&
	i1 < shape.N1 &&
	i2 < shape.N2 &&
	i3 < shape.N3 &&
	i4 < shape.N4 &&
	i5 < shape.N5 &&
	i6 < shape.N6 &&
	i7 < shape.N7 ;

	if ( ! ok ) {
	AssertShapeBoundsAbort< Kokkos::Impl::ActiveExecutionMemorySpace >
	::apply( ShapeType::rank ,
	shape.N0 , shape.N1 , shape.N2 , shape.N3 ,
	shape.N4 , shape.N5 , shape.N6 , shape.N7 ,
	arg_rank , i0 , i1 , i2 , i3 , i4 , i5 , i6 , i7 );
	}
	}

	-#if defined( KOKKOS_EXPRESSION_CHECK )
	+#if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_1( S , I0 ) assert_shape_bounds(S,1,I0);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_2( S , I0 , I1 ) assert_shape_bounds(S,2,I0,I1);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_3( S , I0 , I1 , I2 ) assert_shape_bounds(S,3,I0,I1,I2);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_4( S , I0 , I1 , I2 , I3 ) assert_shape_bounds(S,4,I0,I1,I2,I3);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_5( S , I0 , I1 , I2 , I3 , I4 ) assert_shape_bounds(S,5,I0,I1,I2,I3,I4);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_6( S , I0 , I1 , I2 , I3 , I4 , I5 ) assert_shape_bounds(S,6,I0,I1,I2,I3,I4,I5);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_7( S , I0 , I1 , I2 , I3 , I4 , I5 , I6 ) assert_shape_bounds(S,7,I0,I1,I2,I3,I4,I5,I6);
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_8( S , I0 , I1 , I2 , I3 , I4 , I5 , I6 , I7 ) assert_shape_bounds(S,8,I0,I1,I2,I3,I4,I5,I6,I7);
	#else
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_1( S , I0 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_2( S , I0 , I1 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_3( S , I0 , I1 , I2 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_4( S , I0 , I1 , I2 , I3 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_5( S , I0 , I1 , I2 , I3 , I4 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_6( S , I0 , I1 , I2 , I3 , I4 , I5 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_7( S , I0 , I1 , I2 , I3 , I4 , I5 , I6 ) /* */
	#define KOKKOS_ASSERT_SHAPE_BOUNDS_8( S , I0 , I1 , I2 , I3 , I4 , I5 , I6 , I7 ) /* */
	#endif


	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	// Specialization and optimization for the Rank 0 shape.

	template < unsigned ScalarSize >
	struct Shape< ScalarSize , 0, 1,1,1,1, 1,1,1,1 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 0 };
	enum { rank = 0 };

	enum { N0 = 1 };
	enum { N1 = 1 };
	enum { N2 = 1 };
	enum { N3 = 1 };
	enum { N4 = 1 };
	enum { N5 = 1 };
	enum { N6 = 1 };
	enum { N7 = 1 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{}
	};

	//----------------------------------------------------------------------------

	template< unsigned R > struct assign_shape_dimension ;

	#define KOKKOS_ASSIGN_SHAPE_DIMENSION( R ) \
	template<> \
	struct assign_shape_dimension< R > \
	{ \
	template< class ShapeType > \
	KOKKOS_INLINE_FUNCTION \
	assign_shape_dimension( ShapeType & shape \
	, typename Impl::enable_if<( R < ShapeType::rank_dynamic ), size_t >::type n \
	) { shape.N ## R = n ; } \
	};

	KOKKOS_ASSIGN_SHAPE_DIMENSION(0)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(1)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(2)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(3)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(4)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(5)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(6)
	KOKKOS_ASSIGN_SHAPE_DIMENSION(7)

	#undef KOKKOS_ASSIGN_SHAPE_DIMENSION

	//----------------------------------------------------------------------------
	// All-static dimension array

	template < unsigned ScalarSize ,
	unsigned Rank ,
	unsigned s0 ,
	unsigned s1 ,
	unsigned s2 ,
	unsigned s3 ,
	unsigned s4 ,
	unsigned s5 ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape {

	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 0 };
	enum { rank = Rank };

	enum { N0 = s0 };
	enum { N1 = s1 };
	enum { N2 = s2 };
	enum { N3 = s3 };
	enum { N4 = s4 };
	enum { N5 = s5 };
	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{}
	};

	// 1 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize ,
	unsigned Rank ,
	unsigned s1 ,
	unsigned s2 ,
	unsigned s3 ,
	unsigned s4 ,
	unsigned s5 ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank , 0,s1,s2,s3, s4,s5,s6,s7 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 1 };
	enum { rank = Rank };

	size_t N0 ; // For 1 == dynamic_rank allow N0 > 2^32

	enum { N1 = s1 };
	enum { N2 = s2 };
	enum { N3 = s3 };
	enum { N4 = s4 };
	enum { N5 = s5 };
	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	size_t n0 , unsigned = 0 , unsigned = 0 , unsigned = 0 ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{ s.N0 = n0 ; }
	};

	// 2 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize , unsigned Rank ,
	unsigned s2 ,
	unsigned s3 ,
	unsigned s4 ,
	unsigned s5 ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank , 0,0,s2,s3, s4,s5,s6,s7 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 2 };
	enum { rank = Rank };

	unsigned N0 ;
	unsigned N1 ;

	enum { N2 = s2 };
	enum { N3 = s3 };
	enum { N4 = s4 };
	enum { N5 = s5 };
	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned = 0 , unsigned = 0 ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{ s.N0 = n0 ; s.N1 = n1 ; }
	};

	// 3 == dynamic_rank <= rank <= 8
	template < unsigned Rank , unsigned ScalarSize ,
	unsigned s3 ,
	unsigned s4 ,
	unsigned s5 ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank , 0,0,0,s3, s4,s5,s6,s7>
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 3 };
	enum { rank = Rank };

	unsigned N0 ;
	unsigned N1 ;
	unsigned N2 ;

	enum { N3 = s3 };
	enum { N4 = s4 };
	enum { N5 = s5 };
	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned n2 , unsigned = 0 ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{ s.N0 = n0 ; s.N1 = n1 ; s.N2 = n2 ; }
	};

	// 4 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize , unsigned Rank ,
	unsigned s4 ,
	unsigned s5 ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank, 0,0,0,0, s4,s5,s6,s7 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 4 };
	enum { rank = Rank };

	unsigned N0 ;
	unsigned N1 ;
	unsigned N2 ;
	unsigned N3 ;

	enum { N4 = s4 };
	enum { N5 = s5 };
	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3 ,
	unsigned = 0 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{ s.N0 = n0 ; s.N1 = n1 ; s.N2 = n2 ; s.N3 = n3 ; }
	};

	// 5 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize , unsigned Rank ,
	unsigned s5 ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank , 0,0,0,0, 0,s5,s6,s7 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 5 };
	enum { rank = Rank };

	unsigned N0 ;
	unsigned N1 ;
	unsigned N2 ;
	unsigned N3 ;
	unsigned N4 ;

	enum { N5 = s5 };
	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3 ,
	unsigned n4 , unsigned = 0 , unsigned = 0 , unsigned = 0 )
	{ s.N0 = n0 ; s.N1 = n1 ; s.N2 = n2 ; s.N3 = n3 ; s.N4 = n4 ; }
	};

	// 6 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize , unsigned Rank ,
	unsigned s6 ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank , 0,0,0,0, 0,0,s6,s7 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 6 };
	enum { rank = Rank };

	unsigned N0 ;
	unsigned N1 ;
	unsigned N2 ;
	unsigned N3 ;
	unsigned N4 ;
	unsigned N5 ;

	enum { N6 = s6 };
	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3 ,
	unsigned n4 , unsigned n5 = 0 , unsigned = 0 , unsigned = 0 )
	{
	s.N0 = n0 ; s.N1 = n1 ; s.N2 = n2 ; s.N3 = n3 ;
	s.N4 = n4 ; s.N5 = n5 ;
	}
	};

	// 7 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize , unsigned Rank ,
	unsigned s7 >
	struct Shape< ScalarSize , Rank , 0,0,0,0, 0,0,0,s7 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 7 };
	enum { rank = Rank };

	unsigned N0 ;
	unsigned N1 ;
	unsigned N2 ;
	unsigned N3 ;
	unsigned N4 ;
	unsigned N5 ;
	unsigned N6 ;

	enum { N7 = s7 };

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3 ,
	unsigned n4 , unsigned n5 , unsigned n6 , unsigned = 0 )
	{
	s.N0 = n0 ; s.N1 = n1 ; s.N2 = n2 ; s.N3 = n3 ;
	s.N4 = n4 ; s.N5 = n5 ; s.N6 = n6 ;
	}
	};

	// 8 == dynamic_rank <= rank <= 8
	template < unsigned ScalarSize >
	struct Shape< ScalarSize , 8 , 0,0,0,0, 0,0,0,0 >
	{
	enum { scalar_size = ScalarSize };
	enum { rank_dynamic = 8 };
	enum { rank = 8 };

	unsigned N0 ;
	unsigned N1 ;
	unsigned N2 ;
	unsigned N3 ;
	unsigned N4 ;
	unsigned N5 ;
	unsigned N6 ;
	unsigned N7 ;

	KOKKOS_INLINE_FUNCTION
	static
	void assign( Shape & s ,
	unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3 ,
	unsigned n4 , unsigned n5 , unsigned n6 , unsigned n7 )
	{
	s.N0 = n0 ; s.N1 = n1 ; s.N2 = n2 ; s.N3 = n3 ;
	s.N4 = n4 ; s.N5 = n5 ; s.N6 = n6 ; s.N7 = n7 ;
	}
	};

	//----------------------------------------------------------------------------

	template< class ShapeType , unsigned N ,
	unsigned R = ShapeType::rank_dynamic >
	struct ShapeInsert ;

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 0 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	N ,
	ShapeType::N0 ,
	ShapeType::N1 ,
	ShapeType::N2 ,
	ShapeType::N3 ,
	ShapeType::N4 ,
	ShapeType::N5 ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 1 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	N ,
	ShapeType::N1 ,
	ShapeType::N2 ,
	ShapeType::N3 ,
	ShapeType::N4 ,
	ShapeType::N5 ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 2 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	0 ,
	N ,
	ShapeType::N2 ,
	ShapeType::N3 ,
	ShapeType::N4 ,
	ShapeType::N5 ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 3 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	0 ,
	0 ,
	N ,
	ShapeType::N3 ,
	ShapeType::N4 ,
	ShapeType::N5 ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 4 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	0 ,
	0 ,
	0 ,
	N ,
	ShapeType::N4 ,
	ShapeType::N5 ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 5 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	0 ,
	0 ,
	0 ,
	0 ,
	N ,
	ShapeType::N5 ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 6 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	0 ,
	0 ,
	0 ,
	0 ,
	0 ,
	N ,
	ShapeType::N6 > type ;
	};

	template< class ShapeType , unsigned N >
	struct ShapeInsert< ShapeType , N , 7 >
	{
	typedef Shape< ShapeType::scalar_size ,
	ShapeType::rank + 1 ,
	0 ,
	0 ,
	0 ,
	0 ,
	0 ,
	0 ,
	0 ,
	N > type ;
	};

	//----------------------------------------------------------------------------

	template< class DstShape , class SrcShape ,
	unsigned DstRankDynamic = DstShape::rank_dynamic ,
	bool DstRankDynamicOK = unsigned(DstShape::rank_dynamic) >= unsigned(SrcShape::rank_dynamic) >
	struct ShapeCompatible { enum { value = false }; };

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 8 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 7 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 6 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 5 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N5) == unsigned(SrcShape::N5) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 4 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N4) == unsigned(SrcShape::N4) &&
	unsigned(DstShape::N5) == unsigned(SrcShape::N5) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 3 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N3) == unsigned(SrcShape::N3) &&
	unsigned(DstShape::N4) == unsigned(SrcShape::N4) &&
	unsigned(DstShape::N5) == unsigned(SrcShape::N5) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 2 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N2) == unsigned(SrcShape::N2) &&
	unsigned(DstShape::N3) == unsigned(SrcShape::N3) &&
	unsigned(DstShape::N4) == unsigned(SrcShape::N4) &&
	unsigned(DstShape::N5) == unsigned(SrcShape::N5) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 1 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N1) == unsigned(SrcShape::N1) &&
	unsigned(DstShape::N2) == unsigned(SrcShape::N2) &&
	unsigned(DstShape::N3) == unsigned(SrcShape::N3) &&
	unsigned(DstShape::N4) == unsigned(SrcShape::N4) &&
	unsigned(DstShape::N5) == unsigned(SrcShape::N5) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	template< class DstShape , class SrcShape >
	struct ShapeCompatible< DstShape , SrcShape , 0 , true >
	{
	enum { value = unsigned(DstShape::scalar_size) == unsigned(SrcShape::scalar_size) &&
	unsigned(DstShape::N0) == unsigned(SrcShape::N0) &&
	unsigned(DstShape::N1) == unsigned(SrcShape::N1) &&
	unsigned(DstShape::N2) == unsigned(SrcShape::N2) &&
	unsigned(DstShape::N3) == unsigned(SrcShape::N3) &&
	unsigned(DstShape::N4) == unsigned(SrcShape::N4) &&
	unsigned(DstShape::N5) == unsigned(SrcShape::N5) &&
	unsigned(DstShape::N6) == unsigned(SrcShape::N6) &&
	unsigned(DstShape::N7) == unsigned(SrcShape::N7) };
	};

	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< unsigned ScalarSize , unsigned Rank ,
	unsigned s0 , unsigned s1 , unsigned s2 , unsigned s3 ,
	unsigned s4 , unsigned s5 , unsigned s6 , unsigned s7 ,
	typename iType >
	KOKKOS_INLINE_FUNCTION
	size_t dimension(
	const Shape<ScalarSize,Rank,s0,s1,s2,s3,s4,s5,s6,s7> & shape ,
	const iType & r )
	{
	return 0 == r ? shape.N0 : (
	1 == r ? shape.N1 : (
	2 == r ? shape.N2 : (
	3 == r ? shape.N3 : (
	4 == r ? shape.N4 : (
	5 == r ? shape.N5 : (
	6 == r ? shape.N6 : (
	7 == r ? shape.N7 : 1 )))))));
	}

	template< unsigned ScalarSize , unsigned Rank ,
	unsigned s0 , unsigned s1 , unsigned s2 , unsigned s3 ,
	unsigned s4 , unsigned s5 , unsigned s6 , unsigned s7 >
	KOKKOS_INLINE_FUNCTION
	size_t cardinality_count(
	const Shape<ScalarSize,Rank,s0,s1,s2,s3,s4,s5,s6,s7> & shape )
	{
	return size_t(shape.N0) * shape.N1 * shape.N2 * shape.N3 *
	shape.N4 * shape.N5 * shape.N6 * shape.N7 ;
	}

	//----------------------------------------------------------------------------

	} /* namespace Impl */
	} /* namespace Kokkos */

	#endif /* #ifndef KOKKOS_CORESHAPE_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/core/src/impl/Kokkos_Singleton.hpp
	similarity index 76%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/core/src/impl/Kokkos_Singleton.hpp
	index 966291abd..86bc94ab0 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Singleton.hpp
	@@ -1,64 +1,55 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-
	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	+#ifndef KOKKOS_SINGLETON_HPP
	+#define KOKKOS_SINGLETON_HPP

	#include <Kokkos_Macros.hpp>
	+#include <cstddef>

	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	+namespace Kokkos { namespace Impl {

	-} /* namespace Impl */
	-} /* namespace Kokkos */

	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+}} // namespace Kokkos::Impl

	+#endif // KOKKOS_SINGLETON_HPP
	diff --git a/lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp b/lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp
	index f1017c312..25e2ec9dc 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp
	@@ -1,79 +1,79 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_STATICASSERT_HPP
	#define KOKKOS_STATICASSERT_HPP

	namespace Kokkos {
	namespace Impl {

	template < bool , class T = void >
	struct StaticAssert ;

	template< class T >
	struct StaticAssert< true , T > {
	typedef T type ;
	static const bool value = true ;
	};

	template < class A , class B >
	struct StaticAssertSame ;

	template < class A >
	struct StaticAssertSame<A,A> { typedef A type ; };

	template < class A , class B >
	struct StaticAssertAssignable ;

	template < class A >
	struct StaticAssertAssignable<A,A> { typedef A type ; };

	template < class A >
	struct StaticAssertAssignable< const A , A > { typedef const A type ; };

	} // namespace Impl
	} // namespace Kokkos

	#endif /* KOKKOS_STATICASSERT_HPP */


	diff --git a/lib/kokkos/core/src/impl/Kokkos_Tags.hpp b/lib/kokkos/core/src/impl/Kokkos_Tags.hpp
	index 372ea14b6..4885d3737 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Tags.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Tags.hpp
	@@ -1,131 +1,156 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos
	-// Manycore Performance-Portable Multidimensional Arrays
	-//
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_TAGS_HPP
	#define KOKKOS_TAGS_HPP

	#include <impl/Kokkos_Traits.hpp>
	+#include <Kokkos_Core_fwd.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	-namespace Impl {
	-
	-struct LayoutTag {};
	+//----------------------------------------------------------------------------

	-struct MemorySpaceTag {};
	-struct MemoryTraitsTag {};
	+template<class ExecutionSpace, class MemorySpace>
	+struct Device {
	+ typedef ExecutionSpace execution_space;
	+ typedef MemorySpace memory_space;
	+ typedef Device<execution_space,memory_space> device_type;
	+};
	+}

	-struct ExecutionPolicyTag {};
	-struct ExecutionSpaceTag {};
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------

	+namespace Kokkos {
	+namespace Impl {

	template< class C , class Enable = void >
	struct is_memory_space : public bool_< false > {};

	template< class C , class Enable = void >
	struct is_execution_space : public bool_< false > {};

	template< class C , class Enable = void >
	struct is_execution_policy : public bool_< false > {};

	template< class C , class Enable = void >
	struct is_array_layout : public Impl::false_type {};

	template< class C , class Enable = void >
	struct is_memory_traits : public Impl::false_type {};


	template< class C >
	struct is_memory_space< C , typename Impl::enable_if_type< typename C::memory_space >::type >
	: public bool_< Impl::is_same< C , typename C::memory_space >::value > {};

	template< class C >
	struct is_execution_space< C , typename Impl::enable_if_type< typename C::execution_space >::type >
	: public bool_< Impl::is_same< C , typename C::execution_space >::value > {};

	template< class C >
	struct is_execution_policy< C , typename Impl::enable_if_type< typename C::execution_policy >::type >
	: public bool_< Impl::is_same< C , typename C::execution_policy >::value > {};

	template< class C >
	struct is_array_layout< C , typename Impl::enable_if_type< typename C::array_layout >::type >
	: public bool_< Impl::is_same< C , typename C::array_layout >::value > {};

	template< class C >
	struct is_memory_traits< C , typename Impl::enable_if_type< typename C::memory_traits >::type >
	: public bool_< Impl::is_same< C , typename C::memory_traits >::value > {};

	+
	//----------------------------------------------------------------------------

	template< class C , class Enable = void >
	struct is_space : public Impl::false_type {};

	template< class C >
	struct is_space< C
	, typename Impl::enable_if<(
	Impl::is_same< C , typename C::execution_space >::value \|\|
	- Impl::is_same< C , typename C::memory_space >::value
	+ Impl::is_same< C , typename C::memory_space >::value \|\|
	+ Impl::is_same< C , Device<
	+ typename C::execution_space,
	+ typename C::memory_space> >::value
	)>::type
	>
	: public Impl::true_type
	{
	typedef typename C::execution_space execution_space ;
	typedef typename C::memory_space memory_space ;

	- // The host_mirror_space defines a space with host-resident memory.
	- // If the execution space's memory space is HostSpace then use that execution space.
	- // Else use the HostSpace.
	+ // The host_memory_space defines a space with host-resident memory.
	+ // If the execution space's memory space is host accessible then use that execution space.
	+ // else use the HostSpace.
	typedef
	- typename Impl::if_c< Impl::is_same< typename execution_space::memory_space , HostSpace >::value , execution_space ,
	- HostSpace >::type
	- host_mirror_space ;
	-};
	+ typename Impl::if_c< Impl::is_same< memory_space , HostSpace >::value
	+#ifdef KOKKOS_HAVE_CUDA
	+ \|\| Impl::is_same< memory_space , CudaUVMSpace>::value
	+ \|\| Impl::is_same< memory_space , CudaHostPinnedSpace>::value
	+#endif
	+ , memory_space , HostSpace >::type
	+ host_memory_space ;

	+ // The host_execution_space defines a space which has access to HostSpace.
	+ // If the execution space can access HostSpace then use that execution space.
	+ // else use the DefaultHostExecutionSpace.
	+#ifdef KOKKOS_HAVE_CUDA
	+ typedef
	+ typename Impl::if_c< Impl::is_same< execution_space , Cuda >::value
	+ , DefaultHostExecutionSpace , execution_space >::type
	+ host_execution_space ;
	+#else
	+ typedef execution_space host_execution_space;
	+#endif
	+
	+ typedef Device<host_execution_space,host_memory_space> host_mirror_space;
	+};
	}
	}

	#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_Timer.hpp b/lib/kokkos/core/src/impl/Kokkos_Timer.hpp
	index 17a5b2c9b..80a326f08 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Timer.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Timer.hpp
	@@ -1,115 +1,115 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_IMPLWALLTIME_HPP
	#define KOKKOS_IMPLWALLTIME_HPP

	#include <stddef.h>

	#ifdef _MSC_VER
	#undef KOKKOS_USE_LIBRT
	#include <gettimeofday.c>
	#else
	#ifdef KOKKOS_USE_LIBRT
	#include <ctime>
	#else
	#include <sys/time.h>
	#endif
	#endif

	namespace Kokkos {
	namespace Impl {

	/** \brief Time since construction */

	class Timer {
	private:
	#ifdef KOKKOS_USE_LIBRT
	struct timespec m_old;
	#else
	struct timeval m_old ;
	#endif
	Timer( const Timer & );
	Timer & operator = ( const Timer & );
	public:

	inline
	void reset() {
	#ifdef KOKKOS_USE_LIBRT
	clock_gettime(CLOCK_REALTIME, &m_old);
	#else
	gettimeofday( & m_old , ((struct timezone *) NULL ) );
	#endif
	}

	inline
	~Timer() {}

	inline
	Timer() { reset(); }

	inline
	double seconds() const
	{
	#ifdef KOKKOS_USE_LIBRT
	struct timespec m_new;
	clock_gettime(CLOCK_REALTIME, &m_new);

	return ( (double) ( m_new.tv_sec - m_old.tv_sec ) ) +
	( (double) ( m_new.tv_nsec - m_old.tv_nsec ) * 1.0e-9 );
	#else
	struct timeval m_new ;

	::gettimeofday( & m_new , ((struct timezone *) NULL ) );

	return ( (double) ( m_new.tv_sec - m_old.tv_sec ) ) +
	( (double) ( m_new.tv_usec - m_old.tv_usec ) * 1.0e-6 );
	#endif
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	#endif /* #ifndef KOKKOS_IMPLWALLTIME_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Traits.hpp b/lib/kokkos/core/src/impl/Kokkos_Traits.hpp
	index 69bab9996..52358842f 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Traits.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Traits.hpp
	@@ -1,370 +1,370 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOSTRAITS_HPP
	#define KOKKOSTRAITS_HPP

	#include <stddef.h>
	#include <stdint.h>
	#include <Kokkos_Macros.hpp>

	namespace Kokkos {
	namespace Impl {

	/* C++11 conformal compile-time type traits utilities.
	* Prefer to use C++11 when portably available.
	*/
	//----------------------------------------------------------------------------
	// C++11 Helpers:

	template < class T , T v >
	struct integral_constant
	{
	// Declaration of 'static const' causes an unresolved linker symbol in debug
	// static const T value = v ;
	enum { value = T(v) };
	typedef T value_type;
	typedef integral_constant<T,v> type;
	KOKKOS_INLINE_FUNCTION operator T() { return v ; }
	};

	typedef integral_constant<bool,false> false_type ;
	typedef integral_constant<bool,true> true_type ;

	//----------------------------------------------------------------------------
	// C++11 Type relationships:

	template< class X , class Y > struct is_same : public false_type {};
	template< class X > struct is_same<X,X> : public true_type {};

	//----------------------------------------------------------------------------
	// C++11 Type properties:

	template <typename T> struct is_const : public false_type {};
	template <typename T> struct is_const<const T> : public true_type {};
	template <typename T> struct is_const<const T & > : public true_type {};

	template <typename T> struct is_array : public false_type {};
	template <typename T> struct is_array< T[] > : public true_type {};
	template <typename T, unsigned N > struct is_array< T[N] > : public true_type {};

	//----------------------------------------------------------------------------
	// C++11 Type transformations:

	template <typename T> struct remove_const { typedef T type; };
	template <typename T> struct remove_const<const T> { typedef T type; };
	template <typename T> struct remove_const<const T & > { typedef T & type; };

	template <typename T> struct add_const { typedef const T type; };
	template <typename T> struct add_const<T & > { typedef const T & type; };
	template <typename T> struct add_const<const T> { typedef const T type; };
	template <typename T> struct add_const<const T & > { typedef const T & type; };

	template <typename T> struct remove_reference { typedef T type ; };
	template <typename T> struct remove_reference< T & > { typedef T type ; };
	template <typename T> struct remove_reference< const T & > { typedef const T type ; };

	template <typename T> struct remove_extent { typedef T type ; };
	template <typename T> struct remove_extent<T[]> { typedef T type ; };
	template <typename T, unsigned N > struct remove_extent<T[N]> { typedef T type ; };

	//----------------------------------------------------------------------------
	// C++11 Other type generators:

	template< bool , class T , class F >
	struct condition { typedef F type ; };

	template< class T , class F >
	struct condition<true,T,F> { typedef T type ; };

	template< bool , class = void >
	struct enable_if ;

	template< class T >
	struct enable_if< true , T > { typedef T type ; };

	//----------------------------------------------------------------------------

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------
	// Other traits

	namespace Kokkos {
	namespace Impl {

	//----------------------------------------------------------------------------

	template< class , class T = void >
	struct enable_if_type { typedef T type ; };

	//----------------------------------------------------------------------------

	template< bool B >
	struct bool_ : public integral_constant<bool,B> {};

	template< unsigned I >
	struct unsigned_ : public integral_constant<unsigned,I> {};

	template< int I >
	struct int_ : public integral_constant<int,I> {};

	typedef bool_<true> true_;
	typedef bool_<false> false_;
	//----------------------------------------------------------------------------
	// if_

	template < bool Cond , typename TrueType , typename FalseType>
	struct if_c
	{
	enum { value = Cond };

	typedef FalseType type;


	typedef typename remove_const<
	typename remove_reference<type>::type >::type value_type ;

	typedef typename add_const<value_type>::type const_value_type ;

	static KOKKOS_INLINE_FUNCTION
	const_value_type & select( const_value_type & v ) { return v ; }

	static KOKKOS_INLINE_FUNCTION
	value_type & select( value_type & v ) { return v ; }

	template< class T >
	static KOKKOS_INLINE_FUNCTION
	value_type & select( const T & ) { value_type * ptr(0); return *ptr ; }


	template< class T >
	static KOKKOS_INLINE_FUNCTION
	const_value_type & select( const T & , const_value_type & v ) { return v ; }

	template< class T >
	static KOKKOS_INLINE_FUNCTION
	value_type & select( const T & , value_type & v ) { return v ; }
	};

	template <typename TrueType, typename FalseType>
	struct if_c< true , TrueType , FalseType >
	{
	enum { value = true };

	typedef TrueType type;


	typedef typename remove_const<
	typename remove_reference<type>::type >::type value_type ;

	typedef typename add_const<value_type>::type const_value_type ;

	static KOKKOS_INLINE_FUNCTION
	const_value_type & select( const_value_type & v ) { return v ; }

	static KOKKOS_INLINE_FUNCTION
	value_type & select( value_type & v ) { return v ; }

	template< class T >
	static KOKKOS_INLINE_FUNCTION
	value_type & select( const T & ) { value_type * ptr(0); return *ptr ; }


	template< class F >
	static KOKKOS_INLINE_FUNCTION
	const_value_type & select( const_value_type & v , const F & ) { return v ; }

	template< class F >
	static KOKKOS_INLINE_FUNCTION
	value_type & select( value_type & v , const F & ) { return v ; }
	};

	template< typename TrueType >
	struct if_c< false , TrueType , void >
	{
	enum { value = false };

	typedef void type ;
	typedef void value_type ;
	};

	template< typename FalseType >
	struct if_c< true , void , FalseType >
	{
	enum { value = true };

	typedef void type ;
	typedef void value_type ;
	};

	template <typename Cond, typename TrueType, typename FalseType>
	struct if_ : public if_c<Cond::value, TrueType, FalseType> {};

	//----------------------------------------------------------------------------

	// Allows aliased types:
	template< typename T >
	struct is_integral : public integral_constant< bool ,
	(
	Impl::is_same< T , char >::value \|\|
	Impl::is_same< T , unsigned char >::value \|\|
	Impl::is_same< T , short int >::value \|\|
	Impl::is_same< T , unsigned short int >::value \|\|
	Impl::is_same< T , int >::value \|\|
	Impl::is_same< T , unsigned int >::value \|\|
	Impl::is_same< T , long int >::value \|\|
	Impl::is_same< T , unsigned long int >::value \|\|
	Impl::is_same< T , long long int >::value \|\|
	Impl::is_same< T , unsigned long long int >::value \|\|

	Impl::is_same< T , int8_t >::value \|\|
	Impl::is_same< T , int16_t >::value \|\|
	Impl::is_same< T , int32_t >::value \|\|
	Impl::is_same< T , int64_t >::value \|\|
	Impl::is_same< T , uint8_t >::value \|\|
	Impl::is_same< T , uint16_t >::value \|\|
	Impl::is_same< T , uint32_t >::value \|\|
	Impl::is_same< T , uint64_t >::value
	)>
	{};

	//----------------------------------------------------------------------------


	template < size_t N >
	struct is_power_of_two
	{
	enum type { value = (N > 0) && !(N & (N-1)) };
	};

	template < size_t N , bool OK = is_power_of_two<N>::value >
	struct power_of_two ;

	template < size_t N >
	struct power_of_two<N,true>
	{
	enum type { value = 1+ power_of_two<(N>>1),true>::value };
	};

	template <>
	struct power_of_two<2,true>
	{
	enum type { value = 1 };
	};

	template <>
	struct power_of_two<1,true>
	{
	enum type { value = 0 };
	};

	/** \brief If power of two then return power,
	* otherwise return ~0u.
	*/
	static KOKKOS_FORCEINLINE_FUNCTION
	unsigned power_of_two_if_valid( const unsigned N )
	{
	unsigned p = ~0u ;
	if ( N && ! ( N & ( N - 1 ) ) ) {
	-#if defined( __CUDA_ARCH__ )
	+#if defined( __CUDA_ARCH__ ) && defined( KOKKOS_HAVE_CUDA )
	p = __ffs(N) - 1 ;
	#elif defined( __GNUC__ ) \|\| defined( __GNUG__ )
	p = __builtin_ffs(N) - 1 ;
	#elif defined( __INTEL_COMPILER )
	p = _bit_scan_forward(N);
	#else
	p = 0 ;
	for ( unsigned j = 1 ; ! ( N & j ) ; j <<= 1 ) { ++p ; }
	#endif
	}
	return p ;
	}

	//----------------------------------------------------------------------------

	template< typename T , T v , bool NonZero = ( v != T(0) ) >
	struct integral_nonzero_constant
	{
	// Declaration of 'static const' causes an unresolved linker symbol in debug
	// static const T value = v ;
	enum { value = T(v) };
	typedef T value_type ;
	typedef integral_nonzero_constant<T,v> type ;
	KOKKOS_INLINE_FUNCTION integral_nonzero_constant( const T & ) {}
	};

	template< typename T , T zero >
	struct integral_nonzero_constant<T,zero,false>
	{
	const T value ;
	typedef T value_type ;
	typedef integral_nonzero_constant<T,0> type ;
	KOKKOS_INLINE_FUNCTION integral_nonzero_constant( const T & v ) : value(v) {}
	};

	//----------------------------------------------------------------------------

	template < class C > struct is_integral_constant : public false_
	{
	typedef void integral_type ;
	enum { integral_value = 0 };
	};

	template < typename T , T v >
	struct is_integral_constant< integral_constant<T,v> > : public true_
	{
	typedef T integral_type ;
	enum { integral_value = v };
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOSTRAITS_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_ViewDefault.hpp b/lib/kokkos/core/src/impl/Kokkos_ViewDefault.hpp
	index 75b893bef..8334af3a3 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_ViewDefault.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_ViewDefault.hpp
	@@ -1,2818 +1,878 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_VIEWDEFAULT_HPP
	#define KOKKOS_VIEWDEFAULT_HPP

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template<>
	struct ViewAssignment< ViewDefault , ViewDefault , void >
	{
	typedef ViewDefault Specialize ;

	//------------------------------------
	- /** \brief Compatible value and shape */
	+ /** \brief Compatible value and shape and LayoutLeft/Right to LayoutStride*/

	template< class DT , class DL , class DD , class DM ,
	class ST , class SL , class SD , class SM >
	KOKKOS_INLINE_FUNCTION
	ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	const View<ST,SL,SD,SM,Specialize> & src ,
	const typename enable_if<(
	ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	ViewTraits<ST,SL,SD,SM> >::value
	\|\|
	( ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	ViewTraits<ST,SL,SD,SM> >::assignable_value
	&&
	ShapeCompatible< typename ViewTraits<DT,DL,DD,DM>::shape_type ,
	typename ViewTraits<ST,SL,SD,SM>::shape_type >::value
	&&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout,LayoutStride>::value )
	+ is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout,LayoutStride>::value
	+ && (is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout,LayoutLeft>::value \|\|
	+ is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout,LayoutRight>::value))
	)>::type * = 0 )
	{
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	dst.m_offset_map.assign( src.m_offset_map );

	dst.m_management = src.m_management ;

	- dst.m_ptr_on_device = typename ViewDataManagement< ViewTraits<DT,DL,DD,DM> >::handle_type( src.m_ptr_on_device );
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-1 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 1 )
	- ), unsigned >::type i0 )
	- {
	- assert_shape_bounds( src.m_offset_map , 1 , i0 );
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + i0 ;
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-2 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 )
	- {
	- assert_shape_bounds( src.m_offset_map , 2 , i0 , i1 );
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-3 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 3 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 ,
	- const unsigned i2 )
	- {
	- assert_shape_bounds( src.m_offset_map, 3, i0, i1, i2 );
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1,i2);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-4 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 4 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 )
	- {
	- assert_shape_bounds( src.m_offset_map, 4, i0, i1, i2, i3 );
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1,i2,i3);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-5 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 5 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 )
	- {
	- assert_shape_bounds( src.m_offset_map, 5, i0, i1, i2, i3, i4);
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1,i2,i3,i4);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-6 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 6 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 ,
	- const unsigned i5 )
	- {
	- assert_shape_bounds( src.m_offset_map, 6, i0, i1, i2, i3, i4, i5);
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1,i2,i3,i4,i5);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-7 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 7 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 ,
	- const unsigned i5 ,
	- const unsigned i6 )
	- {
	- assert_shape_bounds( src.m_offset_map, 7, i0, i1, i2, i3, i4, i5, i6 );
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	+ dst.m_ptr_on_device = ViewDataManagement< ViewTraits<DT,DL,DD,DM> >::create_handle( src.m_ptr_on_device, src.m_tracker );

	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1,i2,i3,i4,i5,i6);
	+ dst.m_tracker = src.m_tracker ;

	- dst.m_management.increment( dst.m_ptr_on_device );
	}

	- //------------------------------------
	- /** \brief Extract Rank-0 from Rank-8 */
	-
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	- ViewTraits<ST,SL,SD,SM> >::assignable_value &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 0 ) &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 8 )
	- ), unsigned >::type i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 ,
	- const unsigned i5 ,
	- const unsigned i6 ,
	- const unsigned i7 )
	- {
	- assert_shape_bounds( src.m_offset_map, 8, i0, i1, i2, i3, i4, i5, i6, i7 );
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,i1,i2,i3,i4,i5,i6,i7);

	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	+ /** \brief Assign 1D Strided View to LayoutLeft or LayoutRight if stride[0]==1 */

	- //------------------------------------
	- /** \brief Extract Rank-1 array from range of Rank-1 array, either layout */
	template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType >
	+ class ST , class SD , class SM >
	KOKKOS_INLINE_FUNCTION
	ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<iType,iType> & range ,
	- typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 1 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 1 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 1 )
	- ) >::type * = 0 )
	+ const View<ST,LayoutStride,SD,SM,Specialize> & src ,
	+ const typename enable_if<(
	+ (
	+ ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	+ ViewTraits<ST,LayoutStride,SD,SM> >::value
	+ \|\|
	+ ( ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	+ ViewTraits<ST,LayoutStride,SD,SM> >::assignable_value
	+ &&
	+ ShapeCompatible< typename ViewTraits<DT,DL,DD,DM>::shape_type ,
	+ typename ViewTraits<ST,LayoutStride,SD,SM>::shape_type >::value
	+ )
	+ )
	+ &&
	+ (View<DT,DL,DD,DM,Specialize>::rank==1)
	+ && (is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout,LayoutLeft>::value \|\|
	+ is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout,LayoutRight>::value)
	+ )>::type * = 0 )
	{
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_offset_map.N0 = 0 ;
	- dst.m_ptr_on_device = 0 ;
	-
	- if ( range.first < range.second ) {
	- assert_shape_bounds( src.m_offset_map , 1 , range.first );
	- assert_shape_bounds( src.m_offset_map , 1 , range.second - 1 );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = range.second - range.first ;
	- dst.m_ptr_on_device = src.ptr_on_device() + range.first ;
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	+ size_t strides[8];
	+ src.stride(strides);
	+ if(strides[0]!=1) {
	+ abort("Trying to assign strided 1D View to LayoutRight or LayoutLeft which is not stride-1");
	}
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-1 array from LayoutLeft Rank-2 array, using ALL as first argument. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutLeft >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 1 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 1 )
	- ), unsigned >::type i1 )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N0 ;
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(0,i1);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	+ dst.m_offset_map.assign( src.dimension_0(), 0, 0, 0, 0, 0, 0, 0, 0 );

	+ dst.m_management = src.m_management ;

	- //------------------------------------
	- /** \brief Extract Rank-1 array from LayoutLeft Rank-2 array, using a row range as first argument. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename IndexType >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<IndexType, IndexType>& rowRange,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutLeft >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 1 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 1 )
	- ), IndexType >::type columnIndex )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	+ dst.m_ptr_on_device = ViewDataManagement< ViewTraits<DT,DL,DD,DM> >::create_handle( src.m_ptr_on_device, src.m_tracker );

	- if (rowRange.first < rowRange.second) { // valid row range
	- dst.m_management = src.m_management;
	- dst.m_offset_map.N0 = rowRange.second - rowRange.first;
	- dst.m_ptr_on_device = src.ptr_on_device () +
	- src.m_offset_map (rowRange.first, columnIndex);
	+ dst.m_tracker = src.m_tracker ;

	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	- else { // not a valid row range
	- dst.m_offset_map.N0 = 0;
	- dst.m_ptr_on_device = 0;
	- }
	}

	-
	//------------------------------------
	- /** \brief Extract Rank-1 array from LayoutRight Rank-2 array. */
	+ /** \brief Deep copy data from compatible value type, layout, rank, and specialization.
	+ * Check the dimensions and allocation lengths at runtime.
	+ */
	template< class DT , class DL , class DD , class DM ,
	class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	+ inline static
	+ void deep_copy( const View<DT,DL,DD,DM,Specialize> & dst ,
	const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	+ const typename Impl::enable_if<(
	+ Impl::is_same< typename ViewTraits<DT,DL,DD,DM>::value_type ,
	+ typename ViewTraits<ST,SL,SD,SM>::non_const_value_type >::value
	&&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 1 )
	+ Impl::is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout ,
	+ typename ViewTraits<ST,SL,SD,SM>::array_layout >::value
	&&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 1 )
	- ), ALL >::type & )
	+ ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) == unsigned(ViewTraits<ST,SL,SD,SM>::rank) )
	+ )>::type * = 0 )
	{
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N1 ;
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	+ typedef typename ViewTraits<DT,DL,DD,DM>::memory_space dst_memory_space ;
	+ typedef typename ViewTraits<ST,SL,SD,SM>::memory_space src_memory_space ;

	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutLeft Rank-2 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<iType,iType> & range ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutLeft >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), unsigned >::type i1 )
	- {
	- assert_shape_bounds( src.m_offset_map , 2 , range.first , i1 );
	- assert_shape_bounds( src.m_offset_map , 2 , range.second - 1 , i1 );
	+ if ( dst.ptr_on_device() != src.ptr_on_device() ) {

	- dst.m_management.decrement( dst.m_ptr_on_device );
	+ Impl::assert_shapes_are_equal( dst.m_offset_map , src.m_offset_map );

	- if ( range.first < range.second ) {
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = range.second - range.first ;
	- dst.m_offset_map.N1 = 1 ;
	- dst.m_offset_map.S0 = range.second - range.first ;
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(range.first,i1);
	+ const size_t nbytes = dst.m_offset_map.scalar_size * dst.m_offset_map.capacity();

	- dst.m_management.increment( dst.m_ptr_on_device );
	+ DeepCopy< dst_memory_space , src_memory_space >( dst.ptr_on_device() , src.ptr_on_device() , nbytes );
	}
	}
	+};

	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutLeft Rank-2 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutLeft >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), unsigned >::type i1 )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N0 ;
	- dst.m_offset_map.N1 = 1 ;
	-
	- dst.m_offset_map.S0 = src.m_offset_map.N0 ;
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(0,i1);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-2 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = 1 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N1 ;
	- dst.m_offset_map.SR = src.m_offset_map.SR ;
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(i0,0);
	+} /* namespace Impl */
	+} /* namespace Kokkos */

	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	- //------------------------------------
	- /** \brief Extract LayoutRight Rank-N array from range of LayoutRight Rank-N array */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<iType,iType> & range ,
	- typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::value
	- &&
	- Impl::is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank > 1 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic > 0 )
	- )>::type * = 0 )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	- //typedef typename traits_type::shape_type shape_type ; // unused
	- //typedef typename View<DT,DL,DD,DM,Specialize>::stride_type stride_type ; // unused
	+//----------------------------------------------------------------------------
	+//----------------------------------------------------------------------------

	- dst.m_management.decrement( dst.m_ptr_on_device );
	+namespace Kokkos {
	+namespace Impl {

	- dst.m_offset_map.assign( 0, 0, 0, 0, 0, 0, 0, 0 );
	+template< class ExecSpace , class DT , class DL, class DD, class DM, class DS >
	+struct ViewDefaultConstruct< ExecSpace , Kokkos::View<DT,DL,DD,DM,DS> , true >
	+{
	+ Kokkos::View<DT,DL,DD,DM,DS> * const m_ptr ;

	- dst.m_ptr_on_device = 0 ;
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ void operator()( const typename ExecSpace::size_type& i ) const
	+ { new(m_ptr+i) Kokkos::View<DT,DL,DD,DM,DS>(); }

	- if ( ( range.first == range.second ) \|\|
	- ( (src.capacity()==0u) && (range.second<src.m_offset_map.N0) )) {
	- dst.m_offset_map.assign( 0 , src.m_offset_map.N1 , src.m_offset_map.N2 , src.m_offset_map.N3 ,
	- src.m_offset_map.N4 , src.m_offset_map.N5 , src.m_offset_map.N6 , src.m_offset_map.N7 );
	- dst.m_offset_map.SR = src.m_offset_map.SR ;
	+ ViewDefaultConstruct( Kokkos::View<DT,DL,DD,DM,DS> * pointer , size_t capacity )
	+ : m_ptr( pointer )
	+ {
	+ Kokkos::RangePolicy< ExecSpace > range( 0 , capacity );
	+ parallel_for( range , *this );
	+ ExecSpace::fence();
	}
	- else if ( (range.first < range.second) ) {
	- assert_shape_bounds( src.m_offset_map , 8 , range.first , 0,0,0,0,0,0,0);
	- assert_shape_bounds( src.m_offset_map , 8 , range.second - 1 , 0,0,0,0,0,0,0);
	-
	- dst.m_offset_map.assign( range.second - range.first
	- , src.m_offset_map.N1 , src.m_offset_map.N2 , src.m_offset_map.N3
	- , src.m_offset_map.N4 , src.m_offset_map.N5 , src.m_offset_map.N6 , src.m_offset_map.N7 );
	-
	- dst.m_offset_map.SR = src.m_offset_map.SR ;
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + range.first * src.m_offset_map.SR ;
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	- }
	-
	- //------------------------------------
	- /** \brief Extract rank-2 from rank-2 array */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType0 , typename iType1 >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<iType0,iType0> & range0 ,
	- const std::pair<iType1,iType1> & range1 ,
	- typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::value
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank == 2
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2
	- ) >::type * = 0 )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_offset_map.assign(0,0,0,0, 0,0,0,0);
	- dst.m_ptr_on_device = 0 ;
	-
	- if ( (range0.first == range0.second) \|\|
	- (range1.first == range1.second) \|\|
	- ( ( src.capacity() == 0u ) &&
	- ( long(range0.second) < long(src.m_offset_map.N0) ) &&
	- ( long(range1.second) < long(src.m_offset_map.N1) ) ) ) {
	-
	- dst.m_offset_map.assign( src.m_offset_map );
	- dst.m_offset_map.N0 = range0.second - range0.first ;
	- dst.m_offset_map.N1 = range1.second - range1.first ;
	- }
	- else if ( (range0.first < range0.second && range1.first < range1.second) ) {
	-
	- assert_shape_bounds( src.m_offset_map , 2 , range0.first , range1.first );
	- assert_shape_bounds( src.m_offset_map , 2 , range0.second - 1 , range1.second - 1 );
	-
	- dst.m_offset_map.assign( src.m_offset_map );
	- dst.m_offset_map.N0 = range0.second - range0.first ;
	- dst.m_offset_map.N1 = range1.second - range1.first ;
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(range0.first,range1.first);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	- }
	-
	- //------------------------------------
	- /** \brief Extract rank-2 from rank-2 array */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- ALL ,
	- const std::pair<iType,iType> & range1 ,
	- typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::value
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank == 2
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2
	- ) >::type * = 0 )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_offset_map.assign(0,0,0,0, 0,0,0,0);
	- dst.m_ptr_on_device = 0 ;
	-
	- if ( (range1.first == range1.second) \|\| ( (src.capacity()==0) && (range1.second<src.m_offset_map.N1) )) {
	- dst.m_offset_map.assign(src.m_offset_map);
	- dst.m_offset_map.N1 = range1.second - range1.first ;
	- }
	- else if ( (range1.first < range1.second) ) {
	- assert_shape_bounds( src.m_offset_map , 2 , 0 , range1.first );
	- assert_shape_bounds( src.m_offset_map , 2 , src.m_offset_map.N0 - 1 , range1.second - 1 );
	-
	- dst.m_offset_map.assign(src.m_offset_map);
	- dst.m_offset_map.N1 = range1.second - range1.first ;
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(0,range1.first);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	- }
	-
	- //------------------------------------
	- /** \brief Extract rank-2 from rank-2 array */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<iType,iType> & range0 ,
	- ALL ,
	- typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::value
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank == 2
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2
	- ) >::type * = 0 )
	- {
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_offset_map.assign(0,0,0,0, 0,0,0,0);
	- dst.m_ptr_on_device = 0 ;
	-
	- if ( (range0.first == range0.second) \|\| ( (src.capacity()==0) && (range0.second<src.m_offset_map.N0) )) {
	- dst.m_offset_map.assign(src.m_offset_map);
	- dst.m_offset_map.N0 = range0.second - range0.first ;
	- }
	- else if ( (range0.first < range0.second) ) {
	- assert_shape_bounds( src.m_offset_map , 2 , range0.first , 0 );
	- assert_shape_bounds( src.m_offset_map , 2 , range0.second - 1 , src.m_offset_map.N1 - 1 );
	-
	- dst.m_offset_map.assign(src.m_offset_map);
	- dst.m_offset_map.N0 = range0.second - range0.first ;
	- dst.m_management = src.m_management ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(range0.first,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	- }
	-
	- //------------------------------------
	- /** \brief Extract rank-2 from rank-2 array */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM ,
	- typename iType >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const std::pair<iType,iType> & range0 ,
	- ALL ,
	- typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::value
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank == 2
	- &&
	- ViewTraits<DT,DL,DD,DM>::rank_dynamic == 1
	- ) >::type * = 0 )
	- {
	- dst.m_tracking.decrement( dst.ptr_on_device() );
	- dst.m_offset_map.assign(0,0,0,0, 0,0,0,0);
	- dst.m_ptr_on_device = 0 ;
	-
	- if ( (range0.first == range0.second) \|\| ( (src.capacity()==0) && (range0.second<src.m_offset_map.N0) )) {
	- dst.m_offset_map.assign(src.m_offset_map);
	- dst.m_offset_map.N0 = range0.second - range0.first ;
	- }
	- else if ( (range0.first < range0.second) ) {
	- assert_shape_bounds( src.m_offset_map , 2 , range0.first , 0 );
	- assert_shape_bounds( src.m_offset_map , 2 , range0.second - 1 , src.m_offset_map.N1 - 1 );
	-
	- dst.m_offset_map.assign(src.m_offset_map);
	- dst.m_offset_map.N0 = range0.second - range0.first ;
	- dst.m_tracking = src.m_tracking ;
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + src.m_offset_map(range0.first,0);
	-
	- dst.m_tracking.increment( dst.ptr_on_device() );
	- }
	- }
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-3 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 3 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N1 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N2 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 ;
	- dst.m_ptr_on_device = &src(i0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-4 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 4 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N3 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 ;
	- dst.m_ptr_on_device = &src(i0,i1,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-5 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 5 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N4 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	-
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-6 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 6 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N5 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,i3,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-7 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 7 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N6 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,i3,i4,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	-
	- //------------------------------------
	- /** \brief Extract Rank-2 array from LayoutRight Rank-8 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 ,
	- const unsigned i5 ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 8 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 2 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 2 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N6 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N7 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,i3,i4,i5,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-3 array from LayoutRight Rank-4 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 4 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 3 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 3 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N1 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N3 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 ;
	- dst.m_ptr_on_device = &src(i0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-3 array from LayoutRight Rank-5 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 5 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 3 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 3 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N4 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 ;
	- dst.m_ptr_on_device = &src(i0,i1,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-3 array from LayoutRight Rank-6 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 6 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 3 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 3 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N5 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-3 array from LayoutRight Rank-7 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 7 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 3 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 3 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N6 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,i3,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-3 array from LayoutRight Rank-8 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const unsigned i4 ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 8 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 3 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 3 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N6 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N7 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,i3,i4,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-4 array from LayoutRight Rank-5 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 5 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 4 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 4 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N1 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N4 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 ;
	- dst.m_ptr_on_device = &src(i0,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-4 array from LayoutRight Rank-6 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 6 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 4 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 4 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N5 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 ;
	- dst.m_ptr_on_device = &src(i0,i1,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-4 array from LayoutRight Rank-7 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 7 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 4 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 4 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N6 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-4 array from LayoutRight Rank-8 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const unsigned i3 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 8 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 4 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 4 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N6 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N7 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,i3,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-5 array from LayoutRight Rank-6 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 6 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 5 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 5 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N1 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N4 = src.m_offset_map.N5 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 * dst.m_offset_map.N4 ;
	- dst.m_ptr_on_device = &src(i0,0,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-5 array from LayoutRight Rank-7 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 7 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 5 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 5 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N2 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N4 = src.m_offset_map.N6 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 * dst.m_offset_map.N4 ;
	- dst.m_ptr_on_device = &src(i0,i1,0,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Extract Rank-5 array from LayoutRight Rank-8 array. */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const unsigned i0 ,
	- const unsigned i1 ,
	- const unsigned i2 ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const ALL & ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<ST,SL,SD,SM>::array_layout , LayoutRight >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 8 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank == 5 )
	- &&
	- ( ViewTraits<DT,DL,DD,DM>::rank_dynamic == 5 )
	- ), ALL >::type & )
	- {
	- //typedef ViewTraits<DT,DL,DD,DM> traits_type ; // unused
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.N0 = src.m_offset_map.N3 ;
	- dst.m_offset_map.N1 = src.m_offset_map.N4 ;
	- dst.m_offset_map.N2 = src.m_offset_map.N5 ;
	- dst.m_offset_map.N3 = src.m_offset_map.N6 ;
	- dst.m_offset_map.N4 = src.m_offset_map.N7 ;
	- dst.m_offset_map.SR = dst.m_offset_map.N1 * dst.m_offset_map.N2 *
	- dst.m_offset_map.N3 * dst.m_offset_map.N4 ;
	- dst.m_ptr_on_device = &src(i0,i1,i2,0,0,0,0,0);
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 1 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 1 };
	-
	- size_t str[2] = {0,0};
	-
	- src.m_offset_map.stride( str );
	-
	- const size_t offset = ViewOffsetRange< Type0 >::begin( arg0 ) * str[0] ;
	-
	- LayoutStride spec ;
	-
	- // Collapse dimension for non-ranges
	- if ( ViewOffsetRange< Type0 >::is_range ) {
	- spec.dimension[0] = ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 );
	- spec.stride[0] = str[0] ;
	- }
	- else {
	- spec.dimension[0] = 1 ;
	- spec.stride[0] = 1 ;
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 2 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 2 };
	-
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- };
	-
	- const unsigned begin[ src_rank ] =
	- { static_cast<unsigned>(ViewOffsetRange< Type0 >::begin( arg0 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type1 >::begin( arg1 ))
	- };
	-
	- size_t stride[9] ;
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- spec.dimension[0] = ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 );
	- spec.dimension[1] = ViewOffsetRange< Type1 >::dimension( src.m_offset_map.N1 , arg1 );
	- spec.stride[0] = stride[0] ;
	- spec.stride[1] = stride[1] ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = spec.dimension[i] ;
	- spec.stride[j] = spec.stride[i] ;
	- offset += begin[i] * spec.stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- , class Type2
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const Type2 & arg2 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 3 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type2 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 3 };
	-
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- , ViewOffsetRange< Type2 >::is_range
	- };
	-
	- // FIXME (mfh 26 Oct 2014) Should use size_type typedef here
	- // instead of unsigned. If we did that, the static_casts would be
	- // unnecessary.
	- const unsigned begin[ src_rank ] = {
	- static_cast<unsigned> (ViewOffsetRange< Type0 >::begin (arg0))
	- , static_cast<unsigned> (ViewOffsetRange< Type1 >::begin (arg1))
	- , static_cast<unsigned> (ViewOffsetRange< Type2 >::begin (arg2))
	- };
	-
	- // FIXME (mfh 26 Oct 2014) Should use size_type typedef here
	- // instead of unsigned. If we did that, the static_casts would be
	- // unnecessary.
	- unsigned dim[ src_rank ] = {
	- static_cast<unsigned> (ViewOffsetRange< Type0 >::dimension (src.m_offset_map.N0, arg0))
	- , static_cast<unsigned> (ViewOffsetRange< Type1 >::dimension (src.m_offset_map.N1, arg1))
	- , static_cast<unsigned> (ViewOffsetRange< Type2 >::dimension (src.m_offset_map.N2, arg2))
	- };
	-
	- size_t stride[9] = {0,0,0,0,0,0,0,0,0};
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = dim[i] ;
	- spec.stride[j] = stride[i] ;
	- offset += begin[i] * stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- , class Type2
	- , class Type3
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const Type2 & arg2 ,
	- const Type3 & arg3 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 4 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type2 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type3 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 4 };
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- , ViewOffsetRange< Type2 >::is_range
	- , ViewOffsetRange< Type3 >::is_range
	- };
	-
	- const unsigned begin[ src_rank ] =
	- { static_cast<unsigned>(ViewOffsetRange< Type0 >::begin( arg0 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type1 >::begin( arg1 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type2 >::begin( arg2 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type3 >::begin( arg3 ))
	- };
	-
	- unsigned dim[ src_rank ] =
	- { static_cast<unsigned>(ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type1 >::dimension( src.m_offset_map.N1 , arg1 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type2 >::dimension( src.m_offset_map.N2 , arg2 ))
	- , static_cast<unsigned>(ViewOffsetRange< Type3 >::dimension( src.m_offset_map.N3 , arg3 ))
	- };
	-
	- size_t stride[9] = {0,0,0,0,0,0,0,0,0};
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = dim[i] ;
	- spec.stride[j] = stride[i] ;
	- offset += begin[i] * stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- , class Type2
	- , class Type3
	- , class Type4
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const Type2 & arg2 ,
	- const Type3 & arg3 ,
	- const Type4 & arg4 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 5 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type2 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type3 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type4 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 5 };
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- , ViewOffsetRange< Type2 >::is_range
	- , ViewOffsetRange< Type3 >::is_range
	- , ViewOffsetRange< Type4 >::is_range
	- };
	-
	- const unsigned begin[ src_rank ] =
	- { ViewOffsetRange< Type0 >::begin( arg0 )
	- , ViewOffsetRange< Type1 >::begin( arg1 )
	- , ViewOffsetRange< Type2 >::begin( arg2 )
	- , ViewOffsetRange< Type3 >::begin( arg3 )
	- , ViewOffsetRange< Type4 >::begin( arg4 )
	- };
	-
	- unsigned dim[ src_rank ] =
	- { ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 )
	- , ViewOffsetRange< Type1 >::dimension( src.m_offset_map.N1 , arg1 )
	- , ViewOffsetRange< Type2 >::dimension( src.m_offset_map.N2 , arg2 )
	- , ViewOffsetRange< Type3 >::dimension( src.m_offset_map.N3 , arg3 )
	- , ViewOffsetRange< Type4 >::dimension( src.m_offset_map.N4 , arg4 )
	- };
	-
	- size_t stride[9] = {0,0,0,0,0,0,0,0,0};
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = dim[i] ;
	- spec.stride[j] = stride[i] ;
	- offset += begin[i] * stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- , class Type2
	- , class Type3
	- , class Type4
	- , class Type5
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const Type2 & arg2 ,
	- const Type3 & arg3 ,
	- const Type4 & arg4 ,
	- const Type5 & arg5 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 6 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type2 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type3 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type4 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type5 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 6 };
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- , ViewOffsetRange< Type2 >::is_range
	- , ViewOffsetRange< Type3 >::is_range
	- , ViewOffsetRange< Type4 >::is_range
	- , ViewOffsetRange< Type5 >::is_range
	- };
	-
	- const unsigned begin[ src_rank ] =
	- { ViewOffsetRange< Type0 >::begin( arg0 )
	- , ViewOffsetRange< Type1 >::begin( arg1 )
	- , ViewOffsetRange< Type2 >::begin( arg2 )
	- , ViewOffsetRange< Type3 >::begin( arg3 )
	- , ViewOffsetRange< Type4 >::begin( arg4 )
	- , ViewOffsetRange< Type5 >::begin( arg5 )
	- };
	-
	- unsigned dim[ src_rank ] =
	- { ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 )
	- , ViewOffsetRange< Type1 >::dimension( src.m_offset_map.N1 , arg1 )
	- , ViewOffsetRange< Type2 >::dimension( src.m_offset_map.N2 , arg2 )
	- , ViewOffsetRange< Type3 >::dimension( src.m_offset_map.N3 , arg3 )
	- , ViewOffsetRange< Type4 >::dimension( src.m_offset_map.N4 , arg4 )
	- , ViewOffsetRange< Type5 >::dimension( src.m_offset_map.N5 , arg5 )
	- };
	-
	- size_t stride[9] = {0,0,0,0,0,0,0,0,0};
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = dim[i] ;
	- spec.stride[j] = stride[i] ;
	- offset += begin[i] * stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- , class Type2
	- , class Type3
	- , class Type4
	- , class Type5
	- , class Type6
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const Type2 & arg2 ,
	- const Type3 & arg3 ,
	- const Type4 & arg4 ,
	- const Type5 & arg5 ,
	- const Type6 & arg6 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 7 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type2 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type3 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type4 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type5 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type6 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 7 };
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- , ViewOffsetRange< Type2 >::is_range
	- , ViewOffsetRange< Type3 >::is_range
	- , ViewOffsetRange< Type4 >::is_range
	- , ViewOffsetRange< Type5 >::is_range
	- , ViewOffsetRange< Type6 >::is_range
	- };
	-
	- const unsigned begin[ src_rank ] =
	- { ViewOffsetRange< Type0 >::begin( arg0 )
	- , ViewOffsetRange< Type1 >::begin( arg1 )
	- , ViewOffsetRange< Type2 >::begin( arg2 )
	- , ViewOffsetRange< Type3 >::begin( arg3 )
	- , ViewOffsetRange< Type4 >::begin( arg4 )
	- , ViewOffsetRange< Type5 >::begin( arg5 )
	- , ViewOffsetRange< Type6 >::begin( arg6 )
	- };
	-
	- unsigned dim[ src_rank ] =
	- { ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 )
	- , ViewOffsetRange< Type1 >::dimension( src.m_offset_map.N1 , arg1 )
	- , ViewOffsetRange< Type2 >::dimension( src.m_offset_map.N2 , arg2 )
	- , ViewOffsetRange< Type3 >::dimension( src.m_offset_map.N3 , arg3 )
	- , ViewOffsetRange< Type4 >::dimension( src.m_offset_map.N4 , arg4 )
	- , ViewOffsetRange< Type5 >::dimension( src.m_offset_map.N5 , arg5 )
	- , ViewOffsetRange< Type6 >::dimension( src.m_offset_map.N6 , arg6 )
	- };
	-
	- size_t stride[9] = {0,0,0,0,0,0,0,0,0};
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = dim[i] ;
	- spec.stride[j] = stride[i] ;
	- offset += begin[i] * stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	- dst.m_management = src.m_management ;
	- dst.m_offset_map.assign( spec );
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- template< class DT , class DL , class DD , class DM
	- , class ST , class SL , class SD , class SM
	- , class Type0
	- , class Type1
	- , class Type2
	- , class Type3
	- , class Type4
	- , class Type5
	- , class Type6
	- , class Type7
	- >
	- KOKKOS_INLINE_FUNCTION
	- ViewAssignment( View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const Type0 & arg0 ,
	- const Type1 & arg1 ,
	- const Type2 & arg2 ,
	- const Type3 & arg3 ,
	- const Type4 & arg4 ,
	- const Type5 & arg5 ,
	- const Type6 & arg6 ,
	- const Type7 & arg7 ,
	- const typename enable_if< (
	- ViewAssignable< ViewTraits<DT,DL,DD,DM> , ViewTraits<ST,SL,SD,SM> >::assignable_value
	- &&
	- is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout , LayoutStride >::value
	- &&
	- ( ViewTraits<ST,SL,SD,SM>::rank == 8 )
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) ==
	- ( ViewOffsetRange< Type0 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type1 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type2 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type3 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type4 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type5 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type6 >::is_range ? 1u : 0 ) +
	- ( ViewOffsetRange< Type7 >::is_range ? 1u : 0 ) )
	- )>::type * = 0 )
	- {
	- enum { src_rank = 8 };
	-
	- const bool is_range[ src_rank ] =
	- { ViewOffsetRange< Type0 >::is_range
	- , ViewOffsetRange< Type1 >::is_range
	- , ViewOffsetRange< Type2 >::is_range
	- , ViewOffsetRange< Type3 >::is_range
	- , ViewOffsetRange< Type4 >::is_range
	- , ViewOffsetRange< Type5 >::is_range
	- , ViewOffsetRange< Type6 >::is_range
	- , ViewOffsetRange< Type7 >::is_range
	- };
	-
	- const unsigned begin[ src_rank ] =
	- { ViewOffsetRange< Type0 >::begin( arg0 )
	- , ViewOffsetRange< Type1 >::begin( arg1 )
	- , ViewOffsetRange< Type2 >::begin( arg2 )
	- , ViewOffsetRange< Type3 >::begin( arg3 )
	- , ViewOffsetRange< Type4 >::begin( arg4 )
	- , ViewOffsetRange< Type5 >::begin( arg5 )
	- , ViewOffsetRange< Type6 >::begin( arg6 )
	- , ViewOffsetRange< Type7 >::begin( arg7 )
	- };
	-
	- unsigned dim[ src_rank ] =
	- { ViewOffsetRange< Type0 >::dimension( src.m_offset_map.N0 , arg0 )
	- , ViewOffsetRange< Type1 >::dimension( src.m_offset_map.N1 , arg1 )
	- , ViewOffsetRange< Type2 >::dimension( src.m_offset_map.N2 , arg2 )
	- , ViewOffsetRange< Type3 >::dimension( src.m_offset_map.N3 , arg3 )
	- , ViewOffsetRange< Type4 >::dimension( src.m_offset_map.N4 , arg4 )
	- , ViewOffsetRange< Type5 >::dimension( src.m_offset_map.N5 , arg5 )
	- , ViewOffsetRange< Type6 >::dimension( src.m_offset_map.N6 , arg6 )
	- , ViewOffsetRange< Type7 >::dimension( src.m_offset_map.N7 , arg7 )
	- };
	-
	- size_t stride[9] = {0,0,0,0,0,0,0,0,0};
	-
	- src.m_offset_map.stride( stride );
	-
	- LayoutStride spec ;
	-
	- size_t offset = 0 ;
	-
	- // Collapse dimension for non-ranges
	- for ( int i = 0 , j = 0 ; i < int(src_rank) ; ++i ) {
	- spec.dimension[j] = dim[i] ;
	- spec.stride[j] = stride[i] ;
	- offset += begin[i] * stride[i] ;
	- if ( is_range[i] ) { ++j ; }
	- }
	-
	- dst.m_management.decrement( dst.m_ptr_on_device );
	-
	- dst.m_management = src.m_management ;
	-
	- dst.m_offset_map.assign( spec );
	-
	- dst.m_ptr_on_device = src.ptr_on_device() + offset ;
	-
	- dst.m_management.increment( dst.m_ptr_on_device );
	- }
	-
	- //------------------------------------
	- /** \brief Deep copy data from compatible value type, layout, rank, and specialization.
	- * Check the dimensions and allocation lengths at runtime.
	- */
	- template< class DT , class DL , class DD , class DM ,
	- class ST , class SL , class SD , class SM >
	- inline static
	- void deep_copy( const View<DT,DL,DD,DM,Specialize> & dst ,
	- const View<ST,SL,SD,SM,Specialize> & src ,
	- const typename Impl::enable_if<(
	- Impl::is_same< typename ViewTraits<DT,DL,DD,DM>::value_type ,
	- typename ViewTraits<ST,SL,SD,SM>::non_const_value_type >::value
	- &&
	- Impl::is_same< typename ViewTraits<DT,DL,DD,DM>::array_layout ,
	- typename ViewTraits<ST,SL,SD,SM>::array_layout >::value
	- &&
	- ( unsigned(ViewTraits<DT,DL,DD,DM>::rank) == unsigned(ViewTraits<ST,SL,SD,SM>::rank) )
	- )>::type * = 0 )
	- {
	- typedef typename ViewTraits<DT,DL,DD,DM>::memory_space dst_memory_space ;
	- typedef typename ViewTraits<ST,SL,SD,SM>::memory_space src_memory_space ;
	-
	- if ( dst.ptr_on_device() != src.ptr_on_device() ) {
	-
	- Impl::assert_shapes_are_equal( dst.m_offset_map , src.m_offset_map );
	-
	- const size_t nbytes = dst.m_offset_map.scalar_size * dst.m_offset_map.capacity();
	-
	- DeepCopy< dst_memory_space , src_memory_space >( dst.ptr_on_device() , src.ptr_on_device() , nbytes );
	- }
	- }
	};

	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-//----------------------------------------------------------------------------
	-//----------------------------------------------------------------------------
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type , class SubArg6_type , class SubArg7_type
	>
	struct ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	, SubArg4_type , SubArg5_type , SubArg6_type , SubArg7_type >
	{
	private:

	typedef View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > SrcViewType ;

	enum { V0 = Impl::is_same< SubArg0_type , void >::value ? 1 : 0 };
	enum { V1 = Impl::is_same< SubArg1_type , void >::value ? 1 : 0 };
	enum { V2 = Impl::is_same< SubArg2_type , void >::value ? 1 : 0 };
	enum { V3 = Impl::is_same< SubArg3_type , void >::value ? 1 : 0 };
	enum { V4 = Impl::is_same< SubArg4_type , void >::value ? 1 : 0 };
	enum { V5 = Impl::is_same< SubArg5_type , void >::value ? 1 : 0 };
	enum { V6 = Impl::is_same< SubArg6_type , void >::value ? 1 : 0 };
	enum { V7 = Impl::is_same< SubArg7_type , void >::value ? 1 : 0 };

	// The source view rank must be equal to the input argument rank
	// Once a void argument is encountered all subsequent arguments must be void.
	enum { InputRank =
	Impl::StaticAssert<( SrcViewType::rank ==
	( V0 ? 0 : (
	V1 ? 1 : (
	V2 ? 2 : (
	V3 ? 3 : (
	V4 ? 4 : (
	V5 ? 5 : (
	V6 ? 6 : (
	V7 ? 7 : 8 ))))))) ))
	&&
	( SrcViewType::rank ==
	( 8 - ( V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 ) ) )
	>::value ? SrcViewType::rank : 0 };

	enum { R0 = Impl::ViewOffsetRange< SubArg0_type >::is_range ? 1 : 0 };
	enum { R1 = Impl::ViewOffsetRange< SubArg1_type >::is_range ? 1 : 0 };
	enum { R2 = Impl::ViewOffsetRange< SubArg2_type >::is_range ? 1 : 0 };
	enum { R3 = Impl::ViewOffsetRange< SubArg3_type >::is_range ? 1 : 0 };
	enum { R4 = Impl::ViewOffsetRange< SubArg4_type >::is_range ? 1 : 0 };
	enum { R5 = Impl::ViewOffsetRange< SubArg5_type >::is_range ? 1 : 0 };
	enum { R6 = Impl::ViewOffsetRange< SubArg6_type >::is_range ? 1 : 0 };
	enum { R7 = Impl::ViewOffsetRange< SubArg7_type >::is_range ? 1 : 0 };

	enum { OutputRank = unsigned(R0) + unsigned(R1) + unsigned(R2) + unsigned(R3)
	+ unsigned(R4) + unsigned(R5) + unsigned(R6) + unsigned(R7) };

	// Reverse
	enum { R0_rev = 0 == InputRank ? 0u : (
	1 == InputRank ? unsigned(R0) : (
	2 == InputRank ? unsigned(R1) : (
	3 == InputRank ? unsigned(R2) : (
	4 == InputRank ? unsigned(R3) : (
	5 == InputRank ? unsigned(R4) : (
	6 == InputRank ? unsigned(R5) : (
	7 == InputRank ? unsigned(R6) : unsigned(R7) ))))))) };

	typedef typename SrcViewType::array_layout SrcViewLayout ;

	// Choose array layout, attempting to preserve original layout if at all possible.
	typedef typename Impl::if_c<
	( // Same Layout IF
	// OutputRank 0
	( OutputRank == 0 )
	\|\|
	// OutputRank 1 or 2, InputLayout Left, Interval 0
	// because single stride one or second index has a stride.
	( OutputRank <= 2 && R0 && Impl::is_same<SrcViewLayout,LayoutLeft>::value )
	\|\|
	// OutputRank 1 or 2, InputLayout Right, Interval [InputRank-1]
	// because single stride one or second index has a stride.
	( OutputRank <= 2 && R0_rev && Impl::is_same<SrcViewLayout,LayoutRight>::value )
	), SrcViewLayout , Kokkos::LayoutStride >::type OutputViewLayout ;

	// Choose data type as a purely dynamic rank array to accomodate a runtime range.
	typedef typename Impl::if_c< OutputRank == 0 , typename SrcViewType::value_type ,
	typename Impl::if_c< OutputRank == 1 , typename SrcViewType::value_type *,
	typename Impl::if_c< OutputRank == 2 , typename SrcViewType::value_type **,
	typename Impl::if_c< OutputRank == 3 , typename SrcViewType::value_type ***,
	typename Impl::if_c< OutputRank == 4 , typename SrcViewType::value_type ****,
	typename Impl::if_c< OutputRank == 5 , typename SrcViewType::value_type *****,
	typename Impl::if_c< OutputRank == 6 , typename SrcViewType::value_type ******,
	typename Impl::if_c< OutputRank == 7 , typename SrcViewType::value_type *******,
	typename SrcViewType::value_type ********
	>::type >::type >::type >::type >::type >::type >::type >::type OutputData ;

	// Choose space.
	// If the source view's template arg1 or arg2 is a space then use it,
	// otherwise use the source view's execution space.

	typedef typename Impl::if_c< Impl::is_space< SrcArg1Type >::value , SrcArg1Type ,
	- typename Impl::if_c< Impl::is_space< SrcArg2Type >::value , SrcArg2Type , typename SrcViewType::execution_space
	+ typename Impl::if_c< Impl::is_space< SrcArg2Type >::value , SrcArg2Type , typename SrcViewType::device_type
	>::type >::type OutputSpace ;

	public:

	// If keeping the layout then match non-data type arguments
	// else keep execution space and memory traits.
	typedef typename
	Impl::if_c< Impl::is_same< SrcViewLayout , OutputViewLayout >::value
	, Kokkos::View< OutputData , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, Kokkos::View< OutputData , OutputViewLayout , OutputSpace
	, typename SrcViewType::memory_traits
	, Impl::ViewDefault >
	>::type type ;
	};

	} /* namespace Impl */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	// Construct subview of a Rank 8 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type , class SubArg6_type , class SubArg7_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	, const SubArg2_type & arg2
	, const SubArg3_type & arg3
	, const SubArg4_type & arg4
	, const SubArg5_type & arg5
	, const SubArg6_type & arg6
	, const SubArg7_type & arg7
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	, SubArg4_type , SubArg5_type , SubArg6_type , SubArg7_type >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;
	typedef Impl::ViewOffsetRange< SubArg2_type > R2 ;
	typedef Impl::ViewOffsetRange< SubArg3_type > R3 ;
	typedef Impl::ViewOffsetRange< SubArg4_type > R4 ;
	typedef Impl::ViewOffsetRange< SubArg5_type > R5 ;
	typedef Impl::ViewOffsetRange< SubArg6_type > R6 ;
	typedef Impl::ViewOffsetRange< SubArg7_type > R7 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, R2::dimension( src.m_offset_map.N2 , arg2 )
	, R3::dimension( src.m_offset_map.N3 , arg3 )
	, R4::dimension( src.m_offset_map.N4 , arg4 )
	, R5::dimension( src.m_offset_map.N5 , arg5 )
	, R6::dimension( src.m_offset_map.N6 , arg6 )
	, R7::dimension( src.m_offset_map.N7 , arg7 )
	);

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	, R2::begin( arg2 )
	, R3::begin( arg3 )
	, R4::begin( arg4 )
	, R5::begin( arg5 )
	, R6::begin( arg6 )
	, R7::begin( arg7 ) );
	- m_management.increment( m_ptr_on_device );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 7 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type , class SubArg6_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	, const SubArg2_type & arg2
	, const SubArg3_type & arg3
	, const SubArg4_type & arg4
	, const SubArg5_type & arg5
	, const SubArg6_type & arg6
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	, SubArg4_type , SubArg5_type , SubArg6_type , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;
	typedef Impl::ViewOffsetRange< SubArg2_type > R2 ;
	typedef Impl::ViewOffsetRange< SubArg3_type > R3 ;
	typedef Impl::ViewOffsetRange< SubArg4_type > R4 ;
	typedef Impl::ViewOffsetRange< SubArg5_type > R5 ;
	typedef Impl::ViewOffsetRange< SubArg6_type > R6 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, R2::dimension( src.m_offset_map.N2 , arg2 )
	, R3::dimension( src.m_offset_map.N3 , arg3 )
	, R4::dimension( src.m_offset_map.N4 , arg4 )
	, R5::dimension( src.m_offset_map.N5 , arg5 )
	, R6::dimension( src.m_offset_map.N6 , arg6 )
	, 0
	);

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	, R2::begin( arg2 )
	, R3::begin( arg3 )
	, R4::begin( arg4 )
	, R5::begin( arg5 )
	, R6::begin( arg6 )
	- , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 6 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type , class SubArg5_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	, const SubArg2_type & arg2
	, const SubArg3_type & arg3
	, const SubArg4_type & arg4
	, const SubArg5_type & arg5
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	, SubArg4_type , SubArg5_type , void , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;
	typedef Impl::ViewOffsetRange< SubArg2_type > R2 ;
	typedef Impl::ViewOffsetRange< SubArg3_type > R3 ;
	typedef Impl::ViewOffsetRange< SubArg4_type > R4 ;
	typedef Impl::ViewOffsetRange< SubArg5_type > R5 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, R2::dimension( src.m_offset_map.N2 , arg2 )
	, R3::dimension( src.m_offset_map.N3 , arg3 )
	, R4::dimension( src.m_offset_map.N4 , arg4 )
	, R5::dimension( src.m_offset_map.N5 , arg5 )
	, 0
	, 0
	);

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	, R2::begin( arg2 )
	, R3::begin( arg3 )
	, R4::begin( arg4 )
	, R5::begin( arg5 )
	- , 0
	- , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 5 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	, class SubArg4_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	, const SubArg2_type & arg2
	, const SubArg3_type & arg3
	, const SubArg4_type & arg4
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	, SubArg4_type , void , void , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;
	typedef Impl::ViewOffsetRange< SubArg2_type > R2 ;
	typedef Impl::ViewOffsetRange< SubArg3_type > R3 ;
	typedef Impl::ViewOffsetRange< SubArg4_type > R4 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, R2::dimension( src.m_offset_map.N2 , arg2 )
	, R3::dimension( src.m_offset_map.N3 , arg3 )
	, R4::dimension( src.m_offset_map.N4 , arg4 )
	, 0
	, 0
	, 0
	);

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	, R2::begin( arg2 )
	, R3::begin( arg3 )
	, R4::begin( arg4 )
	- , 0
	- , 0
	- , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 4 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type , class SubArg3_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	, const SubArg2_type & arg2
	, const SubArg3_type & arg3
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , SubArg3_type
	, void , void , void , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;
	typedef Impl::ViewOffsetRange< SubArg2_type > R2 ;
	typedef Impl::ViewOffsetRange< SubArg3_type > R3 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, R2::dimension( src.m_offset_map.N2 , arg2 )
	, R3::dimension( src.m_offset_map.N3 , arg3 )
	, 0
	, 0
	, 0
	, 0
	);

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	, R2::begin( arg2 )
	, R3::begin( arg3 )
	- , 0
	- , 0
	- , 0
	- , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 3 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type , class SubArg2_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	, const SubArg2_type & arg2
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , SubArg2_type , void , void , void , void , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;
	typedef Impl::ViewOffsetRange< SubArg2_type > R2 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, R2::dimension( src.m_offset_map.N2 , arg2 )
	, 0 , 0 , 0 , 0 , 0);

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	, R2::begin( arg2 )
	- , 0 , 0 , 0 , 0 , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 2 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type , class SubArg1_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	, const SubArg1_type & arg1
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , SubArg1_type , void , void , void , void , void , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;
	typedef Impl::ViewOffsetRange< SubArg1_type > R1 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, R1::dimension( src.m_offset_map.N1 , arg1 )
	, 0 , 0 , 0 , 0 , 0 , 0 );

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	, R1::begin( arg1 )
	- , 0 , 0 , 0 , 0 , 0 , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	// Construct subview of a Rank 1 view
	template< class DstDataType , class DstArg1Type , class DstArg2Type , class DstArg3Type >
	template< class SrcDataType , class SrcArg1Type , class SrcArg2Type , class SrcArg3Type
	, class SubArg0_type
	>
	KOKKOS_INLINE_FUNCTION
	View< DstDataType , DstArg1Type , DstArg2Type , DstArg3Type , Impl::ViewDefault >::
	View( const View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault > & src
	, const SubArg0_type & arg0
	)
	: m_ptr_on_device( (typename traits::value_type*) NULL)
	, m_offset_map()
	, m_management()
	+ , m_tracker()
	{
	// This constructor can only be used to construct a subview
	// from the source view. This type must match the subview type
	// deduced from the source view and subview arguments.

	typedef Impl::ViewSubview< View< SrcDataType , SrcArg1Type , SrcArg2Type , SrcArg3Type , Impl::ViewDefault >
	, SubArg0_type , void , void , void , void , void , void , void >
	ViewSubviewDeduction ;

	enum { is_a_valid_subview_constructor =
	Impl::StaticAssert<
	Impl::is_same< View , typename ViewSubviewDeduction::type >::value
	>::value
	};

	if ( is_a_valid_subview_constructor ) {

	typedef Impl::ViewOffsetRange< SubArg0_type > R0 ;

	// 'assign_subview' returns whether the subview offset_map
	// introduces noncontiguity in the view.
	const bool introduce_noncontiguity =
	m_offset_map.assign_subview( src.m_offset_map
	, R0::dimension( src.m_offset_map.N0 , arg0 )
	, 0 , 0 , 0 , 0 , 0 , 0 , 0 );

	if ( m_offset_map.capacity() ) {

	m_management = src.m_management ;

	if ( introduce_noncontiguity ) m_management.set_noncontiguous();

	m_ptr_on_device = src.m_ptr_on_device +
	src.m_offset_map( R0::begin( arg0 )
	- , 0 , 0 , 0 , 0 , 0 , 0 , 0 );
	- m_management.increment( m_ptr_on_device );
	+ );
	+ m_tracker = src.m_tracker ;
	}
	}
	}

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_VIEWDEFAULT_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_ViewOffset.hpp b/lib/kokkos/core/src/impl/Kokkos_ViewOffset.hpp
	index 1cced4954..61cd75844 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_ViewOffset.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_ViewOffset.hpp
	@@ -1,1335 +1,1348 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_VIEWOFFSET_HPP
	#define KOKKOS_VIEWOFFSET_HPP

	#include <Kokkos_Pair.hpp>
	#include <Kokkos_Layout.hpp>
	#include <impl/Kokkos_Traits.hpp>
	#include <impl/Kokkos_Shape.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	struct ALL ;
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos { namespace Impl {

	template < class ShapeType , class LayoutType , typename Enable = void >
	struct ViewOffset ;

	//----------------------------------------------------------------------------
	// LayoutLeft AND ( 1 >= rank OR 0 == rank_dynamic ) : no padding / striding
	template < class ShapeType >
	struct ViewOffset< ShapeType , LayoutLeft
	, typename enable_if<( 1 >= ShapeType::rank
	\|\|
	0 == ShapeType::rank_dynamic
	)>::type >
	: public ShapeType
	{
	typedef size_t size_type ;
	typedef ShapeType shape_type ;
	typedef LayoutLeft array_layout ;

	enum { has_padding = false };

	template< unsigned R >
	KOKKOS_INLINE_FUNCTION
	void assign( size_t n )
	{ assign_shape_dimension<R>( *this , n ); }

	// Return whether the subview introduced noncontiguity
	template< class S , class L >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( 0 == shape_type::rank &&
	Impl::is_same<L,LayoutLeft>::value
	), bool >::type
	assign_subview( const ViewOffset<S,L,void> &
	, const size_t n0
	, const size_t n1
	, const size_t n2
	, const size_t n3
	, const size_t n4
	, const size_t n5
	, const size_t n6
	, const size_t n7
	)
	{
	return false ; // did not introduce noncontiguity
	}

	// This subview must be 1 == rank and 1 == rank_dynamic.
	// The source dimension #0 must be non-zero and all other dimensions are zero.
	// Return whether the subview introduced noncontiguity
	template< class S , class L >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( 1 == shape_type::rank &&
	1 == shape_type::rank_dynamic &&
	1 <= S::rank &&
	Impl::is_same<L,LayoutLeft>::value
	), bool >::type
	assign_subview( const ViewOffset<S,L,void> &
	, const size_t n0
	, const size_t n1
	, const size_t n2
	, const size_t n3
	, const size_t n4
	, const size_t n5
	, const size_t n6
	, const size_t n7
	)
	{
	// n1 .. n7 must be zero
	shape_type::N0 = n0 ;
	return false ; // did not introduce noncontiguity
	}


	KOKKOS_INLINE_FUNCTION
	- void assign( size_t n0 , unsigned n1 , unsigned n2 , unsigned n3
	- , unsigned n4 , unsigned n5 , unsigned n6 , unsigned n7
	- , unsigned = 0 )
	+ void assign( size_t n0 , size_t n1 , size_t n2 , size_t n3
	+ , size_t n4 , size_t n5 , size_t n6 , size_t n7
	+ , size_t = 0 )
	{ shape_type::assign( *this , n0, n1, n2, n3, n4, n5, n6, n7 ); }

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutLeft > & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank)
	&&
	int(ShapeRHS::rank_dynamic) <= int(shape_type::rank_dynamic)
	)>::type * = 0 )
	{ shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 ); }

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutRight > & rhs
	, typename enable_if<( 1 == int(ShapeRHS::rank)
	&&
	1 == int(shape_type::rank)
	&&
	1 == int(shape_type::rank_dynamic)
	)>::type * = 0 )
	{ shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 ); }

	KOKKOS_INLINE_FUNCTION
	void set_padding() {}

	KOKKOS_INLINE_FUNCTION
	size_type cardinality() const
	{ return size_type(shape_type::N0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type capacity() const
	{ return size_type(shape_type::N0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	// Stride with [ rank ] value is the total length
	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	void stride( iType * const s ) const
	{
	s[0] = 1 ;
	if ( 0 < shape_type::rank ) { s[1] = shape_type::N0 ; }
	if ( 1 < shape_type::rank ) { s[2] = s[1] * shape_type::N1 ; }
	if ( 2 < shape_type::rank ) { s[3] = s[2] * shape_type::N2 ; }
	if ( 3 < shape_type::rank ) { s[4] = s[3] * shape_type::N3 ; }
	if ( 4 < shape_type::rank ) { s[5] = s[4] * shape_type::N4 ; }
	if ( 5 < shape_type::rank ) { s[6] = s[5] * shape_type::N5 ; }
	if ( 6 < shape_type::rank ) { s[7] = s[6] * shape_type::N6 ; }
	if ( 7 < shape_type::rank ) { s[8] = s[7] * shape_type::N7 ; }
	}

	KOKKOS_INLINE_FUNCTION size_type stride_0() const { return 1 ; }
	KOKKOS_INLINE_FUNCTION size_type stride_1() const { return shape_type::N0 ; }
	KOKKOS_INLINE_FUNCTION size_type stride_2() const { return shape_type::N0 * shape_type::N1 ; }
	KOKKOS_INLINE_FUNCTION size_type stride_3() const { return shape_type::N0 * shape_type::N1 * shape_type::N2 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_4() const
	{ return shape_type::N0 * shape_type::N1 * shape_type::N2 * shape_type::N3 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_5() const
	{ return shape_type::N0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_6() const
	{ return shape_type::N0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_7() const
	{ return shape_type::N0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 ; }

	// rank 1
	template< typename I0 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const & i0 ) const { return i0 ; }

	// rank 2
	template < typename I0 , typename I1 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const & i0 , I1 const & i1 ) const
	{ return i0 + shape_type::N0 * i1 ; }

	//rank 3
	template <typename I0, typename I1, typename I2>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0
	, I1 const& i1
	, I2 const& i2
	) const
	{
	return i0 + shape_type::N0 * (
	i1 + shape_type::N1 * i2 );
	}

	//rank 4
	template <typename I0, typename I1, typename I2, typename I3>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3 ) const
	{
	return i0 + shape_type::N0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * i3 ));
	}

	//rank 5
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4 ) const
	{
	return i0 + shape_type::N0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * i4 )));
	}

	//rank 6
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4, I5 const& i5 ) const
	{
	return i0 + shape_type::N0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * (
	i4 + shape_type::N4 * i5 ))));
	}

	//rank 7
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6) const
	{
	return i0 + shape_type::N0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * (
	i4 + shape_type::N4 * (
	i5 + shape_type::N5 * i6 )))));
	}

	//rank 8
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6, typename I7 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6, I7 const& i7) const
	{
	return i0 + shape_type::N0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * (
	i4 + shape_type::N4 * (
	i5 + shape_type::N5 * (
	i6 + shape_type::N6 * i7 ))))));
	}
	};

	//----------------------------------------------------------------------------
	// LayoutLeft AND ( 1 < rank AND 0 < rank_dynamic ) : has padding / striding
	template < class ShapeType >
	struct ViewOffset< ShapeType , LayoutLeft
	, typename enable_if<( 1 < ShapeType::rank
	&&
	0 < ShapeType::rank_dynamic
	)>::type >
	: public ShapeType
	{
	typedef size_t size_type ;
	typedef ShapeType shape_type ;
	typedef LayoutLeft array_layout ;

	enum { has_padding = true };

	size_type S0 ;

	// This subview must be 2 == rank and 2 == rank_dynamic
	// due to only having stride #0.
	// The source dimension #0 must be non-zero for stride-one leading dimension.
	- // If source is rank deficient then set to zero.
	- // Return whether the subview introduced noncontiguity
	+ // At most subsequent dimension can be non-zero.
	+ // Return whether the subview introduced noncontiguity.
	template< class S , class L >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( 2 == shape_type::rank &&
	2 == shape_type::rank_dynamic &&
	2 <= S::rank &&
	Impl::is_same<L,LayoutLeft>::value
	), bool >::type
	assign_subview( const ViewOffset<S,L,void> & rhs
	, const size_t n0
	, const size_t n1
	, const size_t n2
	, const size_t n3
	, const size_t n4
	, const size_t n5
	, const size_t n6
	, const size_t n7
	)
	{
	- // N0 = n0 ;
	// N1 = second non-zero dimension
	// S0 = stride for second non-zero dimension
	- shape_type::N0 = 0 ;
	+ shape_type::N0 = n0 ;
	shape_type::N1 = 0 ;
	S0 = 0 ;

	- if ( 0 == n0 ) {}
	- else if ( n1 ) { shape_type::N0 = n0 ; shape_type::N1 = n1 ; S0 = rhs.stride_1(); }
	- else if ( 2 < S::rank && n2 ) { shape_type::N0 = n0 ; shape_type::N1 = n2 ; S0 = rhs.stride_2(); }
	- else if ( 3 < S::rank && n3 ) { shape_type::N0 = n0 ; shape_type::N1 = n3 ; S0 = rhs.stride_3(); }
	- else if ( 4 < S::rank && n4 ) { shape_type::N0 = n0 ; shape_type::N1 = n4 ; S0 = rhs.stride_4(); }
	- else if ( 5 < S::rank && n5 ) { shape_type::N0 = n0 ; shape_type::N1 = n5 ; S0 = rhs.stride_5(); }
	- else if ( 6 < S::rank && n6 ) { shape_type::N0 = n0 ; shape_type::N1 = n6 ; S0 = rhs.stride_6(); }
	- else if ( 7 < S::rank && n7 ) { shape_type::N0 = n0 ; shape_type::N1 = n7 ; S0 = rhs.stride_7(); }
	+ if ( n1 ) { shape_type::N1 = n1 ; S0 = rhs.stride_1(); }
	+ else if ( 2 < S::rank && n2 ) { shape_type::N1 = n2 ; S0 = rhs.stride_2(); }
	+ else if ( 3 < S::rank && n3 ) { shape_type::N1 = n3 ; S0 = rhs.stride_3(); }
	+ else if ( 4 < S::rank && n4 ) { shape_type::N1 = n4 ; S0 = rhs.stride_4(); }
	+ else if ( 5 < S::rank && n5 ) { shape_type::N1 = n5 ; S0 = rhs.stride_5(); }
	+ else if ( 6 < S::rank && n6 ) { shape_type::N1 = n6 ; S0 = rhs.stride_6(); }
	+ else if ( 7 < S::rank && n7 ) { shape_type::N1 = n7 ; S0 = rhs.stride_7(); }

	// Introduce noncontiguity if change the first dimension
	// or took a range of a dimension after the second.
	return ( size_t(shape_type::N0) != size_t(rhs.N0) ) \|\| ( 0 == n1 );
	}


	template< unsigned R >
	KOKKOS_INLINE_FUNCTION
	void assign( size_t n )
	{ assign_shape_dimension<R>( *this , n ); }


	KOKKOS_INLINE_FUNCTION
	- void assign( size_t n0 , unsigned n1 , unsigned n2 , unsigned n3
	- , unsigned n4 , unsigned n5 , unsigned n6 , unsigned n7
	- , unsigned = 0 )
	+ void assign( size_t n0 , size_t n1 , size_t n2 , size_t n3
	+ , size_t n4 , size_t n5 , size_t n6 , size_t n7
	+ , size_t = 0 )
	{ shape_type::assign( *this , n0, n1, n2, n3, n4, n5, n6, n7 ); S0 = shape_type::N0 ; }

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutLeft > & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank)
	&&
	int(ShapeRHS::rank_dynamic) <= int(shape_type::rank_dynamic)
	&&
	int(ShapeRHS::rank_dynamic) == 0
	)>::type * = 0 )
	{
	shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 );
	S0 = shape_type::N0 ; // No padding when dynamic_rank == 0
	}

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutLeft > & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank)
	&&
	int(ShapeRHS::rank_dynamic) <= int(shape_type::rank_dynamic)
	&&
	int(ShapeRHS::rank_dynamic) > 0
	)>::type * = 0 )
	{
	shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 );
	S0 = rhs.S0 ; // possibly padding when dynamic rank > 0
	}

	KOKKOS_INLINE_FUNCTION
	void set_padding()
	{
	enum { div = MEMORY_ALIGNMENT / shape_type::scalar_size };
	enum { mod = MEMORY_ALIGNMENT % shape_type::scalar_size };
	enum { align = 0 == mod ? div : 0 };

	if ( align && MEMORY_ALIGNMENT_THRESHOLD * align < S0 ) {

	const size_type count_mod = S0 % ( div ? div : 1 );

	if ( count_mod ) { S0 += align - count_mod ; }
	}
	}

	KOKKOS_INLINE_FUNCTION
	size_type cardinality() const
	{ return size_type(shape_type::N0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type capacity() const
	{ return size_type(S0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	// Stride with [ rank ] as total length
	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	void stride( iType * const s ) const
	{
	s[0] = 1 ;
	if ( 0 < shape_type::rank ) { s[1] = S0 ; }
	if ( 1 < shape_type::rank ) { s[2] = s[1] * shape_type::N1 ; }
	if ( 2 < shape_type::rank ) { s[3] = s[2] * shape_type::N2 ; }
	if ( 3 < shape_type::rank ) { s[4] = s[3] * shape_type::N3 ; }
	if ( 4 < shape_type::rank ) { s[5] = s[4] * shape_type::N4 ; }
	if ( 5 < shape_type::rank ) { s[6] = s[5] * shape_type::N5 ; }
	if ( 6 < shape_type::rank ) { s[7] = s[6] * shape_type::N6 ; }
	if ( 7 < shape_type::rank ) { s[8] = s[7] * shape_type::N6 ; }
	}

	KOKKOS_INLINE_FUNCTION size_type stride_0() const { return 1 ; }
	KOKKOS_INLINE_FUNCTION size_type stride_1() const { return S0 ; }
	KOKKOS_INLINE_FUNCTION size_type stride_2() const { return S0 * shape_type::N1 ; }
	KOKKOS_INLINE_FUNCTION size_type stride_3() const { return S0 * shape_type::N1 * shape_type::N2 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_4() const
	{ return S0 * shape_type::N1 * shape_type::N2 * shape_type::N3 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_5() const
	{ return S0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_6() const
	{ return S0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_7() const
	{ return S0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 ; }

	// rank 2
	template < typename I0 , typename I1 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const & i0 , I1 const & i1) const
	{ return i0 + S0 * i1 ; }

	//rank 3
	template <typename I0, typename I1, typename I2>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 ) const
	{
	return i0 + S0 * (
	i1 + shape_type::N1 * i2 );
	}

	//rank 4
	template <typename I0, typename I1, typename I2, typename I3>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3 ) const
	{
	return i0 + S0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * i3 ));
	}

	//rank 5
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4 ) const
	{
	return i0 + S0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * i4 )));
	}

	//rank 6
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4, I5 const& i5 ) const
	{
	return i0 + S0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * (
	i4 + shape_type::N4 * i5 ))));
	}

	//rank 7
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6 ) const
	{
	return i0 + S0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * (
	i4 + shape_type::N4 * (
	i5 + shape_type::N5 * i6 )))));
	}

	//rank 8
	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6, typename I7 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2, I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6, I7 const& i7 ) const
	{
	return i0 + S0 * (
	i1 + shape_type::N1 * (
	i2 + shape_type::N2 * (
	i3 + shape_type::N3 * (
	i4 + shape_type::N4 * (
	i5 + shape_type::N5 * (
	i6 + shape_type::N6 * i7 ))))));
	}
	};

	//----------------------------------------------------------------------------
	// LayoutRight AND ( 1 >= rank OR 1 >= rank_dynamic ) : no padding / striding
	template < class ShapeType >
	struct ViewOffset< ShapeType , LayoutRight
	, typename enable_if<( 1 >= ShapeType::rank
	\|\|
	1 >= ShapeType::rank_dynamic
	)>::type >
	: public ShapeType
	{
	typedef size_t size_type;
	typedef ShapeType shape_type;
	typedef LayoutRight array_layout ;

	enum { has_padding = false };

	// This subview must be 1 == rank and 1 == rank_dynamic
	// The source view's last dimension must be non-zero
	// Return whether the subview introduced noncontiguity
	template< class S , class L >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( 0 == shape_type::rank &&
	Impl::is_same<L,LayoutRight>::value
	), bool >::type
	assign_subview( const ViewOffset<S,L,void> &
	, const size_t n0
	, const size_t n1
	, const size_t n2
	, const size_t n3
	, const size_t n4
	, const size_t n5
	, const size_t n6
	, const size_t n7
	)
	{ return false ; }

	// This subview must be 1 == rank and 1 == rank_dynamic
	// The source view's last dimension must be non-zero
	// Return whether the subview introduced noncontiguity
	template< class S , class L >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( 1 == shape_type::rank &&
	1 == shape_type::rank_dynamic &&
	1 <= S::rank &&
	Impl::is_same<L,LayoutRight>::value
	), bool >::type
	assign_subview( const ViewOffset<S,L,void> &
	, const size_t n0
	, const size_t n1
	, const size_t n2
	, const size_t n3
	, const size_t n4
	, const size_t n5
	, const size_t n6
	, const size_t n7
	)
	{
	shape_type::N0 = S::rank == 1 ? n0 : (
	S::rank == 2 ? n1 : (
	S::rank == 3 ? n2 : (
	S::rank == 4 ? n3 : (
	S::rank == 5 ? n4 : (
	S::rank == 6 ? n5 : (
	S::rank == 7 ? n6 : n7 ))))));
	// should have n0 .. n_(rank-2) equal zero
	return false ;
	}

	template< unsigned R >
	KOKKOS_INLINE_FUNCTION
	- void assign( unsigned n )
	+ void assign( size_t n )
	{ assign_shape_dimension<R>( *this , n ); }

	KOKKOS_INLINE_FUNCTION
	- void assign( unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3
	- , unsigned n4 , unsigned n5 , unsigned n6 , unsigned n7
	- , unsigned = 0 )
	+ void assign( size_t n0 , size_t n1 , size_t n2 , size_t n3
	+ , size_t n4 , size_t n5 , size_t n6 , size_t n7
	+ , size_t = 0 )
	{ shape_type::assign( *this , n0, n1, n2, n3, n4, n5, n6, n7 ); }

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutRight > & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank)
	&&
	int(ShapeRHS::rank_dynamic) <= int(shape_type::rank_dynamic)
	)>::type * = 0 )
	{ shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 ); }

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutLeft > & rhs
	, typename enable_if<( 1 == int(ShapeRHS::rank)
	&&
	1 == int(shape_type::rank)
	&&
	1 == int(shape_type::rank_dynamic)
	)>::type * = 0 )
	{ shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 ); }

	KOKKOS_INLINE_FUNCTION
	void set_padding() {}

	KOKKOS_INLINE_FUNCTION
	size_type cardinality() const
	{ return size_type(shape_type::N0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type capacity() const
	{ return size_type(shape_type::N0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	size_type stride_R() const
	{
	return size_type(shape_type::N1) * shape_type::N2 * shape_type::N3 *
	shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ;
	};

	// Stride with [rank] as total length
	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	void stride( iType * const s ) const
	{
	size_type n = 1 ;
	if ( 7 < shape_type::rank ) { s[7] = n ; n *= shape_type::N7 ; }
	if ( 6 < shape_type::rank ) { s[6] = n ; n *= shape_type::N6 ; }
	if ( 5 < shape_type::rank ) { s[5] = n ; n *= shape_type::N5 ; }
	if ( 4 < shape_type::rank ) { s[4] = n ; n *= shape_type::N4 ; }
	if ( 3 < shape_type::rank ) { s[3] = n ; n *= shape_type::N3 ; }
	if ( 2 < shape_type::rank ) { s[2] = n ; n *= shape_type::N2 ; }
	if ( 1 < shape_type::rank ) { s[1] = n ; n *= shape_type::N1 ; }
	if ( 0 < shape_type::rank ) { s[0] = n ; }
	s[shape_type::rank] = n * shape_type::N0 ;
	}

	KOKKOS_INLINE_FUNCTION
	size_type stride_7() const { return 1 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_6() const { return shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_5() const { return shape_type::N7 * shape_type::N6 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_4() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_3() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_2() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 * shape_type::N3 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_1() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 * shape_type::N3 * shape_type::N2 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_0() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 * shape_type::N3 * shape_type::N2 * shape_type::N1 ; }

	+ // rank 1
	+ template <typename I0>
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ size_type operator()( I0 const& i0) const
	+ {
	+ return i0 ;
	+ }
	+
	// rank 2
	template <typename I0, typename I1>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1 ) const
	{
	return i1 + shape_type::N1 * i0 ;
	}

	template <typename I0, typename I1, typename I2>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 ) const
	{
	return i2 + shape_type::N2 * (
	i1 + shape_type::N1 * ( i0 ));
	}

	template <typename I0, typename I1, typename I2, typename I3>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3 ) const
	{
	return i3 + shape_type::N3 * (
	i2 + shape_type::N2 * (
	i1 + shape_type::N1 * ( i0 )));
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4 ) const
	{
	return i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * (
	i1 + shape_type::N1 * ( i0 ))));
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5 ) const
	{
	return i5 + shape_type::N5 * (
	i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * (
	i1 + shape_type::N1 * ( i0 )))));
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6 ) const
	{
	return i6 + shape_type::N6 * (
	i5 + shape_type::N5 * (
	i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * (
	i1 + shape_type::N1 * ( i0 ))))));
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6, typename I7 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6, I7 const& i7 ) const
	{
	return i7 + shape_type::N7 * (
	i6 + shape_type::N6 * (
	i5 + shape_type::N5 * (
	i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * (
	i1 + shape_type::N1 * ( i0 )))))));
	}
	};

	//----------------------------------------------------------------------------
	// LayoutRight AND ( 1 < rank AND 1 < rank_dynamic ) : has padding / striding
	template < class ShapeType >
	struct ViewOffset< ShapeType , LayoutRight
	, typename enable_if<( 1 < ShapeType::rank
	&&
	1 < ShapeType::rank_dynamic
	)>::type >
	: public ShapeType
	{
	typedef size_t size_type;
	typedef ShapeType shape_type;
	typedef LayoutRight array_layout ;

	enum { has_padding = true };

	size_type SR ;

	// This subview must be 2 == rank and 2 == rank_dynamic
	// due to only having stride #(rank-1).
	// The source dimension #(rank-1) must be non-zero for stride-one leading dimension.
	- // If source is rank deficient then set to zero.
	- // Return whether the subview introduced noncontiguity
	+ // At most one prior dimension can be non-zero.
	+ // Return whether the subview introduced noncontiguity.
	template< class S , class L >
	KOKKOS_INLINE_FUNCTION
	typename Impl::enable_if<( 2 == shape_type::rank &&
	2 == shape_type::rank_dynamic &&
	2 <= S::rank &&
	Impl::is_same<L,LayoutRight>::value
	), bool >::type
	assign_subview( const ViewOffset<S,L,void> & rhs
	, const size_t n0
	, const size_t n1
	, const size_t n2
	, const size_t n3
	, const size_t n4
	, const size_t n5
	, const size_t n6
	, const size_t n7
	)
	{
	const size_type nR = S::rank == 2 ? n1 : (
	S::rank == 3 ? n2 : (
	S::rank == 4 ? n3 : (
	S::rank == 5 ? n4 : (
	S::rank == 6 ? n5 : (
	S::rank == 7 ? n6 : n7 )))));

	// N0 = first non-zero-dimension
	// N1 = last non-zero dimension
	// SR = stride for second non-zero dimension
	shape_type::N0 = 0 ;
	- shape_type::N1 = 0 ;
	+ shape_type::N1 = nR ;
	SR = 0 ;

	- if ( 0 == nR ) {}
	- else if ( n0 ) { shape_type::N0 = n0 ; shape_type::N1 = nR ; SR = rhs.stride_0(); }
	- else if ( 2 < S::rank && n1 ) { shape_type::N0 = n1 ; shape_type::N1 = nR ; SR = rhs.stride_1(); }
	- else if ( 3 < S::rank && n2 ) { shape_type::N0 = n2 ; shape_type::N1 = nR ; SR = rhs.stride_2(); }
	- else if ( 4 < S::rank && n3 ) { shape_type::N0 = n3 ; shape_type::N1 = nR ; SR = rhs.stride_3(); }
	- else if ( 5 < S::rank && n4 ) { shape_type::N0 = n4 ; shape_type::N1 = nR ; SR = rhs.stride_4(); }
	- else if ( 6 < S::rank && n5 ) { shape_type::N0 = n5 ; shape_type::N1 = nR ; SR = rhs.stride_5(); }
	- else if ( 7 < S::rank && n6 ) { shape_type::N0 = n6 ; shape_type::N1 = nR ; SR = rhs.stride_6(); }
	+ if ( n0 ) { shape_type::N0 = n0 ; SR = rhs.stride_0(); }
	+ else if ( 2 < S::rank && n1 ) { shape_type::N0 = n1 ; SR = rhs.stride_1(); }
	+ else if ( 3 < S::rank && n2 ) { shape_type::N0 = n2 ; SR = rhs.stride_2(); }
	+ else if ( 4 < S::rank && n3 ) { shape_type::N0 = n3 ; SR = rhs.stride_3(); }
	+ else if ( 5 < S::rank && n4 ) { shape_type::N0 = n4 ; SR = rhs.stride_4(); }
	+ else if ( 6 < S::rank && n5 ) { shape_type::N0 = n5 ; SR = rhs.stride_5(); }
	+ else if ( 7 < S::rank && n6 ) { shape_type::N0 = n6 ; SR = rhs.stride_6(); }

	// Introduce noncontiguous if change the last dimension
	// or take a range of a dimension other than the second-to-last dimension.

	return 2 == S::rank ? ( size_t(shape_type::N1) != size_t(rhs.N1) \|\| 0 == n0 ) : (
	3 == S::rank ? ( size_t(shape_type::N1) != size_t(rhs.N2) \|\| 0 == n1 ) : (
	4 == S::rank ? ( size_t(shape_type::N1) != size_t(rhs.N3) \|\| 0 == n2 ) : (
	5 == S::rank ? ( size_t(shape_type::N1) != size_t(rhs.N4) \|\| 0 == n3 ) : (
	6 == S::rank ? ( size_t(shape_type::N1) != size_t(rhs.N5) \|\| 0 == n4 ) : (
	7 == S::rank ? ( size_t(shape_type::N1) != size_t(rhs.N6) \|\| 0 == n5 ) : (
	( size_t(shape_type::N1) != size_t(rhs.N7) \|\| 0 == n6 ) ))))));
	}

	template< unsigned R >
	KOKKOS_INLINE_FUNCTION
	- void assign( unsigned n )
	+ void assign( size_t n )
	{ assign_shape_dimension<R>( *this , n ); }

	KOKKOS_INLINE_FUNCTION
	- void assign( unsigned n0 , unsigned n1 , unsigned n2 , unsigned n3
	- , unsigned n4 , unsigned n5 , unsigned n6 , unsigned n7
	- , unsigned = 0 )
	+ void assign( size_t n0 , size_t n1 , size_t n2 , size_t n3
	+ , size_t n4 , size_t n5 , size_t n6 , size_t n7
	+ , size_t = 0 )
	{
	shape_type::assign( *this , n0, n1, n2, n3, n4, n5, n6, n7 );
	SR = size_type(shape_type::N1) * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ;
	}

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutRight > & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank)
	&&
	int(ShapeRHS::rank_dynamic) <= int(shape_type::rank_dynamic)
	&&
	int(ShapeRHS::rank_dynamic) <= 1
	)>::type * = 0 )
	{
	shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 );
	SR = shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ;
	}

	template< class ShapeRHS >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset< ShapeRHS , LayoutRight > & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank)
	&&
	int(ShapeRHS::rank_dynamic) <= int(shape_type::rank_dynamic)
	&&
	int(ShapeRHS::rank_dynamic) > 1
	)>::type * = 0 )
	{
	shape_type::assign( *this , rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 );
	SR = rhs.SR ;
	}

	KOKKOS_INLINE_FUNCTION
	void set_padding()
	{
	enum { div = MEMORY_ALIGNMENT / shape_type::scalar_size };
	enum { mod = MEMORY_ALIGNMENT % shape_type::scalar_size };
	enum { align = 0 == mod ? div : 0 };

	if ( align && MEMORY_ALIGNMENT_THRESHOLD * align < SR ) {

	const size_type count_mod = SR % ( div ? div : 1 );

	if ( count_mod ) { SR += align - count_mod ; }
	}
	}

	KOKKOS_INLINE_FUNCTION
	size_type cardinality() const
	{ return size_type(shape_type::N0) * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type capacity() const { return shape_type::N0 * SR ; }

	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	void stride( iType * const s ) const
	{
	size_type n = 1 ;
	if ( 7 < shape_type::rank ) { s[7] = n ; n *= shape_type::N7 ; }
	if ( 6 < shape_type::rank ) { s[6] = n ; n *= shape_type::N6 ; }
	if ( 5 < shape_type::rank ) { s[5] = n ; n *= shape_type::N5 ; }
	if ( 4 < shape_type::rank ) { s[4] = n ; n *= shape_type::N4 ; }
	if ( 3 < shape_type::rank ) { s[3] = n ; n *= shape_type::N3 ; }
	if ( 2 < shape_type::rank ) { s[2] = n ; n *= shape_type::N2 ; }
	if ( 1 < shape_type::rank ) { s[1] = n ; n *= shape_type::N1 ; }
	if ( 0 < shape_type::rank ) { s[0] = SR ; }
	s[shape_type::rank] = SR * shape_type::N0 ;
	}

	KOKKOS_INLINE_FUNCTION
	size_type stride_7() const { return 1 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_6() const { return shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_5() const { return shape_type::N7 * shape_type::N6 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_4() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_3() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_2() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 * shape_type::N3 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_1() const { return shape_type::N7 * shape_type::N6 * shape_type::N5 * shape_type::N4 * shape_type::N3 * shape_type::N2 ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_0() const { return SR ; }

	// rank 2
	template <typename I0, typename I1>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1 ) const
	{
	return i1 + i0 * SR ;
	}

	template <typename I0, typename I1, typename I2>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 ) const
	{
	return i2 + shape_type::N2 * ( i1 ) +
	i0 * SR ;
	}

	template <typename I0, typename I1, typename I2, typename I3>
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3 ) const
	{
	return i3 + shape_type::N3 * (
	i2 + shape_type::N2 * ( i1 )) +
	i0 * SR ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4 ) const
	{
	return i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * ( i1 ))) +
	i0 * SR ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5 ) const
	{
	return i5 + shape_type::N5 * (
	i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * ( i1 )))) +
	i0 * SR ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6 ) const
	{
	return i6 + shape_type::N6 * (
	i5 + shape_type::N5 * (
	i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * ( i1 ))))) +
	i0 * SR ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6, typename I7 >
	KOKKOS_FORCEINLINE_FUNCTION
	size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6, I7 const& i7 ) const
	{
	return i7 + shape_type::N7 * (
	i6 + shape_type::N6 * (
	i5 + shape_type::N5 * (
	i4 + shape_type::N4 * (
	i3 + shape_type::N3 * (
	i2 + shape_type::N2 * ( i1 )))))) +
	i0 * SR ;
	}
	};

	//----------------------------------------------------------------------------
	// LayoutStride :
	template < class ShapeType >
	struct ViewOffset< ShapeType , LayoutStride
	, typename enable_if<( 0 < ShapeType::rank )>::type >
	: public ShapeType
	{
	typedef size_t size_type;
	typedef ShapeType shape_type;
	typedef LayoutStride array_layout ;

	size_type S[ shape_type::rank + 1 ];

	template< class SType , class L >
	KOKKOS_INLINE_FUNCTION
	bool assign_subview( const ViewOffset<SType,L,void> & rhs
	, const size_type n0
	, const size_type n1
	, const size_type n2
	, const size_type n3
	, const size_type n4
	, const size_type n5
	, const size_type n6
	, const size_type n7
	)
	{
	shape_type::assign( *this, 0,0,0,0, 0,0,0,0 );

	for ( int i = 0 ; i < int(shape_type::rank+1) ; ++i ) { S[i] = 0 ; }

	// preconditions:
	// shape_type::rank <= rhs.rank
	// shape_type::rank == count of nonzero( rhs_dim[i] )
	size_type dim[8] = { n0 , n1 , n2 , n3 , n4 , n5 , n6 , n7 };
	size_type str[ SType::rank + 1 ];

	rhs.stride( str );

	// contract the zero-dimensions
	int r = 0 ;
	for ( int i = 0 ; i < int(SType::rank) ; ++i ) {
	if ( 0 != dim[i] ) {
	dim[r] = dim[i] ;
	str[r] = str[i] ;
	++r ;
	}
	}

	if ( int(shape_type::rank) == r ) {
	// The shape is non-zero
	for ( int i = 0 ; i < int(shape_type::rank) ; ++i ) {
	const size_type cap = dim[i] * ( S[i] = str[i] );
	if ( S[ shape_type::rank ] < cap ) S[ shape_type::rank ] = cap ;
	}
	// set the contracted nonzero dimensions
	shape_type::assign( *this, dim[0], dim[1], dim[2], dim[3], dim[4], dim[5], dim[6], dim[7] );
	}

	return true ; // definitely noncontiguous
	}

	template< unsigned R >
	KOKKOS_INLINE_FUNCTION
	- void assign( unsigned n )
	+ void assign( size_t n )
	{ assign_shape_dimension<R>( *this , n ); }

	template< class ShapeRHS , class Layout >
	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset<ShapeRHS,Layout> & rhs
	, typename enable_if<( int(ShapeRHS::rank) == int(shape_type::rank) )>::type * = 0 )
	{
	rhs.stride(S);
	shape_type::assign( *this, rhs.N0, rhs.N1, rhs.N2, rhs.N3, rhs.N4, rhs.N5, rhs.N6, rhs.N7 );
	}

	KOKKOS_INLINE_FUNCTION
	void assign( const LayoutStride & layout )
	{
	size_type max = 0 ;
	for ( int i = 0 ; i < shape_type::rank ; ++i ) {
	S[i] = layout.stride[i] ;
	const size_type m = layout.dimension[i] * S[i] ;
	if ( max < m ) { max = m ; }
	}
	S[ shape_type::rank ] = max ;
	shape_type::assign( *this, layout.dimension[0], layout.dimension[1],
	layout.dimension[2], layout.dimension[3],
	layout.dimension[4], layout.dimension[5],
	layout.dimension[6], layout.dimension[7] );
	}

	KOKKOS_INLINE_FUNCTION
	void assign( size_t s0 , size_t s1 , size_t s2 , size_t s3
	, size_t s4 , size_t s5 , size_t s6 , size_t s7
	, size_t s8 )
	{
	const size_t str[9] = { s0, s1, s2, s3, s4, s5, s6, s7, s8 };

	// Last argument is the total length.
	// Total length must be non-zero.
	// All strides must be non-zero and less than total length.
	bool ok = 0 < str[ shape_type::rank ] ;

	for ( int i = 0 ; ( i < shape_type::rank ) &&
	( ok = 0 < str[i] && str[i] < str[ shape_type::rank ] ); ++i );

	if ( ok ) {
	size_t dim[8] = { 1,1,1,1,1,1,1,1 };
	int iorder[9] = { 0,0,0,0,0,0,0,0,0 };

	// Ordering of strides smallest to largest.
	for ( int i = 1 ; i < shape_type::rank ; ++i ) {
	int j = i ;
	for ( ; 0 < j && str[i] < str[ iorder[j-1] ] ; --j ) {
	iorder[j] = iorder[j-1] ;
	}
	iorder[j] = i ;
	}

	// Last argument is the total length.
	iorder[ shape_type::rank ] = shape_type::rank ;

	// Determine dimension associated with each stride.
	// Guarantees non-overlap by truncating dimension
	// if ( 0 != str[ iorder[i+1] ] % str[ iorder[i] ] )
	for ( int i = 0 ; i < shape_type::rank ; ++i ) {
	dim[ iorder[i] ] = str[ iorder[i+1] ] / str[ iorder[i] ] ;
	}

	// Assign dimensions and strides:
	shape_type::assign( *this, dim[0], dim[1], dim[2], dim[3], dim[4], dim[5], dim[6], dim[7] );
	for ( int i = 0 ; i <= shape_type::rank ; ++i ) { S[i] = str[i] ; }
	}
	else {
	shape_type::assign(*this,0,0,0,0,0,0,0,0);
	for ( int i = 0 ; i <= shape_type::rank ; ++i ) { S[i] = 0 ; }
	}
	}

	KOKKOS_INLINE_FUNCTION
	void set_padding() {}

	KOKKOS_INLINE_FUNCTION
	size_type cardinality() const
	{ return shape_type::N0 * shape_type::N1 * shape_type::N2 * shape_type::N3 * shape_type::N4 * shape_type::N5 * shape_type::N6 * shape_type::N7 ; }

	KOKKOS_INLINE_FUNCTION
	size_type capacity() const { return S[ shape_type::rank ]; }

	template< typename iType >
	KOKKOS_INLINE_FUNCTION
	void stride( iType * const s ) const
	{ for ( int i = 0 ; i <= shape_type::rank ; ++i ) { s[i] = S[i] ; } }

	KOKKOS_INLINE_FUNCTION
	size_type stride_0() const { return S[0] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_1() const { return S[1] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_2() const { return S[2] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_3() const { return S[3] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_4() const { return S[4] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_5() const { return S[5] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_6() const { return S[6] ; }

	KOKKOS_INLINE_FUNCTION
	size_type stride_7() const { return S[7] ; }

	// rank 1
	template <typename I0 >
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==1),size_type>::type
	+ operator()( I0 const& i0) const
	{
	return i0 * S[0] ;
	}

	// rank 2
	template <typename I0, typename I1>
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==2),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1 ) const
	{
	return i0 * S[0] + i1 * S[1] ;
	}

	template <typename I0, typename I1, typename I2>
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==3),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1, I2 const& i2 ) const
	{
	return i0 * S[0] + i1 * S[1] + i2 * S[2] ;
	}

	template <typename I0, typename I1, typename I2, typename I3>
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==4),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3 ) const
	{
	return i0 * S[0] + i1 * S[1] + i2 * S[2] + i3 * S[3] ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4 >
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==5),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4 ) const
	{
	return i0 * S[0] + i1 * S[1] + i2 * S[2] + i3 * S[3] + i4 * S[4] ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5 >
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==6),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5 ) const
	{
	return i0 * S[0] + i1 * S[1] + i2 * S[2] + i3 * S[3] + i4 * S[4] + i5 * S[5] ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6 >
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==7),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6 ) const
	{
	return i0 * S[0] + i1 * S[1] + i2 * S[2] + i3 * S[3] + i4 * S[4] + i5 * S[5] + i6 * S[6] ;
	}

	template < typename I0, typename I1, typename I2, typename I3
	,typename I4, typename I5, typename I6, typename I7 >
	KOKKOS_FORCEINLINE_FUNCTION
	- size_type operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6, I7 const& i7 ) const
	+ typename std::enable_if< (std::is_integral<I0>::value) && (shape_type::rank==8),size_type>::type
	+ operator()( I0 const& i0, I1 const& i1, I2 const& i2 , I3 const& i3, I4 const& i4, I5 const& i5, I6 const& i6, I7 const& i7 ) const
	{
	return i0 * S[0] + i1 * S[1] + i2 * S[2] + i3 * S[3] + i4 * S[4] + i5 * S[5] + i6 * S[6] + i7 * S[7] ;
	}
	};

	//----------------------------------------------------------------------------

	template< class T >
	struct ViewOffsetRange {

	enum { OK_integral_type = Impl::StaticAssert< Impl::is_integral<T>::value >::value };

	enum { is_range = false };

	KOKKOS_INLINE_FUNCTION static
	size_t dimension( size_t const , T const & ) { return 0 ; }

	KOKKOS_INLINE_FUNCTION static
	size_t begin( T const & i ) { return size_t(i) ; }
	};

	template<>
	struct ViewOffsetRange<void> {
	enum { is_range = false };
	};

	template<>
	struct ViewOffsetRange< Kokkos::ALL > {
	enum { is_range = true };

	KOKKOS_INLINE_FUNCTION static
	size_t dimension( size_t const n , ALL const & ) { return n ; }

	KOKKOS_INLINE_FUNCTION static
	size_t begin( ALL const & ) { return 0 ; }
	};

	template< typename iType >
	struct ViewOffsetRange< std::pair<iType,iType> > {

	enum { OK_integral_type = Impl::StaticAssert< Impl::is_integral<iType>::value >::value };

	enum { is_range = true };

	KOKKOS_INLINE_FUNCTION static
	size_t dimension( size_t const n , std::pair<iType,iType> const & r )
	{ return ( size_t(r.first) < size_t(r.second) && size_t(r.second) <= n ) ? size_t(r.second) - size_t(r.first) : 0 ; }

	KOKKOS_INLINE_FUNCTION static
	size_t begin( std::pair<iType,iType> const & r ) { return size_t(r.first) ; }
	};

	template< typename iType >
	struct ViewOffsetRange< Kokkos::pair<iType,iType> > {

	enum { OK_integral_type = Impl::StaticAssert< Impl::is_integral<iType>::value >::value };

	enum { is_range = true };

	KOKKOS_INLINE_FUNCTION static
	size_t dimension( size_t const n , Kokkos::pair<iType,iType> const & r )
	{ return ( size_t(r.first) < size_t(r.second) && size_t(r.second) <= n ) ? size_t(r.second) - size_t(r.first) : 0 ; }

	KOKKOS_INLINE_FUNCTION static
	size_t begin( Kokkos::pair<iType,iType> const & r ) { return size_t(r.first) ; }
	};

	}} // namespace Kokkos::Impl

	#endif //KOKKOS_VIEWOFFSET_HPP

	diff --git a/lib/kokkos/core/src/impl/Kokkos_ViewSupport.hpp b/lib/kokkos/core/src/impl/Kokkos_ViewSupport.hpp
	index fbce4fb17..006b35923 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_ViewSupport.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_ViewSupport.hpp
	@@ -1,541 +1,518 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_VIEWSUPPORT_HPP
	#define KOKKOS_VIEWSUPPORT_HPP

	#include <Kokkos_ExecPolicy.hpp>
	#include <impl/Kokkos_Shape.hpp>

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	/** \brief Evaluate if LHS = RHS view assignment is allowed. */
	template< class ViewLHS , class ViewRHS >
	struct ViewAssignable
	{
	// Same memory space.
	// Same value type.
	// Compatible 'const' qualifier
	// Cannot assign managed = unmannaged
	enum { assignable_value =
	( is_same< typename ViewLHS::value_type ,
	typename ViewRHS::value_type >::value
	\|\|
	is_same< typename ViewLHS::value_type ,
	typename ViewRHS::const_value_type >::value )
	&&
	is_same< typename ViewLHS::memory_space ,
	typename ViewRHS::memory_space >::value
	&&
	( ! ( ViewLHS::is_managed && ! ViewRHS::is_managed ) )
	};

	enum { assignable_shape =
	// Compatible shape and matching layout:
	( ShapeCompatible< typename ViewLHS::shape_type ,
	typename ViewRHS::shape_type >::value
	&&
	is_same< typename ViewLHS::array_layout ,
	typename ViewRHS::array_layout >::value )
	\|\|
	// Matching layout, same rank, and LHS dynamic rank
	( is_same< typename ViewLHS::array_layout ,
	typename ViewRHS::array_layout >::value
	&&
	int(ViewLHS::rank) == int(ViewRHS::rank)
	&&
	int(ViewLHS::rank) == int(ViewLHS::rank_dynamic) )
	\|\|
	// Both rank-0, any shape and layout
	( int(ViewLHS::rank) == 0 && int(ViewRHS::rank) == 0 )
	\|\|
	// Both rank-1 and LHS is dynamic rank-1, any shape and layout
	( int(ViewLHS::rank) == 1 && int(ViewRHS::rank) == 1 &&
	int(ViewLHS::rank_dynamic) == 1 )
	};

	enum { value = assignable_value && assignable_shape };
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class ExecSpace , class Type , bool Initialize >
	struct ViewDefaultConstruct
	{ ViewDefaultConstruct( Type * , size_t ) {} };


	/** \brief ViewDataHandle provides the type of the 'data handle' which the view
	* uses to access data with the [] operator. It also provides
	* an allocate function and a function to extract a raw ptr from the
	* data handle. ViewDataHandle also defines an enum ReferenceAble which
	* specifies whether references/pointers to elements can be taken and a
	* 'return_type' which is what the view operators will give back.
	* Specialisation of this object allows three things depending
	* on ViewTraits and compiler options:
	* (i) Use special allocator (e.g. huge pages/small pages and pinned memory)
	* (ii) Use special data handle type (e.g. add Cuda Texture Object)
	* (iii) Use special access intrinsics (e.g. texture fetch and non-caching loads)
	*/
	template< class StaticViewTraits , class Enable = void >
	struct ViewDataHandle {

	enum { ReturnTypeIsReference = true };

	typedef typename StaticViewTraits::value_type * handle_type;
	typedef typename StaticViewTraits::value_type & return_type;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static handle_type create_handle( typename StaticViewTraits::value_type * arg_data_ptr, AllocationTracker const & /arg_tracker/ )
	+ {
	+ return handle_type(arg_data_ptr);
	+ }
	};

	template< class StaticViewTraits , class Enable = void >
	class ViewDataManagement : public ViewDataHandle< StaticViewTraits > {
	private:

	template< class , class > friend class ViewDataManagement ;

	struct PotentiallyManaged {};
	struct StaticallyUnmanaged {};

	/* Statically unmanaged if traits or not executing in host-accessible memory space */
	typedef typename
	Impl::if_c< StaticViewTraits::is_managed &&
	Impl::is_same< Kokkos::HostSpace
	, Kokkos::Impl::ActiveExecutionMemorySpace >::value
	, PotentiallyManaged
	, StaticallyUnmanaged
	>::type StaticManagementTag ;

	enum { Unmanaged = 0x01
	, Noncontiguous = 0x02
	};

	enum { DefaultTraits = Impl::is_same< StaticManagementTag , StaticallyUnmanaged >::value ? Unmanaged : 0 };

	unsigned m_traits ; ///< Runtime traits


	template< class T >
	inline static
	unsigned assign( const ViewDataManagement<T> & rhs , const PotentiallyManaged & )
	{ return rhs.m_traits \| ( rhs.is_managed() && Kokkos::HostSpace::in_parallel() ? unsigned(Unmanaged) : 0u ); }

	template< class T >
	KOKKOS_INLINE_FUNCTION static
	unsigned assign( const ViewDataManagement<T> & rhs , const StaticallyUnmanaged & )
	{ return rhs.m_traits \| Unmanaged ; }

	- inline
	- void increment( const void * ptr , const PotentiallyManaged & ) const
	- { if ( is_managed() ) StaticViewTraits::memory_space::increment( ptr ); }
	-
	- inline
	- void decrement( const void * ptr , const PotentiallyManaged & ) const
	- { if ( is_managed() ) StaticViewTraits::memory_space::decrement( ptr ); }
	-
	- KOKKOS_INLINE_FUNCTION
	- void increment( const void * , const StaticallyUnmanaged & ) const {}
	-
	- KOKKOS_INLINE_FUNCTION
	- void decrement( const void * , const StaticallyUnmanaged & ) const {}
	-
	public:

	typedef typename ViewDataHandle< StaticViewTraits >::handle_type handle_type;

	KOKKOS_INLINE_FUNCTION
	ViewDataManagement() : m_traits( DefaultTraits ) {}

	KOKKOS_INLINE_FUNCTION
	ViewDataManagement( const ViewDataManagement & rhs )
	: m_traits( assign( rhs , StaticManagementTag() ) ) {}

	KOKKOS_INLINE_FUNCTION
	ViewDataManagement & operator = ( const ViewDataManagement & rhs )
	{ m_traits = assign( rhs , StaticManagementTag() ); return *this ; }

	template< class SVT >
	KOKKOS_INLINE_FUNCTION
	ViewDataManagement( const ViewDataManagement<SVT> & rhs )
	: m_traits( assign( rhs , StaticManagementTag() ) ) {}

	template< class SVT >
	KOKKOS_INLINE_FUNCTION
	ViewDataManagement & operator = ( const ViewDataManagement<SVT> & rhs )
	{ m_traits = assign( rhs , StaticManagementTag() ); return *this ; }

	KOKKOS_INLINE_FUNCTION
	bool is_managed() const { return ! ( m_traits & Unmanaged ); }

	KOKKOS_INLINE_FUNCTION
	bool is_contiguous() const { return ! ( m_traits & Noncontiguous ); }

	KOKKOS_INLINE_FUNCTION
	void set_unmanaged() { m_traits \|= Unmanaged ; }

	KOKKOS_INLINE_FUNCTION
	void set_noncontiguous() { m_traits \|= Noncontiguous ; }

	-
	- KOKKOS_INLINE_FUNCTION
	- void increment( handle_type handle ) const
	- { increment( ( typename StaticViewTraits::value_type *) handle , StaticManagementTag() ); }
	-
	- KOKKOS_INLINE_FUNCTION
	- void decrement( handle_type handle ) const
	- { decrement( ( typename StaticViewTraits::value_type *) handle , StaticManagementTag() ); }
	-
	-
	- KOKKOS_INLINE_FUNCTION
	- void increment( const void * ptr ) const
	- { increment( ptr , StaticManagementTag() ); }
	-
	- KOKKOS_INLINE_FUNCTION
	- void decrement( const void * ptr ) const
	- { decrement( ptr , StaticManagementTag() ); }
	-
	-
	template< bool Initialize >
	static
	- handle_type allocate( const std::string & label
	- , const Impl::ViewOffset< typename StaticViewTraits::shape_type
	- , typename StaticViewTraits::array_layout > & offset_map )
	+ handle_type allocate( const std::string & label
	+ , const Impl::ViewOffset< typename StaticViewTraits::shape_type, typename StaticViewTraits::array_layout > & offset_map
	+ , AllocationTracker & tracker
	+ )
	{
	typedef typename StaticViewTraits::execution_space execution_space ;
	typedef typename StaticViewTraits::memory_space memory_space ;
	typedef typename StaticViewTraits::value_type value_type ;

	const size_t count = offset_map.capacity();

	- value_type * ptr = (value_type) memory_space::allocate( label , sizeof(value_type) count );
	+ tracker = memory_space::allocate_and_track( label, sizeof(value_type) * count );

	- // Default construct within the view's execution space.
	+ value_type * ptr = reinterpret_cast<value_type *>(tracker.alloc_ptr());
	+
	+ // Default construct within the view's execution space.
	(void) ViewDefaultConstruct< execution_space , value_type , Initialize >( ptr , count );

	- return typename ViewDataHandle< StaticViewTraits >::handle_type(ptr);
	+ return ViewDataHandle< StaticViewTraits >::create_handle(ptr, tracker);
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class OutputView , class InputView , unsigned Rank = OutputView::Rank >
	struct ViewRemap
	{
	typedef typename OutputView::size_type size_type ;

	const OutputView output ;
	const InputView input ;
	const size_type n0 ;
	const size_type n1 ;
	const size_type n2 ;
	const size_type n3 ;
	const size_type n4 ;
	const size_type n5 ;
	const size_type n6 ;
	const size_type n7 ;

	ViewRemap( const OutputView & arg_out , const InputView & arg_in )
	: output( arg_out ), input( arg_in )
	, n0( std::min( (size_t)arg_out.dimension_0() , (size_t)arg_in.dimension_0() ) )
	, n1( std::min( (size_t)arg_out.dimension_1() , (size_t)arg_in.dimension_1() ) )
	, n2( std::min( (size_t)arg_out.dimension_2() , (size_t)arg_in.dimension_2() ) )
	, n3( std::min( (size_t)arg_out.dimension_3() , (size_t)arg_in.dimension_3() ) )
	, n4( std::min( (size_t)arg_out.dimension_4() , (size_t)arg_in.dimension_4() ) )
	, n5( std::min( (size_t)arg_out.dimension_5() , (size_t)arg_in.dimension_5() ) )
	, n6( std::min( (size_t)arg_out.dimension_6() , (size_t)arg_in.dimension_6() ) )
	, n7( std::min( (size_t)arg_out.dimension_7() , (size_t)arg_in.dimension_7() ) )
	{
	typedef typename OutputView::execution_space execution_space ;
	Kokkos::RangePolicy< execution_space > range( 0 , n0 );
	parallel_for( range , *this );
	}

	KOKKOS_INLINE_FUNCTION
	void operator()( const size_type i0 ) const
	{
	for ( size_type i1 = 0 ; i1 < n1 ; ++i1 ) {
	for ( size_type i2 = 0 ; i2 < n2 ; ++i2 ) {
	for ( size_type i3 = 0 ; i3 < n3 ; ++i3 ) {
	for ( size_type i4 = 0 ; i4 < n4 ; ++i4 ) {
	for ( size_type i5 = 0 ; i5 < n5 ; ++i5 ) {
	for ( size_type i6 = 0 ; i6 < n6 ; ++i6 ) {
	for ( size_type i7 = 0 ; i7 < n7 ; ++i7 ) {
	output.at(i0,i1,i2,i3,i4,i5,i6,i7) = input.at(i0,i1,i2,i3,i4,i5,i6,i7);
	}}}}}}}
	}
	};

	template< class OutputView , class InputView >
	struct ViewRemap< OutputView , InputView , 0 >
	{
	typedef typename OutputView::value_type value_type ;
	typedef typename OutputView::memory_space dst_space ;
	typedef typename InputView ::memory_space src_space ;

	ViewRemap( const OutputView & arg_out , const InputView & arg_in )
	{
	DeepCopy< dst_space , src_space >( arg_out.ptr_on_device() ,
	arg_in.ptr_on_device() ,
	sizeof(value_type) );
	}
	};

	//----------------------------------------------------------------------------

	template< class ExecSpace , class Type >
	struct ViewDefaultConstruct< ExecSpace , Type , true >
	{
	Type * const m_ptr ;

	- KOKKOS_INLINE_FUNCTION
	- void operator()( const typename ExecSpace::size_type i ) const
	- { new( m_ptr + i ) Type(); }
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ void operator()( const typename ExecSpace::size_type& i ) const
	+ { m_ptr[i] = Type(); }

	ViewDefaultConstruct( Type * pointer , size_t capacity )
	: m_ptr( pointer )
	{
	Kokkos::RangePolicy< ExecSpace > range( 0 , capacity );
	parallel_for( range , *this );
	ExecSpace::fence();
	}
	};

	-template< class OutputView , unsigned Rank = OutputView::Rank >
	+template< class OutputView , unsigned Rank = OutputView::Rank ,
	+ class Enabled = void >
	struct ViewFill
	{
	typedef typename OutputView::const_value_type const_value_type ;
	typedef typename OutputView::size_type size_type ;

	const OutputView output ;
	const_value_type input ;

	ViewFill( const OutputView & arg_out , const_value_type & arg_in )
	: output( arg_out ), input( arg_in )
	{
	typedef typename OutputView::execution_space execution_space ;
	Kokkos::RangePolicy< execution_space > range( 0 , output.dimension_0() );
	parallel_for( range , *this );
	execution_space::fence();
	}

	KOKKOS_INLINE_FUNCTION
	void operator()( const size_type i0 ) const
	{
	for ( size_type i1 = 0 ; i1 < output.dimension_1() ; ++i1 ) {
	for ( size_type i2 = 0 ; i2 < output.dimension_2() ; ++i2 ) {
	for ( size_type i3 = 0 ; i3 < output.dimension_3() ; ++i3 ) {
	for ( size_type i4 = 0 ; i4 < output.dimension_4() ; ++i4 ) {
	for ( size_type i5 = 0 ; i5 < output.dimension_5() ; ++i5 ) {
	for ( size_type i6 = 0 ; i6 < output.dimension_6() ; ++i6 ) {
	for ( size_type i7 = 0 ; i7 < output.dimension_7() ; ++i7 ) {
	output.at(i0,i1,i2,i3,i4,i5,i6,i7) = input ;
	}}}}}}}
	}
	};

	template< class OutputView >
	struct ViewFill< OutputView , 0 >
	{
	typedef typename OutputView::const_value_type const_value_type ;
	typedef typename OutputView::memory_space dst_space ;

	ViewFill( const OutputView & arg_out , const_value_type & arg_in )
	{
	DeepCopy< dst_space , dst_space >( arg_out.ptr_on_device() , & arg_in ,
	sizeof(const_value_type) );
	}
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {

	struct ViewAllocateWithoutInitializing {

	const std::string label ;

	ViewAllocateWithoutInitializing() : label() {}
	ViewAllocateWithoutInitializing( const std::string & arg_label ) : label( arg_label ) {}
	ViewAllocateWithoutInitializing( const char * const arg_label ) : label( arg_label ) {}
	};

	struct ViewAllocate {

	const std::string label ;

	ViewAllocate() : label() {}
	ViewAllocate( const std::string & arg_label ) : label( arg_label ) {}
	ViewAllocate( const char * const arg_label ) : label( arg_label ) {}
	};

	}

	namespace Kokkos {
	namespace Impl {

	template< class Traits , class AllocationProperties , class Enable = void >
	struct ViewAllocProp : public Kokkos::Impl::false_type {};

	template< class Traits >
	struct ViewAllocProp< Traits , Kokkos::ViewAllocate
	, typename Kokkos::Impl::enable_if<(
	Traits::is_managed && ! Kokkos::Impl::is_const< typename Traits::value_type >::value
	)>::type >
	: public Kokkos::Impl::true_type
	{
	typedef size_t size_type ;
	typedef const ViewAllocate & property_type ;

	enum { Initialize = true };
	enum { AllowPadding = false };

	inline
	static const std::string & label( property_type p ) { return p.label ; }
	};

	template< class Traits >
	struct ViewAllocProp< Traits , std::string
	, typename Kokkos::Impl::enable_if<(
	Traits::is_managed && ! Kokkos::Impl::is_const< typename Traits::value_type >::value
	)>::type >
	: public Kokkos::Impl::true_type
	{
	typedef size_t size_type ;
	typedef const std::string & property_type ;

	enum { Initialize = true };
	enum { AllowPadding = false };

	inline
	static const std::string & label( property_type s ) { return s ; }
	};

	template< class Traits , unsigned N >
	struct ViewAllocProp< Traits , char[N]
	, typename Kokkos::Impl::enable_if<(
	Traits::is_managed && ! Kokkos::Impl::is_const< typename Traits::value_type >::value
	)>::type >
	: public Kokkos::Impl::true_type
	{
	private:
	typedef char label_type[N] ;
	public:

	typedef size_t size_type ;
	typedef const label_type & property_type ;

	enum { Initialize = true };
	enum { AllowPadding = false };

	inline
	static std::string label( property_type s ) { return std::string(s) ; }
	};

	template< class Traits >
	struct ViewAllocProp< Traits , Kokkos::ViewAllocateWithoutInitializing
	, typename Kokkos::Impl::enable_if<(
	Traits::is_managed && ! Kokkos::Impl::is_const< typename Traits::value_type >::value
	)>::type >
	: public Kokkos::Impl::true_type
	{
	typedef size_t size_type ;
	typedef const Kokkos::ViewAllocateWithoutInitializing & property_type ;

	enum { Initialize = false };
	enum { AllowPadding = false };

	inline
	static std::string label( property_type s ) { return s.label ; }
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class Traits , class PointerProperties , class Enable = void >
	struct ViewRawPointerProp : public Kokkos::Impl::false_type {};

	template< class Traits , typename T >
	struct ViewRawPointerProp< Traits , T ,
	typename Kokkos::Impl::enable_if<(
	Impl::is_same< T , typename Traits::value_type >::value \|\|
	Impl::is_same< T , typename Traits::non_const_value_type >::value
	)>::type >
	: public Kokkos::Impl::true_type
	{
	- typedef size_t size_type ;
	+ typedef size_t size_type ;
	};

	} // namespace Impl
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_VIEWSUPPORT_HPP */


	diff --git a/lib/kokkos/core/src/impl/Kokkos_ViewTileLeft.hpp b/lib/kokkos/core/src/impl/Kokkos_ViewTileLeft.hpp
	index 7a9afc4ee..91d30927a 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_ViewTileLeft.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_ViewTileLeft.hpp
	@@ -1,195 +1,195 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#ifndef KOKKOS_VIEWTILELEFT_HPP
	#define KOKKOS_VIEWTILELEFT_HPP

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	namespace Kokkos {
	namespace Impl {

	template< class T , unsigned N0 , unsigned N1 , class MemorySpace , class MemoryTraits >
	struct ViewSpecialize< T , void , LayoutTileLeft<N0,N1> , MemorySpace , MemoryTraits >
	{
	typedef ViewDefault type ;
	};

	struct ViewTile {};

	template< class ShapeType , unsigned N0 , unsigned N1 >
	struct ViewOffset< ShapeType
	, LayoutTileLeft<N0,N1,true> /* Only accept properly shaped tiles */
	, typename Impl::enable_if<( 2 == ShapeType::rank
	&&
	2 == ShapeType::rank_dynamic
	)>::type >
	: public ShapeType
	{
	enum { SHIFT_0 = Impl::power_of_two<N0>::value };
	enum { SHIFT_1 = Impl::power_of_two<N1>::value };
	enum { MASK_0 = N0 - 1 };
	enum { MASK_1 = N1 - 1 };

	typedef size_t size_type ;
	typedef ShapeType shape_type ;
	typedef LayoutTileLeft<N0,N1,true> array_layout ;

	enum { has_padding = true };

	size_type tile_N0 ;

	KOKKOS_INLINE_FUNCTION
	void assign( const ViewOffset & rhs )
	{
	shape_type::N0 = rhs.N0 ;
	shape_type::N1 = rhs.N1 ;
	tile_N0 = ( rhs.N0 + MASK_0 ) >> SHIFT_0 ; // number of tiles in first dimension
	}

	KOKKOS_INLINE_FUNCTION
	void assign( size_t n0 , size_t n1
	, int = 0 , int = 0
	, int = 0 , int = 0
	, int = 0 , int = 0
	, int = 0
	)
	{
	shape_type::N0 = n0 ;
	shape_type::N1 = n1 ;
	tile_N0 = ( n0 + MASK_0 ) >> SHIFT_0 ; // number of tiles in first dimension
	}


	KOKKOS_INLINE_FUNCTION
	void set_padding() {}


	template< typename I0 , typename I1 >
	KOKKOS_INLINE_FUNCTION
	size_type operator()( I0 const & i0 , I1 const & i1
	, int = 0 , int = 0
	, int = 0 , int = 0
	, int = 0 , int = 0
	) const
	{
	return /* ( ( Tile offset ) * ( Tile size ) ) */
	( ( (i0>>SHIFT_0) + tile_N0 * (i1>>SHIFT_1) ) << (SHIFT_0 + SHIFT_1) ) +
	/* ( Offset within tile ) */
	( (i0 & MASK_0) + ((i1 & MASK_1)<<SHIFT_0) ) ;
	}

	template< typename I0 , typename I1 >
	KOKKOS_INLINE_FUNCTION
	size_type tile_begin( I0 const & i_tile0 , I1 const & i_tile1 ) const
	{
	return ( i_tile0 + tile_N0 * i_tile1 ) << ( SHIFT_0 + SHIFT_1 );
	}


	KOKKOS_INLINE_FUNCTION
	size_type capacity() const
	{
	// ( TileDim0 * ( TileDim1 ) ) * TileSize
	return ( tile_N0 * ( ( shape_type::N1 + MASK_1 ) >> SHIFT_1 ) ) << ( SHIFT_0 + SHIFT_1 );
	}
	};

	template<>
	struct ViewAssignment< ViewTile , void , void >
	{
	// Some compilers have type-matching issues on the integer values when using:
	// template< class T , unsigned N0 , unsigned N1 , class A2 , class A3 >
	template< class T , unsigned dN0 , unsigned dN1
	, class A2 , class A3
	, unsigned sN0 , unsigned sN1 >
	KOKKOS_INLINE_FUNCTION
	ViewAssignment( View< T[dN0][dN1], LayoutLeft, A2, A3, Impl::ViewDefault > & dst
	, View< T** , LayoutTileLeft<sN0,sN1,true>, A2, A3, Impl::ViewDefault > const & src
	, size_t const i_tile0
	, typename Impl::enable_if< unsigned(dN0) == unsigned(sN0) &&
	unsigned(dN1) == unsigned(sN1)
	, size_t const
	>::type i_tile1
	)
	{
	// Destination is always contiguous but source may be non-contiguous
	// so don't assign the whole view management object.
	// Just query and appropriately set the reference-count state.

	if ( ! src.m_management.is_managed() ) dst.m_management.set_unmanaged();

	dst.m_ptr_on_device = src.m_ptr_on_device + src.m_offset_map.tile_begin(i_tile0,i_tile1);

	- dst.m_management.increment( dst.m_ptr_on_device );
	+ dst.m_tracker = src.m_tracker;
	}
	};

	} /* namespace Impl */
	} /* namespace Kokkos */

	namespace Kokkos {

	template< class T , unsigned N0, unsigned N1, class A2, class A3 >
	KOKKOS_INLINE_FUNCTION
	View< T[N0][N1], LayoutLeft, A2, A3, Impl::ViewDefault >
	tile_subview( const View<T**,LayoutTileLeft<N0,N1,true>,A2,A3,Impl::ViewDefault> & src
	, const size_t i_tile0
	, const size_t i_tile1
	)
	{
	View< T[N0][N1], LayoutLeft, A2, A3, Impl::ViewDefault > dst ;

	(void) Impl::ViewAssignment< Impl::ViewTile , void , void >( dst , src , i_tile0 , i_tile1 );

	return dst ;
	}

	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif /* #ifndef KOKKOS_VIEWTILELEFT_HPP */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_Volatile_Load.hpp b/lib/kokkos/core/src/impl/Kokkos_Volatile_Load.hpp
	index ea349e7ab..420ee6389 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_Volatile_Load.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_Volatile_Load.hpp
	@@ -1,242 +1,242 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#if defined( KOKKOS_ATOMIC_HPP ) && ! defined( KOKKOS_VOLATILE_LOAD )
	#define KOKKOS_VOLATILE_LOAD

	#if defined( __GNUC__ ) /* GNU C */ \|\| \
	defined( __GNUG__ ) /* GNU C++ */ \|\| \
	defined( __clang__ )

	#define KOKKOS_MAY_ALIAS __attribute__((__may_alias__))

	#else

	#define KOKKOS_MAY_ALIAS

	#endif

	namespace Kokkos {

	//----------------------------------------------------------------------------

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	T volatile_load(T const volatile * const src_ptr)
	{
	typedef uint64_t KOKKOS_MAY_ALIAS T64;
	typedef uint32_t KOKKOS_MAY_ALIAS T32;
	typedef uint16_t KOKKOS_MAY_ALIAS T16;
	typedef uint8_t KOKKOS_MAY_ALIAS T8;

	enum {
	NUM_8 = sizeof(T),
	NUM_16 = NUM_8 / 2,
	NUM_32 = NUM_8 / 4,
	NUM_64 = NUM_8 / 8
	};

	union {
	T const volatile * const ptr;
	T64 const volatile * const ptr64;
	T32 const volatile * const ptr32;
	T16 const volatile * const ptr16;
	T8 const volatile * const ptr8;
	} src = {src_ptr};

	T result;

	union {
	T * const ptr;
	T64 * const ptr64;
	T32 * const ptr32;
	T16 * const ptr16;
	T8 * const ptr8;
	} dst = {&result};

	for (int i=0; i < NUM_64; ++i) {
	dst.ptr64[i] = src.ptr64[i];
	}

	if ( NUM_64*2 < NUM_32 ) {
	dst.ptr32[NUM_642] = src.ptr32[NUM_642];
	}

	if ( NUM_32*2 < NUM_16 ) {
	dst.ptr16[NUM_322] = src.ptr16[NUM_322];
	}

	if ( NUM_16*2 < NUM_8 ) {
	dst.ptr8[NUM_162] = src.ptr8[NUM_162];
	}

	return result;
	}

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	void volatile_store(T volatile * const dst_ptr, T const volatile * const src_ptr)
	{
	typedef uint64_t KOKKOS_MAY_ALIAS T64;
	typedef uint32_t KOKKOS_MAY_ALIAS T32;
	typedef uint16_t KOKKOS_MAY_ALIAS T16;
	typedef uint8_t KOKKOS_MAY_ALIAS T8;

	enum {
	NUM_8 = sizeof(T),
	NUM_16 = NUM_8 / 2,
	NUM_32 = NUM_8 / 4,
	NUM_64 = NUM_8 / 8
	};

	union {
	T const volatile * const ptr;
	T64 const volatile * const ptr64;
	T32 const volatile * const ptr32;
	T16 const volatile * const ptr16;
	T8 const volatile * const ptr8;
	} src = {src_ptr};

	union {
	T volatile * const ptr;
	T64 volatile * const ptr64;
	T32 volatile * const ptr32;
	T16 volatile * const ptr16;
	T8 volatile * const ptr8;
	} dst = {dst_ptr};

	for (int i=0; i < NUM_64; ++i) {
	dst.ptr64[i] = src.ptr64[i];
	}

	if ( NUM_64*2 < NUM_32 ) {
	dst.ptr32[NUM_642] = src.ptr32[NUM_642];
	}

	if ( NUM_32*2 < NUM_16 ) {
	dst.ptr16[NUM_322] = src.ptr16[NUM_322];
	}

	if ( NUM_16*2 < NUM_8 ) {
	dst.ptr8[NUM_162] = src.ptr8[NUM_162];
	}
	}

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	void volatile_store(T volatile * const dst_ptr, T const * const src_ptr)
	{
	typedef uint64_t KOKKOS_MAY_ALIAS T64;
	typedef uint32_t KOKKOS_MAY_ALIAS T32;
	typedef uint16_t KOKKOS_MAY_ALIAS T16;
	typedef uint8_t KOKKOS_MAY_ALIAS T8;

	enum {
	NUM_8 = sizeof(T),
	NUM_16 = NUM_8 / 2,
	NUM_32 = NUM_8 / 4,
	NUM_64 = NUM_8 / 8
	};

	union {
	T const * const ptr;
	T64 const * const ptr64;
	T32 const * const ptr32;
	T16 const * const ptr16;
	T8 const * const ptr8;
	} src = {src_ptr};

	union {
	T volatile * const ptr;
	T64 volatile * const ptr64;
	T32 volatile * const ptr32;
	T16 volatile * const ptr16;
	T8 volatile * const ptr8;
	} dst = {dst_ptr};

	for (int i=0; i < NUM_64; ++i) {
	dst.ptr64[i] = src.ptr64[i];
	}

	if ( NUM_64*2 < NUM_32 ) {
	dst.ptr32[NUM_642] = src.ptr32[NUM_642];
	}

	if ( NUM_32*2 < NUM_16 ) {
	dst.ptr16[NUM_322] = src.ptr16[NUM_322];
	}

	if ( NUM_16*2 < NUM_8 ) {
	dst.ptr8[NUM_162] = src.ptr8[NUM_162];
	}
	}

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	void volatile_store(T volatile * dst_ptr, T const volatile & src)
	{ volatile_store(dst_ptr, &src); }

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	void volatile_store(T volatile * dst_ptr, T const & src)
	{ volatile_store(dst_ptr, &src); }

	template <typename T>
	KOKKOS_FORCEINLINE_FUNCTION
	T safe_load(T const * const ptr)
	{
	#if !defined( __MIC__ )
	return *ptr;
	#else
	return volatile_load(ptr);
	#endif
	}

	} // namespace kokkos

	#undef KOKKOS_MAY_ALIAS

	#endif



	diff --git a/lib/kokkos/core/src/impl/Kokkos_hwloc.cpp b/lib/kokkos/core/src/impl/Kokkos_hwloc.cpp
	index f192d4716..1d173fb4f 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_hwloc.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_hwloc.cpp
	@@ -1,704 +1,704 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	#define DEBUG_PRINT 0

	#include <iostream>
	#include <sstream>

	#include <Kokkos_Macros.hpp>
	#include <Kokkos_hwloc.hpp>
	#include <impl/Kokkos_Error.hpp>

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace hwloc {

	/* Return 0 if asynchronous, 1 if synchronous and include process. */
	unsigned thread_mapping( const char * const label ,
	const bool allow_async ,
	unsigned & thread_count ,
	unsigned & use_numa_count ,
	unsigned & use_cores_per_numa ,
	std::pair<unsigned,unsigned> threads_coord[] )
	{
	const bool hwloc_avail = Kokkos::hwloc::available();
	const unsigned avail_numa_count = hwloc_avail ? hwloc::get_available_numa_count() : 1 ;
	const unsigned avail_cores_per_numa = hwloc_avail ? hwloc::get_available_cores_per_numa() : thread_count ;
	const unsigned avail_threads_per_core = hwloc_avail ? hwloc::get_available_threads_per_core() : 1 ;

	// (numa,core) coordinate of the process:
	const std::pair<unsigned,unsigned> proc_coord = Kokkos::hwloc::get_this_thread_coordinate();

	//------------------------------------------------------------------------
	// Defaults for unspecified inputs:

	if ( ! use_numa_count ) {
	// Default to use all NUMA regions
	use_numa_count = ! thread_count ? avail_numa_count : (
	thread_count < avail_numa_count ? thread_count : avail_numa_count );
	}

	if ( ! use_cores_per_numa ) {
	// Default to use all but one core if asynchronous, all cores if synchronous.
	const unsigned threads_per_numa = thread_count / use_numa_count ;

	use_cores_per_numa = ! threads_per_numa ? avail_cores_per_numa - ( allow_async ? 1 : 0 ) : (
	threads_per_numa < avail_cores_per_numa ? threads_per_numa : avail_cores_per_numa );
	}

	if ( ! thread_count ) {
	thread_count = use_numa_count * use_cores_per_numa * avail_threads_per_core ;
	}

	//------------------------------------------------------------------------
	// Input verification:

	const bool valid_numa = use_numa_count <= avail_numa_count ;
	const bool valid_cores = use_cores_per_numa &&
	use_cores_per_numa <= avail_cores_per_numa ;
	const bool valid_threads = thread_count &&
	thread_count <= use_numa_count * use_cores_per_numa * avail_threads_per_core ;
	const bool balanced_numa = ! ( thread_count % use_numa_count );
	const bool balanced_cores = ! ( thread_count % ( use_numa_count * use_cores_per_numa ) );

	const bool valid_input = valid_numa && valid_cores && valid_threads && balanced_numa && balanced_cores ;

	if ( ! valid_input ) {

	std::ostringstream msg ;

	msg << label << " HWLOC ERROR(s)" ;

	if ( ! valid_threads ) {
	msg << " : thread_count(" << thread_count
	<< ") exceeds capacity("
	<< use_numa_count * use_cores_per_numa * avail_threads_per_core
	<< ")" ;
	}
	if ( ! valid_numa ) {
	msg << " : use_numa_count(" << use_numa_count
	<< ") exceeds capacity(" << avail_numa_count << ")" ;
	}
	if ( ! valid_cores ) {
	msg << " : use_cores_per_numa(" << use_cores_per_numa
	<< ") exceeds capacity(" << avail_cores_per_numa << ")" ;
	}
	if ( ! balanced_numa ) {
	msg << " : thread_count(" << thread_count
	<< ") imbalanced among numa(" << use_numa_count << ")" ;
	}
	if ( ! balanced_cores ) {
	msg << " : thread_count(" << thread_count
	<< ") imbalanced among cores(" << use_numa_count * use_cores_per_numa << ")" ;
	}

	Kokkos::Impl::throw_runtime_exception( msg.str() );
	}

	const unsigned thread_spawn_synchronous =
	( allow_async &&
	1 < thread_count &&
	( use_numa_count < avail_numa_count \|\|
	use_cores_per_numa < avail_cores_per_numa ) )
	? 0 /* asyncronous */
	: 1 /* synchronous, threads_coord[0] is process core */ ;

	// Determine binding coordinates for to-be-spawned threads so that
	// threads may be bound to cores as they are spawned.

	const unsigned threads_per_core = thread_count / ( use_numa_count * use_cores_per_numa );

	if ( thread_spawn_synchronous ) {
	// Working synchronously and include process core as threads_coord[0].
	// Swap the NUMA coordinate of the process core with 0
	// Swap the CORE coordinate of the process core with 0
	for ( unsigned i = 0 , inuma = avail_numa_count - use_numa_count ; inuma < avail_numa_count ; ++inuma ) {
	const unsigned numa_coord = 0 == inuma ? proc_coord.first : ( proc_coord.first == inuma ? 0 : inuma );
	for ( unsigned icore = avail_cores_per_numa - use_cores_per_numa ; icore < avail_cores_per_numa ; ++icore ) {
	const unsigned core_coord = 0 == icore ? proc_coord.second : ( proc_coord.second == icore ? 0 : icore );
	for ( unsigned ith = 0 ; ith < threads_per_core ; ++ith , ++i ) {
	threads_coord[i].first = numa_coord ;
	threads_coord[i].second = core_coord ;
	}
	}
	}
	}
	else if ( use_numa_count < avail_numa_count ) {
	// Working asynchronously and omit the process' NUMA region from the pool.
	// Swap the NUMA coordinate of the process core with ( ( avail_numa_count - use_numa_count ) - 1 )
	const unsigned numa_coord_swap = ( avail_numa_count - use_numa_count ) - 1 ;
	for ( unsigned i = 0 , inuma = avail_numa_count - use_numa_count ; inuma < avail_numa_count ; ++inuma ) {
	const unsigned numa_coord = proc_coord.first == inuma ? numa_coord_swap : inuma ;
	for ( unsigned icore = avail_cores_per_numa - use_cores_per_numa ; icore < avail_cores_per_numa ; ++icore ) {
	const unsigned core_coord = icore ;
	for ( unsigned ith = 0 ; ith < threads_per_core ; ++ith , ++i ) {
	threads_coord[i].first = numa_coord ;
	threads_coord[i].second = core_coord ;
	}
	}
	}
	}
	else if ( use_cores_per_numa < avail_cores_per_numa ) {
	// Working asynchronously and omit the process' core from the pool.
	// Swap the CORE coordinate of the process core with ( ( avail_cores_per_numa - use_cores_per_numa ) - 1 )
	const unsigned core_coord_swap = ( avail_cores_per_numa - use_cores_per_numa ) - 1 ;
	for ( unsigned i = 0 , inuma = avail_numa_count - use_numa_count ; inuma < avail_numa_count ; ++inuma ) {
	const unsigned numa_coord = inuma ;
	for ( unsigned icore = avail_cores_per_numa - use_cores_per_numa ; icore < avail_cores_per_numa ; ++icore ) {
	const unsigned core_coord = proc_coord.second == icore ? core_coord_swap : icore ;
	for ( unsigned ith = 0 ; ith < threads_per_core ; ++ith , ++i ) {
	threads_coord[i].first = numa_coord ;
	threads_coord[i].second = core_coord ;
	}
	}
	}
	}

	return thread_spawn_synchronous ;
	}

	} /* namespace hwloc */
	} /* namespace Kokkos */

	/--------------------------------------------------------------------------/
	/--------------------------------------------------------------------------/

	#if defined( KOKKOS_HAVE_HWLOC )

	#include <iostream>
	#include <sstream>
	#include <stdexcept>

	/--------------------------------------------------------------------------/
	/* Third Party Libraries */

	/* Hardware locality library: http://www.open-mpi.org/projects/hwloc/ */
	#include <hwloc.h>

	#define REQUIRED_HWLOC_API_VERSION 0x000010300

	#if HWLOC_API_VERSION < REQUIRED_HWLOC_API_VERSION
	#error "Requires http://www.open-mpi.org/projects/hwloc/ Version 1.3 or greater"
	#endif

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace hwloc {
	namespace {

	#if DEBUG_PRINT

	inline
	void print_bitmap( std::ostream & s , const hwloc_const_bitmap_t bitmap )
	{
	s << "{" ;
	for ( int i = hwloc_bitmap_first( bitmap ) ;
	-1 != i ; i = hwloc_bitmap_next( bitmap , i ) ) {
	s << " " << i ;
	}
	s << " }" ;
	}

	#endif

	enum { MAX_CORE = 1024 };

	std::pair<unsigned,unsigned> s_core_topology(0,0);
	unsigned s_core_capacity(0);
	hwloc_topology_t s_hwloc_topology(0);
	hwloc_bitmap_t s_hwloc_location(0);
	hwloc_bitmap_t s_process_binding(0);
	hwloc_bitmap_t s_core[ MAX_CORE ];

	struct Sentinel {
	~Sentinel();
	Sentinel();
	};

	bool sentinel()
	{
	static Sentinel self ;

	if ( 0 == s_hwloc_topology ) {
	std::cerr << "Kokkos::hwloc ERROR : Called after return from main()" << std::endl ;
	std::cerr.flush();
	}

	return 0 != s_hwloc_topology ;
	}

	Sentinel::~Sentinel()
	{
	hwloc_topology_destroy( s_hwloc_topology );
	hwloc_bitmap_free( s_process_binding );
	hwloc_bitmap_free( s_hwloc_location );

	s_core_topology.first = 0 ;
	s_core_topology.second = 0 ;
	s_core_capacity = 0 ;
	s_hwloc_topology = 0 ;
	s_hwloc_location = 0 ;
	s_process_binding = 0 ;
	}

	Sentinel::Sentinel()
	{
	#if defined(__MIC__)
	static const bool remove_core_0 = true ;
	#else
	static const bool remove_core_0 = false ;
	#endif

	s_core_topology = std::pair<unsigned,unsigned>(0,0);
	s_core_capacity = 0 ;
	s_hwloc_topology = 0 ;
	s_hwloc_location = 0 ;
	s_process_binding = 0 ;

	for ( unsigned i = 0 ; i < MAX_CORE ; ++i ) s_core[i] = 0 ;

	hwloc_topology_init( & s_hwloc_topology );
	hwloc_topology_load( s_hwloc_topology );

	s_hwloc_location = hwloc_bitmap_alloc();
	s_process_binding = hwloc_bitmap_alloc();

	hwloc_get_cpubind( s_hwloc_topology , s_process_binding , HWLOC_CPUBIND_PROCESS );

	if ( remove_core_0 ) {

	const hwloc_obj_t core = hwloc_get_obj_by_type( s_hwloc_topology , HWLOC_OBJ_CORE , 0 );

	if ( hwloc_bitmap_intersects( s_process_binding , core->allowed_cpuset ) ) {

	hwloc_bitmap_t s_process_no_core_zero = hwloc_bitmap_alloc();

	hwloc_bitmap_andnot( s_process_no_core_zero , s_process_binding , core->allowed_cpuset );

	bool ok = 0 == hwloc_set_cpubind( s_hwloc_topology ,
	s_process_no_core_zero ,
	HWLOC_CPUBIND_PROCESS \| HWLOC_CPUBIND_STRICT );

	if ( ok ) {
	hwloc_get_cpubind( s_hwloc_topology , s_process_binding , HWLOC_CPUBIND_PROCESS );

	ok = 0 != hwloc_bitmap_isequal( s_process_binding , s_process_no_core_zero );
	}

	hwloc_bitmap_free( s_process_no_core_zero );

	if ( ! ok ) {
	std::cerr << "WARNING: Kokkos::hwloc attempted and failed to move process off of core #0" << std::endl ;
	}
	}
	}

	// Choose a hwloc object type for the NUMA level, which may not exist.

	hwloc_obj_type_t root_type = HWLOC_OBJ_TYPE_MAX ;

	{
	// Object types to search, in order.
	static const hwloc_obj_type_t candidate_root_type[] =
	{ HWLOC_OBJ_NODE /* NUMA region */
	, HWLOC_OBJ_SOCKET /* hardware socket */
	, HWLOC_OBJ_MACHINE /* local machine */
	};

	enum { CANDIDATE_ROOT_TYPE_COUNT =
	sizeof(candidate_root_type) / sizeof(hwloc_obj_type_t) };

	for ( int k = 0 ; k < CANDIDATE_ROOT_TYPE_COUNT && HWLOC_OBJ_TYPE_MAX == root_type ; ++k ) {
	if ( 0 < hwloc_get_nbobjs_by_type( s_hwloc_topology , candidate_root_type[k] ) ) {
	root_type = candidate_root_type[k] ;
	}
	}
	}

	// Determine which of these 'root' types are available to this process.
	// The process may have been bound (e.g., by MPI) to a subset of these root types.
	// Determine current location of the master (calling) process>

	hwloc_bitmap_t proc_cpuset_location = hwloc_bitmap_alloc();

	hwloc_get_last_cpu_location( s_hwloc_topology , proc_cpuset_location , HWLOC_CPUBIND_THREAD );

	const unsigned max_root = hwloc_get_nbobjs_by_type( s_hwloc_topology , root_type );

	unsigned root_base = max_root ;
	unsigned root_count = 0 ;
	unsigned core_per_root = 0 ;
	unsigned pu_per_core = 0 ;
	bool symmetric = true ;

	for ( unsigned i = 0 ; i < max_root ; ++i ) {

	const hwloc_obj_t root = hwloc_get_obj_by_type( s_hwloc_topology , root_type , i );

	if ( hwloc_bitmap_intersects( s_process_binding , root->allowed_cpuset ) ) {

	++root_count ;

	// Remember which root (NUMA) object the master thread is running on.
	// This will be logical NUMA rank #0 for this process.

	if ( hwloc_bitmap_intersects( proc_cpuset_location, root->allowed_cpuset ) ) {
	root_base = i ;
	}

	// Count available cores:

	const unsigned max_core =
	hwloc_get_nbobjs_inside_cpuset_by_type( s_hwloc_topology ,
	root->allowed_cpuset ,
	HWLOC_OBJ_CORE );

	unsigned core_count = 0 ;

	for ( unsigned j = 0 ; j < max_core ; ++j ) {

	const hwloc_obj_t core =
	hwloc_get_obj_inside_cpuset_by_type( s_hwloc_topology ,
	root->allowed_cpuset ,
	HWLOC_OBJ_CORE , j );

	// If process' cpuset intersects core's cpuset then process can access this core.
	// Must use intersection instead of inclusion because the Intel-Phi
	// MPI may bind the process to only one of the core's hyperthreads.
	//
	// Assumption: if the process can access any hyperthread of the core
	// then it has ownership of the entire core.
	// This assumes that it would be performance-detrimental
	// to spawn more than one MPI process per core and use nested threading.

	if ( hwloc_bitmap_intersects( s_process_binding , core->allowed_cpuset ) ) {

	++core_count ;

	const unsigned pu_count =
	hwloc_get_nbobjs_inside_cpuset_by_type( s_hwloc_topology ,
	core->allowed_cpuset ,
	HWLOC_OBJ_PU );

	if ( pu_per_core == 0 ) pu_per_core = pu_count ;

	// Enforce symmetry by taking the minimum:

	pu_per_core = std::min( pu_per_core , pu_count );

	if ( pu_count != pu_per_core ) symmetric = false ;
	}
	}

	if ( 0 == core_per_root ) core_per_root = core_count ;

	// Enforce symmetry by taking the minimum:

	core_per_root = std::min( core_per_root , core_count );

	if ( core_count != core_per_root ) symmetric = false ;
	}
	}

	s_core_topology.first = root_count ;
	s_core_topology.second = core_per_root ;
	s_core_capacity = pu_per_core ;

	// Fill the 's_core' array for fast mapping from a core coordinate to the
	// hwloc cpuset object required for thread location querying and binding.

	for ( unsigned i = 0 ; i < max_root ; ++i ) {

	const unsigned root_rank = ( i + root_base ) % max_root ;

	const hwloc_obj_t root = hwloc_get_obj_by_type( s_hwloc_topology , root_type , root_rank );

	if ( hwloc_bitmap_intersects( s_process_binding , root->allowed_cpuset ) ) {

	const unsigned max_core =
	hwloc_get_nbobjs_inside_cpuset_by_type( s_hwloc_topology ,
	root->allowed_cpuset ,
	HWLOC_OBJ_CORE );

	unsigned core_count = 0 ;

	for ( unsigned j = 0 ; j < max_core && core_count < core_per_root ; ++j ) {

	const hwloc_obj_t core =
	hwloc_get_obj_inside_cpuset_by_type( s_hwloc_topology ,
	root->allowed_cpuset ,
	HWLOC_OBJ_CORE , j );

	if ( hwloc_bitmap_intersects( s_process_binding , core->allowed_cpuset ) ) {

	s_core[ core_count + core_per_root * i ] = core->allowed_cpuset ;

	++core_count ;
	}
	}
	}
	}

	hwloc_bitmap_free( proc_cpuset_location );

	if ( ! symmetric ) {
	std::cout << "Kokkos::hwloc WARNING: Using a symmetric subset of a non-symmetric core topology."
	<< std::endl ;
	}
	}


	} // namespace

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	bool available()
	{ return true ; }

	unsigned get_available_numa_count()
	{ sentinel(); return s_core_topology.first ; }

	unsigned get_available_cores_per_numa()
	{ sentinel(); return s_core_topology.second ; }

	unsigned get_available_threads_per_core()
	{ sentinel(); return s_core_capacity ; }

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	unsigned bind_this_thread(
	const unsigned coordinate_count ,
	std::pair<unsigned,unsigned> coordinate[] )
	{
	unsigned i = 0 ;

	try {
	const std::pair<unsigned,unsigned> current = get_this_thread_coordinate();

	// Match one of the requests:
	for ( i = 0 ; i < coordinate_count && current != coordinate[i] ; ++i );

	if ( coordinate_count == i ) {
	// Match the first request (typically NUMA):
	for ( i = 0 ; i < coordinate_count && current.first != coordinate[i].first ; ++i );
	}

	if ( coordinate_count == i ) {
	// Match any unclaimed request:
	for ( i = 0 ; i < coordinate_count && ~0u == coordinate[i].first ; ++i );
	}

	if ( coordinate_count == i \|\| ! bind_this_thread( coordinate[i] ) ) {
	// Failed to bind:
	i = ~0u ;
	}

	if ( i < coordinate_count ) {

	#if DEBUG_PRINT
	if ( current != coordinate[i] ) {
	std::cout << " bind_this_thread: rebinding from ("
	<< current.first << ","
	<< current.second
	<< ") to ("
	<< coordinate[i].first << ","
	<< coordinate[i].second
	<< ")" << std::endl ;
	}
	#endif

	coordinate[i].first = ~0u ;
	coordinate[i].second = ~0u ;
	}
	}
	catch( ... ) {
	i = ~0u ;
	}

	return i ;
	}


	bool bind_this_thread( const std::pair<unsigned,unsigned> coord )
	{
	if ( ! sentinel() ) return false ;

	#if DEBUG_PRINT

	std::cout << "Kokkos::bind_this_thread() at " ;

	hwloc_get_last_cpu_location( s_hwloc_topology ,
	s_hwloc_location , HWLOC_CPUBIND_THREAD );

	print_bitmap( std::cout , s_hwloc_location );

	std::cout << " to " ;

	print_bitmap( std::cout , s_core[ coord.second + coord.first * s_core_topology.second ] );

	std::cout << std::endl ;

	#endif

	// As safe and fast as possible.
	// Fast-lookup by caching the coordinate -> hwloc cpuset mapping in 's_core'.
	return coord.first < s_core_topology.first &&
	coord.second < s_core_topology.second &&
	0 == hwloc_set_cpubind( s_hwloc_topology ,
	s_core[ coord.second + coord.first * s_core_topology.second ] ,
	HWLOC_CPUBIND_THREAD \| HWLOC_CPUBIND_STRICT );
	}

	bool unbind_this_thread()
	{
	if ( ! sentinel() ) return false ;

	#define HWLOC_DEBUG_PRINT 0

	#if HWLOC_DEBUG_PRINT

	std::cout << "Kokkos::unbind_this_thread() from " ;

	hwloc_get_cpubind( s_hwloc_topology , s_hwloc_location , HWLOC_CPUBIND_THREAD );

	print_bitmap( std::cout , s_hwloc_location );

	#endif

	const bool result =
	s_hwloc_topology &&
	0 == hwloc_set_cpubind( s_hwloc_topology ,
	s_process_binding ,
	HWLOC_CPUBIND_THREAD \| HWLOC_CPUBIND_STRICT );

	#if HWLOC_DEBUG_PRINT

	std::cout << " to " ;

	hwloc_get_cpubind( s_hwloc_topology , s_hwloc_location , HWLOC_CPUBIND_THREAD );

	print_bitmap( std::cout , s_hwloc_location );

	std::cout << std::endl ;

	#endif

	return result ;

	#undef HWLOC_DEBUG_PRINT

	}

	//----------------------------------------------------------------------------

	std::pair<unsigned,unsigned> get_this_thread_coordinate()
	{
	std::pair<unsigned,unsigned> coord(0u,0u);

	if ( ! sentinel() ) return coord ;

	const unsigned n = s_core_topology.first * s_core_topology.second ;

	// Using the pre-allocated 's_hwloc_location' to avoid memory
	// allocation by this thread. This call is NOT thread-safe.
	hwloc_get_last_cpu_location( s_hwloc_topology ,
	s_hwloc_location , HWLOC_CPUBIND_THREAD );

	unsigned i = 0 ;

	while ( i < n && ! hwloc_bitmap_intersects( s_hwloc_location , s_core[ i ] ) ) ++i ;

	if ( i < n ) {
	coord.first = i / s_core_topology.second ;
	coord.second = i % s_core_topology.second ;
	}

	return coord ;
	}

	//----------------------------------------------------------------------------

	} /* namespace hwloc */
	} /* namespace Kokkos */

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#else /* ! defined( KOKKOS_HAVE_HWLOC ) */

	namespace Kokkos {
	namespace hwloc {

	bool available() { return false ; }

	unsigned get_available_numa_count() { return 1 ; }
	unsigned get_available_cores_per_numa() { return 1 ; }
	unsigned get_available_threads_per_core() { return 1 ; }

	unsigned bind_this_thread( const unsigned , std::pair<unsigned,unsigned>[] )
	{ return ~0 ; }

	bool bind_this_thread( const std::pair<unsigned,unsigned> )
	{ return false ; }

	bool unbind_this_thread()
	{ return true ; }

	std::pair<unsigned,unsigned> get_this_thread_coordinate()
	{ return std::pair<unsigned,unsigned>(0,0); }

	} // namespace hwloc
	} // namespace Kokkos

	//----------------------------------------------------------------------------
	//----------------------------------------------------------------------------

	#endif


	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp b/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	index 1e9ff91c2..abd845da9 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	@@ -1,80 +1,82 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	#include <Kokkos_Macros.hpp>
	#include <impl/Kokkos_spinwait.hpp>

	/--------------------------------------------------------------------------/

	#if ( KOKKOS_ENABLE_ASM )
	#if defined( __arm__ )
	/* No-operation instruction to idle the thread. */
	#define YIELD asm volatile("nop")
	#else
	/* Pause instruction to prevent excess processor bus usage */
	#define YIELD asm volatile("pause\n":::"memory")
	#endif
	-#elif defined( KOKKOS_HAVE_WINTHREAD )
	+#elif defined ( KOKKOS_HAVE_WINTHREAD )
	#include <process.h>
	#define YIELD Sleep(0)
	+#elif defined ( _WIN32 )
	+ #define YIELD __asm__ __volatile__("pause\n":::"memory")
	#else
	#include <sched.h>
	#define YIELD sched_yield()
	#endif

	/--------------------------------------------------------------------------/

	namespace Kokkos {
	namespace Impl {
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	void spinwait( volatile int & flag , const int value )
	{
	while ( value == flag ) {
	YIELD ;
	}
	}
	#endif

	} /* namespace Impl */
	} /* namespace Kokkos */

	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	index 966291abd..cc87771fa 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	@@ -1,64 +1,64 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/


	#ifndef KOKKOS_SPINWAIT_HPP
	#define KOKKOS_SPINWAIT_HPP

	#include <Kokkos_Macros.hpp>

	namespace Kokkos {
	namespace Impl {

	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	void spinwait( volatile int & flag , const int value );
	#else
	KOKKOS_INLINE_FUNCTION
	void spinwait( volatile int & , const int ) {}
	#endif

	} /* namespace Impl */
	} /* namespace Kokkos */

	#endif /* #ifndef KOKKOS_SPINWAIT_HPP */

	diff --git a/lib/kokkos/core/unit_test/Makefile b/lib/kokkos/core/unit_test/Makefile
	new file mode 100755
	index 000000000..b2d3d5506
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/Makefile
	@@ -0,0 +1,146 @@
	+KOKKOS_PATH = ../..
	+
	+GTEST_PATH = ../../TPL/gtest
	+
	+vpath %.cpp ${KOKKOS_PATH}/core/unit_test
	+TEST_HEADERS = $(wildcard $(KOKKOS_PATH)/core/unit_test/*.hpp)
	+
	+default: build_all
	+ echo "End Build"
	+
	+
	+include $(KOKKOS_PATH)/Makefile.kokkos
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ CXX = nvcc_wrapper
	+ CXXFLAGS ?= -O3
	+ LINK = $(CXX)
	+ LDFLAGS ?= -lpthread
	+else
	+ CXX ?= g++
	+ CXXFLAGS ?= -O3
	+ LINK ?= $(CXX)
	+ LDFLAGS ?= -lpthread
	+endif
	+
	+KOKKOS_CXXFLAGS += -I$(GTEST_PATH) -I${KOKKOS_PATH}/core/unit_test
	+
	+TEST_TARGETS =
	+TARGETS =
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
	+ OBJ_CUDA = TestCuda.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosCore_UnitTest_Cuda
	+ TEST_TARGETS += test-cuda
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
	+ OBJ_THREADS = TestThreads.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosCore_UnitTest_Threads
	+ TEST_TARGETS += test-threads
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
	+ OBJ_OPENMP = TestOpenMP.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosCore_UnitTest_OpenMP
	+ TEST_TARGETS += test-openmp
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
	+ OBJ_SERIAL = TestSerial.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosCore_UnitTest_Serial
	+ TEST_TARGETS += test-serial
	+endif
	+
	+ifeq ($(KOKKOS_INTERNAL_USE_QTHREAD), 1)
	+ OBJ_QTHREAD = TestQthread.o UnitTestMain.o gtest-all.o
	+ TARGETS += KokkosCore_UnitTest_Qthread
	+ TEST_TARGETS += test-qthread
	+endif
	+
	+OBJ_HWLOC = TestHWLOC.o UnitTestMain.o gtest-all.o
	+TARGETS += KokkosCore_UnitTest_HWLOC
	+TEST_TARGETS += test-hwloc
	+
	+OBJ_ALLOCATIONTRACKER = TestAllocationTracker.o UnitTestMain.o gtest-all.o
	+TARGETS += KokkosCore_UnitTest_AllocationTracker
	+TEST_TARGETS += test-allocationtracker
	+
	+OBJ_DEFAULT = TestDefaultDeviceType.o UnitTestMain.o gtest-all.o
	+TARGETS += KokkosCore_UnitTest_Default
	+TEST_TARGETS += test-default
	+
	+OBJ_DEFAULTINIT = TestDefaultDeviceTypeInit.o UnitTestMain.o gtest-all.o
	+TARGETS += KokkosCore_UnitTest_DefaultInit
	+TEST_TARGETS += test-default-init
	+
	+
	+KokkosCore_UnitTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_Cuda
	+
	+KokkosCore_UnitTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_Threads
	+
	+KokkosCore_UnitTest_OpenMP: $(OBJ_OPENMP) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_OPENMP) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_OpenMP
	+
	+KokkosCore_UnitTest_Serial: $(OBJ_SERIAL) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_SERIAL) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_Serial
	+
	+KokkosCore_UnitTest_Qthread: $(OBJ_QTHREAD) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_QTHREAD) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_Qthread
	+
	+KokkosCore_UnitTest_HWLOC: $(OBJ_HWLOC) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_HWLOC) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_HWLOC
	+
	+KokkosCore_UnitTest_AllocationTracker: $(OBJ_ALLOCATIONTRACKER) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_ALLOCATIONTRACKER) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_AllocationTracker
	+
	+KokkosCore_UnitTest_Default: $(OBJ_DEFAULT) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_DEFAULT) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_Default
	+
	+KokkosCore_UnitTest_DefaultInit: $(OBJ_DEFAULTINIT) $(KOKKOS_LINK_DEPENDS)
	+ $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_DEFAULTINIT) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_UnitTest_DefaultInit
	+
	+test-cuda: KokkosCore_UnitTest_Cuda
	+ ./KokkosCore_UnitTest_Cuda
	+
	+test-threads: KokkosCore_UnitTest_Threads
	+ ./KokkosCore_UnitTest_Threads
	+
	+test-openmp: KokkosCore_UnitTest_OpenMP
	+ ./KokkosCore_UnitTest_OpenMP
	+
	+test-serial: KokkosCore_UnitTest_Serial
	+ ./KokkosCore_UnitTest_Serial
	+
	+test-qthread: KokkosCore_UnitTest_Qthread
	+ ./KokkosCore_UnitTest_Qthread
	+
	+test-hwloc: KokkosCore_UnitTest_HWLOC
	+ ./KokkosCore_UnitTest_HWLOC
	+
	+test-allocationtracker: KokkosCore_UnitTest_AllocationTracker
	+ ./KokkosCore_UnitTest_AllocationTracker
	+
	+test-default: KokkosCore_UnitTest_Default
	+ ./KokkosCore_UnitTest_Default
	+
	+test-default-init: KokkosCore_UnitTest_DefaultInit
	+ ./KokkosCore_UnitTest_DefaultInit
	+
	+build_all: $(TARGETS)
	+
	+test: $(TEST_TARGETS)
	+
	+clean: kokkos-clean
	+ rm -f *.o $(TARGETS)
	+
	+# Compilation rules
	+
	+%.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(TEST_HEADERS)
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
	+
	+gtest-all.o:$(GTEST_PATH)/gtest/gtest-all.cc
	+ $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $(GTEST_PATH)/gtest/gtest-all.cc
	+
	diff --git a/lib/kokkos/core/unit_test/TestAggregate.hpp b/lib/kokkos/core/unit_test/TestAggregate.hpp
	new file mode 100755
	index 000000000..35e7a8930
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestAggregate.hpp
	@@ -0,0 +1,716 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef TEST_AGGREGATE_HPP
	+#define TEST_AGGREGATE_HPP
	+
	+#include <gtest/gtest.h>
	+
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+struct EmbedArray {};
	+
	+struct ArrayProxyContiguous {};
	+struct ArrayProxyStrided {};
	+
	+template< typename T , unsigned N = 0 , class Proxy = void >
	+struct Array ;
	+
	+template< typename T >
	+struct Array<T,0,ArrayProxyContiguous>
	+{
	+public:
	+ typedef T value_type ;
	+
	+ enum { StaticLength = 0 };
	+ T * const value ;
	+ const unsigned count ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ Array( T * v , unsigned n ) : value(v), count(n) {}
	+
	+ template< class Proxy >
	+ KOKKOS_INLINE_FUNCTION
	+ Array & operator = ( const Array<T,0,Proxy> & rhs ) { return *this ; }
	+};
	+
	+template< typename T , unsigned N >
	+struct Array<T,N,ArrayProxyContiguous>
	+{
	+public:
	+ typedef T value_type ;
	+
	+ enum { StaticLength = N };
	+ T * const value ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ Array( T * v , unsigned ) : value(v) {}
	+
	+ template< class Proxy >
	+ KOKKOS_INLINE_FUNCTION
	+ Array & operator = ( const Array<T,N,Proxy> & rhs ) { return *this ; }
	+};
	+
	+template< typename T , unsigned N >
	+struct Array<T,N,ArrayProxyStrided>
	+{
	+public:
	+ typedef T value_type ;
	+
	+ enum { StaticLength = N };
	+ T * const value ;
	+ const unsigned stride ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ Array( T * v , unsigned , unsigned s ) : value(v), stride(s) {}
	+
	+ template< class Proxy >
	+ KOKKOS_INLINE_FUNCTION
	+ Array & operator = ( const Array<T,N,Proxy> & rhs ) { return *this ; }
	+};
	+
	+template< typename T >
	+struct Array<T,0,ArrayProxyStrided>
	+{
	+public:
	+ typedef T value_type ;
	+
	+ enum { StaticLength = 0 };
	+ T * const value ;
	+ const unsigned count ;
	+ const unsigned stride ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ Array( T * v , unsigned n , unsigned s ) : value(v), count(n), stride(s) {}
	+
	+ template< class Proxy >
	+ KOKKOS_INLINE_FUNCTION
	+ Array & operator = ( const Array<T,0,Proxy> & rhs ) { return *this ; }
	+};
	+
	+template< typename T >
	+struct Array<T,0,void>
	+{
	+public:
	+ typedef T value_type ;
	+
	+ enum { StaticLength = 0 };
	+ T * value ;
	+ const unsigned count ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ Array() : value(0) , count(0) {}
	+
	+ template< unsigned N , class Proxy >
	+ KOKKOS_INLINE_FUNCTION
	+ Array( const Array<T,N,Proxy> & rhs ) : value(rhs.value), count(N) {}
	+};
	+
	+template< typename T , unsigned N >
	+struct Array<T,N,void>
	+{
	+public:
	+ typedef T value_type ;
	+
	+ enum { StaticLength = N };
	+ T value[N] ;
	+
	+ template< class Proxy >
	+ KOKKOS_INLINE_FUNCTION
	+ Array & operator = ( const Array<T,N,Proxy> & ) { return *this ; }
	+};
	+
	+} // namespace Test
	+
	+/--------------------------------------------------------------------------/
	+/--------------------------------------------------------------------------/
	+
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+namespace Kokkos {
	+namespace Impl {
	+
	+template< typename T , unsigned N >
	+struct AnalyzeShape< Test::Array< T , N > >
	+ : public ShapeInsert< typename AnalyzeShape< T >::shape , N >::type
	+{
	+private:
	+ typedef AnalyzeShape< T > nested ;
	+public:
	+
	+ typedef Test::EmbedArray specialize ;
	+
	+ typedef typename ShapeInsert< typename nested::shape , N >::type shape ;
	+
	+ typedef typename nested::array_intrinsic_type array_intrinsic_type[ N ];
	+ typedef Test::Array< T , N > value_type ;
	+ typedef Test::Array< T , N > type ;
	+
	+ typedef const array_intrinsic_type const_array_intrinsic_type ;
	+ typedef const value_type const_value_type ;
	+ typedef const type const_type ;
	+
	+ typedef typename nested::non_const_array_intrinsic_type non_const_array_intrinsic_type[ N ];
	+ typedef Test::Array< typename nested::non_const_value_type , N > non_const_value_type ;
	+ typedef Test::Array< typename nested::non_const_value_type , N > non_const_type ;
	+};
	+
	+template< typename T >
	+struct AnalyzeShape< Test::Array< T , 0 > >
	+ : public ShapeInsert< typename AnalyzeShape< T >::shape , 0 >::type
	+{
	+private:
	+ typedef AnalyzeShape< T > nested ;
	+public:
	+
	+ typedef Test::EmbedArray specialize ;
	+
	+ typedef typename ShapeInsert< typename nested::shape , 0 >::type shape ;
	+
	+ typedef typename nested::array_intrinsic_type * array_intrinsic_type ;
	+ typedef Test::Array< T , 0 > value_type ;
	+ typedef Test::Array< T , 0 > type ;
	+
	+ typedef const array_intrinsic_type const_array_intrinsic_type ;
	+ typedef const value_type const_value_type ;
	+ typedef const type const_type ;
	+
	+ typedef typename nested::non_const_array_intrinsic_type * non_const_array_intrinsic_type ;
	+ typedef Test::Array< typename nested::non_const_value_type , 0 > non_const_value_type ;
	+ typedef Test::Array< typename nested::non_const_value_type , 0 > non_const_type ;
	+};
	+
	+/--------------------------------------------------------------------------/
	+
	+template< class ValueType , class MemorySpace , class MemoryTraits >
	+struct ViewSpecialize< ValueType
	+ , Test::EmbedArray
	+ , LayoutLeft
	+ , MemorySpace
	+ , MemoryTraits >
	+{ typedef Test::EmbedArray type ; };
	+
	+template< class ValueType , class MemorySpace , class MemoryTraits >
	+struct ViewSpecialize< ValueType
	+ , Test::EmbedArray
	+ , LayoutRight
	+ , MemorySpace
	+ , MemoryTraits >
	+{ typedef Test::EmbedArray type ; };
	+
	+/--------------------------------------------------------------------------/
	+
	+template<>
	+struct ViewAssignment< Test::EmbedArray , Test::EmbedArray , void >
	+{
	+ //------------------------------------
	+ /** \brief Compatible value and shape */
	+
	+ template< class DT , class DL , class DD , class DM ,
	+ class ST , class SL , class SD , class SM >
	+ KOKKOS_INLINE_FUNCTION
	+ ViewAssignment( View<DT,DL,DD,DM,Test::EmbedArray> & dst
	+ , const View<ST,SL,SD,SM,Test::EmbedArray> & src
	+ , const typename enable_if<(
	+ ViewAssignable< ViewTraits<DT,DL,DD,DM> ,
	+ ViewTraits<ST,SL,SD,SM> >::value
	+ )>::type * = 0
	+ )
	+ {
	+ dst.m_offset_map.assign( src.m_offset_map );
	+
	+ dst.m_ptr_on_device = src.m_ptr_on_device ;
	+
	+ dst.m_tracker = src.m_tracker;
	+ }
	+};
	+
	+template<>
	+struct ViewAssignment< ViewDefault , Test::EmbedArray , void >
	+{
	+ //------------------------------------
	+ /** \brief Compatible value and shape */
	+
	+ template< class ST , class SL , class SD , class SM >
	+ KOKKOS_INLINE_FUNCTION
	+ ViewAssignment( typename View<ST,SL,SD,SM,Test::EmbedArray>::array_type & dst
	+ , const View<ST,SL,SD,SM,Test::EmbedArray> & src
	+ )
	+ {
	+ dst.m_offset_map.assign( src.m_offset_map );
	+
	+ dst.m_ptr_on_device = src.m_ptr_on_device ;
	+
	+ dst.m_tracker = src.m_tracker;
	+ }
	+};
	+
	+
	+} // namespace Impl
	+} // namespace Kokkos
	+
	+/--------------------------------------------------------------------------/
	+/--------------------------------------------------------------------------/
	+
	+namespace Kokkos {
	+
	+template< class DataType ,
	+ class Arg1Type ,
	+ class Arg2Type ,
	+ class Arg3Type >
	+class View< DataType , Arg1Type , Arg2Type , Arg3Type , Test::EmbedArray >
	+ : public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
	+{
	+public:
	+
	+ typedef ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type > traits ;
	+
	+private:
	+
	+ // Assignment of compatible views requirement:
	+ template< class , class , class , class , class > friend class View ;
	+
	+ // Assignment of compatible subview requirement:
	+ template< class , class , class > friend struct Impl::ViewAssignment ;
	+
	+ typedef Impl::ViewOffset< typename traits::shape_type ,
	+ typename traits::array_layout > offset_map_type ;
	+
	+ typedef Impl::ViewDataManagement< traits > view_data_management ;
	+
	+ // traits::value_type = Test::Array< T , N >
	+
	+ typename traits::value_type::value_type * m_ptr_on_device ;
	+ offset_map_type m_offset_map ;
	+ view_data_management m_management ;
	+ Impl::AllocationTracker m_tracker ;
	+
	+public:
	+
	+ typedef View< typename traits::array_intrinsic_type ,
	+ typename traits::array_layout ,
	+ typename traits::execution_space ,
	+ typename traits::memory_traits > array_type ;
	+
	+ typedef View< typename traits::non_const_data_type ,
	+ typename traits::array_layout ,
	+ typename traits::execution_space ,
	+ typename traits::memory_traits > non_const_type ;
	+
	+ typedef View< typename traits::const_data_type ,
	+ typename traits::array_layout ,
	+ typename traits::execution_space ,
	+ typename traits::memory_traits > const_type ;
	+
	+ typedef View< typename traits::non_const_data_type ,
	+ typename traits::array_layout ,
	+ typename traits::host_mirror_space ,
	+ void > HostMirror ;
	+
	+ //------------------------------------
	+ // Shape
	+
	+ enum { Rank = traits::rank - 1 };
	+
	+ KOKKOS_INLINE_FUNCTION typename traits::shape_type shape() const { return m_offset_map ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_0() const { return m_offset_map.N0 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_1() const { return m_offset_map.N1 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_2() const { return m_offset_map.N2 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_3() const { return m_offset_map.N3 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_4() const { return m_offset_map.N4 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_5() const { return m_offset_map.N5 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_6() const { return m_offset_map.N6 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type dimension_7() const { return m_offset_map.N7 ; }
	+ KOKKOS_INLINE_FUNCTION typename traits::size_type size() const
	+ {
	+ return m_offset_map.N0
	+ * m_offset_map.N1
	+ * m_offset_map.N2
	+ * m_offset_map.N3
	+ * m_offset_map.N4
	+ * m_offset_map.N5
	+ * m_offset_map.N6
	+ * m_offset_map.N7
	+ ;
	+ }
	+
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ typename traits::size_type dimension( const iType & i ) const
	+ { return Impl::dimension( m_offset_map , i ); }
	+
	+ //------------------------------------
	+ // Destructor, constructors, assignment operators:
	+
	+ KOKKOS_INLINE_FUNCTION
	+ ~View() {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View()
	+ : m_ptr_on_device(0)
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker()
	+ { m_offset_map.assing(0,0,0,0,0,0,0,0); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View( const View & rhs )
	+ : m_ptr_on_device(0)
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker()
	+ {
	+ (void) Impl::ViewAssignment<
	+ typename traits::specialize ,
	+ typename traits::specialize >( *this , rhs );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ View & operator = ( const View & rhs )
	+ {
	+ (void) Impl::ViewAssignment<
	+ typename traits::specialize ,
	+ typename traits::specialize >( *this , rhs );
	+ return *this ;
	+ }
	+
	+ //------------------------------------
	+ // Construct or assign compatible view:
	+
	+ template< class RT , class RL , class RD , class RM , class RS >
	+ KOKKOS_INLINE_FUNCTION
	+ View( const View<RT,RL,RD,RM,RS> & rhs )
	+ : m_ptr_on_device(0)
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker()
	+ {
	+ (void) Impl::ViewAssignment<
	+ typename traits::specialize , RS >( *this , rhs );
	+ }
	+
	+ template< class RT , class RL , class RD , class RM , class RS >
	+ KOKKOS_INLINE_FUNCTION
	+ View & operator = ( const View<RT,RL,RD,RM,RS> & rhs )
	+ {
	+ (void) Impl::ViewAssignment<
	+ typename traits::specialize , RS >( *this , rhs );
	+ return *this ;
	+ }
	+
	+ //------------------------------------
	+ // Allocation of a managed view with possible alignment padding.
	+
	+ template< class AllocationProperties >
	+ explicit inline
	+ View( const AllocationProperties & prop ,
	+ const typename Impl::ViewAllocProp< traits , AllocationProperties >::size_type n0 = 0 ,
	+ const size_t n1 = 0 ,
	+ const size_t n2 = 0 ,
	+ const size_t n3 = 0 ,
	+ const size_t n4 = 0 ,
	+ const size_t n5 = 0 ,
	+ const size_t n6 = 0 ,
	+ const size_t n7 = 0 )
	+ : m_ptr_on_device(0)
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker()
	+ {
	+ typedef Impl::ViewAllocProp< traits , AllocationProperties > Alloc ;
	+
	+ typedef typename traits::memory_space memory_space ;
	+ typedef typename traits::value_type::value_type scalar_type ;
	+
	+ m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7 );
	+ m_offset_map.set_padding();
	+
	+ m_tracker = memory_space::allocate_and_track( Alloc::label( prop ), sizeof(scalar_type) * m_offset_map.capacity() );
	+
	+ m_ptr_on_device = reinterpret_cast<scalar_type *>(m_tracker.alloc_ptr());
	+
	+ (void) Impl::ViewDefaultConstruct< typename traits::execution_space , scalar_type , Alloc::Initialize >( m_ptr_on_device , m_offset_map.capacity() );
	+ }
	+
	+ //------------------------------------
	+ // Assign an unmanaged View from pointer, can be called in functors.
	+ // No alignment padding is performed.
	+
	+ typedef Impl::if_c< ! traits::is_managed ,
	+ typename traits::value_type::value_type * ,
	+ Impl::ViewError::user_pointer_constructor_requires_unmanaged >
	+ if_user_pointer_constructor ;
	+
	+ View( typename if_user_pointer_constructor::type ptr ,
	+ const size_t n0 = 0 ,
	+ const size_t n1 = 0 ,
	+ const size_t n2 = 0 ,
	+ const size_t n3 = 0 ,
	+ const size_t n4 = 0 ,
	+ const size_t n5 = 0 ,
	+ const size_t n6 = 0 ,
	+ const size_t n7 = 0 )
	+ : m_ptr_on_device(0)
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker()
	+ {
	+ m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7 );
	+ m_ptr_on_device = if_user_pointer_constructor::select( ptr );
	+ m_management.set_unmanaged();
	+ }
	+
	+ //------------------------------------
	+ // Assign unmanaged View to portion of Device shared memory
	+
	+ typedef Impl::if_c< ! traits::is_managed ,
	+ typename traits::execution_space ,
	+ Impl::ViewError::device_shmem_constructor_requires_unmanaged >
	+ if_device_shmem_constructor ;
	+
	+ explicit KOKKOS_INLINE_FUNCTION
	+ View( typename if_device_shmem_constructor::type & dev ,
	+ const unsigned n0 = 0 ,
	+ const unsigned n1 = 0 ,
	+ const unsigned n2 = 0 ,
	+ const unsigned n3 = 0 ,
	+ const unsigned n4 = 0 ,
	+ const unsigned n5 = 0 ,
	+ const unsigned n6 = 0 ,
	+ const unsigned n7 = 0 )
	+ : m_ptr_on_device(0)
	+ , m_offset_map()
	+ , m_management()
	+ , m_tracker()
	+ {
	+ typedef typename traits::value_type::value_type scalar_type ;
	+
	+ enum { align = 8 };
	+ enum { mask = align - 1 };
	+
	+ typedef Impl::if_c< ! traits::is_managed ,
	+ scalar_type * ,
	+ Impl::ViewError::device_shmem_constructor_requires_unmanaged >
	+ if_device_shmem_pointer ;
	+
	+ m_offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7 );
	+
	+ // Select the first argument:
	+ m_ptr_on_device = if_device_shmem_pointer::select(
	+ (scalar_type ) dev.get_shmem( unsigned( sizeof(scalar_type) m_offset_map.capacity() + unsigned(mask) ) & ~unsigned(mask) ) );
	+ }
	+
	+ static inline
	+ unsigned shmem_size( const unsigned n0 = 0 ,
	+ const unsigned n1 = 0 ,
	+ const unsigned n2 = 0 ,
	+ const unsigned n3 = 0 ,
	+ const unsigned n4 = 0 ,
	+ const unsigned n5 = 0 ,
	+ const unsigned n6 = 0 ,
	+ const unsigned n7 = 0 )
	+ {
	+ enum { align = 8 };
	+ enum { mask = align - 1 };
	+
	+ typedef typename traits::value_type::value_type scalar_type ;
	+
	+ offset_map_type offset_map ;
	+
	+ offset_map.assign( n0, n1, n2, n3, n4, n5, n6, n7 );
	+
	+ return unsigned( sizeof(scalar_type) * offset_map.capacity() + unsigned(mask) ) & ~unsigned(mask) ;
	+ }
	+
	+ //------------------------------------
	+ // Is not allocated
	+
	+ KOKKOS_INLINE_FUNCTION
	+ bool is_null() const { return 0 == m_ptr_on_device ; }
	+
	+ //------------------------------------
	+ // LayoutLeft, rank 2:
	+
	+ typedef Test::Array< typename traits::value_type::value_type ,
	+ traits::value_type::StaticLength ,
	+ Test::ArrayProxyStrided > LeftValue ;
	+
	+ template< typename iType0 >
	+ KOKKOS_INLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< LeftValue , traits, LayoutLeft, 2, iType0 >::type
	+ operator[] ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0, 0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , m_ptr_on_device );
	+
	+ return LeftValue( m_ptr_on_device + i0 , m_offset_map.N1 , m_offset_map.S0 );
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_INLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< LeftValue , traits, LayoutLeft, 2, iType0 >::type
	+ operator() ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0, 0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , m_ptr_on_device );
	+
	+ return LeftValue( m_ptr_on_device + i0 , m_offset_map.N1 , m_offset_map.S0 );
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_INLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< LeftValue , traits, LayoutLeft, 2, iType0 >::type
	+ at( const iType0 & i0 , const int , const int , const int ,
	+ const int , const int , const int , const int ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0, 0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , m_ptr_on_device );
	+
	+ return LeftValue( m_ptr_on_device + i0 , m_offset_map.N1 , m_offset_map.S0 );
	+ }
	+
	+ //------------------------------------
	+ // LayoutRight, rank 2:
	+
	+ typedef Test::Array< typename traits::value_type::value_type ,
	+ traits::value_type::StaticLength ,
	+ Test::ArrayProxyContiguous > RightValue ;
	+
	+ template< typename iType0 >
	+ KOKKOS_INLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< RightValue , traits, LayoutRight, 2, iType0 >::type
	+ operator[] ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0, 0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , m_ptr_on_device );
	+
	+ return RightValue( m_ptr_on_device + i0 , m_offset_map.N1 );
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_INLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< RightValue , traits, LayoutRight, 2, iType0 >::type
	+ operator() ( const iType0 & i0 ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0, 0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , m_ptr_on_device );
	+
	+ return RightValue( m_ptr_on_device + i0 , m_offset_map.N1 );
	+ }
	+
	+ template< typename iType0 >
	+ KOKKOS_INLINE_FUNCTION
	+ typename Impl::ViewEnableArrayOper< RightValue , traits, LayoutRight, 2, iType0 >::type
	+ at( const iType0 & i0 , const int , const int , const int ,
	+ const int , const int , const int , const int ) const
	+ {
	+ KOKKOS_ASSERT_SHAPE_BOUNDS_2( m_offset_map, i0, 0 );
	+ KOKKOS_RESTRICT_EXECUTION_TO_DATA( typename traits::memory_space , m_ptr_on_device );
	+
	+ return RightValue( m_ptr_on_device + i0 , m_offset_map.N1 );
	+ }
	+
	+ //------------------------------------
	+ // Access to the underlying contiguous storage of this view specialization.
	+ // These methods are specific to specialization of a view.
	+
	+ KOKKOS_INLINE_FUNCTION
	+ typename traits::value_type::value_type * ptr_on_device() const { return m_ptr_on_device ; }
	+
	+ // Stride of physical storage, dimensioned to at least Rank
	+ template< typename iType >
	+ KOKKOS_INLINE_FUNCTION
	+ void stride( iType * const s ) const
	+ { m_offset_map.stride( s ); }
	+
	+ // Count of contiguously allocated data members including padding.
	+ KOKKOS_INLINE_FUNCTION
	+ typename traits::size_type capacity() const
	+ { return m_offset_map.capacity(); }
	+};
	+
	+} // namespace Kokkos
	+
	+#endif /* #if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW ) */
	+
	+/--------------------------------------------------------------------------/
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< class DeviceType >
	+int TestViewAggregate()
	+{
	+
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+ typedef Kokkos::View< Test::Array<double,32> * , DeviceType > a32_type ;
	+ typedef typename a32_type::array_type a32_base_type ;
	+
	+ typedef Kokkos::View< Test::Array<double> * , DeviceType > a0_type ;
	+ typedef typename a0_type::array_type a0_base_type ;
	+
	+ a32_type a32("a32",100);
	+ a32_base_type a32_base ;
	+
	+ a0_type a0("a0",100,32);
	+ a0_base_type a0_base ;
	+
	+ a32_base = a32 ;
	+ a0_base = a0 ;
	+
	+#endif /* #if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW ) */
	+
	+ return 0 ;
	+}
	+
	+}
	+
	+
	+#endif /* #ifndef TEST_AGGREGATE_HPP */
	diff --git a/lib/kokkos/core/unit_test/TestAggregateReduction.hpp b/lib/kokkos/core/unit_test/TestAggregateReduction.hpp
	new file mode 100755
	index 000000000..7175d3434
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestAggregateReduction.hpp
	@@ -0,0 +1,189 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#ifndef TEST_AGGREGATE_REDUCTION_HPP
	+#define TEST_AGGREGATE_REDUCTION_HPP
	+
	+#include <gtest/gtest.h>
	+
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+namespace Test {
	+
	+template< typename T , unsigned N >
	+struct StaticArray {
	+ T value[N] ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray()
	+ { for ( unsigned i = 0 ; i < N ; ++i ) value[i] = T(); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray( const StaticArray & rhs )
	+ { for ( unsigned i = 0 ; i < N ; ++i ) value[i] = rhs.value[i]; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ operator T () { return value[0]; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray & operator = ( const T & rhs )
	+ {
	+ for ( unsigned i = 0 ; i < N ; ++i ) value[i] = rhs ;
	+ return *this ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray & operator = ( const StaticArray & rhs )
	+ {
	+ for ( unsigned i = 0 ; i < N ; ++i ) value[i] = rhs.value[i] ;
	+ return *this ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray operator * ( const StaticArray & rhs )
	+ {
	+ StaticArray tmp ;
	+ for ( unsigned i = 0 ; i < N ; ++i ) tmp.value[i] = value[i] * rhs.value[i] ;
	+ return tmp ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray operator + ( const StaticArray & rhs )
	+ {
	+ StaticArray tmp ;
	+ for ( unsigned i = 0 ; i < N ; ++i ) tmp.value[i] = value[i] + rhs.value[i] ;
	+ return tmp ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ StaticArray & operator += ( const StaticArray & rhs )
	+ {
	+ for ( unsigned i = 0 ; i < N ; ++i ) value[i] += rhs.value[i] ;
	+ return *this ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator += ( const volatile StaticArray & rhs ) volatile
	+ {
	+ for ( unsigned i = 0 ; i < N ; ++i ) value[i] += rhs.value[i] ;
	+ }
	+};
	+
	+template< typename T , class Space >
	+struct DOT {
	+ typedef T value_type ;
	+ typedef Space execution_space ;
	+
	+ Kokkos::View< value_type * , Space > a ;
	+ Kokkos::View< value_type * , Space > b ;
	+
	+ DOT( const Kokkos::View< value_type * , Space > arg_a
	+ , const Kokkos::View< value_type * , Space > arg_b
	+ )
	+ : a( arg_a ), b( arg_b ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const int i , value_type & update ) const
	+ {
	+ update += a(i) * b(i);
	+ }
	+};
	+
	+template< typename T , class Space >
	+struct FILL {
	+ typedef T value_type ;
	+ typedef Space execution_space ;
	+
	+ Kokkos::View< value_type * , Space > a ;
	+ Kokkos::View< value_type * , Space > b ;
	+
	+ FILL( const Kokkos::View< value_type * , Space > & arg_a
	+ , const Kokkos::View< value_type * , Space > & arg_b
	+ )
	+ : a( arg_a ), b( arg_b ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const int i ) const
	+ {
	+ a(i) = i % 2 ? i + 1 : 1 ;
	+ b(i) = i % 2 ? 1 : i + 1 ;
	+ }
	+};
	+
	+template< class Space >
	+void TestViewAggregateReduction()
	+{
	+ const int count = 2 ;
	+ const long result = count % 2 ? ( count * ( ( count + 1 ) / 2 ) )
	+ : ( ( count / 2 ) * ( count + 1 ) );
	+
	+ Kokkos::View< long * , Space > a("a",count);
	+ Kokkos::View< long * , Space > b("b",count);
	+ Kokkos::View< StaticArray<long,4> * , Space > a4("a4",count);
	+ Kokkos::View< StaticArray<long,4> * , Space > b4("b4",count);
	+ Kokkos::View< StaticArray<long,10> * , Space > a10("a10",count);
	+ Kokkos::View< StaticArray<long,10> * , Space > b10("b10",count);
	+
	+ Kokkos::parallel_for( count , FILL<long,Space>(a,b) );
	+ Kokkos::parallel_for( count , FILL< StaticArray<long,4> , Space >(a4,b4) );
	+ Kokkos::parallel_for( count , FILL< StaticArray<long,10> , Space >(a10,b10) );
	+
	+ long r = 0;
	+ StaticArray<long,4> r4 ;
	+ StaticArray<long,10> r10 ;
	+
	+ Kokkos::parallel_reduce( count , DOT<long,Space>(a,b) , r );
	+ Kokkos::parallel_reduce( count , DOT< StaticArray<long,4> , Space >(a4,b4) , r4 );
	+ Kokkos::parallel_reduce( count , DOT< StaticArray<long,10> , Space >(a10,b10) , r10 );
	+
	+ ASSERT_EQ( result , r );
	+ for ( int i = 0 ; i < 10 ; ++i ) { ASSERT_EQ( result , r10.value[i] ); }
	+ for ( int i = 0 ; i < 4 ; ++i ) { ASSERT_EQ( result , r4.value[i] ); }
	+}
	+
	+}
	+
	+#endif /* #ifndef TEST_AGGREGATE_REDUCTION_HPP */
	+
	diff --git a/lib/kokkos/core/unit_test/TestAllocationTracker.cpp b/lib/kokkos/core/unit_test/TestAllocationTracker.cpp
	new file mode 100755
	index 000000000..371b0ac75
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestAllocationTracker.cpp
	@@ -0,0 +1,145 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <iostream>
	+#include <vector>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <impl/Kokkos_AllocationTracker.hpp>
	+#include <impl/Kokkos_BasicAllocators.hpp>
	+
	+namespace Test {
	+
	+class alocation_tracker : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ Kokkos::initialize();
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::finalize();
	+ }
	+};
	+
	+TEST_F( alocation_tracker, simple)
	+{
	+ using namespace Kokkos::Impl;
	+
	+ {
	+ AllocationTracker tracker;
	+ EXPECT_FALSE( tracker.is_valid() );
	+ }
	+
	+ // test ref count and label
	+ {
	+ int size = 100;
	+ std::vector<AllocationTracker> trackers(size);
	+
	+ trackers[0] = AllocationTracker( MallocAllocator(), 128,"Test");
	+
	+ for (int i=0; i<size; ++i) {
	+ trackers[i] = trackers[0];
	+ }
	+
	+ EXPECT_EQ(100u, trackers[0].ref_count());
	+ EXPECT_EQ(std::string("Test"), std::string(trackers[0].label()));
	+ }
	+
	+
	+ // test circular list
	+ {
	+ int num_allocs = 3000;
	+ unsigned ref_count = 100;
	+
	+ std::vector<AllocationTracker> trackers(num_allocs);
	+
	+ for (int i=0; i<num_allocs; ++i) {
	+ trackers[i] = AllocationTracker( MallocAllocator(), 128, "Test");
	+ std::vector<AllocationTracker> ref_trackers(ref_count);
	+ for (unsigned j=0; j<ref_count; ++j) {
	+ ref_trackers[j] = trackers[i];
	+ }
	+ EXPECT_EQ( ref_count + 1u, trackers[i].ref_count() );
	+ }
	+
	+ for (int i=0; i<num_allocs; ++i) {
	+ EXPECT_EQ( 1u, trackers[i].ref_count() );
	+ }
	+ }
	+}
	+
	+TEST_F( alocation_tracker, force_leaks)
	+{
	+// uncomment to force memory leaks
	+#if 0
	+ using namespace Kokkos::Impl;
	+ Kokkos::kokkos_malloc("Forced Leak", 4096*10);
	+ Kokkos::kokkos_malloc<Kokkos::HostSpace>("Forced Leak", 4096*10);
	+#endif
	+}
	+
	+TEST_F( alocation_tracker, disable_reference_counting)
	+{
	+ using namespace Kokkos::Impl;
	+ // test ref count and label
	+ {
	+ int size = 100;
	+ std::vector<AllocationTracker> trackers(size);
	+
	+ trackers[0] = AllocationTracker( MallocAllocator(), 128,"Test");
	+
	+ for (int i=1; i<size; ++i) {
	+ trackers[i] = CopyWithoutTracking::apply(trackers[0]);
	+ }
	+
	+ EXPECT_EQ(1u, trackers[0].ref_count());
	+ EXPECT_EQ(std::string("Test"), std::string(trackers[0].label()));
	+ }
	+}
	+
	+} // namespace Test
	diff --git a/lib/kokkos/core/unit_test/TestAtomic.hpp b/lib/kokkos/core/unit_test/TestAtomic.hpp
	new file mode 100755
	index 000000000..d273c287e
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestAtomic.hpp
	@@ -0,0 +1,376 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_Core.hpp>
	+
	+namespace TestAtomic {
	+
	+// Struct for testing arbitrary size atomics
	+
	+template<int N>
	+struct SuperScalar {
	+ double val[N];
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar() {
	+ for(int i=0; i<N; i++)
	+ val[i] = 0.0;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar(const SuperScalar& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] = src.val[i];
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar(const volatile SuperScalar& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] = src.val[i];
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar& operator = (const SuperScalar& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] = src.val[i];
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar& operator = (const volatile SuperScalar& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] = src.val[i];
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ volatile SuperScalar& operator = (const SuperScalar& src) volatile {
	+ for(int i=0; i<N; i++)
	+ val[i] = src.val[i];
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar operator + (const SuperScalar& src) {
	+ SuperScalar tmp = *this;
	+ for(int i=0; i<N; i++)
	+ tmp.val[i] += src.val[i];
	+ return tmp;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar& operator += (const double& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] += 1.0(i+1)src;
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar& operator += (const SuperScalar& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] += src.val[i];
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ bool operator == (const SuperScalar& src) {
	+ bool compare = true;
	+ for(int i=0; i<N; i++)
	+ compare = compare && ( val[i] == src.val[i]);
	+ return compare;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ bool operator != (const SuperScalar& src) {
	+ bool compare = true;
	+ for(int i=0; i<N; i++)
	+ compare = compare && ( val[i] == src.val[i]);
	+ return !compare;
	+ }
	+
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ SuperScalar(const double& src) {
	+ for(int i=0; i<N; i++)
	+ val[i] = 1.0 * (i+1) * src;
	+ }
	+
	+};
	+
	+template<int N>
	+std::ostream& operator<<(std::ostream& os, const SuperScalar<N>& dt)
	+{
	+ os << "{ ";
	+ for(int i=0;i<N-1;i++)
	+ os << dt.val[i] << ", ";
	+ os << dt.val[N-1] << "}";
	+ return os;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct ZeroFunctor {
	+ typedef DEVICE_TYPE execution_space;
	+ typedef typename Kokkos::View<T,execution_space> type;
	+ typedef typename Kokkos::View<T,execution_space>::HostMirror h_type;
	+ type data;
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int) const {
	+ data() = 0;
	+ }
	+};
	+
	+//---------------------------------------------------
	+//--------------atomic_fetch_add---------------------
	+//---------------------------------------------------
	+
	+template<class T,class DEVICE_TYPE>
	+struct AddFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int) const {
	+ Kokkos::atomic_fetch_add(&data(),(T)1);
	+ }
	+};
	+
	+template<class T, class execution_space >
	+T AddLoop(int loop) {
	+ struct ZeroFunctor<T,execution_space> f_zero;
	+ typename ZeroFunctor<T,execution_space>::type data("Data");
	+ typename ZeroFunctor<T,execution_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ execution_space::fence();
	+
	+ struct AddFunctor<T,execution_space> f_add;
	+ f_add.data = data;
	+ Kokkos::parallel_for(loop,f_add);
	+ execution_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ T val = h_data();
	+ return val;
	+}
	+
	+template<class T>
	+T AddLoopSerial(int loop) {
	+ T* data = new T[1];
	+ data[0] = 0;
	+
	+ for(int i=0;i<loop;i++)
	+ *data+=(T)1;
	+
	+ T val = *data;
	+ delete data;
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct CASFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int) const {
	+ T old = data();
	+ T newval, assumed;
	+ do {
	+ assumed = old;
	+ newval = assumed + (T)1;
	+ old = Kokkos::atomic_compare_exchange(&data(), assumed, newval);
	+ }
	+ while( old != assumed );
	+ }
	+};
	+
	+template<class T, class execution_space >
	+T CASLoop(int loop) {
	+ struct ZeroFunctor<T,execution_space> f_zero;
	+ typename ZeroFunctor<T,execution_space>::type data("Data");
	+ typename ZeroFunctor<T,execution_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ execution_space::fence();
	+
	+ struct CASFunctor<T,execution_space> f_cas;
	+ f_cas.data = data;
	+ Kokkos::parallel_for(loop,f_cas);
	+ execution_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ T val = h_data();
	+
	+ return val;
	+}
	+
	+template<class T>
	+T CASLoopSerial(int loop) {
	+ T* data = new T[1];
	+ data[0] = 0;
	+
	+ for(int i=0;i<loop;i++) {
	+ T assumed;
	+ T newval;
	+ T old;
	+ do {
	+ assumed = *data;
	+ newval = assumed + (T)1;
	+ old = *data;
	+ *data = newval;
	+ }
	+ while(!(assumed==old));
	+ }
	+
	+ T val = *data;
	+ delete data;
	+ return val;
	+}
	+
	+template<class T,class DEVICE_TYPE>
	+struct ExchFunctor{
	+ typedef DEVICE_TYPE execution_space;
	+ typedef Kokkos::View<T,execution_space> type;
	+ type data, data2;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+ T old = Kokkos::atomic_exchange(&data(),(T)i);
	+ Kokkos::atomic_fetch_add(&data2(),old);
	+ }
	+};
	+
	+template<class T, class execution_space >
	+T ExchLoop(int loop) {
	+ struct ZeroFunctor<T,execution_space> f_zero;
	+ typename ZeroFunctor<T,execution_space>::type data("Data");
	+ typename ZeroFunctor<T,execution_space>::h_type h_data("HData");
	+ f_zero.data = data;
	+ Kokkos::parallel_for(1,f_zero);
	+ execution_space::fence();
	+
	+ typename ZeroFunctor<T,execution_space>::type data2("Data");
	+ typename ZeroFunctor<T,execution_space>::h_type h_data2("HData");
	+ f_zero.data = data2;
	+ Kokkos::parallel_for(1,f_zero);
	+ execution_space::fence();
	+
	+ struct ExchFunctor<T,execution_space> f_exch;
	+ f_exch.data = data;
	+ f_exch.data2 = data2;
	+ Kokkos::parallel_for(loop,f_exch);
	+ execution_space::fence();
	+
	+ Kokkos::deep_copy(h_data,data);
	+ Kokkos::deep_copy(h_data2,data2);
	+ T val = h_data() + h_data2();
	+
	+ return val;
	+}
	+
	+template<class T>
	+T ExchLoopSerial(int loop) {
	+ T* data = new T[1];
	+ T* data2 = new T[1];
	+ data[0] = 0;
	+ data2[0] = 0;
	+ for(int i=0;i<loop;i++) {
	+ T old = *data;
	+ *data=(T) i;
	+ *data2+=old;
	+ }
	+
	+ T val = data2 + data;
	+ delete data;
	+ delete data2;
	+ return val;
	+}
	+
	+template<class T, class DeviceType >
	+T LoopVariant(int loop, int test) {
	+ switch (test) {
	+ case 1: return AddLoop<T,DeviceType>(loop);
	+ case 2: return CASLoop<T,DeviceType>(loop);
	+ case 3: return ExchLoop<T,DeviceType>(loop);
	+ }
	+ return 0;
	+}
	+
	+template<class T>
	+T LoopVariantSerial(int loop, int test) {
	+ switch (test) {
	+ case 1: return AddLoopSerial<T>(loop);
	+ case 2: return CASLoopSerial<T>(loop);
	+ case 3: return ExchLoopSerial<T>(loop);
	+ }
	+ return 0;
	+}
	+
	+template<class T,class DeviceType>
	+bool Loop(int loop, int test)
	+{
	+ T res = LoopVariant<T,DeviceType>(loop,test);
	+ T resSerial = LoopVariantSerial<T>(loop,test);
	+
	+ bool passed = true;
	+
	+ if ( resSerial != res ) {
	+ passed = false;
	+
	+ std::cout << "Loop<"
	+ << typeid(T).name()
	+ << ">( test = "
	+ << test << " FAILED : "
	+ << resSerial << " != " << res
	+ << std::endl ;
	+ }
	+
	+
	+ return passed ;
	+}
	+
	+}
	+
	diff --git a/lib/kokkos/core/unit_test/TestCXX11.hpp b/lib/kokkos/core/unit_test/TestCXX11.hpp
	new file mode 100755
	index 000000000..f48c76de5
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestCXX11.hpp
	@@ -0,0 +1,319 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+#include <Kokkos_Core.hpp>
	+
	+namespace TestCXX11 {
	+
	+template<class DeviceType>
	+struct FunctorAddTest{
	+ typedef Kokkos::View<double**,DeviceType> view_type;
	+ view_type a_, b_;
	+ typedef DeviceType execution_space;
	+ FunctorAddTest(view_type & a, view_type &b):a_(a),b_(b) {}
	+ void operator() (const int& i) const {
	+ b_(i,0) = a_(i,1) + a_(i,2);
	+ b_(i,1) = a_(i,0) - a_(i,3);
	+ b_(i,2) = a_(i,4) + a_(i,0);
	+ b_(i,3) = a_(i,2) - a_(i,1);
	+ b_(i,4) = a_(i,3) + a_(i,4);
	+ }
	+
	+ typedef typename Kokkos::TeamPolicy< execution_space >::member_type team_member ;
	+ void operator() (const team_member & dev) const {
	+ int i = dev.league_rank()*dev.team_size() + dev.team_rank();
	+ b_(i,0) = a_(i,1) + a_(i,2);
	+ b_(i,1) = a_(i,0) - a_(i,3);
	+ b_(i,2) = a_(i,4) + a_(i,0);
	+ b_(i,3) = a_(i,2) - a_(i,1);
	+ b_(i,4) = a_(i,3) + a_(i,4);
	+ }
	+};
	+
	+template<class DeviceType, bool PWRTest>
	+double AddTestFunctor() {
	+
	+ typedef Kokkos::TeamPolicy<DeviceType> policy_type ;
	+
	+ Kokkos::View<double**,DeviceType> a("A",100,5);
	+ Kokkos::View<double**,DeviceType> b("B",100,5);
	+ typename Kokkos::View<double**,DeviceType>::HostMirror h_a = Kokkos::create_mirror_view(a);
	+ typename Kokkos::View<double**,DeviceType>::HostMirror h_b = Kokkos::create_mirror_view(b);
	+
	+ for(int i=0;i<100;i++) {
	+ for(int j=0;j<5;j++)
	+ h_a(i,j) = 0.1i/(1.1j+1.0) + 0.5*j;
	+ }
	+ Kokkos::deep_copy(a,h_a);
	+
	+ if(PWRTest==false)
	+ Kokkos::parallel_for(100,FunctorAddTest<DeviceType>(a,b));
	+ else
	+ Kokkos::parallel_for(policy_type(25,4),FunctorAddTest<DeviceType>(a,b));
	+ Kokkos::deep_copy(h_b,b);
	+
	+ double result = 0;
	+ for(int i=0;i<100;i++) {
	+ for(int j=0;j<5;j++)
	+ result += h_b(i,j);
	+ }
	+
	+ return result;
	+}
	+
	+
	+
	+#if defined (KOKKOS_HAVE_CXX11_DISPATCH_LAMBDA)
	+template<class DeviceType, bool PWRTest>
	+double AddTestLambda() {
	+
	+ typedef Kokkos::TeamPolicy<DeviceType> policy_type ;
	+
	+ Kokkos::View<double**,DeviceType> a("A",100,5);
	+ Kokkos::View<double**,DeviceType> b("B",100,5);
	+ typename Kokkos::View<double**,DeviceType>::HostMirror h_a = Kokkos::create_mirror_view(a);
	+ typename Kokkos::View<double**,DeviceType>::HostMirror h_b = Kokkos::create_mirror_view(b);
	+
	+ for(int i=0;i<100;i++) {
	+ for(int j=0;j<5;j++)
	+ h_a(i,j) = 0.1i/(1.1j+1.0) + 0.5*j;
	+ }
	+ Kokkos::deep_copy(a,h_a);
	+
	+ if(PWRTest==false) {
	+ Kokkos::parallel_for(100,[=](const int& i) {
	+ b(i,0) = a(i,1) + a(i,2);
	+ b(i,1) = a(i,0) - a(i,3);
	+ b(i,2) = a(i,4) + a(i,0);
	+ b(i,3) = a(i,2) - a(i,1);
	+ b(i,4) = a(i,3) + a(i,4);
	+ });
	+ } else {
	+ typedef typename policy_type::member_type team_member ;
	+ Kokkos::parallel_for(policy_type(25,4),[=](const team_member & dev) {
	+ int i = dev.league_rank()*dev.team_size() + dev.team_rank();
	+ b(i,0) = a(i,1) + a(i,2);
	+ b(i,1) = a(i,0) - a(i,3);
	+ b(i,2) = a(i,4) + a(i,0);
	+ b(i,3) = a(i,2) - a(i,1);
	+ b(i,4) = a(i,3) + a(i,4);
	+ });
	+ }
	+ Kokkos::deep_copy(h_b,b);
	+
	+ double result = 0;
	+ for(int i=0;i<100;i++) {
	+ for(int j=0;j<5;j++)
	+ result += h_b(i,j);
	+ }
	+
	+ return result;
	+}
	+
	+#else
	+template<class DeviceType, bool PWRTest>
	+double AddTestLambda() {
	+ return AddTestFunctor<DeviceType,PWRTest>();
	+}
	+#endif
	+
	+
	+template<class DeviceType>
	+struct FunctorReduceTest{
	+ typedef Kokkos::View<double**,DeviceType> view_type;
	+ view_type a_;
	+ typedef DeviceType execution_space;
	+ typedef double value_type;
	+ FunctorReduceTest(view_type & a):a_(a) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (const int& i, value_type& sum) const {
	+ sum += a_(i,1) + a_(i,2);
	+ sum += a_(i,0) - a_(i,3);
	+ sum += a_(i,4) + a_(i,0);
	+ sum += a_(i,2) - a_(i,1);
	+ sum += a_(i,3) + a_(i,4);
	+ }
	+
	+ typedef typename Kokkos::TeamPolicy< execution_space >::member_type team_member ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (const team_member & dev, value_type& sum) const {
	+ int i = dev.league_rank()*dev.team_size() + dev.team_rank();
	+ sum += a_(i,1) + a_(i,2);
	+ sum += a_(i,0) - a_(i,3);
	+ sum += a_(i,4) + a_(i,0);
	+ sum += a_(i,2) - a_(i,1);
	+ sum += a_(i,3) + a_(i,4);
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void init(value_type& update) const {update = 0.0;}
	+ KOKKOS_INLINE_FUNCTION
	+ void join(volatile value_type& update, volatile value_type const& input) const {update += input;}
	+};
	+
	+template<class DeviceType, bool PWRTest>
	+double ReduceTestFunctor() {
	+
	+ typedef Kokkos::TeamPolicy<DeviceType> policy_type ;
	+ typedef Kokkos::View<double**,DeviceType> view_type ;
	+ typedef Kokkos::View<double,typename view_type::host_mirror_space,Kokkos::MemoryUnmanaged> unmanaged_result ;
	+
	+ view_type a("A",100,5);
	+ typename view_type::HostMirror h_a = Kokkos::create_mirror_view(a);
	+
	+ for(int i=0;i<100;i++) {
	+ for(int j=0;j<5;j++)
	+ h_a(i,j) = 0.1i/(1.1j+1.0) + 0.5*j;
	+ }
	+ Kokkos::deep_copy(a,h_a);
	+
	+ double result = 0.0;
	+ if(PWRTest==false)
	+ Kokkos::parallel_reduce(100,FunctorReduceTest<DeviceType>(a), unmanaged_result( & result ));
	+ else
	+ Kokkos::parallel_reduce(policy_type(25,4),FunctorReduceTest<DeviceType>(a), unmanaged_result( & result ));
	+
	+ return result;
	+}
	+
	+#if defined (KOKKOS_HAVE_CXX11_DISPATCH_LAMBDA)
	+template<class DeviceType, bool PWRTest>
	+double ReduceTestLambda() {
	+
	+ typedef Kokkos::TeamPolicy<DeviceType> policy_type ;
	+ typedef Kokkos::View<double**,DeviceType> view_type ;
	+ typedef Kokkos::View<double,typename view_type::host_mirror_space,Kokkos::MemoryUnmanaged> unmanaged_result ;
	+
	+ view_type a("A",100,5);
	+ typename view_type::HostMirror h_a = Kokkos::create_mirror_view(a);
	+
	+ for(int i=0;i<100;i++) {
	+ for(int j=0;j<5;j++)
	+ h_a(i,j) = 0.1i/(1.1j+1.0) + 0.5*j;
	+ }
	+ Kokkos::deep_copy(a,h_a);
	+
	+ double result = 0.0;
	+
	+ if(PWRTest==false) {
	+ Kokkos::parallel_reduce(100,[=](const int& i, double& sum) {
	+ sum += a(i,1) + a(i,2);
	+ sum += a(i,0) - a(i,3);
	+ sum += a(i,4) + a(i,0);
	+ sum += a(i,2) - a(i,1);
	+ sum += a(i,3) + a(i,4);
	+ }, unmanaged_result( & result ) );
	+ } else {
	+ typedef typename policy_type::member_type team_member ;
	+ Kokkos::parallel_reduce(policy_type(25,4),[=](const team_member & dev, double& sum) {
	+ int i = dev.league_rank()*dev.team_size() + dev.team_rank();
	+ sum += a(i,1) + a(i,2);
	+ sum += a(i,0) - a(i,3);
	+ sum += a(i,4) + a(i,0);
	+ sum += a(i,2) - a(i,1);
	+ sum += a(i,3) + a(i,4);
	+ }, unmanaged_result( & result ) );
	+ }
	+
	+ return result;
	+}
	+
	+#else
	+template<class DeviceType, bool PWRTest>
	+double ReduceTestLambda() {
	+ return ReduceTestFunctor<DeviceType,PWRTest>();
	+}
	+#endif
	+
	+template<class DeviceType>
	+double TestVariantLambda(int test) {
	+ switch (test) {
	+ case 1: return AddTestLambda<DeviceType,false>();
	+ case 2: return AddTestLambda<DeviceType,true>();
	+ case 3: return ReduceTestLambda<DeviceType,false>();
	+ case 4: return ReduceTestLambda<DeviceType,true>();
	+ }
	+ return 0;
	+}
	+
	+
	+template<class DeviceType>
	+double TestVariantFunctor(int test) {
	+ switch (test) {
	+ case 1: return AddTestFunctor<DeviceType,false>();
	+ case 2: return AddTestFunctor<DeviceType,true>();
	+ case 3: return ReduceTestFunctor<DeviceType,false>();
	+ case 4: return ReduceTestFunctor<DeviceType,true>();
	+ }
	+ return 0;
	+}
	+
	+template<class DeviceType>
	+bool Test(int test) {
	+
	+#ifdef KOKKOS_HAVE_CXX11_DISPATCH_LAMBDA
	+ double res_functor = TestVariantFunctor<DeviceType>(test);
	+ double res_lambda = TestVariantLambda<DeviceType>(test);
	+
	+ char testnames[5][256] = {" "
	+ ,"AddTest","AddTest TeamPolicy"
	+ ,"ReduceTest","ReduceTest TeamPolicy"
	+ };
	+ bool passed = true;
	+
	+ if ( res_functor != res_lambda ) {
	+ passed = false;
	+
	+ std::cout << "CXX11 ( test = '"
	+ << testnames[test] << "' FAILED : "
	+ << res_functor << " != " << res_lambda
	+ << std::endl ;
	+ }
	+
	+ return passed ;
	+#else
	+ return true;
	+#endif
	+}
	+
	+}
	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/core/unit_test/TestCXX11Deduction.hpp
	similarity index 50%
	copy from lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	copy to lib/kokkos/core/unit_test/TestCXX11Deduction.hpp
	index 0dcb3977a..9d20079b2 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/core/unit_test/TestCXX11Deduction.hpp
	@@ -1,84 +1,103 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/
	+#include <Kokkos_Core.hpp>
	+
	+#ifndef TESTCXX11DEDUCTION_HPP
	+#define TESTCXX11DEDUCTION_HPP
	+
	+namespace TestCXX11 {
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+
	+struct TestReductionDeductionTagA {};
	+struct TestReductionDeductionTagB {};
	+
	+template < class ExecSpace >
	+struct TestReductionDeductionFunctor {
	+
	+ // KOKKOS_INLINE_FUNCTION
	+ // void operator()( long i , long & value ) const
	+ // { value += i + 1 ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( TestReductionDeductionTagA , long i , long & value ) const
	+ { value += ( 2 * i + 1 ) + ( 2 * i + 2 ); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TestReductionDeductionTagB & , const long i , long & value ) const
	+ { value += ( 3 * i + 1 ) + ( 3 * i + 2 ) + ( 3 * i + 3 ) ; }

	-#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	-#define KOKKOS_PHYSICAL_LAYOUT_HPP
	-
	-
	-#include <Kokkos_View.hpp>
	-namespace Kokkos {
	-namespace Impl {
	-
	-
	-
	-struct PhysicalLayout {
	- enum LayoutType {Left,Right,Scalar,Error};
	- LayoutType layout_type;
	- int rank;
	- long long int stride[8]; //distance between two neighboring elements in a given dimension
	-
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #ifdef KOKKOS_HAVE_CUDA
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	- }
	- #endif
	};

	+template< class ExecSpace >
	+void test_reduction_deduction()
	+{
	+ typedef TestReductionDeductionFunctor< ExecSpace > Functor ;
	+
	+ const long N = 50 ;
	+ // const long answer = N % 2 ? ( N * ((N+1)/2 )) : ( (N/2) * (N+1) );
	+ const long answerA = N % 2 ? ( (2N) (((2N)+1)/2 )) : ( ((2N)/2) * ((2*N)+1) );
	+ const long answerB = N % 2 ? ( (3N) (((3N)+1)/2 )) : ( ((3N)/2) * ((3*N)+1) );
	+ long result = 0 ;
	+
	+ // Kokkos::parallel_reduce( Kokkos::RangePolicy<ExecSpace>(0,N) , Functor() , result );
	+ // ASSERT_EQ( answer , result );
	+
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy<ExecSpace,TestReductionDeductionTagA>(0,N) , Functor() , result );
	+ ASSERT_EQ( answerA , result );
	+
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy<ExecSpace,TestReductionDeductionTagB>(0,N) , Functor() , result );
	+ ASSERT_EQ( answerB , result );
	}
	+
	+#else /* ! defined( KOKKOS_HAVE_CXX11 ) */
	+
	+template< class ExecSpace >
	+void test_reduction_deduction() {}
	+
	+#endif /* ! defined( KOKKOS_HAVE_CXX11 ) */
	+
	}
	+
	#endif
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp b/lib/kokkos/core/unit_test/TestCompilerMacros.hpp
	similarity index 61%
	copy from lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp
	copy to lib/kokkos/core/unit_test/TestCompilerMacros.hpp
	index f1017c312..dfa2250c0 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_StaticAssert.hpp
	+++ b/lib/kokkos/core/unit_test/TestCompilerMacros.hpp
	@@ -1,79 +1,93 @@
	/*
	//@HEADER
	// ************************************************************************
	//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_STATICASSERT_HPP
	-#define KOKKOS_STATICASSERT_HPP
	+#include <Kokkos_Core.hpp>

	-namespace Kokkos {
	-namespace Impl {
	+#define KOKKOS_PRAGMA_UNROLL(a)

	-template < bool , class T = void >
	-struct StaticAssert ;
	+namespace TestCompilerMacros {

	-template< class T >
	-struct StaticAssert< true , T > {
	- typedef T type ;
	- static const bool value = true ;
	-};
	-
	-template < class A , class B >
	-struct StaticAssertSame ;
	-
	-template < class A >
	-struct StaticAssertSame<A,A> { typedef A type ; };
	+template<class DEVICE_TYPE>
	+struct AddFunctor {
	+ typedef DEVICE_TYPE execution_space;
	+ typedef typename Kokkos::View<int**,execution_space> type;
	+ type a,b;
	+ int length;

	-template < class A , class B >
	-struct StaticAssertAssignable ;
	+ AddFunctor(type a_, type b_):a(a_),b(b_),length(a.dimension_1()) {}

	-template < class A >
	-struct StaticAssertAssignable<A,A> { typedef A type ; };
	-
	-template < class A >
	-struct StaticAssertAssignable< const A , A > { typedef const A type ; };
	-
	-} // namespace Impl
	-} // namespace Kokkos
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()(int i) const {
	+#ifdef KOKKOS_HAVE_PRAGMA_UNROLL
	+ #pragma unroll
	+#endif
	+#ifdef KOKKOS_HAVE_PRAGMA_IVDEP
	+ #pragma ivdep
	+#endif
	+#ifdef KOKKOS_HAVE_PRAGMA_VECTOR
	+ #pragma vector always
	+#endif
	+#ifdef KOKKOS_HAVE_PRAGMA_LOOPCOUNT
	+ #pragma loop count(128)
	+#endif
	+#ifdef KOKKOS_HAVE_PRAGMA_SIMD
	+ #pragma simd
	+#endif
	+ for(int j=0;j<length;j++)
	+ a(i,j) += b(i,j);
	+ }
	+};

	-#endif /* KOKKOS_STATICASSERT_HPP */
	+template<class DeviceType>
	+bool Test() {
	+ typedef typename Kokkos::View<int**,DeviceType> type;
	+ type a("A",1024,128);
	+ type b("B",1024,128);

	+ AddFunctor<DeviceType> f(a,b);
	+ Kokkos::parallel_for(1024,f);
	+ DeviceType::fence();
	+ return true;
	+}

	+}
	diff --git a/lib/kokkos/core/unit_test/TestCuda.cpp b/lib/kokkos/core/unit_test/TestCuda.cpp
	new file mode 100755
	index 000000000..4a74d1f18
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestCuda.cpp
	@@ -0,0 +1,495 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <iostream>
	+
	+#include <Kokkos_Core.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+#include <impl/Kokkos_ViewTileLeft.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+#include <TestSharedAlloc.hpp>
	+#include <TestViewMapping.hpp>
	+
	+#include <TestViewImpl.hpp>
	+#include <TestAtomic.hpp>
	+
	+#include <TestViewAPI.hpp>
	+#include <TestViewSubview.hpp>
	+#include <TestTile.hpp>
	+
	+#include <TestReduce.hpp>
	+#include <TestScan.hpp>
	+#include <TestRange.hpp>
	+#include <TestTeam.hpp>
	+#include <TestAggregate.hpp>
	+#include <TestAggregateReduction.hpp>
	+#include <TestCompilerMacros.hpp>
	+#include <TestMemorySpaceTracking.hpp>
	+#include <TestTeamVector.hpp>
	+#include <TestTemplateMetaFunctions.hpp>
	+#include <TestCXX11Deduction.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+class cuda : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ Kokkos::Cuda::print_configuration( std::cout );
	+ Kokkos::HostSpace::execution_space::initialize();
	+ Kokkos::Cuda::initialize( Kokkos::Cuda::SelectDevice(0) );
	+ }
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Cuda::finalize();
	+ Kokkos::HostSpace::execution_space::finalize();
	+ }
	+};
	+
	+//----------------------------------------------------------------------------
	+
	+namespace Test {
	+
	+__global__
	+void test_abort()
	+{
	+ Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<
	+ Kokkos::CudaSpace ,
	+ Kokkos::HostSpace >::verify();
	+}
	+
	+__global__
	+void test_cuda_spaces_int_value( int * ptr )
	+{
	+ if ( ptr == 42 ) { ptr = 2 * 42 ; }
	+}
	+
	+
	+TEST_F( cuda , compiler_macros )
	+{
	+ ASSERT_TRUE( ( TestCompilerMacros::Test< Kokkos::Cuda >() ) );
	+}
	+
	+TEST_F( cuda , memory_space )
	+{
	+ TestMemorySpace< Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, spaces )
	+{
	+ if ( Kokkos::CudaUVMSpace::available() ) {
	+
	+ Kokkos::Impl::AllocationTracker tracker = Kokkos::CudaUVMSpace::allocate_and_track("uvm_ptr",sizeof(int));
	+
	+ int * uvm_ptr = (int*) tracker.alloc_ptr();
	+
	+ *uvm_ptr = 42 ;
	+
	+ Kokkos::Cuda::fence();
	+ test_cuda_spaces_int_value<<<1,1>>>(uvm_ptr);
	+ Kokkos::Cuda::fence();
	+
	+ EXPECT_EQ( uvm_ptr, int(242) );
	+
	+ }
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( cuda , impl_shared_alloc )
	+{
	+ test_shared_alloc< Kokkos::CudaSpace , Kokkos::HostSpace::execution_space >();
	+ test_shared_alloc< Kokkos::CudaUVMSpace , Kokkos::HostSpace::execution_space >();
	+ test_shared_alloc< Kokkos::CudaHostPinnedSpace , Kokkos::HostSpace::execution_space >();
	+}
	+
	+TEST_F( cuda , impl_view_mapping )
	+{
	+ test_view_mapping< Kokkos::Cuda >();
	+ test_view_mapping_subview< Kokkos::Cuda >();
	+ test_view_mapping_operator< Kokkos::Cuda >();
	+ TestViewMappingAtomic< Kokkos::Cuda >::run();
	+}
	+
	+template< class MemSpace >
	+struct TestViewCudaTexture {
	+
	+ enum { N = 1000 };
	+
	+ using V = Kokkos::Experimental::View<double*,MemSpace> ;
	+ using T = Kokkos::Experimental::View<const double*, MemSpace, Kokkos::MemoryRandomAccess > ;
	+
	+ V m_base ;
	+ T m_tex ;
	+
	+ struct TagInit {};
	+ struct TagTest {};
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagInit & , const int i ) const { m_base[i] = i + 1 ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagTest & , const int i , long & error_count ) const
	+ { if ( m_tex[i] != i + 1 ) ++error_count ; }
	+
	+ TestViewCudaTexture()
	+ : m_base("base",N)
	+ , m_tex( m_base )
	+ {}
	+
	+ static void run()
	+ {
	+ EXPECT_TRUE( ( std::is_same< typename V::reference_type
	+ , double &
	+ >::value ) );
	+
	+ EXPECT_TRUE( ( std::is_same< typename T::reference_type
	+ , const double
	+ >::value ) );
	+
	+ EXPECT_TRUE( V::reference_type_is_lvalue_reference ); // An ordinary view
	+ EXPECT_FALSE( T::reference_type_is_lvalue_reference ); // Texture fetch returns by value
	+
	+ TestViewCudaTexture self ;
	+ Kokkos::parallel_for( Kokkos::RangePolicy< Kokkos::Cuda , TagInit >(0,N) , self );
	+ long error_count = -1 ;
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy< Kokkos::Cuda , TagTest >(0,N) , self , error_count );
	+ EXPECT_EQ( error_count , 0 );
	+ }
	+};
	+
	+
	+TEST_F( cuda , impl_view_texture )
	+{
	+ TestViewCudaTexture< Kokkos::CudaSpace >::run();
	+ TestViewCudaTexture< Kokkos::CudaUVMSpace >::run();
	+}
	+
	+template< class MemSpace , class ExecSpace >
	+struct TestViewCudaAccessible {
	+
	+ enum { N = 1000 };
	+
	+ using V = Kokkos::Experimental::View<double*,MemSpace> ;
	+
	+ V m_base ;
	+
	+ struct TagInit {};
	+ struct TagTest {};
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagInit & , const int i ) const { m_base[i] = i + 1 ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagTest & , const int i , long & error_count ) const
	+ { if ( m_base[i] != i + 1 ) ++error_count ; }
	+
	+ TestViewCudaAccessible()
	+ : m_base("base",N)
	+ {}
	+
	+ static void run()
	+ {
	+ TestViewCudaAccessible self ;
	+ Kokkos::parallel_for( Kokkos::RangePolicy< typename MemSpace::execution_space , TagInit >(0,N) , self );
	+ MemSpace::execution_space::fence();
	+ // Next access is a different execution space, must complete prior kernel.
	+ long error_count = -1 ;
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy< ExecSpace , TagTest >(0,N) , self , error_count );
	+ EXPECT_EQ( error_count , 0 );
	+ }
	+};
	+
	+
	+TEST_F( cuda , impl_view_accessible )
	+{
	+ TestViewCudaAccessible< Kokkos::CudaSpace , Kokkos::Cuda >::run();
	+
	+ TestViewCudaAccessible< Kokkos::CudaUVMSpace , Kokkos::Cuda >::run();
	+ TestViewCudaAccessible< Kokkos::CudaUVMSpace , Kokkos::HostSpace::execution_space >::run();
	+
	+ TestViewCudaAccessible< Kokkos::CudaHostPinnedSpace , Kokkos::Cuda >::run();
	+ TestViewCudaAccessible< Kokkos::CudaHostPinnedSpace , Kokkos::HostSpace::execution_space >::run();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( cuda, view_impl )
	+{
	+ // test_abort<<<32,32>>>(); // Aborts the kernel with CUDA version 4.1 or greater
	+
	+ test_view_impl< Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, view_api )
	+{
	+ typedef Kokkos::View< const int * , Kokkos::Cuda , Kokkos::MemoryTraits< Kokkos::RandomAccess > > view_texture_managed ;
	+ typedef Kokkos::View< const int * , Kokkos::Cuda , Kokkos::MemoryTraits< Kokkos::RandomAccess \| Kokkos::Unmanaged > > view_texture_unmanaged ;
	+
	+ TestViewAPI< double , Kokkos::Cuda >();
	+
	+#if 0
	+ Kokkos::View<double, Kokkos::Cuda > x("x");
	+ Kokkos::View<double[1], Kokkos::Cuda > y("y");
	+ // *x = 10 ;
	+ // x() = 10 ;
	+ // y[0] = 10 ;
	+ // y(0) = 10 ;
	+#endif
	+}
	+
	+TEST_F( cuda, view_subview_auto_1d_left ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutLeft,Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, view_subview_auto_1d_right ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutRight,Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, view_subview_auto_1d_stride ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutStride,Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, view_subview_assign_strided ) {
	+ TestViewSubview::test_1d_strided_assignment< Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, view_subview_left_0 ) {
	+ TestViewSubview::test_left_0< Kokkos::CudaUVMSpace >();
	+}
	+
	+TEST_F( cuda, view_subview_left_1 ) {
	+ TestViewSubview::test_left_1< Kokkos::CudaUVMSpace >();
	+}
	+
	+TEST_F( cuda, view_subview_left_2 ) {
	+ TestViewSubview::test_left_2< Kokkos::CudaUVMSpace >();
	+}
	+
	+TEST_F( cuda, view_subview_left_3 ) {
	+ TestViewSubview::test_left_3< Kokkos::CudaUVMSpace >();
	+}
	+
	+TEST_F( cuda, view_subview_right_0 ) {
	+ TestViewSubview::test_right_0< Kokkos::CudaUVMSpace >();
	+}
	+
	+TEST_F( cuda, view_subview_right_1 ) {
	+ TestViewSubview::test_right_1< Kokkos::CudaUVMSpace >();
	+}
	+
	+TEST_F( cuda, view_subview_right_3 ) {
	+ TestViewSubview::test_right_3< Kokkos::CudaUVMSpace >();
	+}
	+
	+
	+
	+
	+TEST_F( cuda, range_tag )
	+{
	+ TestRange< Kokkos::Cuda >::test_for(1000);
	+ TestRange< Kokkos::Cuda >::test_reduce(1000);
	+ TestRange< Kokkos::Cuda >::test_scan(1000);
	+}
	+
	+TEST_F( cuda, team_tag )
	+{
	+ TestTeamPolicy< Kokkos::Cuda >::test_for(1000);
	+ TestTeamPolicy< Kokkos::Cuda >::test_reduce(1000);
	+}
	+
	+TEST_F( cuda, reduce )
	+{
	+ TestReduce< long , Kokkos::Cuda >( 10000000 );
	+ TestReduce< double , Kokkos::Cuda >( 1000000 );
	+}
	+
	+TEST_F( cuda, reduce_team )
	+{
	+ TestReduceTeam< long , Kokkos::Cuda >( 10000000 );
	+ TestReduceTeam< double , Kokkos::Cuda >( 1000000 );
	+}
	+
	+TEST_F( cuda, shared_team )
	+{
	+ TestSharedTeam< Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda, reduce_dynamic )
	+{
	+ TestReduceDynamic< long , Kokkos::Cuda >( 10000000 );
	+ TestReduceDynamic< double , Kokkos::Cuda >( 1000000 );
	+}
	+
	+TEST_F( cuda, reduce_dynamic_view )
	+{
	+ TestReduceDynamicView< long , Kokkos::Cuda >( 10000000 );
	+ TestReduceDynamicView< double , Kokkos::Cuda >( 1000000 );
	+}
	+
	+TEST_F( cuda, atomic )
	+{
	+ const int loop_count = 1e3 ;
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Cuda>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Cuda>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Cuda>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Cuda>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Cuda>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Cuda>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Cuda>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Cuda>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Cuda>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Cuda>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Cuda>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Cuda>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Cuda>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Cuda>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Cuda>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Cuda>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Cuda>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Cuda>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Cuda>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Cuda>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Cuda>(100,3) ) );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( cuda, tile_layout)
	+{
	+ TestTile::test< Kokkos::Cuda , 1 , 1 >( 1 , 1 );
	+ TestTile::test< Kokkos::Cuda , 1 , 1 >( 2 , 3 );
	+ TestTile::test< Kokkos::Cuda , 1 , 1 >( 9 , 10 );
	+
	+ TestTile::test< Kokkos::Cuda , 2 , 2 >( 1 , 1 );
	+ TestTile::test< Kokkos::Cuda , 2 , 2 >( 2 , 3 );
	+ TestTile::test< Kokkos::Cuda , 2 , 2 >( 4 , 4 );
	+ TestTile::test< Kokkos::Cuda , 2 , 2 >( 9 , 9 );
	+
	+ TestTile::test< Kokkos::Cuda , 2 , 4 >( 9 , 9 );
	+ TestTile::test< Kokkos::Cuda , 4 , 4 >( 9 , 9 );
	+
	+ TestTile::test< Kokkos::Cuda , 4 , 4 >( 1 , 1 );
	+ TestTile::test< Kokkos::Cuda , 4 , 4 >( 4 , 4 );
	+ TestTile::test< Kokkos::Cuda , 4 , 4 >( 9 , 9 );
	+ TestTile::test< Kokkos::Cuda , 4 , 4 >( 9 , 11 );
	+
	+ TestTile::test< Kokkos::Cuda , 8 , 8 >( 1 , 1 );
	+ TestTile::test< Kokkos::Cuda , 8 , 8 >( 4 , 4 );
	+ TestTile::test< Kokkos::Cuda , 8 , 8 >( 9 , 9 );
	+ TestTile::test< Kokkos::Cuda , 8 , 8 >( 9 , 11 );
	+}
	+
	+
	+TEST_F( cuda , view_aggregate )
	+{
	+ TestViewAggregate< Kokkos::Cuda >();
	+ TestViewAggregateReduction< Kokkos::Cuda >();
	+}
	+
	+
	+TEST_F( cuda , scan )
	+{
	+ TestScan< Kokkos::Cuda >::test_range( 1 , 1000 );
	+ TestScan< Kokkos::Cuda >( 1000000 );
	+ TestScan< Kokkos::Cuda >( 10000000 );
	+ Kokkos::Cuda::fence();
	+}
	+
	+TEST_F( cuda , team_scan )
	+{
	+ TestScanTeam< Kokkos::Cuda >( 10 );
	+ TestScanTeam< Kokkos::Cuda >( 10000 );
	+}
	+
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( cuda , template_meta_functions )
	+{
	+ TestTemplateMetaFunctions<int, Kokkos::Cuda >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#ifdef KOKKOS_HAVE_CXX11
	+
	+namespace Test {
	+
	+TEST_F( cuda , reduction_deduction )
	+{
	+ TestCXX11::test_reduction_deduction< Kokkos::Cuda >();
	+}
	+
	+TEST_F( cuda , team_vector )
	+{
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(0) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(1) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(2) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(3) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(4) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(5) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(6) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(7) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(8) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(9) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Cuda >(10) ) );
	+}
	+
	+}
	+#endif
	+
	diff --git a/lib/kokkos/core/unit_test/TestDefaultDeviceType.cpp b/lib/kokkos/core/unit_test/TestDefaultDeviceType.cpp
	new file mode 100755
	index 000000000..d1a525f9e
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestDefaultDeviceType.cpp
	@@ -0,0 +1,250 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#if !defined(KOKKOS_HAVE_CUDA) \|\| defined(__CUDACC__)
	+//----------------------------------------------------------------------------
	+
	+#include <TestViewImpl.hpp>
	+#include <TestAtomic.hpp>
	+
	+#include <TestViewAPI.hpp>
	+
	+#include <TestReduce.hpp>
	+#include <TestScan.hpp>
	+#include <TestTeam.hpp>
	+#include <TestAggregate.hpp>
	+#include <TestCompilerMacros.hpp>
	+#include <TestCXX11.hpp>
	+#include <TestTeamVector.hpp>
	+
	+namespace Test {
	+
	+class defaultdevicetype : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ Kokkos::initialize();
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::finalize();
	+ }
	+};
	+
	+
	+TEST_F( defaultdevicetype, view_impl) {
	+ test_view_impl< Kokkos::DefaultExecutionSpace >();
	+}
	+
	+TEST_F( defaultdevicetype, view_api) {
	+ TestViewAPI< double , Kokkos::DefaultExecutionSpace >();
	+}
	+
	+TEST_F( defaultdevicetype, long_reduce) {
	+ TestReduce< long , Kokkos::DefaultExecutionSpace >( 100000 );
	+}
	+
	+TEST_F( defaultdevicetype, double_reduce) {
	+ TestReduce< double , Kokkos::DefaultExecutionSpace >( 100000 );
	+}
	+
	+TEST_F( defaultdevicetype, long_reduce_dynamic ) {
	+ TestReduceDynamic< long , Kokkos::DefaultExecutionSpace >( 100000 );
	+}
	+
	+TEST_F( defaultdevicetype, double_reduce_dynamic ) {
	+ TestReduceDynamic< double , Kokkos::DefaultExecutionSpace >( 100000 );
	+}
	+
	+TEST_F( defaultdevicetype, long_reduce_dynamic_view ) {
	+ TestReduceDynamicView< long , Kokkos::DefaultExecutionSpace >( 100000 );
	+}
	+
	+
	+TEST_F( defaultdevicetype , atomics )
	+{
	+ const int loop_count = 1e4 ;
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::DefaultExecutionSpace>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::DefaultExecutionSpace>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::DefaultExecutionSpace>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::DefaultExecutionSpace>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::DefaultExecutionSpace>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::DefaultExecutionSpace>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::DefaultExecutionSpace>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::DefaultExecutionSpace>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::DefaultExecutionSpace>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::DefaultExecutionSpace>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::DefaultExecutionSpace>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::DefaultExecutionSpace>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::DefaultExecutionSpace>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::DefaultExecutionSpace>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::DefaultExecutionSpace>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::DefaultExecutionSpace>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::DefaultExecutionSpace>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::DefaultExecutionSpace>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::DefaultExecutionSpace>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::DefaultExecutionSpace>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::DefaultExecutionSpace>(100,3) ) );
	+}
	+
	+/*TEST_F( defaultdevicetype , view_remap )
	+{
	+ enum { N0 = 3 , N1 = 2 , N2 = 8 , N3 = 9 };
	+
	+ typedef Kokkos::View< double*[N1][N2][N3] ,
	+ Kokkos::LayoutRight ,
	+ Kokkos::DefaultExecutionSpace > output_type ;
	+
	+ typedef Kokkos::View< int**[N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::DefaultExecutionSpace > input_type ;
	+
	+ typedef Kokkos::View< int*[N0][N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::DefaultExecutionSpace > diff_type ;
	+
	+ output_type output( "output" , N0 );
	+ input_type input ( "input" , N0 , N1 );
	+ diff_type diff ( "diff" , N0 );
	+
	+ int value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ input(i0,i1,i2,i3) = ++value ;
	+ }}}}
	+
	+ // Kokkos::deep_copy( diff , input ); // throw with incompatible shape
	+ Kokkos::deep_copy( output , input );
	+
	+ value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ ++value ;
	+ ASSERT_EQ( value , ((int) output(i0,i1,i2,i3) ) );
	+ }}}}
	+}*/
	+
	+//----------------------------------------------------------------------------
	+
	+
	+TEST_F( defaultdevicetype , view_aggregate )
	+{
	+ TestViewAggregate< Kokkos::DefaultExecutionSpace >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( defaultdevicetype , scan )
	+{
	+ TestScan< Kokkos::DefaultExecutionSpace >::test_range( 1 , 1000 );
	+ TestScan< Kokkos::DefaultExecutionSpace >( 1000000 );
	+ TestScan< Kokkos::DefaultExecutionSpace >( 10000000 );
	+ Kokkos::DefaultExecutionSpace::fence();
	+}
	+
	+
	+TEST_F( defaultdevicetype , team_scan )
	+{
	+ TestScanTeam< Kokkos::DefaultExecutionSpace >( 10 );
	+ TestScanTeam< Kokkos::DefaultExecutionSpace >( 10000 );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( defaultdevicetype , compiler_macros )
	+{
	+ ASSERT_TRUE( ( TestCompilerMacros::Test< Kokkos::DefaultExecutionSpace >() ) );
	+}
	+
	+
	+//----------------------------------------------------------------------------
	+#if defined (KOKKOS_HAVE_CXX11)
	+TEST_F( defaultdevicetype , cxx11 )
	+{
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::DefaultExecutionSpace >(1) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::DefaultExecutionSpace >(2) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::DefaultExecutionSpace >(3) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::DefaultExecutionSpace >(4) ) );
	+}
	+#endif
	+
	+#if defined (KOKKOS_HAVE_CXX11)
	+TEST_F( defaultdevicetype , team_vector )
	+{
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::DefaultExecutionSpace >(0) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::DefaultExecutionSpace >(1) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::DefaultExecutionSpace >(2) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::DefaultExecutionSpace >(3) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::DefaultExecutionSpace >(4) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::DefaultExecutionSpace >(5) ) );
	+}
	+#endif
	+
	+#if defined (KOKKOS_HAVE_CXX11)
	+TEST_F( defaultdevicetype , malloc )
	+{
	+ int* data = (int) Kokkos::kokkos_malloc(100sizeof(int));
	+ ASSERT_NO_THROW(data = (int) Kokkos::kokkos_realloc(data,120sizeof(int)));
	+ Kokkos::kokkos_free(data);
	+}
	+#endif
	+
	+} // namespace test
	+
	+#endif
	diff --git a/lib/kokkos/core/unit_test/TestDefaultDeviceTypeInit.cpp b/lib/kokkos/core/unit_test/TestDefaultDeviceTypeInit.cpp
	new file mode 100755
	index 000000000..a1e3f8fb0
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestDefaultDeviceTypeInit.cpp
	@@ -0,0 +1,390 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+#ifdef KOKKOS_HAVE_OPENMP
	+#include <omp.h>
	+#endif
	+
	+#if !defined(KOKKOS_HAVE_CUDA) \|\| defined(__CUDACC__)
	+//----------------------------------------------------------------------------
	+
	+namespace Test {
	+
	+namespace Impl {
	+
	+ char** init_kokkos_args(bool do_threads,bool do_numa,bool do_device,bool do_other, int& nargs, Kokkos::InitArguments& init_args) {
	+ nargs = (do_threads?1:0) +
	+ (do_numa?1:0) +
	+ (do_device?1:0) +
	+ (do_other?4:0);
	+ char** args_kokkos = new char*[nargs];
	+ for(int i = 0; i < nargs; i++)
	+ args_kokkos[i] = new char[20];
	+
	+ int threads_idx = do_other?1:0;
	+ int numa_idx = (do_other?3:0) + (do_threads?1:0);
	+ int device_idx = (do_other?3:0) + (do_threads?1:0) + (do_numa?1:0);
	+
	+
	+ if(do_threads) {
	+ int nthreads = 3;
	+
	+#ifdef KOKKOS_HAVE_OPENMP
	+ if(omp_get_max_threads() < 3)
	+ nthreads = omp_get_max_threads();
	+#endif
	+
	+ if(Kokkos::hwloc::available()) {
	+ if(Kokkos::hwloc::get_available_threads_per_core()<3)
	+ nthreads = Kokkos::hwloc::get_available_threads_per_core()
	+ * Kokkos::hwloc::get_available_numa_count();
	+ }
	+
	+#ifdef KOKKOS_HAVE_SERIAL
	+ if(Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultExecutionSpace>::value \|\|
	+ Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultHostExecutionSpace>::value ) {
	+ nthreads = 1;
	+ }
	+#endif
	+ init_args.num_threads = nthreads;
	+ sprintf(args_kokkos[threads_idx],"--threads=%i",nthreads);
	+ }
	+
	+ if(do_numa) {
	+ int numa = 1;
	+ if(Kokkos::hwloc::available())
	+ numa = Kokkos::hwloc::get_available_numa_count();
	+#ifdef KOKKOS_HAVE_SERIAL
	+ if(Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultExecutionSpace>::value \|\|
	+ Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultHostExecutionSpace>::value ) {
	+ numa = 1;
	+ }
	+#endif
	+
	+ init_args.num_numa = numa;
	+ sprintf(args_kokkos[numa_idx],"--numa=%i",numa);
	+ }
	+
	+ if(do_device) {
	+
	+ init_args.device_id = 0;
	+ sprintf(args_kokkos[device_idx],"--device=%i",0);
	+ }
	+
	+ if(do_other) {
	+ sprintf(args_kokkos[0],"--dummyarg=1");
	+ sprintf(args_kokkos[threads_idx+(do_threads?1:0)],"--dummy2arg");
	+ sprintf(args_kokkos[threads_idx+(do_threads?1:0)+1],"dummy3arg");
	+ sprintf(args_kokkos[device_idx+(do_device?1:0)],"dummy4arg=1");
	+ }
	+
	+
	+ return args_kokkos;
	+ }
	+
	+ Kokkos::InitArguments init_initstruct(bool do_threads, bool do_numa, bool do_device) {
	+ Kokkos::InitArguments args;
	+
	+ if(do_threads) {
	+ int nthreads = 3;
	+
	+#ifdef KOKKOS_HAVE_OPENMP
	+ if(omp_get_max_threads() < 3)
	+ nthreads = omp_get_max_threads();
	+#endif
	+
	+ if(Kokkos::hwloc::available()) {
	+ if(Kokkos::hwloc::get_available_threads_per_core()<3)
	+ nthreads = Kokkos::hwloc::get_available_threads_per_core()
	+ * Kokkos::hwloc::get_available_numa_count();
	+ }
	+#ifdef KOKKOS_HAVE_SERIAL
	+ if(Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultExecutionSpace>::value \|\|
	+ Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultHostExecutionSpace>::value ) {
	+ nthreads = 1;
	+ }
	+#endif
	+
	+ args.num_threads = nthreads;
	+ }
	+
	+ if(do_numa) {
	+ int numa = 1;
	+ if(Kokkos::hwloc::available())
	+ numa = Kokkos::hwloc::get_available_numa_count();
	+#ifdef KOKKOS_HAVE_SERIAL
	+ if(Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultExecutionSpace>::value \|\|
	+ Kokkos::Impl::is_same<Kokkos::Serial,Kokkos::DefaultHostExecutionSpace>::value ) {
	+ numa = 1;
	+ }
	+#endif
	+ args.num_numa = numa;
	+ }
	+
	+ if(do_device) {
	+ args.device_id = 0;
	+ }
	+
	+ return args;
	+ }
	+
	+ void check_correct_initialization(const Kokkos::InitArguments& argstruct) {
	+ ASSERT_EQ( Kokkos::DefaultExecutionSpace::is_initialized(), 1);
	+ ASSERT_EQ( Kokkos::HostSpace::execution_space::is_initialized(), 1);
	+
	+ //Figure out the number of threads the HostSpace ExecutionSpace should have initialized to
	+ int expected_nthreads = argstruct.num_threads;
	+ if(expected_nthreads<1) {
	+ if(Kokkos::hwloc::available()) {
	+ expected_nthreads = Kokkos::hwloc::get_available_numa_count()
	+ * Kokkos::hwloc::get_available_cores_per_numa()
	+ * Kokkos::hwloc::get_available_threads_per_core();
	+ } else {
	+ #ifdef KOKKOS_HAVE_OPENMP
	+ if(Kokkos::Impl::is_same<Kokkos::HostSpace::execution_space,Kokkos::OpenMP>::value) {
	+ expected_nthreads = omp_get_max_threads();
	+ } else
	+ #endif
	+ expected_nthreads = 1;
	+
	+ }
	+ #ifdef KOKKOS_HAVE_SERIAL
	+ if(Kokkos::Impl::is_same<Kokkos::DefaultExecutionSpace,Kokkos::Serial>::value \|\|
	+ Kokkos::Impl::is_same<Kokkos::DefaultHostExecutionSpace,Kokkos::Serial>::value )
	+ expected_nthreads = 1;
	+ #endif
	+ }
	+
	+ int expected_numa = argstruct.num_numa;
	+ if(expected_numa<1) {
	+ if(Kokkos::hwloc::available()) {
	+ expected_numa = Kokkos::hwloc::get_available_numa_count();
	+ } else {
	+ expected_numa = 1;
	+ }
	+ #ifdef KOKKOS_HAVE_SERIAL
	+ if(Kokkos::Impl::is_same<Kokkos::DefaultExecutionSpace,Kokkos::Serial>::value \|\|
	+ Kokkos::Impl::is_same<Kokkos::DefaultHostExecutionSpace,Kokkos::Serial>::value )
	+ expected_numa = 1;
	+ #endif
	+ }
	+ ASSERT_EQ(Kokkos::HostSpace::execution_space::thread_pool_size(),expected_nthreads);
	+
	+#ifdef KOKKOS_HAVE_CUDA
	+ if(Kokkos::Impl::is_same<Kokkos::DefaultExecutionSpace,Kokkos::Cuda>::value) {
	+ int device;
	+ cudaGetDevice( &device );
	+ int expected_device = argstruct.device_id;
	+ if(argstruct.device_id<0) {
	+ expected_device = 0;
	+ }
	+ ASSERT_EQ(expected_device,device);
	+ }
	+#endif
	+ }
	+
	+ //ToDo: Add check whether correct number of threads are actually started
	+ void test_no_arguments() {
	+ Kokkos::initialize();
	+ check_correct_initialization(Kokkos::InitArguments());
	+ Kokkos::finalize();
	+ }
	+
	+ void test_commandline_args(int nargs, char** args, const Kokkos::InitArguments& argstruct) {
	+ Kokkos::initialize(nargs,args);
	+ check_correct_initialization(argstruct);
	+ Kokkos::finalize();
	+ }
	+
	+ void test_initstruct_args(const Kokkos::InitArguments& args) {
	+ Kokkos::initialize(args);
	+ check_correct_initialization(args);
	+ Kokkos::finalize();
	+ }
	+}
	+
	+class defaultdevicetypeinit : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ }
	+};
	+
	+
	+TEST_F( defaultdevicetypeinit, no_args) {
	+ Impl::test_no_arguments();
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_empty) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(false,false,false,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_other) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(false,false,false,true,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_nthreads) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(true,false,false,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_nthreads_numa) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(true,true,false,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_nthreads_numa_device) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(true,true,true,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_nthreads_device) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(true,false,true,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_numa_device) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(false,true,true,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_device) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(false,false,true,false,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, commandline_args_nthreads_numa_device_other) {
	+ Kokkos::InitArguments argstruct;
	+ int nargs = 0;
	+ char** args = Impl::init_kokkos_args(true,true,true,true,nargs, argstruct);
	+ Impl::test_commandline_args(nargs,args,argstruct);
	+ for(int i = 0; i < nargs; i++)
	+ delete [] args[i];
	+ delete [] args;
	+}
	+
	+TEST_F( defaultdevicetypeinit, initstruct_default) {
	+ Kokkos::InitArguments args;
	+ Impl::test_initstruct_args(args);
	+}
	+
	+TEST_F( defaultdevicetypeinit, initstruct_nthreads) {
	+ Kokkos::InitArguments args = Impl::init_initstruct(true,false,false);
	+ Impl::test_initstruct_args(args);
	+}
	+
	+TEST_F( defaultdevicetypeinit, initstruct_nthreads_numa) {
	+ Kokkos::InitArguments args = Impl::init_initstruct(true,true,false);
	+ Impl::test_initstruct_args(args);
	+}
	+
	+TEST_F( defaultdevicetypeinit, initstruct_device) {
	+ Kokkos::InitArguments args = Impl::init_initstruct(false,false,true);
	+ Impl::test_initstruct_args(args);
	+}
	+
	+TEST_F( defaultdevicetypeinit, initstruct_nthreads_device) {
	+ Kokkos::InitArguments args = Impl::init_initstruct(true,false,true);
	+ Impl::test_initstruct_args(args);
	+}
	+
	+
	+TEST_F( defaultdevicetypeinit, initstruct_nthreads_numa_device) {
	+ Kokkos::InitArguments args = Impl::init_initstruct(true,true,true);
	+ Impl::test_initstruct_args(args);
	+}
	+
	+
	+
	+} // namespace test
	+
	+#endif
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/core/unit_test/TestHWLOC.cpp
	similarity index 72%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/core/unit_test/TestHWLOC.cpp
	index 966291abd..1637dec5d 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/core/unit_test/TestHWLOC.cpp
	@@ -1,64 +1,69 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	+#include <gtest/gtest.h>

	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	+#include <iostream>
	+#include <Kokkos_hwloc.hpp>

	-#include <Kokkos_Macros.hpp>
	+namespace Test {

	-namespace Kokkos {
	-namespace Impl {
	+class hwloc : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {}

	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	+ static void TearDownTestCase()
	+ {}
	+};

	-} /* namespace Impl */
	-} /* namespace Kokkos */
	+TEST_F( hwloc, query)
	+{
	+ std::cout << " NUMA[" << Kokkos::hwloc::get_available_numa_count() << "]"
	+ << " CORE[" << Kokkos::hwloc::get_available_cores_per_numa() << "]"
	+ << " PU[" << Kokkos::hwloc::get_available_threads_per_core() << "]"
	+ << std::endl ;
	+}

	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+}

	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp b/lib/kokkos/core/unit_test/TestMemorySpaceTracking.hpp
	similarity index 64%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	copy to lib/kokkos/core/unit_test/TestMemorySpaceTracking.hpp
	index 1e9ff91c2..80ffcc2af 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	+++ b/lib/kokkos/core/unit_test/TestMemorySpaceTracking.hpp
	@@ -1,80 +1,100 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#include <Kokkos_Macros.hpp>
	-#include <impl/Kokkos_spinwait.hpp>
	+#include <gtest/gtest.h>
	+
	+#include <iostream>
	+#include <Kokkos_Core.hpp>

	/--------------------------------------------------------------------------/

	-#if ( KOKKOS_ENABLE_ASM )
	- #if defined( __arm__ )
	- /* No-operation instruction to idle the thread. */
	- #define YIELD asm volatile("nop")
	- #else
	- /* Pause instruction to prevent excess processor bus usage */
	- #define YIELD asm volatile("pause\n":::"memory")
	- #endif
	-#elif defined( KOKKOS_HAVE_WINTHREAD )
	- #include <process.h>
	- #define YIELD Sleep(0)
	-#else
	- #include <sched.h>
	- #define YIELD sched_yield()
	-#endif
	+namespace {

	-/--------------------------------------------------------------------------/
	+template<class Arg1>
	+class TestMemorySpace {
	+public:
	+
	+ typedef typename Arg1::memory_space MemorySpace;
	+ TestMemorySpace() { run_test(); }
	+
	+ void run_test()
	+ {
	+
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+ Kokkos::View<int* ,Arg1> invalid;
	+ ASSERT_EQ(0u, invalid.tracker().ref_count() );
	+
	+ {
	+ Kokkos::View<int* ,Arg1> a("A",10);
	+
	+ ASSERT_EQ(1u, a.tracker().ref_count() );
	+
	+ {
	+ Kokkos::View<int* ,Arg1> b = a;
	+ ASSERT_EQ(2u, b.tracker().ref_count() );
	+
	+ Kokkos::View<int* ,Arg1> D("D",10);
	+ ASSERT_EQ(1u, D.tracker().ref_count() );
	+
	+ {
	+ Kokkos::View<int* ,Arg1> E("E",10);
	+ ASSERT_EQ(1u, E.tracker().ref_count() );
	+ }
	+
	+ ASSERT_EQ(2u, b.tracker().ref_count() );
	+ }
	+ ASSERT_EQ(1u, a.tracker().ref_count() );
	+ }
	+
	+#endif

	-namespace Kokkos {
	-namespace Impl {
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value )
	-{
	- while ( value == flag ) {
	- YIELD ;
	}
	+};
	+
	}
	-#endif

	-} /* namespace Impl */
	-} /* namespace Kokkos */
	+/--------------------------------------------------------------------------/
	+
	+

	diff --git a/lib/kokkos/core/unit_test/TestOpenMP.cpp b/lib/kokkos/core/unit_test/TestOpenMP.cpp
	new file mode 100755
	index 000000000..8d4bcd1e2
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestOpenMP.cpp
	@@ -0,0 +1,375 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+#include <TestViewImpl.hpp>
	+#include <TestAtomic.hpp>
	+
	+#include <TestViewAPI.hpp>
	+#include <TestViewSubview.hpp>
	+
	+#include <TestSharedAlloc.hpp>
	+#include <TestViewMapping.hpp>
	+
	+#include <TestRange.hpp>
	+#include <TestTeam.hpp>
	+#include <TestReduce.hpp>
	+#include <TestScan.hpp>
	+#include <TestAggregate.hpp>
	+#include <TestAggregateReduction.hpp>
	+#include <TestCompilerMacros.hpp>
	+#include <TestCXX11.hpp>
	+#include <TestCXX11Deduction.hpp>
	+#include <TestTeamVector.hpp>
	+#include <TestMemorySpaceTracking.hpp>
	+#include <TestTemplateMetaFunctions.hpp>
	+
	+namespace Test {
	+
	+class openmp : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ const unsigned numa_count = Kokkos::hwloc::get_available_numa_count();
	+ const unsigned cores_per_numa = Kokkos::hwloc::get_available_cores_per_numa();
	+ const unsigned threads_per_core = Kokkos::hwloc::get_available_threads_per_core();
	+
	+ const unsigned threads_count = std::max( 1u , numa_count ) *
	+ std::max( 2u , ( cores_per_numa * threads_per_core ) / 2 );
	+
	+ Kokkos::OpenMP::initialize( threads_count );
	+ Kokkos::OpenMP::print_configuration( std::cout , true );
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::OpenMP::finalize();
	+
	+ omp_set_num_threads(1);
	+
	+ ASSERT_EQ( 1 , omp_get_max_threads() );
	+ }
	+};
	+
	+
	+TEST_F( openmp , impl_shared_alloc ) {
	+ test_shared_alloc< Kokkos::HostSpace , Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp , impl_view_mapping ) {
	+ test_view_mapping< Kokkos::OpenMP >();
	+ test_view_mapping_subview< Kokkos::OpenMP >();
	+ test_view_mapping_operator< Kokkos::OpenMP >();
	+ TestViewMappingAtomic< Kokkos::OpenMP >::run();
	+}
	+
	+TEST_F( openmp, view_impl) {
	+ test_view_impl< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_api) {
	+ TestViewAPI< double , Kokkos::OpenMP >();
	+}
	+
	+
	+TEST_F( openmp, view_subview_auto_1d_left ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutLeft,Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_auto_1d_right ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutRight,Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_auto_1d_stride ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutStride,Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_assign_strided ) {
	+ TestViewSubview::test_1d_strided_assignment< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_left_0 ) {
	+ TestViewSubview::test_left_0< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_left_1 ) {
	+ TestViewSubview::test_left_1< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_left_2 ) {
	+ TestViewSubview::test_left_2< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_left_3 ) {
	+ TestViewSubview::test_left_3< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_right_0 ) {
	+ TestViewSubview::test_right_0< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_right_1 ) {
	+ TestViewSubview::test_right_1< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp, view_subview_right_3 ) {
	+ TestViewSubview::test_right_3< Kokkos::OpenMP >();
	+}
	+
	+
	+
	+TEST_F( openmp , range_tag )
	+{
	+ TestRange< Kokkos::OpenMP >::test_for(1000);
	+ TestRange< Kokkos::OpenMP >::test_reduce(1000);
	+ TestRange< Kokkos::OpenMP >::test_scan(1000);
	+}
	+
	+TEST_F( openmp , team_tag )
	+{
	+ TestTeamPolicy< Kokkos::OpenMP >::test_for(1000);
	+ TestTeamPolicy< Kokkos::OpenMP >::test_reduce(1000);
	+}
	+
	+TEST_F( openmp, long_reduce) {
	+ TestReduce< long , Kokkos::OpenMP >( 1000000 );
	+}
	+
	+TEST_F( openmp, double_reduce) {
	+ TestReduce< double , Kokkos::OpenMP >( 1000000 );
	+}
	+
	+TEST_F( openmp, long_reduce_dynamic ) {
	+ TestReduceDynamic< long , Kokkos::OpenMP >( 1000000 );
	+}
	+
	+TEST_F( openmp, double_reduce_dynamic ) {
	+ TestReduceDynamic< double , Kokkos::OpenMP >( 1000000 );
	+}
	+
	+TEST_F( openmp, long_reduce_dynamic_view ) {
	+ TestReduceDynamicView< long , Kokkos::OpenMP >( 1000000 );
	+}
	+
	+TEST_F( openmp, team_long_reduce) {
	+ TestReduceTeam< long , Kokkos::OpenMP >( 100000 );
	+}
	+
	+TEST_F( openmp, team_double_reduce) {
	+ TestReduceTeam< double , Kokkos::OpenMP >( 100000 );
	+}
	+
	+TEST_F( openmp, team_shared_request) {
	+ TestSharedTeam< Kokkos::OpenMP >();
	+}
	+
	+
	+TEST_F( openmp , atomics )
	+{
	+ const int loop_count = 1e4 ;
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::OpenMP>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::OpenMP>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::OpenMP>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::OpenMP>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::OpenMP>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::OpenMP>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::OpenMP>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::OpenMP>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::OpenMP>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::OpenMP>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::OpenMP>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::OpenMP>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::OpenMP>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::OpenMP>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::OpenMP>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::OpenMP>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::OpenMP>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::OpenMP>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::OpenMP>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::OpenMP>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::OpenMP>(100,3) ) );
	+
	+#if defined( KOKKOS_ENABLE_ASM )
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::OpenMP>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::OpenMP>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::OpenMP>(100,3) ) );
	+#endif
	+}
	+
	+TEST_F( openmp , view_remap )
	+{
	+ enum { N0 = 3 , N1 = 2 , N2 = 8 , N3 = 9 };
	+
	+ typedef Kokkos::View< double*[N1][N2][N3] ,
	+ Kokkos::LayoutRight ,
	+ Kokkos::OpenMP > output_type ;
	+
	+ typedef Kokkos::View< int**[N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::OpenMP > input_type ;
	+
	+ typedef Kokkos::View< int*[N0][N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::OpenMP > diff_type ;
	+
	+ output_type output( "output" , N0 );
	+ input_type input ( "input" , N0 , N1 );
	+ diff_type diff ( "diff" , N0 );
	+
	+ int value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ input(i0,i1,i2,i3) = ++value ;
	+ }}}}
	+
	+ // Kokkos::deep_copy( diff , input ); // throw with incompatible shape
	+ Kokkos::deep_copy( output , input );
	+
	+ value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ ++value ;
	+ ASSERT_EQ( value , ((int) output(i0,i1,i2,i3) ) );
	+ }}}}
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+
	+TEST_F( openmp , view_aggregate )
	+{
	+ TestViewAggregate< Kokkos::OpenMP >();
	+ TestViewAggregateReduction< Kokkos::OpenMP >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( openmp , scan )
	+{
	+ TestScan< Kokkos::OpenMP >::test_range( 1 , 1000 );
	+ TestScan< Kokkos::OpenMP >( 1000000 );
	+ TestScan< Kokkos::OpenMP >( 10000000 );
	+ Kokkos::OpenMP::fence();
	+}
	+
	+
	+TEST_F( openmp , team_scan )
	+{
	+ TestScanTeam< Kokkos::OpenMP >( 10000 );
	+ TestScanTeam< Kokkos::OpenMP >( 10000 );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( openmp , compiler_macros )
	+{
	+ ASSERT_TRUE( ( TestCompilerMacros::Test< Kokkos::OpenMP >() ) );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( openmp , memory_space )
	+{
	+ TestMemorySpace< Kokkos::OpenMP >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( openmp , template_meta_functions )
	+{
	+ TestTemplateMetaFunctions<int, Kokkos::OpenMP >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if defined( KOKKOS_HAVE_CXX11 ) && defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_OPENMP )
	+TEST_F( openmp , cxx11 )
	+{
	+ if ( Kokkos::Impl::is_same< Kokkos::DefaultExecutionSpace , Kokkos::OpenMP >::value ) {
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::OpenMP >(1) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::OpenMP >(2) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::OpenMP >(3) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::OpenMP >(4) ) );
	+ }
	+}
	+#endif
	+
	+#if defined (KOKKOS_HAVE_CXX11)
	+TEST_F( openmp , reduction_deduction )
	+{
	+ TestCXX11::test_reduction_deduction< Kokkos::OpenMP >();
	+}
	+
	+TEST_F( openmp , team_vector )
	+{
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(0) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(1) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(2) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(3) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(4) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(5) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(6) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(7) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(8) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(9) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::OpenMP >(10) ) );
	+}
	+#endif
	+} // namespace test
	+
	diff --git a/lib/kokkos/core/unit_test/TestQthread.cpp b/lib/kokkos/core/unit_test/TestQthread.cpp
	new file mode 100755
	index 000000000..19bfa6bde
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestQthread.cpp
	@@ -0,0 +1,283 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+#include <Kokkos_Qthread.hpp>
	+
	+#include <Qthread/Kokkos_Qthread_TaskPolicy.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+#include <TestViewImpl.hpp>
	+#include <TestAtomic.hpp>
	+
	+#include <TestViewAPI.hpp>
	+
	+#include <TestTeam.hpp>
	+#include <TestRange.hpp>
	+#include <TestReduce.hpp>
	+#include <TestScan.hpp>
	+#include <TestAggregate.hpp>
	+#include <TestCompilerMacros.hpp>
	+#include <TestTaskPolicy.hpp>
	+// #include <TestTeamVector.hpp>
	+
	+namespace Test {
	+
	+class qthread : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ const unsigned numa_count = Kokkos::hwloc::get_available_numa_count();
	+ const unsigned cores_per_numa = Kokkos::hwloc::get_available_cores_per_numa();
	+ const unsigned threads_per_core = Kokkos::hwloc::get_available_threads_per_core();
	+
	+ int threads_count = std::max( 1u , numa_count )
	+ * std::max( 2u , ( cores_per_numa * threads_per_core ) / 2 );
	+ Kokkos::Qthread::initialize( threads_count );
	+ Kokkos::Qthread::print_configuration( std::cout , true );
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Qthread::finalize();
	+ }
	+};
	+
	+TEST_F( qthread , compiler_macros )
	+{
	+ ASSERT_TRUE( ( TestCompilerMacros::Test< Kokkos::Qthread >() ) );
	+}
	+
	+TEST_F( qthread, view_impl) {
	+ test_view_impl< Kokkos::Qthread >();
	+}
	+
	+TEST_F( qthread, view_api) {
	+ TestViewAPI< double , Kokkos::Qthread >();
	+}
	+
	+TEST_F( qthread , range_tag )
	+{
	+ TestRange< Kokkos::Qthread >::test_for(1000);
	+ TestRange< Kokkos::Qthread >::test_reduce(1000);
	+ TestRange< Kokkos::Qthread >::test_scan(1000);
	+}
	+
	+TEST_F( qthread , team_tag )
	+{
	+ TestTeamPolicy< Kokkos::Qthread >::test_for( 1000 );
	+ TestTeamPolicy< Kokkos::Qthread >::test_reduce( 1000 );
	+}
	+
	+TEST_F( qthread, long_reduce) {
	+ TestReduce< long , Kokkos::Qthread >( 1000000 );
	+}
	+
	+TEST_F( qthread, double_reduce) {
	+ TestReduce< double , Kokkos::Qthread >( 1000000 );
	+}
	+
	+TEST_F( qthread, long_reduce_dynamic ) {
	+ TestReduceDynamic< long , Kokkos::Qthread >( 1000000 );
	+}
	+
	+TEST_F( qthread, double_reduce_dynamic ) {
	+ TestReduceDynamic< double , Kokkos::Qthread >( 1000000 );
	+}
	+
	+TEST_F( qthread, long_reduce_dynamic_view ) {
	+ TestReduceDynamicView< long , Kokkos::Qthread >( 1000000 );
	+}
	+
	+TEST_F( qthread, team_long_reduce) {
	+ TestReduceTeam< long , Kokkos::Qthread >( 1000000 );
	+}
	+
	+TEST_F( qthread, team_double_reduce) {
	+ TestReduceTeam< double , Kokkos::Qthread >( 1000000 );
	+}
	+
	+
	+TEST_F( qthread , atomics )
	+{
	+ const int loop_count = 1e4 ;
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Qthread>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Qthread>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Qthread>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Qthread>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Qthread>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Qthread>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Qthread>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Qthread>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Qthread>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Qthread>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Qthread>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Qthread>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Qthread>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Qthread>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Qthread>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Qthread>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Qthread>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Qthread>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Qthread>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Qthread>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Qthread>(100,3) ) );
	+
	+#if defined( KOKKOS_ENABLE_ASM )
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Qthread>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Qthread>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Qthread>(100,3) ) );
	+#endif
	+
	+}
	+
	+TEST_F( qthread , view_remap )
	+{
	+ enum { N0 = 3 , N1 = 2 , N2 = 8 , N3 = 9 };
	+
	+ typedef Kokkos::View< double*[N1][N2][N3] ,
	+ Kokkos::LayoutRight ,
	+ Kokkos::Qthread > output_type ;
	+
	+ typedef Kokkos::View< int**[N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::Qthread > input_type ;
	+
	+ typedef Kokkos::View< int*[N0][N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::Qthread > diff_type ;
	+
	+ output_type output( "output" , N0 );
	+ input_type input ( "input" , N0 , N1 );
	+ diff_type diff ( "diff" , N0 );
	+
	+ int value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ input(i0,i1,i2,i3) = ++value ;
	+ }}}}
	+
	+ // Kokkos::deep_copy( diff , input ); // throw with incompatible shape
	+ Kokkos::deep_copy( output , input );
	+
	+ value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ ++value ;
	+ ASSERT_EQ( value , ((int) output(i0,i1,i2,i3) ) );
	+ }}}}
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( qthread , view_aggregate )
	+{
	+ TestViewAggregate< Kokkos::Qthread >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( qthread , scan )
	+{
	+ TestScan< Kokkos::Qthread >::test_range( 1 , 1000 );
	+ TestScan< Kokkos::Qthread >( 1000000 );
	+ TestScan< Kokkos::Qthread >( 10000000 );
	+ Kokkos::Qthread::fence();
	+}
	+
	+TEST_F( qthread, team_shared ) {
	+ TestSharedTeam< Kokkos::Qthread >();
	+}
	+
	+TEST_F( qthread , team_scan )
	+{
	+ TestScanTeam< Kokkos::Qthread >( 10 );
	+ TestScanTeam< Kokkos::Qthread >( 10000 );
	+}
	+
	+#if defined (KOKKOS_HAVE_CXX11) && 0 /* disable */
	+TEST_F( qthread , team_vector )
	+{
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Qthread >(0) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Qthread >(1) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Qthread >(2) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Qthread >(3) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Qthread >(4) ) );
	+}
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( qthread , task_policy )
	+{
	+ TestTaskPolicy::test_task_dep< Kokkos::Qthread >( 10 );
	+ for ( long i = 0 ; i < 25 ; ++i ) TestTaskPolicy::test_fib< Kokkos::Qthread >(i);
	+ for ( long i = 0 ; i < 35 ; ++i ) TestTaskPolicy::test_fib2< Kokkos::Qthread >(i);
	+}
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+TEST_F( qthread , task_team )
	+{
	+ std::cout << "qthread.task_team test disabled due to unresolved error causing the test to hang." << std::endl ;
	+ // TestTaskPolicy::test_task_team< Kokkos::Qthread >(1000);
	+}
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+} // namespace test
	+
	diff --git a/lib/kokkos/core/unit_test/TestRange.hpp b/lib/kokkos/core/unit_test/TestRange.hpp
	new file mode 100755
	index 000000000..1af531327
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestRange.hpp
	@@ -0,0 +1,171 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <stdio.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+namespace {
	+
	+template< class ExecSpace >
	+struct TestRange {
	+
	+ typedef int value_type ; ///< typedef required for the parallel_reduce
	+
	+ typedef Kokkos::View<int*,ExecSpace> view_type ;
	+
	+ view_type m_flags ;
	+
	+ struct VerifyInitTag {};
	+ struct ResetTag {};
	+ struct VerifyResetTag {};
	+
	+ TestRange( const size_t N )
	+ : m_flags( Kokkos::ViewAllocateWithoutInitializing("flags"), N )
	+ {}
	+
	+ static void test_for( const size_t N )
	+ {
	+ TestRange functor(N);
	+
	+ typename view_type::HostMirror host_flags = Kokkos::create_mirror_view( functor.m_flags );
	+
	+ Kokkos::parallel_for( Kokkos::RangePolicy<ExecSpace>(0,N) , functor );
	+ Kokkos::parallel_for( Kokkos::RangePolicy<ExecSpace,VerifyInitTag>(0,N) , functor );
	+
	+ Kokkos::deep_copy( host_flags , functor.m_flags );
	+
	+ size_t error_count = 0 ;
	+ for ( size_t i = 0 ; i < N ; ++i ) {
	+ if ( int(i) != host_flags(i) ) ++error_count ;
	+ }
	+ ASSERT_EQ( error_count , size_t(0) );
	+
	+ Kokkos::parallel_for( Kokkos::RangePolicy<ExecSpace,ResetTag>(0,N) , functor );
	+ Kokkos::parallel_for( std::string("TestKernelFor") , Kokkos::RangePolicy<ExecSpace,VerifyResetTag>(0,N) , functor );
	+
	+ Kokkos::deep_copy( host_flags , functor.m_flags );
	+
	+ error_count = 0 ;
	+ for ( size_t i = 0 ; i < N ; ++i ) {
	+ if ( int(2*i) != host_flags(i) ) ++error_count ;
	+ }
	+ ASSERT_EQ( error_count , size_t(0) );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const int i ) const
	+ { m_flags(i) = i ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const VerifyInitTag & , const int i ) const
	+ { if ( i != m_flags(i) ) { printf("TestRange::test_for error at %d != %d\n",i,m_flags(i)); } }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const ResetTag & , const int i ) const
	+ { m_flags(i) = 2 * m_flags(i); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const VerifyResetTag & , const int i ) const
	+ { if ( 2 * i != m_flags(i) ) { printf("TestRange::test_for error at %d != %d\n",i,m_flags(i)); } }
	+
	+ //----------------------------------------
	+
	+ struct OffsetTag {};
	+
	+ static void test_reduce( const size_t N )
	+ {
	+ TestRange functor(N);
	+ int total = 0 ;
	+
	+ Kokkos::parallel_for( Kokkos::RangePolicy<ExecSpace>(0,N) , functor );
	+
	+ Kokkos::parallel_reduce( "TestKernelReduce" , Kokkos::RangePolicy<ExecSpace>(0,N) , functor , total );
	+ // sum( 0 .. N-1 )
	+ ASSERT_EQ( size_t((N-1)*(N)/2) , size_t(total) );
	+
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy<ExecSpace,OffsetTag>(0,N) , functor , total );
	+ // sum( 1 .. N )
	+ ASSERT_EQ( size_t((N)*(N+1)/2) , size_t(total) );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const int i , value_type & update ) const
	+ { update += m_flags(i); }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const OffsetTag & , const int i , value_type & update ) const
	+ { update += 1 + m_flags(i); }
	+
	+ //----------------------------------------
	+
	+ static void test_scan( const size_t N )
	+ {
	+ TestRange functor(N);
	+
	+ Kokkos::parallel_for( Kokkos::RangePolicy<ExecSpace>(0,N) , functor );
	+
	+ Kokkos::parallel_scan( "TestKernelScan" , Kokkos::RangePolicy<ExecSpace,OffsetTag>(0,N) , functor );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const OffsetTag & , const int i , value_type & update , bool final ) const
	+ {
	+ update += m_flags(i);
	+
	+ if ( final ) {
	+ if ( update != (i*(i+1))/2 ) {
	+ printf("TestRange::test_scan error %d : %d != %d\n",i,(i*(i+1))/2,m_flags(i));
	+ }
	+ }
	+ }
	+};
	+
	+} /* namespace */
	+} /* namespace Test */
	+
	+/--------------------------------------------------------------------------/
	+
	diff --git a/lib/kokkos/core/unit_test/TestReduce.hpp b/lib/kokkos/core/unit_test/TestReduce.hpp
	new file mode 100755
	index 000000000..30b94d40f
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestReduce.hpp
	@@ -0,0 +1,371 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+#include <Kokkos_Core.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< typename ScalarType , class DeviceType >
	+class ReduceFunctor
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ struct value_type {
	+ ScalarType value[3] ;
	+ };
	+
	+ const size_type nwork ;
	+
	+ ReduceFunctor( const size_type & arg_nwork ) : nwork( arg_nwork ) {}
	+
	+ ReduceFunctor( const ReduceFunctor & rhs )
	+ : nwork( rhs.nwork ) {}
	+
	+/*
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & dst ) const
	+ {
	+ dst.value[0] = 0 ;
	+ dst.value[1] = 0 ;
	+ dst.value[2] = 0 ;
	+ }
	+*/
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & dst ,
	+ const volatile value_type & src ) const
	+ {
	+ dst.value[0] += src.value[0] ;
	+ dst.value[1] += src.value[1] ;
	+ dst.value[2] += src.value[2] ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_type iwork , value_type & dst ) const
	+ {
	+ dst.value[0] += 1 ;
	+ dst.value[1] += iwork + 1 ;
	+ dst.value[2] += nwork - iwork ;
	+ }
	+};
	+
	+template< class DeviceType >
	+class ReduceFunctorFinal : public ReduceFunctor< long , DeviceType > {
	+public:
	+
	+ typedef typename ReduceFunctor< long , DeviceType >::value_type value_type ;
	+
	+ ReduceFunctorFinal( const size_t n )
	+ : ReduceFunctor<long,DeviceType>(n)
	+ {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void final( value_type & dst ) const
	+ {
	+ dst.value[0] = - dst.value[0] ;
	+ dst.value[1] = - dst.value[1] ;
	+ dst.value[2] = - dst.value[2] ;
	+ }
	+};
	+
	+template< typename ScalarType , class DeviceType >
	+class RuntimeReduceFunctor
	+{
	+public:
	+ // Required for functor:
	+ typedef DeviceType execution_space ;
	+ typedef ScalarType value_type[] ;
	+ const unsigned value_count ;
	+
	+
	+ // Unit test details:
	+
	+ typedef typename execution_space::size_type size_type ;
	+
	+ const size_type nwork ;
	+
	+ RuntimeReduceFunctor( const size_type arg_nwork ,
	+ const size_type arg_count )
	+ : value_count( arg_count )
	+ , nwork( arg_nwork ) {}
	+
	+/*
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type dst ) const
	+ {
	+ for ( unsigned i = 0 ; i < value_count ; ++i ) dst[i] = 0 ;
	+ }
	+*/
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile ScalarType dst[] ,
	+ const volatile ScalarType src[] ) const
	+ {
	+ for ( unsigned i = 0 ; i < value_count ; ++i ) dst[i] += src[i] ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_type iwork , ScalarType dst[] ) const
	+ {
	+ const size_type tmp[3] = { 1 , iwork + 1 , nwork - iwork };
	+
	+ for ( size_type i = 0 ; i < value_count ; ++i ) {
	+ dst[i] += tmp[ i % 3 ];
	+ }
	+ }
	+};
	+
	+template< class DeviceType >
	+class RuntimeReduceFunctorFinal : public RuntimeReduceFunctor< long , DeviceType > {
	+public:
	+
	+ typedef RuntimeReduceFunctor< long , DeviceType > base_type ;
	+ typedef typename base_type::value_type value_type ;
	+ typedef long scalar_type ;
	+
	+ RuntimeReduceFunctorFinal( const size_t theNwork , const size_t count ) : base_type(theNwork,count) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void final( value_type dst ) const
	+ {
	+ for ( unsigned i = 0 ; i < base_type::value_count ; ++i ) {
	+ dst[i] = - dst[i] ;
	+ }
	+ }
	+};
	+} // namespace Test
	+
	+namespace {
	+
	+template< typename ScalarType , class DeviceType >
	+class TestReduce
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ //------------------------------------
	+
	+ TestReduce( const size_type & nwork )
	+ {
	+ run_test(nwork);
	+ run_test_final(nwork);
	+ }
	+
	+ void run_test( const size_type & nwork )
	+ {
	+ typedef Test::ReduceFunctor< ScalarType , execution_space > functor_type ;
	+ typedef typename functor_type::value_type value_type ;
	+
	+ enum { Count = 3 };
	+ enum { Repeat = 100 };
	+
	+ value_type result[ Repeat ];
	+
	+ const unsigned long nw = nwork ;
	+ const unsigned long nsum = nw % 2 ? nw * (( nw + 1 )/2 )
	+ : (nw/2) * ( nw + 1 );
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ Kokkos::parallel_reduce( nwork , functor_type(nwork) , result[i] );
	+ }
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ for ( unsigned j = 0 ; j < Count ; ++j ) {
	+ const unsigned long correct = 0 == j % 3 ? nw : nsum ;
	+ ASSERT_EQ( (ScalarType) correct , result[i].value[j] );
	+ }
	+ }
	+ }
	+
	+ void run_test_final( const size_type & nwork )
	+ {
	+ typedef Test::ReduceFunctorFinal< execution_space > functor_type ;
	+ typedef typename functor_type::value_type value_type ;
	+
	+ enum { Count = 3 };
	+ enum { Repeat = 100 };
	+
	+ value_type result[ Repeat ];
	+
	+ const unsigned long nw = nwork ;
	+ const unsigned long nsum = nw % 2 ? nw * (( nw + 1 )/2 )
	+ : (nw/2) * ( nw + 1 );
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ Kokkos::parallel_reduce( nwork , functor_type(nwork) , result[i] );
	+ }
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ for ( unsigned j = 0 ; j < Count ; ++j ) {
	+ const unsigned long correct = 0 == j % 3 ? nw : nsum ;
	+ ASSERT_EQ( (ScalarType) correct , - result[i].value[j] );
	+ }
	+ }
	+ }
	+};
	+
	+template< typename ScalarType , class DeviceType >
	+class TestReduceDynamic
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ //------------------------------------
	+
	+ TestReduceDynamic( const size_type nwork )
	+ {
	+ run_test_dynamic(nwork);
	+ run_test_dynamic_final(nwork);
	+ }
	+
	+ void run_test_dynamic( const size_type nwork )
	+ {
	+ typedef Test::RuntimeReduceFunctor< ScalarType , execution_space > functor_type ;
	+
	+ enum { Count = 3 };
	+ enum { Repeat = 100 };
	+
	+ ScalarType result[ Repeat ][ Count ] ;
	+
	+ const unsigned long nw = nwork ;
	+ const unsigned long nsum = nw % 2 ? nw * (( nw + 1 )/2 )
	+ : (nw/2) * ( nw + 1 );
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ Kokkos::parallel_reduce( nwork , functor_type(nwork,Count) , result[i] );
	+ }
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ for ( unsigned j = 0 ; j < Count ; ++j ) {
	+ const unsigned long correct = 0 == j % 3 ? nw : nsum ;
	+ ASSERT_EQ( (ScalarType) correct , result[i][j] );
	+ }
	+ }
	+ }
	+
	+ void run_test_dynamic_final( const size_type nwork )
	+ {
	+ typedef Test::RuntimeReduceFunctorFinal< execution_space > functor_type ;
	+
	+ enum { Count = 3 };
	+ enum { Repeat = 100 };
	+
	+ typename functor_type::scalar_type result[ Repeat ][ Count ] ;
	+
	+ const unsigned long nw = nwork ;
	+ const unsigned long nsum = nw % 2 ? nw * (( nw + 1 )/2 )
	+ : (nw/2) * ( nw + 1 );
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ Kokkos::parallel_reduce( "TestKernelReduce" , nwork , functor_type(nwork,Count) , result[i] );
	+ }
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ for ( unsigned j = 0 ; j < Count ; ++j ) {
	+ const unsigned long correct = 0 == j % 3 ? nw : nsum ;
	+ ASSERT_EQ( (ScalarType) correct , - result[i][j] );
	+ }
	+ }
	+ }
	+};
	+
	+template< typename ScalarType , class DeviceType >
	+class TestReduceDynamicView
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ //------------------------------------
	+
	+ TestReduceDynamicView( const size_type nwork )
	+ {
	+ run_test_dynamic_view(nwork);
	+ }
	+
	+ void run_test_dynamic_view( const size_type nwork )
	+ {
	+ typedef Test::RuntimeReduceFunctor< ScalarType , execution_space > functor_type ;
	+
	+ typedef Kokkos::View< ScalarType* , DeviceType > result_type ;
	+ typedef typename result_type::HostMirror result_host_type ;
	+
	+ const unsigned CountLimit = 23 ;
	+
	+ const unsigned long nw = nwork ;
	+ const unsigned long nsum = nw % 2 ? nw * (( nw + 1 )/2 )
	+ : (nw/2) * ( nw + 1 );
	+
	+ for ( unsigned count = 0 ; count < CountLimit ; ++count ) {
	+
	+ result_type result("result",count);
	+ result_host_type host_result = Kokkos::create_mirror( result );
	+
	+ // Test result to host pointer:
	+
	+ std::string str("TestKernelReduce");
	+ Kokkos::parallel_reduce( str , nw , functor_type(nw,count) , host_result.ptr_on_device() );
	+
	+ for ( unsigned j = 0 ; j < count ; ++j ) {
	+ const unsigned long correct = 0 == j % 3 ? nw : nsum ;
	+ ASSERT_EQ( host_result(j), (ScalarType) correct );
	+ host_result(j) = 0 ;
	+ }
	+ }
	+ }
	+};
	+
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp b/lib/kokkos/core/unit_test/TestScan.hpp
	similarity index 56%
	copy from lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	copy to lib/kokkos/core/unit_test/TestScan.hpp
	index 0dcb3977a..eb5e833a1 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_PhysicalLayout.hpp
	+++ b/lib/kokkos/core/unit_test/TestScan.hpp
	@@ -1,84 +1,97 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#ifndef KOKKOS_PHYSICAL_LAYOUT_HPP
	-#define KOKKOS_PHYSICAL_LAYOUT_HPP
	+/--------------------------------------------------------------------------/

	+#include <stdio.h>

	-#include <Kokkos_View.hpp>
	-namespace Kokkos {
	-namespace Impl {
	+namespace Test {

	+template< class Device , class WorkSpec = size_t >
	+struct TestScan {

	+ typedef Device execution_space ;
	+ typedef long int value_type ;

	-struct PhysicalLayout {
	- enum LayoutType {Left,Right,Scalar,Error};
	- LayoutType layout_type;
	- int rank;
	- long long int stride[8]; //distance between two neighboring elements in a given dimension
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const int iwork , value_type & update , const bool final_pass ) const
	+ {
	+ const value_type n = iwork + 1 ;
	+ const value_type imbalance = ( (1000 <= n) && (0 == n % 1000) ) ? 1000 : 0 ;

	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewDefault> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	- {
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	+ // Insert an artificial load imbalance
	+
	+ for ( value_type i = 0 ; i < imbalance ; ++i ) { ++update ; }
	+
	+ update += n - imbalance ;
	+
	+ if ( final_pass ) {
	+ const value_type answer = n & 1 ? ( n * ( ( n + 1 ) / 2 ) ) : ( ( n / 2 ) * ( n + 1 ) );
	+
	+ if ( answer != update ) {
	+ printf("TestScan(%d,%ld) != %ld\n",iwork,update,answer);
	+ }
	}
	- #ifdef KOKKOS_HAVE_CUDA
	- template< class T , class L , class D , class M >
	- PhysicalLayout( const View<T,L,D,M,ViewCudaTexture> & view )
	- : layout_type( is_same< typename View<T,L,D,M>::array_layout , LayoutLeft >::value ? Left : (
	- is_same< typename View<T,L,D,M>::array_layout , LayoutRight >::value ? Right : Error ))
	- , rank( view.Rank )
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & update ) const { update = 0 ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & update ,
	+ volatile const value_type & input ) const
	+ { update += input ; }
	+
	+ TestScan( const WorkSpec & N )
	+ { parallel_scan( N , *this ); }
	+
	+ static void test_range( const WorkSpec & begin , const WorkSpec & end )
	{
	- for(int i=0;i<8;i++) stride[i] = 0;
	- view.stride( stride );
	+ for ( WorkSpec i = begin ; i < end ; ++i ) {
	+ (void) TestScan( i );
	+ }
	}
	- #endif
	};

	}
	-}
	-#endif
	+
	diff --git a/lib/kokkos/core/unit_test/TestSerial.cpp b/lib/kokkos/core/unit_test/TestSerial.cpp
	new file mode 100755
	index 000000000..dbe94005e
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestSerial.cpp
	@@ -0,0 +1,419 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+#include <impl/Kokkos_ViewTileLeft.hpp>
	+#include <TestTile.hpp>
	+
	+#endif
	+
	+#include <impl/Kokkos_Serial_TaskPolicy.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+#include <TestSharedAlloc.hpp>
	+#include <TestViewMapping.hpp>
	+
	+#include <TestViewImpl.hpp>
	+
	+#include <TestViewAPI.hpp>
	+#include <TestViewOfClass.hpp>
	+#include <TestViewSubview.hpp>
	+#include <TestAtomic.hpp>
	+#include <TestRange.hpp>
	+#include <TestTeam.hpp>
	+#include <TestReduce.hpp>
	+#include <TestScan.hpp>
	+#include <TestAggregate.hpp>
	+#include <TestAggregateReduction.hpp>
	+#include <TestCompilerMacros.hpp>
	+#include <TestTaskPolicy.hpp>
	+#include <TestCXX11.hpp>
	+#include <TestCXX11Deduction.hpp>
	+#include <TestTeamVector.hpp>
	+#include <TestMemorySpaceTracking.hpp>
	+#include <TestTemplateMetaFunctions.hpp>
	+
	+namespace Test {
	+
	+class serial : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ Kokkos::HostSpace::execution_space::initialize();
	+ }
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::HostSpace::execution_space::finalize();
	+ }
	+};
	+
	+TEST_F( serial , impl_shared_alloc ) {
	+ test_shared_alloc< Kokkos::HostSpace , Kokkos::Serial >();
	+}
	+
	+TEST_F( serial , impl_view_mapping ) {
	+ test_view_mapping< Kokkos::Serial >();
	+ test_view_mapping_subview< Kokkos::Serial >();
	+ test_view_mapping_operator< Kokkos::Serial >();
	+ TestViewMappingAtomic< Kokkos::Serial >::run();
	+}
	+
	+TEST_F( serial, view_impl) {
	+ test_view_impl< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_api) {
	+ TestViewAPI< double , Kokkos::Serial >();
	+}
	+
	+TEST_F( serial , view_nested_view )
	+{
	+ ::Test::view_nested_view< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_auto_1d_left ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutLeft,Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_auto_1d_right ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutRight,Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_auto_1d_stride ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutStride,Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_assign_strided ) {
	+ TestViewSubview::test_1d_strided_assignment< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_left_0 ) {
	+ TestViewSubview::test_left_0< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_left_1 ) {
	+ TestViewSubview::test_left_1< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_left_2 ) {
	+ TestViewSubview::test_left_2< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_left_3 ) {
	+ TestViewSubview::test_left_3< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_right_0 ) {
	+ TestViewSubview::test_right_0< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_right_1 ) {
	+ TestViewSubview::test_right_1< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial, view_subview_right_3 ) {
	+ TestViewSubview::test_right_3< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial , range_tag )
	+{
	+ TestRange< Kokkos::Serial >::test_for(1000);
	+ TestRange< Kokkos::Serial >::test_reduce(1000);
	+ TestRange< Kokkos::Serial >::test_scan(1000);
	+}
	+
	+TEST_F( serial , team_tag )
	+{
	+ TestTeamPolicy< Kokkos::Serial >::test_for( 1000 );
	+ TestTeamPolicy< Kokkos::Serial >::test_reduce( 1000 );
	+}
	+
	+TEST_F( serial, long_reduce) {
	+ TestReduce< long , Kokkos::Serial >( 1000000 );
	+}
	+
	+TEST_F( serial, double_reduce) {
	+ TestReduce< double , Kokkos::Serial >( 1000000 );
	+}
	+
	+TEST_F( serial, long_reduce_dynamic ) {
	+ TestReduceDynamic< long , Kokkos::Serial >( 1000000 );
	+}
	+
	+TEST_F( serial, double_reduce_dynamic ) {
	+ TestReduceDynamic< double , Kokkos::Serial >( 1000000 );
	+}
	+
	+TEST_F( serial, long_reduce_dynamic_view ) {
	+ TestReduceDynamicView< long , Kokkos::Serial >( 1000000 );
	+}
	+
	+TEST_F( serial , scan )
	+{
	+ TestScan< Kokkos::Serial >::test_range( 1 , 1000 );
	+ TestScan< Kokkos::Serial >( 10 );
	+ TestScan< Kokkos::Serial >( 10000 );
	+}
	+
	+TEST_F( serial , team_long_reduce) {
	+ TestReduceTeam< long , Kokkos::Serial >( 100000 );
	+}
	+
	+TEST_F( serial , team_double_reduce) {
	+ TestReduceTeam< double , Kokkos::Serial >( 100000 );
	+}
	+
	+TEST_F( serial , team_shared_request) {
	+ TestSharedTeam< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial , team_scan )
	+{
	+ TestScanTeam< Kokkos::Serial >( 10 );
	+ TestScanTeam< Kokkos::Serial >( 10000 );
	+}
	+
	+
	+TEST_F( serial , view_remap )
	+{
	+ enum { N0 = 3 , N1 = 2 , N2 = 8 , N3 = 9 };
	+
	+ typedef Kokkos::View< double*[N1][N2][N3] ,
	+ Kokkos::LayoutRight ,
	+ Kokkos::Serial > output_type ;
	+
	+ typedef Kokkos::View< int**[N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::Serial > input_type ;
	+
	+ typedef Kokkos::View< int*[N0][N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::Serial > diff_type ;
	+
	+ output_type output( "output" , N0 );
	+ input_type input ( "input" , N0 , N1 );
	+ diff_type diff ( "diff" , N0 );
	+
	+ int value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ input(i0,i1,i2,i3) = ++value ;
	+ }}}}
	+
	+ // Kokkos::deep_copy( diff , input ); // throw with incompatible shape
	+ Kokkos::deep_copy( output , input );
	+
	+ value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ ++value ;
	+ ASSERT_EQ( value , ((int) output(i0,i1,i2,i3) ) );
	+ }}}}
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( serial , view_aggregate )
	+{
	+ TestViewAggregate< Kokkos::Serial >();
	+ TestViewAggregateReduction< Kokkos::Serial >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( serial , atomics )
	+{
	+ const int loop_count = 1e6 ;
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Serial>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Serial>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Serial>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Serial>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Serial>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Serial>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Serial>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Serial>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Serial>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Serial>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Serial>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Serial>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Serial>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Serial>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Serial>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Serial>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Serial>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Serial>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Serial>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Serial>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Serial>(100,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Serial>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Serial>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Serial>(100,3) ) );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if ! defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+TEST_F( serial, tile_layout )
	+{
	+ TestTile::test< Kokkos::Serial , 1 , 1 >( 1 , 1 );
	+ TestTile::test< Kokkos::Serial , 1 , 1 >( 2 , 3 );
	+ TestTile::test< Kokkos::Serial , 1 , 1 >( 9 , 10 );
	+
	+ TestTile::test< Kokkos::Serial , 2 , 2 >( 1 , 1 );
	+ TestTile::test< Kokkos::Serial , 2 , 2 >( 2 , 3 );
	+ TestTile::test< Kokkos::Serial , 2 , 2 >( 4 , 4 );
	+ TestTile::test< Kokkos::Serial , 2 , 2 >( 9 , 9 );
	+
	+ TestTile::test< Kokkos::Serial , 2 , 4 >( 9 , 9 );
	+ TestTile::test< Kokkos::Serial , 4 , 2 >( 9 , 9 );
	+
	+ TestTile::test< Kokkos::Serial , 4 , 4 >( 1 , 1 );
	+ TestTile::test< Kokkos::Serial , 4 , 4 >( 4 , 4 );
	+ TestTile::test< Kokkos::Serial , 4 , 4 >( 9 , 9 );
	+ TestTile::test< Kokkos::Serial , 4 , 4 >( 9 , 11 );
	+
	+ TestTile::test< Kokkos::Serial , 8 , 8 >( 1 , 1 );
	+ TestTile::test< Kokkos::Serial , 8 , 8 >( 4 , 4 );
	+ TestTile::test< Kokkos::Serial , 8 , 8 >( 9 , 9 );
	+ TestTile::test< Kokkos::Serial , 8 , 8 >( 9 , 11 );
	+}
	+
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( serial , compiler_macros )
	+{
	+ ASSERT_TRUE( ( TestCompilerMacros::Test< Kokkos::Serial >() ) );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( serial , memory_space )
	+{
	+ TestMemorySpace< Kokkos::Serial >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( serial , task_policy )
	+{
	+ TestTaskPolicy::test_task_dep< Kokkos::Serial >( 10 );
	+ // TestTaskPolicy::test_norm2< Kokkos::Serial >( 1000 );
	+ // for ( long i = 0 ; i < 30 ; ++i ) TestTaskPolicy::test_fib< Kokkos::Serial >(i);
	+ // for ( long i = 0 ; i < 40 ; ++i ) TestTaskPolicy::test_fib2< Kokkos::Serial >(i);
	+ for ( long i = 0 ; i < 20 ; ++i ) TestTaskPolicy::test_fib< Kokkos::Serial >(i);
	+ for ( long i = 0 ; i < 25 ; ++i ) TestTaskPolicy::test_fib2< Kokkos::Serial >(i);
	+}
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+TEST_F( serial , task_team )
	+{
	+ TestTaskPolicy::test_task_team< Kokkos::Serial >(1000);
	+}
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( serial , template_meta_functions )
	+{
	+ TestTemplateMetaFunctions<int, Kokkos::Serial >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if defined( KOKKOS_HAVE_CXX11 ) && defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_SERIAL )
	+TEST_F( serial , cxx11 )
	+{
	+ if ( Kokkos::Impl::is_same< Kokkos::DefaultExecutionSpace , Kokkos::Serial >::value ) {
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Serial >(1) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Serial >(2) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Serial >(3) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Serial >(4) ) );
	+ }
	+}
	+#endif
	+
	+#if defined (KOKKOS_HAVE_CXX11)
	+TEST_F( serial , reduction_deduction )
	+{
	+ TestCXX11::test_reduction_deduction< Kokkos::Serial >();
	+}
	+
	+TEST_F( serial , team_vector )
	+{
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(0) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(1) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(2) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(3) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(4) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(5) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(6) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(7) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(8) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(9) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Serial >(10) ) );
	+}
	+#endif
	+
	+} // namespace test
	+
	diff --git a/lib/kokkos/core/unit_test/TestSharedAlloc.hpp b/lib/kokkos/core/unit_test/TestSharedAlloc.hpp
	new file mode 100755
	index 000000000..060f5f460
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestSharedAlloc.hpp
	@@ -0,0 +1,204 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+#include <Kokkos_Core.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+struct SharedAllocDestroy {
	+
	+ volatile int * count ;
	+
	+ SharedAllocDestroy() = default ;
	+ SharedAllocDestroy( int * arg ) : count( arg ) {}
	+
	+ void destroy_shared_allocation()
	+ {
	+ Kokkos::atomic_fetch_add( count , 1 );
	+ }
	+
	+};
	+
	+template< class MemorySpace , class ExecutionSpace >
	+void test_shared_alloc()
	+{
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+
	+ typedef const Kokkos::Experimental::Impl::SharedAllocationHeader Header ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationTracker Tracker ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< void , void > RecordBase ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< MemorySpace , void > RecordMemS ;
	+ typedef Kokkos::Experimental::Impl::SharedAllocationRecord< MemorySpace , SharedAllocDestroy > RecordFull ;
	+
	+ static_assert( sizeof(Tracker) == sizeof(int*), "SharedAllocationTracker has wrong size!" );
	+
	+ MemorySpace s ;
	+
	+ const size_t N = 1200 ;
	+ const size_t size = 8 ;
	+
	+ RecordMemS * rarray[ N ];
	+ Header * harray[ N ];
	+
	+ RecordMemS ** const r = rarray ;
	+ Header ** const h = harray ;
	+
	+ Kokkos::RangePolicy< ExecutionSpace > range(0,N);
	+
	+ //----------------------------------------
	+ {
	+ Kokkos::parallel_for( range , [=]( size_t i ){
	+ char name[64] ;
	+ sprintf(name,"test_%.2d",int(i));
	+
	+ r[i] = RecordMemS::allocate( s , name , size * ( i + 1 ) );
	+ h[i] = Header::get_header( r[i]->data() );
	+
	+ ASSERT_EQ( r[i]->use_count() , 0 );
	+
	+ for ( size_t j = 0 ; j < ( i / 10 ) + 1 ; ++j ) RecordBase::increment( r[i] );
	+
	+ ASSERT_EQ( r[i]->use_count() , ( i / 10 ) + 1 );
	+ ASSERT_EQ( r[i] , RecordMemS::get_record( r[i]->data() ) );
	+ });
	+
	+ // Sanity check for the whole set of allocation records to which this record belongs.
	+ RecordBase::is_sane( r[0] );
	+ // RecordMemS::print_records( std::cout , s , true );
	+
	+ Kokkos::parallel_for( range , [=]( size_t i ){
	+ while ( 0 != ( r[i] = static_cast< RecordMemS *>( RecordBase::decrement( r[i] ) ) ) ) {
	+ if ( r[i]->use_count() == 1 ) RecordBase::is_sane( r[i] );
	+ }
	+ });
	+ }
	+ //----------------------------------------
	+ {
	+ int destroy_count = 0 ;
	+ SharedAllocDestroy counter( & destroy_count );
	+
	+ Kokkos::parallel_for( range , [=]( size_t i ){
	+ char name[64] ;
	+ sprintf(name,"test_%.2d",int(i));
	+
	+ RecordFull * rec = RecordFull::allocate( s , name , size * ( i + 1 ) );
	+
	+ rec->m_destroy = counter ;
	+
	+ r[i] = rec ;
	+ h[i] = Header::get_header( r[i]->data() );
	+
	+ ASSERT_EQ( r[i]->use_count() , 0 );
	+
	+ for ( size_t j = 0 ; j < ( i / 10 ) + 1 ; ++j ) RecordBase::increment( r[i] );
	+
	+ ASSERT_EQ( r[i]->use_count() , ( i / 10 ) + 1 );
	+ ASSERT_EQ( r[i] , RecordMemS::get_record( r[i]->data() ) );
	+ });
	+
	+ RecordBase::is_sane( r[0] );
	+
	+ Kokkos::parallel_for( range , [=]( size_t i ){
	+ while ( 0 != ( r[i] = static_cast< RecordMemS *>( RecordBase::decrement( r[i] ) ) ) ) {
	+ if ( r[i]->use_count() == 1 ) RecordBase::is_sane( r[i] );
	+ }
	+ });
	+
	+ ASSERT_EQ( destroy_count , int(N) );
	+ }
	+
	+ //----------------------------------------
	+ {
	+ int destroy_count = 0 ;
	+
	+ {
	+ RecordFull * rec = RecordFull::allocate( s , "test" , size );
	+
	+ // ... Construction of the allocated { rec->data() , rec->size() }
	+
	+ // Copy destruction function object into the allocation record
	+ rec->m_destroy = SharedAllocDestroy( & destroy_count );
	+
	+ // Start tracking, increments the use count from 0 to 1
	+ Tracker track( rec );
	+
	+ ASSERT_EQ( rec->use_count() , 1 );
	+
	+ // Verify construction / destruction increment
	+ for ( size_t i = 0 ; i < N ; ++i ) {
	+ ASSERT_EQ( rec->use_count() , 1 );
	+ {
	+ Tracker local_tracker( rec );
	+ ASSERT_EQ( rec->use_count() , 2 );
	+ }
	+ ASSERT_EQ( rec->use_count() , 1 );
	+ }
	+
	+ Kokkos::parallel_for( range , [=]( size_t i ){
	+ Tracker local_tracker( rec );
	+ ASSERT_GT( rec->use_count() , 1 );
	+ });
	+
	+ ASSERT_EQ( rec->use_count() , 1 );
	+
	+ // Destruction of 'track' object deallocates the 'rec' and invokes the destroy function object.
	+ }
	+
	+ ASSERT_EQ( destroy_count , 1 );
	+ }
	+
	+#endif /* #if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST ) */
	+
	+}
	+
	+
	+}
	+
	diff --git a/lib/kokkos/core/unit_test/TestTaskPolicy.hpp b/lib/kokkos/core/unit_test/TestTaskPolicy.hpp
	new file mode 100755
	index 000000000..96a5ca3b0
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestTaskPolicy.hpp
	@@ -0,0 +1,494 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+
	+#ifndef KOKKOS_UNITTEST_TASKPOLICY_HPP
	+#define KOKKOS_UNITTEST_TASKPOLICY_HPP
	+
	+#include <stdio.h>
	+#include <iostream>
	+#include <cmath>
	+#include <Kokkos_TaskPolicy.hpp>
	+
	+namespace TestTaskPolicy {
	+
	+//----------------------------------------------------------------------------
	+
	+template< class ExecSpace >
	+struct FibChild {
	+
	+ typedef long value_type ;
	+
	+ Kokkos::Experimental::TaskPolicy<ExecSpace> policy ;
	+ const value_type n ;
	+ int has_nested ;
	+
	+ FibChild( const Kokkos::Experimental::TaskPolicy<ExecSpace> & arg_policy
	+ , const value_type arg_n )
	+ : policy(arg_policy,2) /* default dependence capacity = 2 */
	+ , n( arg_n ), has_nested(0) {}
	+
	+ inline
	+ void apply( value_type & result )
	+ {
	+ if ( n < 2 ) {
	+
	+ has_nested = -1 ;
	+
	+ result = n ;
	+ }
	+ else {
	+ if ( has_nested == 0 ) {
	+ // Spawn new children and respawn myself to sum their results:
	+ has_nested = 2 ;
	+
	+ Kokkos::Experimental::respawn
	+ ( policy
	+ , this
	+ , Kokkos::Experimental::spawn( policy , FibChild(policy,n-1) )
	+ , Kokkos::Experimental::spawn( policy , FibChild(policy,n-2) )
	+ );
	+
	+ }
	+ else if ( has_nested == 2 ) {
	+
	+ has_nested = -1 ;
	+
	+ const Kokkos::Experimental::Future<long,ExecSpace> fib_1 = policy.get_dependence(this,0);
	+ const Kokkos::Experimental::Future<long,ExecSpace> fib_2 = policy.get_dependence(this,1);
	+
	+ result = fib_1.get() + fib_2.get();
	+ }
	+ else {
	+ fprintf(stderr,"FibChild(%ld) execution error\n",(long)n);
	+ fflush(stderr);
	+ }
	+ }
	+ }
	+};
	+
	+template< class ExecSpace >
	+struct FibChild2 {
	+
	+ typedef long value_type ;
	+
	+ Kokkos::Experimental::TaskPolicy<ExecSpace> policy ;
	+ const value_type n ;
	+ int has_nested ;
	+
	+ FibChild2( const Kokkos::Experimental::TaskPolicy<ExecSpace> & arg_policy
	+ , const value_type arg_n )
	+ : policy(arg_policy,2) /* default dependence capacity = 2 */
	+ , n( arg_n ), has_nested(0) {}
	+
	+ inline
	+ void apply( value_type & result )
	+ {
	+ if ( 0 == has_nested ) {
	+ if ( n < 2 ) {
	+
	+ has_nested = -1 ;
	+
	+ result = n ;
	+ }
	+ else if ( n < 4 ) {
	+ // Spawn new children and respawn myself to sum their results:
	+ // result = Fib(n-1) + Fib(n-2)
	+ has_nested = 2 ;
	+ // Kokkos::respawn implements the following steps:
	+ policy.clear_dependence( this );
	+ policy.add_dependence( this , Kokkos::Experimental::spawn( policy , FibChild2(policy,n-1) ) );
	+ policy.add_dependence( this , Kokkos::Experimental::spawn( policy , FibChild2(policy,n-2) ) );
	+ policy.respawn( this );
	+ }
	+ else {
	+ // Spawn new children and respawn myself to sum their results:
	+ // result = Fib(n-1) + Fib(n-2)
	+ // result = ( Fib(n-2) + Fib(n-3) ) + ( Fib(n-3) + Fib(n-4) )
	+ // result = ( ( Fib(n-3) + Fib(n-4) ) + Fib(n-3) ) + ( Fib(n-3) + Fib(n-4) )
	+ // result = 3 * Fib(n-3) + 2 * Fib(n-4)
	+ has_nested = 4 ;
	+ // Kokkos::Experimental::respawn implements the following steps:
	+ policy.clear_dependence( this );
	+ policy.add_dependence( this , Kokkos::Experimental::spawn( policy , FibChild2(policy,n-3) ) );
	+ policy.add_dependence( this , Kokkos::Experimental::spawn( policy , FibChild2(policy,n-4) ) );
	+ policy.respawn( this );
	+ }
	+ }
	+ else if ( 2 == has_nested \|\| 4 == has_nested ) {
	+ const Kokkos::Experimental::Future<long,ExecSpace> fib_a = policy.get_dependence(this,0);
	+ const Kokkos::Experimental::Future<long,ExecSpace> fib_b = policy.get_dependence(this,1);
	+
	+ result = ( has_nested == 2 ) ? fib_a.get() + fib_b.get()
	+ : 3 * fib_a.get() + 2 * fib_b.get() ;
	+
	+ has_nested = -1 ;
	+ }
	+ else {
	+ fprintf(stderr,"FibChild2(%ld) execution error\n",(long)n);
	+ fflush(stderr);
	+ }
	+ }
	+};
	+
	+namespace {
	+
	+long eval_fib( long n )
	+{
	+ if ( n < 2 ) return n ;
	+
	+ std::vector<long> fib(n+1);
	+
	+ fib[0] = 0 ;
	+ fib[1] = 1 ;
	+
	+ for ( long i = 2 ; i <= n ; ++i ) { fib[i] = fib[i-2] + fib[i-1]; }
	+
	+ return fib[n];
	+}
	+
	+}
	+
	+template< class ExecSpace >
	+void test_fib( long n )
	+{
	+ Kokkos::Experimental::TaskPolicy<ExecSpace> policy(2);
	+
	+ Kokkos::Experimental::Future<long,ExecSpace> f = Kokkos::Experimental::spawn( policy , FibChild<ExecSpace>(policy,n) );
	+
	+ Kokkos::Experimental::wait( policy );
	+
	+ if ( f.get() != eval_fib(n) ) {
	+ std::cout << "Fib(" << n << ") = " << f.get();
	+ std::cout << " != " << eval_fib(n);
	+ std::cout << std::endl ;
	+ }
	+}
	+
	+template< class ExecSpace >
	+void test_fib2( long n )
	+{
	+ Kokkos::Experimental::TaskPolicy<ExecSpace> policy(2); // default dependence capacity
	+
	+ Kokkos::Experimental::Future<long,ExecSpace> f = Kokkos::Experimental::spawn( policy , FibChild2<ExecSpace>(policy,n) );
	+
	+ Kokkos::Experimental::wait( policy );
	+
	+ if ( f.get() != eval_fib(n) ) {
	+ std::cout << "Fib2(" << n << ") = " << f.get();
	+ std::cout << " != " << eval_fib(n);
	+ std::cout << std::endl ;
	+ }
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+template< class ExecSpace >
	+struct Norm2 {
	+
	+ typedef double value_type ;
	+
	+ const double * const m_x ;
	+
	+ Norm2( const double * x ) : m_x(x) {}
	+
	+ inline
	+ void init( double & val ) const { val = 0 ; }
	+
	+ inline
	+ void operator()( int i , double & val ) const { val += m_x[i] * m_x[i] ; }
	+
	+ void apply( double & dst ) const { dst = std::sqrt( dst ); }
	+};
	+
	+template< class ExecSpace >
	+void test_norm2( const int n )
	+{
	+ Kokkos::Experimental::TaskPolicy< ExecSpace > policy ;
	+
	+ double * const x = new double[n];
	+
	+ for ( int i = 0 ; i < n ; ++i ) x[i] = 1 ;
	+
	+ Kokkos::RangePolicy<ExecSpace> r(0,n);
	+
	+ Kokkos::Experimental::Future<double,ExecSpace> f = Kokkos::Experimental::spawn_reduce( policy , r , Norm2<ExecSpace>(x) );
	+
	+ Kokkos::Experimental::wait( policy );
	+
	+#if defined(PRINT)
	+ std::cout << "Norm2: " << f.get() << std::endl ;
	+#endif
	+
	+ delete[] x ;
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+template< class Space >
	+struct TaskDep {
	+
	+ typedef int value_type ;
	+ typedef Kokkos::Experimental::TaskPolicy< Space > policy_type ;
	+
	+ const policy_type policy ;
	+ const int input ;
	+
	+ TaskDep( const policy_type & arg_p , const int arg_i )
	+ : policy( arg_p ), input( arg_i ) {}
	+
	+ void apply( int & val )
	+ {
	+ val = input ;
	+ const int num = policy.get_dependence( this );
	+
	+ for ( int i = 0 ; i < num ; ++i ) {
	+ Kokkos::Experimental::Future<int,Space> f = policy.get_dependence( this , i );
	+ val += f.get();
	+ }
	+ }
	+};
	+
	+
	+template< class Space >
	+void test_task_dep( const int n )
	+{
	+ enum { NTEST = 64 };
	+
	+ Kokkos::Experimental::TaskPolicy< Space > policy ;
	+
	+ Kokkos::Experimental::Future<int,Space> f[ NTEST ];
	+
	+ for ( int i = 0 ; i < NTEST ; ++i ) {
	+ // Create task in the "constructing" state with capacity for 'n+1' dependences
	+ f[i] = policy.create( TaskDep<Space>(policy,0) , n + 1 );
	+
	+ if ( f[i].get_task_state() != Kokkos::Experimental::TASK_STATE_CONSTRUCTING ) {
	+ Kokkos::Impl::throw_runtime_exception("get_task_state() != Kokkos::Experimental::TASK_STATE_CONSTRUCTING");
	+ }
	+
	+ // Only use 'n' dependences
	+
	+ for ( int j = 0 ; j < n ; ++j ) {
	+
	+ Kokkos::Experimental::Future<int,Space> nested = policy.create( TaskDep<Space>(policy,j+1) );
	+
	+ policy.spawn( nested );
	+
	+ // Add dependence to a "constructing" task
	+ policy.add_dependence( f[i] , nested );
	+ }
	+
	+ // Spawn task from the "constructing" to the "waiting" state
	+ policy.spawn( f[i] );
	+ }
	+
	+ const int answer = n % 2 ? n * ( ( n + 1 ) / 2 ) : ( n / 2 ) * ( n + 1 );
	+
	+ Kokkos::Experimental::wait( policy );
	+
	+ int error = 0 ;
	+ for ( int i = 0 ; i < NTEST ; ++i ) {
	+ if ( f[i].get_task_state() != Kokkos::Experimental::TASK_STATE_COMPLETE ) {
	+ Kokkos::Impl::throw_runtime_exception("get_task_state() != Kokkos::Experimental::TASK_STATE_COMPLETE");
	+ }
	+ if ( answer != f[i].get() && 0 == error ) {
	+ std::cout << "test_task_dep(" << n << ") ERROR at[" << i << "]"
	+ << " answer(" << answer << ") != result(" << f[i].get() << ")" << std::endl ;
	+ }
	+ }
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+
	+template< class ExecSpace >
	+struct TaskTeam {
	+
	+ enum { SPAN = 8 };
	+
	+ typedef void value_type ;
	+ typedef Kokkos::Experimental::TaskPolicy<ExecSpace> policy_type ;
	+ typedef Kokkos::Experimental::Future<ExecSpace> future_type ;
	+ typedef Kokkos::View<long*,ExecSpace> view_type ;
	+
	+ policy_type policy ;
	+ future_type future ;
	+
	+ view_type result ;
	+ const long nvalue ;
	+
	+ TaskTeam( const policy_type & arg_policy
	+ , const view_type & arg_result
	+ , const long arg_nvalue )
	+ : policy(arg_policy)
	+ , future()
	+ , result( arg_result )
	+ , nvalue( arg_nvalue )
	+ {}
	+
	+ inline
	+ void apply( const typename policy_type::member_type & member )
	+ {
	+ const long end = nvalue + 1 ;
	+ const long begin = 0 < end - SPAN ? end - SPAN : 0 ;
	+
	+ if ( 0 < begin && future.get_task_state() == Kokkos::Experimental::TASK_STATE_NULL ) {
	+ if ( member.team_rank() == 0 ) {
	+ future = policy.spawn( policy.create_team( TaskTeam( policy , result , begin - 1 ) ) );
	+ policy.clear_dependence( this );
	+ policy.add_dependence( this , future );
	+ policy.respawn( this );
	+ }
	+ return ;
	+ }
	+
	+ Kokkos::parallel_for( Kokkos::TeamThreadRange(member,begin,end)
	+ , [&]( int i ) { result[i] = i + 1 ; }
	+ );
	+ }
	+};
	+
	+template< class ExecSpace >
	+struct TaskTeamValue {
	+
	+ enum { SPAN = 8 };
	+
	+ typedef long value_type ;
	+ typedef Kokkos::Experimental::TaskPolicy<ExecSpace> policy_type ;
	+ typedef Kokkos::Experimental::Future<value_type,ExecSpace> future_type ;
	+ typedef Kokkos::View<long*,ExecSpace> view_type ;
	+
	+ policy_type policy ;
	+ future_type future ;
	+
	+ view_type result ;
	+ const long nvalue ;
	+
	+ TaskTeamValue( const policy_type & arg_policy
	+ , const view_type & arg_result
	+ , const long arg_nvalue )
	+ : policy(arg_policy)
	+ , future()
	+ , result( arg_result )
	+ , nvalue( arg_nvalue )
	+ {}
	+
	+ inline
	+ void apply( const typename policy_type::member_type & member , value_type & final )
	+ {
	+ const long end = nvalue + 1 ;
	+ const long begin = 0 < end - SPAN ? end - SPAN : 0 ;
	+
	+ if ( 0 < begin && future.get_task_state() == Kokkos::Experimental::TASK_STATE_NULL ) {
	+ if ( member.team_rank() == 0 ) {
	+ future = policy.spawn( policy.create_team( TaskTeamValue( policy , result , begin - 1 ) ) );
	+ policy.clear_dependence( this );
	+ policy.add_dependence( this , future );
	+ policy.respawn( this );
	+ }
	+ return ;
	+ }
	+
	+ Kokkos::parallel_for( Kokkos::TeamThreadRange(member,begin,end)
	+ , [&]( int i ) { result[i] = i + 1 ; }
	+ );
	+
	+ if ( member.team_rank() == 0 ) {
	+ final = result[nvalue] ;
	+ }
	+
	+ Kokkos::memory_fence();
	+ }
	+};
	+
	+template< class ExecSpace >
	+void test_task_team( long n )
	+{
	+ typedef TaskTeam< ExecSpace > task_type ;
	+ typedef TaskTeamValue< ExecSpace > task_value_type ;
	+ typedef typename task_type::view_type view_type ;
	+ typedef typename task_type::policy_type policy_type ;
	+
	+ typedef typename task_type::future_type future_type ;
	+ typedef typename task_value_type::future_type future_value_type ;
	+
	+ policy_type policy ;
	+ view_type result("result",n+1);
	+
	+ future_type f = policy.spawn( policy.create_team( task_type( policy , result , n ) ) );
	+
	+ Kokkos::Experimental::wait( policy );
	+
	+ for ( long i = 0 ; i <= n ; ++i ) {
	+ const long answer = i + 1 ;
	+ if ( result(i) != answer ) {
	+ std::cerr << "test_task_team void ERROR result(" << i << ") = " << result(i) << " != " << answer << std::endl ;
	+ }
	+ }
	+
	+ future_value_type fv = policy.spawn( policy.create_team( task_value_type( policy , result , n ) ) );
	+
	+ Kokkos::Experimental::wait( policy );
	+
	+ if ( fv.get() != n + 1 ) {
	+ std::cerr << "test_task_team value ERROR future = " << fv.get() << " != " << n + 1 << std::endl ;
	+ }
	+ for ( long i = 0 ; i <= n ; ++i ) {
	+ const long answer = i + 1 ;
	+ if ( result(i) != answer ) {
	+ std::cerr << "test_task_team value ERROR result(" << i << ") = " << result(i) << " != " << answer << std::endl ;
	+ }
	+ }
	+}
	+
	+#endif
	+
	+//----------------------------------------------------------------------------
	+
	+} // namespace TestTaskPolicy
	+
	+#endif /* #ifndef KOKKOS_UNITTEST_TASKPOLICY_HPP */
	+
	+
	diff --git a/lib/kokkos/core/unit_test/TestTeam.hpp b/lib/kokkos/core/unit_test/TestTeam.hpp
	new file mode 100755
	index 000000000..4849f18df
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestTeam.hpp
	@@ -0,0 +1,466 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <stdio.h>
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+#include <Kokkos_Core.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+namespace {
	+
	+template< class ExecSpace >
	+struct TestTeamPolicy {
	+
	+ typedef typename Kokkos::TeamPolicy< ExecSpace >::member_type team_member ;
	+ typedef Kokkos::View<int**,ExecSpace> view_type ;
	+
	+ view_type m_flags ;
	+
	+ TestTeamPolicy( const size_t league_size )
	+ : m_flags( Kokkos::ViewAllocateWithoutInitializing("flags")
	+ , Kokkos::TeamPolicy< ExecSpace >::team_size_max( *this )
	+ , league_size )
	+ {}
	+
	+ struct VerifyInitTag {};
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const team_member & member ) const
	+ {
	+ const int tid = member.team_rank() + member.team_size() * member.league_rank();
	+
	+ m_flags( member.team_rank() , member.league_rank() ) = tid ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const VerifyInitTag & , const team_member & member ) const
	+ {
	+ const int tid = member.team_rank() + member.team_size() * member.league_rank();
	+
	+ if ( tid != m_flags( member.team_rank() , member.league_rank() ) ) {
	+ printf("TestTeamPolicy member(%d,%d) error %d != %d\n"
	+ , member.league_rank() , member.team_rank()
	+ , tid , m_flags( member.team_rank() , member.league_rank() ) );
	+ }
	+ }
	+
	+ static void test_for( const size_t league_size )
	+ {
	+ TestTeamPolicy functor( league_size );
	+
	+ const int team_size = Kokkos::TeamPolicy< ExecSpace >::team_size_max( functor );
	+
	+ Kokkos::parallel_for( Kokkos::TeamPolicy< ExecSpace >( league_size , team_size ) , functor );
	+ Kokkos::parallel_for( Kokkos::TeamPolicy< ExecSpace , VerifyInitTag >( league_size , team_size ) , functor );
	+ }
	+
	+ struct ReduceTag {};
	+
	+ typedef long value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const team_member & member , value_type & update ) const
	+ {
	+ update += member.team_rank() + member.team_size() * member.league_rank();
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const ReduceTag & , const team_member & member , value_type & update ) const
	+ {
	+ update += 1 + member.team_rank() + member.team_size() * member.league_rank();
	+ }
	+
	+ static void test_reduce( const size_t league_size )
	+ {
	+ TestTeamPolicy functor( league_size );
	+
	+ const int team_size = Kokkos::TeamPolicy< ExecSpace >::team_size_max( functor );
	+ const long N = team_size * league_size ;
	+
	+ long total = 0 ;
	+
	+ Kokkos::parallel_reduce( Kokkos::TeamPolicy< ExecSpace >( league_size , team_size ) , functor , total );
	+ ASSERT_EQ( size_t((N-1)*(N))/2 , size_t(total) );
	+
	+ Kokkos::parallel_reduce( Kokkos::TeamPolicy< ExecSpace , ReduceTag >( league_size , team_size ) , functor , total );
	+ ASSERT_EQ( (size_t(N)*size_t(N+1))/2 , size_t(total) );
	+ }
	+};
	+
	+}
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< typename ScalarType , class DeviceType >
	+class ReduceTeamFunctor
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef Kokkos::TeamPolicy< execution_space > policy_type ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ struct value_type {
	+ ScalarType value[3] ;
	+ };
	+
	+ const size_type nwork ;
	+
	+ ReduceTeamFunctor( const size_type & arg_nwork ) : nwork( arg_nwork ) {}
	+
	+ ReduceTeamFunctor( const ReduceTeamFunctor & rhs )
	+ : nwork( rhs.nwork ) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & dst ) const
	+ {
	+ dst.value[0] = 0 ;
	+ dst.value[1] = 0 ;
	+ dst.value[2] = 0 ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( volatile value_type & dst ,
	+ const volatile value_type & src ) const
	+ {
	+ dst.value[0] += src.value[0] ;
	+ dst.value[1] += src.value[1] ;
	+ dst.value[2] += src.value[2] ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const typename policy_type::member_type ind , value_type & dst ) const
	+ {
	+ const int thread_rank = ind.team_rank() + ind.team_size() * ind.league_rank();
	+ const int thread_size = ind.team_size() * ind.league_size();
	+ const int chunk = ( nwork + thread_size - 1 ) / thread_size ;
	+
	+ size_type iwork = chunk * thread_rank ;
	+ const size_type iwork_end = iwork + chunk < nwork ? iwork + chunk : nwork ;
	+
	+ for ( ; iwork < iwork_end ; ++iwork ) {
	+ dst.value[0] += 1 ;
	+ dst.value[1] += iwork + 1 ;
	+ dst.value[2] += nwork - iwork ;
	+ }
	+ }
	+};
	+
	+} // namespace Test
	+
	+namespace {
	+
	+template< typename ScalarType , class DeviceType >
	+class TestReduceTeam
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef Kokkos::TeamPolicy< execution_space > policy_type ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ //------------------------------------
	+
	+ TestReduceTeam( const size_type & nwork )
	+ {
	+ run_test(nwork);
	+ }
	+
	+ void run_test( const size_type & nwork )
	+ {
	+ typedef Test::ReduceTeamFunctor< ScalarType , execution_space > functor_type ;
	+ typedef typename functor_type::value_type value_type ;
	+ typedef Kokkos::View< value_type, Kokkos::HostSpace, Kokkos::MemoryUnmanaged > result_type ;
	+
	+ enum { Count = 3 };
	+ enum { Repeat = 100 };
	+
	+ value_type result[ Repeat ];
	+
	+ const unsigned long nw = nwork ;
	+ const unsigned long nsum = nw % 2 ? nw * (( nw + 1 )/2 )
	+ : (nw/2) * ( nw + 1 );
	+
	+ const unsigned team_size = policy_type::team_size_recommended( functor_type(nwork) );
	+ const unsigned league_size = ( nwork + team_size - 1 ) / team_size ;
	+
	+ policy_type team_exec( league_size , team_size );
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ result_type tmp( & result[i] );
	+ Kokkos::parallel_reduce( team_exec , functor_type(nwork) , tmp );
	+ }
	+
	+ execution_space::fence();
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ for ( unsigned j = 0 ; j < Count ; ++j ) {
	+ const unsigned long correct = 0 == j % 3 ? nw : nsum ;
	+ ASSERT_EQ( (ScalarType) correct , result[i].value[j] );
	+ }
	+ }
	+ }
	+};
	+
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< class DeviceType >
	+class ScanTeamFunctor
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef Kokkos::TeamPolicy< execution_space > policy_type ;
	+
	+ typedef long int value_type ;
	+ Kokkos::View< value_type , execution_space > accum ;
	+ Kokkos::View< value_type , execution_space > total ;
	+
	+ ScanTeamFunctor() : accum("accum"), total("total") {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init( value_type & error ) const { error = 0 ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join( value_type volatile & error ,
	+ value_type volatile const & input ) const
	+ { if ( input ) error = 1 ; }
	+
	+ struct JoinMax {
	+ typedef long int value_type ;
	+ KOKKOS_INLINE_FUNCTION
	+ void join( value_type volatile & dst
	+ , value_type volatile const & input ) const
	+ { if ( dst < input ) dst = input ; }
	+ };
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const typename policy_type::member_type ind , value_type & error ) const
	+ {
	+ if ( 0 == ind.league_rank() && 0 == ind.team_rank() ) {
	+ const long int thread_count = ind.league_size() * ind.team_size();
	+ total() = ( thread_count * ( thread_count + 1 ) ) / 2 ;
	+ }
	+
	+ // Team max:
	+ const int long m = ind.team_reduce( (long int) ( ind.league_rank() + ind.team_rank() ) , JoinMax() );
	+
	+ if ( m != ind.league_rank() + ( ind.team_size() - 1 ) ) {
	+ printf("ScanTeamFunctor[%d.%d of %d.%d] reduce_max_answer(%ld) != reduce_max(%ld)\n"
	+ , ind.league_rank(), ind.team_rank()
	+ , ind.league_size(), ind.team_size()
	+ , (long int)(ind.league_rank() + ( ind.team_size() - 1 )) , m );
	+ }
	+
	+ // Scan:
	+ const long int answer =
	+ ( ind.league_rank() + 1 ) * ind.team_rank() +
	+ ( ind.team_rank() * ( ind.team_rank() + 1 ) ) / 2 ;
	+
	+ const long int result =
	+ ind.team_scan( ind.league_rank() + 1 + ind.team_rank() + 1 );
	+
	+ const long int result2 =
	+ ind.team_scan( ind.league_rank() + 1 + ind.team_rank() + 1 );
	+
	+ if ( answer != result \|\| answer != result2 ) {
	+ printf("ScanTeamFunctor[%d.%d of %d.%d] answer(%ld) != scan_first(%ld) or scan_second(%ld)\n",
	+ ind.league_rank(), ind.team_rank(),
	+ ind.league_size(), ind.team_size(),
	+ answer,result,result2);
	+ error = 1 ;
	+ }
	+
	+ const long int thread_rank = ind.team_rank() +
	+ ind.team_size() * ind.league_rank();
	+ ind.team_scan( 1 + thread_rank , accum.ptr_on_device() );
	+ }
	+};
	+
	+template< class DeviceType >
	+class TestScanTeam
	+{
	+public:
	+ typedef DeviceType execution_space ;
	+ typedef long int value_type ;
	+
	+ typedef Kokkos::TeamPolicy< execution_space > policy_type ;
	+ typedef Test::ScanTeamFunctor<DeviceType> functor_type ;
	+
	+ //------------------------------------
	+
	+ TestScanTeam( const size_t nteam )
	+ {
	+ run_test(nteam);
	+ }
	+
	+ void run_test( const size_t nteam )
	+ {
	+ typedef Kokkos::View< long int , Kokkos::HostSpace , Kokkos::MemoryUnmanaged > result_type ;
	+
	+ const unsigned REPEAT = 100000 ;
	+ const unsigned Repeat = ( REPEAT + nteam - 1 ) / nteam ;
	+
	+ functor_type functor ;
	+
	+ policy_type team_exec( nteam , policy_type::team_size_max( functor ) );
	+
	+ for ( unsigned i = 0 ; i < Repeat ; ++i ) {
	+ long int accum = 0 ;
	+ long int total = 0 ;
	+ long int error = 0 ;
	+ Kokkos::deep_copy( functor.accum , total );
	+ Kokkos::parallel_reduce( team_exec , functor , result_type( & error ) );
	+ DeviceType::fence();
	+ Kokkos::deep_copy( accum , functor.accum );
	+ Kokkos::deep_copy( total , functor.total );
	+
	+ ASSERT_EQ( error , 0 );
	+ ASSERT_EQ( total , accum );
	+ }
	+
	+ execution_space::fence();
	+ }
	+};
	+
	+} // namespace Test
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< class ExecSpace >
	+struct SharedTeamFunctor {
	+
	+ typedef ExecSpace execution_space ;
	+ typedef int value_type ;
	+ typedef Kokkos::TeamPolicy< execution_space > policy_type ;
	+
	+ enum { SHARED_COUNT = 1000 };
	+
	+ typedef typename ExecSpace::scratch_memory_space shmem_space ;
	+
	+ // tbd: MemoryUnmanaged should be the default for shared memory space
	+ typedef Kokkos::View<int*,shmem_space,Kokkos::MemoryUnmanaged> shared_int_array_type ;
	+
	+ // Tell how much shared memory will be required by this functor:
	+ inline
	+ unsigned team_shmem_size( int /* team_size */ ) const
	+ {
	+ return shared_int_array_type::shmem_size( SHARED_COUNT ) +
	+ shared_int_array_type::shmem_size( SHARED_COUNT );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const typename policy_type::member_type & ind , value_type & update ) const
	+ {
	+ const shared_int_array_type shared_A( ind.team_shmem() , SHARED_COUNT );
	+ const shared_int_array_type shared_B( ind.team_shmem() , SHARED_COUNT );
	+
	+ if ((shared_A.ptr_on_device () == NULL && SHARED_COUNT > 0) \|\|
	+ (shared_B.ptr_on_device () == NULL && SHARED_COUNT > 0)) {
	+ printf ("Failed to allocate shared memory of size %lu\n",
	+ static_cast<unsigned long> (SHARED_COUNT));
	+ ++update; // failure to allocate is an error
	+ }
	+ else {
	+ for ( int i = ind.team_rank() ; i < SHARED_COUNT ; i += ind.team_size() ) {
	+ shared_A[i] = i + ind.league_rank();
	+ shared_B[i] = 2 * i + ind.league_rank();
	+ }
	+
	+ ind.team_barrier();
	+
	+ if ( ind.team_rank() + 1 == ind.team_size() ) {
	+ for ( int i = 0 ; i < SHARED_COUNT ; ++i ) {
	+ if ( shared_A[i] != i + ind.league_rank() ) {
	+ ++update ;
	+ }
	+ if ( shared_B[i] != 2 * i + ind.league_rank() ) {
	+ ++update ;
	+ }
	+ }
	+ }
	+ }
	+ }
	+};
	+
	+}
	+
	+namespace {
	+
	+template< class ExecSpace >
	+struct TestSharedTeam {
	+
	+ TestSharedTeam()
	+ { run(); }
	+
	+ void run()
	+ {
	+ typedef Test::SharedTeamFunctor<ExecSpace> Functor ;
	+ typedef Kokkos::View< typename Functor::value_type , Kokkos::HostSpace , Kokkos::MemoryUnmanaged > result_type ;
	+
	+ const size_t team_size = Kokkos::TeamPolicy< ExecSpace >::team_size_max( Functor() );
	+
	+ Kokkos::TeamPolicy< ExecSpace > team_exec( 8192 / team_size , team_size );
	+
	+ typename Functor::value_type error_count = 0 ;
	+
	+ Kokkos::parallel_reduce( team_exec , Functor() , result_type( & error_count ) );
	+
	+ ASSERT_EQ( error_count , 0 );
	+ }
	+};
	+
	+}
	+
	+/--------------------------------------------------------------------------/
	diff --git a/lib/kokkos/core/unit_test/TestTeamVector.hpp b/lib/kokkos/core/unit_test/TestTeamVector.hpp
	new file mode 100755
	index 000000000..add8b7ed4
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestTeamVector.hpp
	@@ -0,0 +1,650 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <impl/Kokkos_Timer.hpp>
	+#include <iostream>
	+#include <cstdlib>
	+
	+namespace TestTeamVector {
	+
	+struct my_complex {
	+ double re,im;
	+ int dummy;
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex() {
	+ re = 0.0;
	+ im = 0.0;
	+ dummy = 0;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex(const my_complex& src) {
	+ re = src.re;
	+ im = src.im;
	+ dummy = src.dummy;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex(const volatile my_complex& src) {
	+ re = src.re;
	+ im = src.im;
	+ dummy = src.dummy;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex(const double& val) {
	+ re = val;
	+ im = 0.0;
	+ dummy = 0;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex& operator += (const my_complex& src) {
	+ re += src.re;
	+ im += src.im;
	+ dummy += src.dummy;
	+ return *this;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator += (const volatile my_complex& src) volatile {
	+ re += src.re;
	+ im += src.im;
	+ dummy += src.dummy;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex& operator *= (const my_complex& src) {
	+ double re_tmp = resrc.re - imsrc.im;
	+ double im_tmp = re * src.im + im * src.re;
	+ re = re_tmp;
	+ im = im_tmp;
	+ dummy *= src.dummy;
	+ return *this;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator *= (const volatile my_complex& src) volatile {
	+ double re_tmp = resrc.re - imsrc.im;
	+ double im_tmp = re * src.im + im * src.re;
	+ re = re_tmp;
	+ im = im_tmp;
	+ dummy *= src.dummy;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ bool operator == (const my_complex& src) {
	+ return (re == src.re) && (im == src.im) && ( dummy == src.dummy );
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ bool operator != (const my_complex& src) {
	+ return (re != src.re) \|\| (im != src.im) \|\| ( dummy != src.dummy );
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ bool operator != (const double& val) {
	+ return (re != val) \|\|
	+ (im != 0) \|\| (dummy != 0);
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex& operator= (const int& val) {
	+ re = val;
	+ im = 0.0;
	+ dummy = 0;
	+ return *this;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ my_complex& operator= (const double& val) {
	+ re = val;
	+ im = 0.0;
	+ dummy = 0;
	+ return *this;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ operator double() {
	+ return re;
	+ }
	+};
	+
	+#if defined (KOKKOS_HAVE_CXX11)
	+
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_team_for {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_team_for(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ typedef typename ExecutionSpace::scratch_memory_space shmem_space ;
	+ typedef Kokkos::View<Scalar*,shmem_space,Kokkos::MemoryUnmanaged> shared_int;
	+ typedef typename shared_int::size_type size_type;
	+
	+ const size_type shmemSize = team.team_size () * 13;
	+ shared_int values = shared_int (team.team_shmem (), shmemSize);
	+
	+ if (values.ptr_on_device () == NULL \|\| values.dimension_0 () < shmemSize) {
	+ printf ("FAILED to allocate shared memory of size %u\n",
	+ static_cast<unsigned int> (shmemSize));
	+ }
	+ else {
	+
	+ // Initialize shared memory
	+ values(team.team_rank ()) = 0;
	+
	+ // Accumulate value into per thread shared memory
	+ // This is non blocking
	+ Kokkos::parallel_for(Kokkos::TeamThreadRange(team,131),[&] (int i) {
	+ values(team.team_rank ()) += i - team.league_rank () + team.league_size () + team.team_size ();
	+ });
	+ // Wait for all memory to be written
	+ team.team_barrier ();
	+ // One thread per team executes the comparison
	+ Kokkos::single(Kokkos::PerTeam(team),[&]() {
	+ Scalar test = 0;
	+ Scalar value = 0;
	+ for (int i = 0; i < 131; ++i) {
	+ test += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ for (int i = 0; i < team.team_size (); ++i) {
	+ value += values(i);
	+ }
	+ if (test != value) {
	+ printf ("FAILED team_parallel_for %i %i %f %f\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_team_reduce {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_team_reduce(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ Scalar value = Scalar();
	+ Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,131),[&] (int i, Scalar& val) {
	+ val += i - team.league_rank () + team.league_size () + team.team_size ();
	+ },value);
	+
	+ team.team_barrier ();
	+ Kokkos::single(Kokkos::PerTeam(team),[&]() {
	+ Scalar test = 0;
	+ for (int i = 0; i < 131; ++i) {
	+ test += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ if (test != value) {
	+ if(team.league_rank() == 0)
	+ printf ("FAILED team_parallel_reduce %i %i %f %f %lu\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value),sizeof(Scalar));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_team_reduce_join {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_team_reduce_join(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ Scalar value = 0;
	+
	+ Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,131)
	+ , [&] (int i, Scalar& val) {
	+ val += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ , [&] (volatile Scalar& val, const volatile Scalar& src) {val+=src;}
	+ , value
	+ );
	+
	+ team.team_barrier ();
	+ Kokkos::single(Kokkos::PerTeam(team),[&]() {
	+ Scalar test = 0;
	+ for (int i = 0; i < 131; ++i) {
	+ test += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ if (test != value) {
	+ printf ("FAILED team_vector_parallel_reduce_join %i %i %f %f\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_team_vector_for {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_team_vector_for(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ typedef typename ExecutionSpace::scratch_memory_space shmem_space ;
	+ typedef Kokkos::View<Scalar*,shmem_space,Kokkos::MemoryUnmanaged> shared_int;
	+ typedef typename shared_int::size_type size_type;
	+
	+ const size_type shmemSize = team.team_size () * 13;
	+ shared_int values = shared_int (team.team_shmem (), shmemSize);
	+
	+ if (values.ptr_on_device () == NULL \|\| values.dimension_0 () < shmemSize) {
	+ printf ("FAILED to allocate shared memory of size %u\n",
	+ static_cast<unsigned int> (shmemSize));
	+ }
	+ else {
	+ Kokkos::single(Kokkos::PerThread(team),[&] () {
	+ values(team.team_rank ()) = 0;
	+ });
	+
	+ Kokkos::parallel_for(Kokkos::TeamThreadRange(team,131),[&] (int i) {
	+ Kokkos::single(Kokkos::PerThread(team),[&] () {
	+ values(team.team_rank ()) += i - team.league_rank () + team.league_size () + team.team_size ();
	+ });
	+ });
	+
	+ team.team_barrier ();
	+ Kokkos::single(Kokkos::PerTeam(team),[&]() {
	+ Scalar test = 0;
	+ Scalar value = 0;
	+ for (int i = 0; i < 131; ++i) {
	+ test += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ for (int i = 0; i < team.team_size (); ++i) {
	+ value += values(i);
	+ }
	+ if (test != value) {
	+ printf ("FAILED team_vector_parallel_for %i %i %f %f\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_team_vector_reduce {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_team_vector_reduce(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ Scalar value = Scalar();
	+ Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,131),[&] (int i, Scalar& val) {
	+ val += i - team.league_rank () + team.league_size () + team.team_size ();
	+ },value);
	+
	+ team.team_barrier ();
	+ Kokkos::single(Kokkos::PerTeam(team),[&]() {
	+ Scalar test = 0;
	+ for (int i = 0; i < 131; ++i) {
	+ test += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ if (test != value) {
	+ if(team.league_rank() == 0)
	+ printf ("FAILED team_vector_parallel_reduce %i %i %f %f %lu\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value),sizeof(Scalar));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_team_vector_reduce_join {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_team_vector_reduce_join(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ Scalar value = 0;
	+ Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,131)
	+ , [&] (int i, Scalar& val) {
	+ val += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ , [&] (volatile Scalar& val, const volatile Scalar& src) {val+=src;}
	+ , value
	+ );
	+
	+ team.team_barrier ();
	+ Kokkos::single(Kokkos::PerTeam(team),[&]() {
	+ Scalar test = 0;
	+ for (int i = 0; i < 131; ++i) {
	+ test += i - team.league_rank () + team.league_size () + team.team_size ();
	+ }
	+ if (test != value) {
	+ printf ("FAILED team_vector_parallel_reduce_join %i %i %f %f\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_vec_single {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_vec_single(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ // Warning: this test case intentionally violates permissable semantics
	+ // It is not valid to get references to members of the enclosing region
	+ // inside a parallel_for and write to it.
	+ Scalar value = 0;
	+
	+ Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,13),[&] (int i) {
	+ value = i; // This write is violating Kokkos semantics for nested parallelism
	+ });
	+
	+ Kokkos::single(Kokkos::PerThread(team),[&] (Scalar& val) {
	+ val = 1;
	+ },value);
	+
	+ Scalar value2 = 0;
	+ Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,13), [&] (int i, Scalar& val) {
	+ val += value;
	+ },value2);
	+
	+ if(value2!=(value*13)) {
	+ printf("FAILED vector_single broadcast %i %i %f %f\n",team.league_rank(),team.team_rank(),(double) value2,(double) value);
	+ flag()=1;
	+ }
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_vec_for {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_vec_for(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ unsigned team_shmem_size(int team_size) const {return team_size13sizeof(Scalar)+8;}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+
	+ typedef typename ExecutionSpace::scratch_memory_space shmem_space ;
	+ typedef Kokkos::View<Scalar*,shmem_space,Kokkos::MemoryUnmanaged> shared_int;
	+ shared_int values = shared_int(team.team_shmem(),team.team_size()*13);
	+
	+ if (values.ptr_on_device () == NULL \|\|
	+ values.dimension_0() < (unsigned) team.team_size() * 13) {
	+ printf ("FAILED to allocate memory of size %i\n",
	+ static_cast<int> (team.team_size () * 13));
	+ flag() = 1;
	+ }
	+ else {
	+ Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,13), [&] (int i) {
	+ values(13*team.team_rank() + i) = i - team.team_rank() - team.league_rank() + team.league_size() + team.team_size();
	+ });
	+
	+ Kokkos::single(Kokkos::PerThread(team),[&] () {
	+ Scalar test = 0;
	+ Scalar value = 0;
	+ for (int i = 0; i < 13; ++i) {
	+ test += i - team.team_rank() - team.league_rank() + team.league_size() + team.team_size();
	+ value += values(13*team.team_rank() + i);
	+ }
	+ if (test != value) {
	+ printf ("FAILED vector_par_for %i %i %f %f\n",
	+ team.league_rank (), team.team_rank (),
	+ static_cast<double> (test), static_cast<double> (value));
	+ flag() = 1;
	+ }
	+ });
	+ }
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_vec_red {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_vec_red(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+ Scalar value = 0;
	+
	+ Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,13),[&] (int i, Scalar& val) {
	+ val += i;
	+ }, value);
	+
	+ Kokkos::single(Kokkos::PerThread(team),[&] () {
	+ Scalar test = 0;
	+ for(int i = 0; i < 13; i++) {
	+ test+=i;
	+ }
	+ if(test!=value) {
	+ printf("FAILED vector_par_reduce %i %i %f %f\n",team.league_rank(),team.team_rank(),(double) test,(double) value);
	+ flag()=1;
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_vec_red_join {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_vec_red_join(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+ Scalar value = 1;
	+
	+ Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,13)
	+ , [&] (int i, Scalar& val) { val *= i; }
	+ , [&] (Scalar& val, const Scalar& src) {val*=src;}
	+ , value
	+ );
	+
	+ Kokkos::single(Kokkos::PerThread(team),[&] () {
	+ Scalar test = 1;
	+ for(int i = 0; i < 13; i++) {
	+ test*=i;
	+ }
	+ if(test!=value) {
	+ printf("FAILED vector_par_reduce_join %i %i %f %f\n",team.league_rank(),team.team_rank(),(double) test,(double) value);
	+ flag()=1;
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_vec_scan {
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_vec_scan(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team) const {
	+ Kokkos::parallel_scan(Kokkos::ThreadVectorRange(team,13),[&] (int i, Scalar& val, bool final) {
	+ val += i;
	+ if(final) {
	+ Scalar test = 0;
	+ for(int k = 0; k <= i; k++) {
	+ test+=k;
	+ }
	+ if(test!=val) {
	+ printf("FAILED vector_par_scan %i %i %f %f\n",team.league_rank(),team.team_rank(),(double) test,(double) val);
	+ flag()=1;
	+ }
	+ }
	+ });
	+ }
	+};
	+
	+template<typename Scalar, class ExecutionSpace>
	+struct functor_reduce {
	+ typedef double value_type;
	+ typedef Kokkos::TeamPolicy<ExecutionSpace> policy_type;
	+ typedef ExecutionSpace execution_space;
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag;
	+ functor_reduce(Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> flag_):flag(flag_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (typename policy_type::member_type team, double& sum) const {
	+ sum += team.league_rank() * 100 + team.thread_rank();
	+ }
	+};
	+#endif
	+
	+template<typename Scalar,class ExecutionSpace>
	+bool test_scalar(int nteams, int team_size, int test) {
	+ Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace> d_flag("flag");
	+ typename Kokkos::View<int,Kokkos::LayoutLeft,ExecutionSpace>::HostMirror h_flag("h_flag");
	+ h_flag() = 0 ;
	+ Kokkos::deep_copy(d_flag,h_flag);
	+ #ifdef KOKKOS_HAVE_CXX11
	+ if(test==0)
	+ Kokkos::parallel_for( std::string("A") , Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_vec_red<Scalar, ExecutionSpace>(d_flag));
	+ if(test==1)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_vec_red_join<Scalar, ExecutionSpace>(d_flag));
	+ if(test==2)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_vec_scan<Scalar, ExecutionSpace>(d_flag));
	+ if(test==3)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_vec_for<Scalar, ExecutionSpace>(d_flag));
	+ if(test==4)
	+ Kokkos::parallel_for( "B" , Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_vec_single<Scalar, ExecutionSpace>(d_flag));
	+ if(test==5)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size),
	+ functor_team_for<Scalar, ExecutionSpace>(d_flag));
	+ if(test==6)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size),
	+ functor_team_reduce<Scalar, ExecutionSpace>(d_flag));
	+ if(test==7)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size),
	+ functor_team_reduce_join<Scalar, ExecutionSpace>(d_flag));
	+ if(test==8)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_team_vector_for<Scalar, ExecutionSpace>(d_flag));
	+ if(test==9)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_team_vector_reduce<Scalar, ExecutionSpace>(d_flag));
	+ if(test==10)
	+ Kokkos::parallel_for( Kokkos::TeamPolicy<ExecutionSpace>(nteams,team_size,8),
	+ functor_team_vector_reduce_join<Scalar, ExecutionSpace>(d_flag));
	+ #endif
	+ Kokkos::deep_copy(h_flag,d_flag);
	+
	+ return (h_flag() == 0);
	+}
	+
	+template<class ExecutionSpace>
	+bool Test(int test) {
	+ bool passed = true;
	+ passed = passed && test_scalar<int, ExecutionSpace>(317,33,test);
	+ passed = passed && test_scalar<long long int, ExecutionSpace>(317,33,test);
	+ passed = passed && test_scalar<float, ExecutionSpace>(317,33,test);
	+ passed = passed && test_scalar<double, ExecutionSpace>(317,33,test);
	+ passed = passed && test_scalar<my_complex, ExecutionSpace>(317,33,test);
	+ return passed;
	+}
	+
	+}
	+
	diff --git a/lib/kokkos/core/unit_test/TestTemplateMetaFunctions.hpp b/lib/kokkos/core/unit_test/TestTemplateMetaFunctions.hpp
	new file mode 100755
	index 000000000..4f136bc64
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestTemplateMetaFunctions.hpp
	@@ -0,0 +1,219 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <Kokkos_Core.hpp>
	+
	+#define KOKKOS_PRAGMA_UNROLL(a)
	+
	+namespace {
	+
	+template<class Scalar, class ExecutionSpace>
	+struct SumPlain {
	+ typedef ExecutionSpace execution_space;
	+ typedef typename Kokkos::View<Scalar*,execution_space> type;
	+ type view;
	+ SumPlain(type view_):view(view_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, Scalar& val) {
	+ val += Scalar();
	+ }
	+};
	+
	+template<class Scalar, class ExecutionSpace>
	+struct SumInitJoinFinalValueType {
	+ typedef ExecutionSpace execution_space;
	+ typedef typename Kokkos::View<Scalar*,execution_space> type;
	+ type view;
	+ typedef Scalar value_type;
	+ SumInitJoinFinalValueType(type view_):view(view_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init(value_type& val) const {
	+ val = value_type();
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join(volatile value_type& val, volatile value_type& src) const {
	+ val += src;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, value_type& val) const {
	+ val += value_type();
	+ }
	+
	+};
	+
	+template<class Scalar, class ExecutionSpace>
	+struct SumInitJoinFinalValueType2 {
	+ typedef ExecutionSpace execution_space;
	+ typedef typename Kokkos::View<Scalar*,execution_space> type;
	+ type view;
	+ typedef Scalar value_type;
	+ SumInitJoinFinalValueType2(type view_):view(view_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init(volatile value_type& val) const {
	+ val = value_type();
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join(volatile value_type& val, const volatile value_type& src) const {
	+ val += src;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, value_type& val) const {
	+ val += value_type();
	+ }
	+
	+};
	+
	+template<class Scalar, class ExecutionSpace>
	+struct SumInitJoinFinalValueTypeArray {
	+ typedef ExecutionSpace execution_space;
	+ typedef typename Kokkos::View<Scalar*,execution_space> type;
	+ type view;
	+ typedef Scalar value_type[];
	+ int n;
	+ SumInitJoinFinalValueTypeArray(type view_, int n_):view(view_),n(n_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init(value_type val) const {
	+ for(int k=0;k<n;k++)
	+ val[k] = 0;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join(volatile value_type val, const volatile value_type src) const {
	+ for(int k=0;k<n;k++)
	+ val[k] += src[k];
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, value_type val) const {
	+ for(int k=0;k<n;k++)
	+ val[k] += k*i;
	+ }
	+
	+};
	+
	+template<class Scalar, class ExecutionSpace>
	+struct SumWrongInitJoinFinalValueType {
	+ typedef ExecutionSpace execution_space;
	+ typedef typename Kokkos::View<Scalar*,execution_space> type;
	+ type view;
	+ typedef Scalar value_type;
	+ SumWrongInitJoinFinalValueType(type view_):view(view_) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void init(double& val) const {
	+ val = double();
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void join(volatile value_type& val, const value_type& src) const {
	+ val += src;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (int i, value_type& val) const {
	+ val += value_type();
	+ }
	+
	+};
	+
	+template<class Scalar, class ExecutionSpace>
	+void TestTemplateMetaFunctions() {
	+ typedef typename Kokkos::View<Scalar*,ExecutionSpace> type;
	+ type a("A",100);
	+/* #ifdef KOKKOS_HAVE_CXX11
	+ int sum_plain_has_init_arg = Kokkos::Impl::FunctorHasInit<SumPlain<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_plain_has_init_arg,0);
	+ int sum_initjoinfinalvaluetype_has_init_arg = Kokkos::Impl::FunctorHasInit<SumInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_init_arg,1);
	+ int sum_initjoinfinalvaluetype_has_init_arg2 = Kokkos::Impl::FunctorHasInit<SumInitJoinFinalValueType2<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_init_arg2,1);
	+ int sum_wronginitjoinfinalvaluetype_has_init_arg = Kokkos::Impl::FunctorHasInit<SumWrongInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_wronginitjoinfinalvaluetype_has_init_arg,0);
	+
	+ //int sum_initjoinfinalvaluetypearray_has_init_arg = Kokkos::Impl::FunctorHasInit<SumInitJoinFinalValueTypeArray<Scalar,ExecutionSpace>, Scalar[] >::value;
	+ //ASSERT_EQ(sum_initjoinfinalvaluetypearray_has_init_arg,1);
	+
	+ #else
	+
	+ int sum_plain_has_init_arg = Kokkos::Impl::FunctorHasInit<SumPlain<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_plain_has_init_arg,0);
	+ int sum_initjoinfinalvaluetype_has_init_arg = Kokkos::Impl::FunctorHasInit<SumInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_init_arg,1);
	+ int sum_wronginitjoinfinalvaluetype_has_init_arg = Kokkos::Impl::FunctorHasInit<SumWrongInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_wronginitjoinfinalvaluetype_has_init_arg,1);
	+
	+ #endif
	+
	+ //printf("Values Init: %i %i %i\n",sum_plain_has_init_arg,sum_initjoinfinalvaluetype_has_init_arg,sum_wronginitjoinfinalvaluetype_has_init_arg);
	+
	+#ifdef KOKKOS_HAVE_CXX11
	+ int sum_plain_has_join_arg = Kokkos::Impl::FunctorHasJoin<SumPlain<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_plain_has_join_arg,0);
	+ int sum_initjoinfinalvaluetype_has_join_arg = Kokkos::Impl::FunctorHasJoin<SumInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_join_arg,1);
	+ int sum_initjoinfinalvaluetype_has_join_arg2 = Kokkos::Impl::FunctorHasJoin<SumInitJoinFinalValueType2<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_join_arg2,1);
	+ int sum_wronginitjoinfinalvaluetype_has_join_arg = Kokkos::Impl::FunctorHasJoin<SumWrongInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar >::value;
	+ ASSERT_EQ(sum_wronginitjoinfinalvaluetype_has_join_arg,0);
	+#else
	+ int sum_plain_has_join_arg = Kokkos::Impl::FunctorHasJoin<SumPlain<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_plain_has_join_arg,0);
	+ int sum_initjoinfinalvaluetype_has_join_arg = Kokkos::Impl::FunctorHasJoin<SumInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_join_arg,1);
	+ int sum_initjoinfinalvaluetype_has_join_arg2 = Kokkos::Impl::FunctorHasJoin<SumInitJoinFinalValueType2<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_initjoinfinalvaluetype_has_join_arg2,1);
	+ int sum_wronginitjoinfinalvaluetype_has_join_arg = Kokkos::Impl::FunctorHasJoin<SumWrongInitJoinFinalValueType<Scalar,ExecutionSpace>, Scalar& >::value;
	+ ASSERT_EQ(sum_wronginitjoinfinalvaluetype_has_join_arg,1);
	+#endif*/
	+ //printf("Values Join: %i %i %i\n",sum_plain_has_join_arg,sum_initjoinfinalvaluetype_has_join_arg,sum_wronginitjoinfinalvaluetype_has_join_arg);
	+}
	+
	+}
	diff --git a/lib/kokkos/core/unit_test/TestThreads.cpp b/lib/kokkos/core/unit_test/TestThreads.cpp
	new file mode 100755
	index 000000000..3832998ab
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestThreads.cpp
	@@ -0,0 +1,443 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Macros.hpp>
	+
	+#if defined( KOKKOS_HAVE_PTHREAD )
	+
	+#include <Kokkos_Core.hpp>
	+
	+#include <Threads/Kokkos_Threads_TaskPolicy.hpp>
	+
	+//----------------------------------------------------------------------------
	+
	+#include <TestSharedAlloc.hpp>
	+#include <TestViewMapping.hpp>
	+
	+#include <TestViewImpl.hpp>
	+
	+#include <TestViewAPI.hpp>
	+#include <TestViewSubview.hpp>
	+#include <TestAtomic.hpp>
	+
	+#include <TestReduce.hpp>
	+#include <TestScan.hpp>
	+#include <TestRange.hpp>
	+#include <TestTeam.hpp>
	+#include <TestAggregate.hpp>
	+#include <TestAggregateReduction.hpp>
	+#include <TestCompilerMacros.hpp>
	+#include <TestCXX11.hpp>
	+#include <TestCXX11Deduction.hpp>
	+#include <TestTeamVector.hpp>
	+#include <TestMemorySpaceTracking.hpp>
	+#include <TestTemplateMetaFunctions.hpp>
	+
	+#include <TestTaskPolicy.hpp>
	+
	+namespace Test {
	+
	+class threads : public ::testing::Test {
	+protected:
	+ static void SetUpTestCase()
	+ {
	+ // Finalize without initialize is a no-op:
	+ Kokkos::Threads::finalize();
	+
	+ const unsigned numa_count = Kokkos::hwloc::get_available_numa_count();
	+ const unsigned cores_per_numa = Kokkos::hwloc::get_available_cores_per_numa();
	+ const unsigned threads_per_core = Kokkos::hwloc::get_available_threads_per_core();
	+
	+ unsigned threads_count = 0 ;
	+
	+ // Initialize and finalize with no threads:
	+ Kokkos::Threads::initialize( 1u );
	+ Kokkos::Threads::finalize();
	+
	+ threads_count = std::max( 1u , numa_count )
	+ * std::max( 2u , cores_per_numa * threads_per_core );
	+
	+ Kokkos::Threads::initialize( threads_count );
	+ Kokkos::Threads::finalize();
	+
	+
	+ threads_count = std::max( 1u , numa_count * 2 )
	+ * std::max( 2u , ( cores_per_numa * threads_per_core ) / 2 );
	+
	+ Kokkos::Threads::initialize( threads_count );
	+ Kokkos::Threads::finalize();
	+
	+ // Quick attempt to verify thread start/terminate don't have race condition:
	+ threads_count = std::max( 1u , numa_count )
	+ * std::max( 2u , ( cores_per_numa * threads_per_core ) / 2 );
	+ for ( unsigned i = 0 ; i < 10 ; ++i ) {
	+ Kokkos::Threads::initialize( threads_count );
	+ Kokkos::Threads::sleep();
	+ Kokkos::Threads::wake();
	+ Kokkos::Threads::finalize();
	+ }
	+
	+ Kokkos::Threads::initialize( threads_count );
	+ Kokkos::Threads::print_configuration( std::cout , true /* detailed */ );
	+ }
	+
	+ static void TearDownTestCase()
	+ {
	+ Kokkos::Threads::finalize();
	+ }
	+};
	+
	+TEST_F( threads , init ) {
	+ ;
	+}
	+
	+TEST_F( threads , impl_shared_alloc ) {
	+ test_shared_alloc< Kokkos::HostSpace , Kokkos::Threads >();
	+}
	+
	+TEST_F( threads , impl_view_mapping ) {
	+ test_view_mapping< Kokkos::Threads >();
	+ test_view_mapping_subview< Kokkos::Threads >();
	+ test_view_mapping_operator< Kokkos::Threads >();
	+ TestViewMappingAtomic< Kokkos::Threads >::run();
	+}
	+
	+
	+TEST_F( threads, view_impl) {
	+ test_view_impl< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_api) {
	+ TestViewAPI< double , Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_auto_1d_left ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutLeft,Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_auto_1d_right ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutRight,Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_auto_1d_stride ) {
	+ TestViewSubview::test_auto_1d< Kokkos::LayoutStride,Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_assign_strided ) {
	+ TestViewSubview::test_1d_strided_assignment< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_left_0 ) {
	+ TestViewSubview::test_left_0< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_left_1 ) {
	+ TestViewSubview::test_left_1< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_left_2 ) {
	+ TestViewSubview::test_left_2< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_left_3 ) {
	+ TestViewSubview::test_left_3< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_right_0 ) {
	+ TestViewSubview::test_right_0< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_right_1 ) {
	+ TestViewSubview::test_right_1< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads, view_subview_right_3 ) {
	+ TestViewSubview::test_right_3< Kokkos::Threads >();
	+}
	+
	+
	+TEST_F( threads, view_aggregate ) {
	+ TestViewAggregate< Kokkos::Threads >();
	+ TestViewAggregateReduction< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads , range_tag )
	+{
	+ TestRange< Kokkos::Threads >::test_for(1000);
	+ TestRange< Kokkos::Threads >::test_reduce(1000);
	+ TestRange< Kokkos::Threads >::test_scan(1000);
	+}
	+
	+TEST_F( threads , team_tag )
	+{
	+ TestTeamPolicy< Kokkos::Threads >::test_for(1000);
	+ TestTeamPolicy< Kokkos::Threads >::test_reduce(1000);
	+}
	+
	+TEST_F( threads, long_reduce) {
	+ TestReduce< long , Kokkos::Threads >( 1000000 );
	+}
	+
	+TEST_F( threads, double_reduce) {
	+ TestReduce< double , Kokkos::Threads >( 1000000 );
	+}
	+
	+TEST_F( threads, team_long_reduce) {
	+ TestReduceTeam< long , Kokkos::Threads >( 100000 );
	+}
	+
	+TEST_F( threads, team_double_reduce) {
	+ TestReduceTeam< double , Kokkos::Threads >( 100000 );
	+}
	+
	+TEST_F( threads, long_reduce_dynamic ) {
	+ TestReduceDynamic< long , Kokkos::Threads >( 1000000 );
	+}
	+
	+TEST_F( threads, double_reduce_dynamic ) {
	+ TestReduceDynamic< double , Kokkos::Threads >( 1000000 );
	+}
	+
	+TEST_F( threads, long_reduce_dynamic_view ) {
	+ TestReduceDynamicView< long , Kokkos::Threads >( 1000000 );
	+}
	+
	+TEST_F( threads, team_shared_request) {
	+ TestSharedTeam< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads , view_remap )
	+{
	+ enum { N0 = 3 , N1 = 2 , N2 = 8 , N3 = 9 };
	+
	+ typedef Kokkos::View< double*[N1][N2][N3] ,
	+ Kokkos::LayoutRight ,
	+ Kokkos::Threads > output_type ;
	+
	+ typedef Kokkos::View< int**[N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::Threads > input_type ;
	+
	+ typedef Kokkos::View< int*[N0][N2][N3] ,
	+ Kokkos::LayoutLeft ,
	+ Kokkos::Threads > diff_type ;
	+
	+ output_type output( "output" , N0 );
	+ input_type input ( "input" , N0 , N1 );
	+ diff_type diff ( "diff" , N0 );
	+
	+ int value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ input(i0,i1,i2,i3) = ++value ;
	+ }}}}
	+
	+ // Kokkos::deep_copy( diff , input ); // throw with incompatible shape
	+ Kokkos::deep_copy( output , input );
	+
	+ value = 0 ;
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i0 = 0 ; i0 < N0 ; ++i0 ) {
	+ ++value ;
	+ ASSERT_EQ( value , ((int) output(i0,i1,i2,i3) ) );
	+ }}}}
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( threads , atomics )
	+{
	+ const int loop_count = 1e6 ;
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Threads>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<int,Kokkos::Threads>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Threads>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned int,Kokkos::Threads>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Threads>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long int,Kokkos::Threads>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Threads>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<unsigned long int,Kokkos::Threads>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Threads>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<long long int,Kokkos::Threads>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Threads>(loop_count,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<double,Kokkos::Threads>(loop_count,3) ) );
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Threads>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Threads>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<float,Kokkos::Threads>(100,3) ) );
	+
	+#if defined( KOKKOS_ENABLE_ASM )
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Threads>(100,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Threads>(100,2) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<Kokkos::complex<double> ,Kokkos::Threads>(100,3) ) );
	+#endif
	+
	+ ASSERT_TRUE( ( TestAtomic::Loop<TestAtomic::SuperScalar<3>, Kokkos::Threads>(loop_count,1) ) );
	+ ASSERT_TRUE( ( TestAtomic::Loop<TestAtomic::SuperScalar<3>, Kokkos::Threads>(loop_count,2) ) );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if 0
	+TEST_F( threads , scan_small )
	+{
	+ typedef TestScan< Kokkos::Threads , Kokkos::Impl::ThreadsExecUseScanSmall > TestScanFunctor ;
	+ for ( int i = 0 ; i < 1000 ; ++i ) {
	+ TestScanFunctor( 10 );
	+ TestScanFunctor( 10000 );
	+ }
	+ TestScanFunctor( 1000000 );
	+ TestScanFunctor( 10000000 );
	+
	+ Kokkos::Threads::fence();
	+}
	+#endif
	+
	+TEST_F( threads , scan )
	+{
	+ TestScan< Kokkos::Threads >::test_range( 1 , 1000 );
	+ TestScan< Kokkos::Threads >( 1000000 );
	+ TestScan< Kokkos::Threads >( 10000000 );
	+ Kokkos::Threads::fence();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( threads , team_scan )
	+{
	+ TestScanTeam< Kokkos::Threads >( 10 );
	+ TestScanTeam< Kokkos::Threads >( 10000 );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( threads , compiler_macros )
	+{
	+ ASSERT_TRUE( ( TestCompilerMacros::Test< Kokkos::Threads >() ) );
	+}
	+
	+TEST_F( threads , memory_space )
	+{
	+ TestMemorySpace< Kokkos::Threads >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+TEST_F( threads , template_meta_functions )
	+{
	+ TestTemplateMetaFunctions<int, Kokkos::Threads >();
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+#if defined( KOKKOS_HAVE_CXX11 ) && defined( KOKKOS_HAVE_DEFAULT_DEVICE_TYPE_THREADS )
	+TEST_F( threads , cxx11 )
	+{
	+ if ( Kokkos::Impl::is_same< Kokkos::DefaultExecutionSpace , Kokkos::Threads >::value ) {
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Threads >(1) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Threads >(2) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Threads >(3) ) );
	+ ASSERT_TRUE( ( TestCXX11::Test< Kokkos::Threads >(4) ) );
	+ }
	+}
	+#endif
	+
	+#if defined (KOKKOS_HAVE_CXX11)
	+
	+TEST_F( threads , reduction_deduction )
	+{
	+ TestCXX11::test_reduction_deduction< Kokkos::Threads >();
	+}
	+
	+TEST_F( threads , team_vector )
	+{
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(0) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(1) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(2) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(3) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(4) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(5) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(6) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(7) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(8) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(9) ) );
	+ ASSERT_TRUE( ( TestTeamVector::Test< Kokkos::Threads >(10) ) );
	+}
	+
	+#endif
	+
	+TEST_F( threads , task_policy )
	+{
	+ TestTaskPolicy::test_task_dep< Kokkos::Threads >( 10 );
	+ for ( long i = 0 ; i < 25 ; ++i ) TestTaskPolicy::test_fib< Kokkos::Threads >(i);
	+ for ( long i = 0 ; i < 35 ; ++i ) TestTaskPolicy::test_fib2< Kokkos::Threads >(i);
	+}
	+
	+#if defined( KOKKOS_HAVE_CXX11 )
	+TEST_F( threads , task_team )
	+{
	+ TestTaskPolicy::test_task_team< Kokkos::Threads >(1000);
	+}
	+#endif
	+
	+
	+} // namespace Test
	+
	+#endif /* #if defined( KOKKOS_HAVE_PTHREAD ) */
	diff --git a/lib/kokkos/core/unit_test/TestTile.hpp b/lib/kokkos/core/unit_test/TestTile.hpp
	new file mode 100755
	index 000000000..dfb2bd81b
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestTile.hpp
	@@ -0,0 +1,153 @@
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+
	+#ifndef TEST_TILE_HPP
	+#define TEST_TILE_HPP
	+
	+#include <Kokkos_Core.hpp>
	+
	+namespace TestTile {
	+
	+template < typename Device , typename TileLayout>
	+struct ReduceTileErrors
	+{
	+ typedef Device execution_space ;
	+
	+ typedef Kokkos::View< ptrdiff_t**, TileLayout, Device> array_type;
	+ typedef Kokkos::View< ptrdiff_t[ TileLayout::N0 ][ TileLayout::N1 ], Kokkos::LayoutLeft , Device > tile_type ;
	+
	+ array_type m_array ;
	+
	+ typedef ptrdiff_t value_type;
	+
	+ ReduceTileErrors( array_type a )
	+ : m_array(a)
	+ {}
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & errors )
	+ {
	+ errors = 0;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & errors ,
	+ const volatile value_type & src_errors )
	+ {
	+ errors += src_errors;
	+ }
	+
	+ // Initialize
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_t iwork ) const
	+ {
	+ const size_t i = iwork % m_array.dimension_0();
	+ const size_t j = iwork / m_array.dimension_0();
	+ if ( j < m_array.dimension_1() ) {
	+ m_array(i,j) = & m_array(i,j) - & m_array(0,0);
	+
	+// printf("m_array(%d,%d) = %d\n",int(i),int(j),int(m_array(i,j)));
	+
	+ }
	+ }
	+
	+ // Verify:
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_t iwork , value_type & errors ) const
	+ {
	+ const size_t tile_dim0 = ( m_array.dimension_0() + TileLayout::N0 - 1 ) / TileLayout::N0 ;
	+ const size_t tile_dim1 = ( m_array.dimension_1() + TileLayout::N1 - 1 ) / TileLayout::N1 ;
	+
	+ const size_t itile = iwork % tile_dim0 ;
	+ const size_t jtile = iwork / tile_dim0 ;
	+
	+ if ( jtile < tile_dim1 ) {
	+
	+ tile_type tile = Kokkos::tile_subview( m_array , itile , jtile );
	+
	+ if ( tile(0,0) != ptrdiff_t(( itile + jtile * tile_dim0 ) * TileLayout::N0 * TileLayout::N1 ) ) {
	+ ++errors ;
	+ }
	+ else {
	+
	+ for ( size_t j = 0 ; j < size_t(TileLayout::N1) ; ++j ) {
	+ for ( size_t i = 0 ; i < size_t(TileLayout::N0) ; ++i ) {
	+ const size_t iglobal = i + itile * TileLayout::N0 ;
	+ const size_t jglobal = j + jtile * TileLayout::N1 ;
	+
	+ if ( iglobal < m_array.dimension_0() && jglobal < m_array.dimension_1() ) {
	+ if ( tile(i,j) != ptrdiff_t( tile(0,0) + i + j * TileLayout::N0 ) ) ++errors ;
	+
	+// printf("tile(%d,%d)(%d,%d) = %d\n",int(itile),int(jtile),int(i),int(j),int(tile(i,j)));
	+
	+ }
	+ }
	+ }
	+ }
	+ }
	+ }
	+};
	+
	+template< class Space , unsigned N0 , unsigned N1 >
	+void test( const size_t dim0 , const size_t dim1 )
	+{
	+ typedef Kokkos::LayoutTileLeft<N0,N1> array_layout ;
	+ typedef ReduceTileErrors< Space , array_layout > functor_type ;
	+
	+ const size_t tile_dim0 = ( dim0 + N0 - 1 ) / N0 ;
	+ const size_t tile_dim1 = ( dim1 + N1 - 1 ) / N1 ;
	+
	+ typename functor_type::array_type array("",dim0,dim1);
	+
	+ Kokkos::parallel_for( Kokkos::RangePolicy<Space,size_t>(0,dim0*dim1) , functor_type( array ) );
	+
	+ ptrdiff_t error = 0 ;
	+
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy<Space,size_t>(0,tile_dim0*tile_dim1) , functor_type( array ) , error );
	+
	+ EXPECT_EQ( error , ptrdiff_t(0) );
	+}
	+
	+} /* namespace TestTile */
	+
	+#endif //TEST_TILE_HPP
	+
	diff --git a/lib/kokkos/core/unit_test/TestViewAPI.hpp b/lib/kokkos/core/unit_test/TestViewAPI.hpp
	new file mode 100755
	index 000000000..b0a81cec6
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestViewAPI.hpp
	@@ -0,0 +1,1305 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+/--------------------------------------------------------------------------/
	+
	+#if defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+namespace Test {
	+
	+template< typename T, class DeviceType >
	+class TestViewAPI {
	+public:
	+ TestViewAPI() {}
	+};
	+
	+}
	+
	+#else
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< class T , class L , class D , class M , class S >
	+size_t allocation_count( const Kokkos::View<T,L,D,M,S> & view )
	+{
	+ const size_t card = Kokkos::Impl::cardinality_count( view.shape() );
	+ const size_t alloc = view.capacity();
	+
	+ return card <= alloc ? alloc : 0 ;
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+template< typename T, class DeviceType>
	+struct TestViewOperator
	+{
	+ typedef DeviceType execution_space ;
	+
	+ static const unsigned N = 100 ;
	+ static const unsigned D = 3 ;
	+
	+ typedef Kokkos::View< T*[D] , execution_space > view_type ;
	+
	+ const view_type v1 ;
	+ const view_type v2 ;
	+
	+ TestViewOperator()
	+ : v1( "v1" , N )
	+ , v2( "v2" , N )
	+ {}
	+
	+ static void testit()
	+ {
	+ Kokkos::parallel_for( N , TestViewOperator() );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const unsigned i ) const
	+ {
	+ const unsigned X = 0 ;
	+ const unsigned Y = 1 ;
	+ const unsigned Z = 2 ;
	+
	+ v2(i,X) = v1(i,X);
	+ v2(i,Y) = v1(i,Y);
	+ v2(i,Z) = v1(i,Z);
	+ }
	+};
	+
	+/--------------------------------------------------------------------------/
	+
	+template< class DataType >
	+struct rank {
	+private:
	+ typedef typename Kokkos::Impl::AnalyzeShape<DataType>::shape shape ;
	+public:
	+ static const unsigned value = shape::rank ;
	+};
	+
	+template< class DataType ,
	+ class DeviceType ,
	+ unsigned Rank = rank< DataType >::value >
	+struct TestViewOperator_LeftAndRight ;
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 8 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutStride, execution_space > stride_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ stride_view left_stride ;
	+ stride_view right_stride ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( "left" )
	+ , right( "right" )
	+ , left_stride( left )
	+ , right_stride( right )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i7 = 0 ; i7 < unsigned(lsh.N7) ; ++i7 )
	+ for ( unsigned i6 = 0 ; i6 < unsigned(lsh.N6) ; ++i6 )
	+ for ( unsigned i5 = 0 ; i5 < unsigned(lsh.N5) ; ++i5 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(lsh.N4) ; ++i4 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(lsh.N3) ; ++i3 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1, i2, i3, i4, i5, i6, i7 ) -
	+ & left( 0, 0, 0, 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+
	+ if ( & left(i0,i1,i2,i3,i4,i5,i6,i7) !=
	+ & left_stride(i0,i1,i2,i3,i4,i5,i6,i7) ) {
	+ update \|= 4 ;
	+ }
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(rsh.N2) ; ++i2 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(rsh.N3) ; ++i3 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(rsh.N4) ; ++i4 )
	+ for ( unsigned i5 = 0 ; i5 < unsigned(rsh.N5) ; ++i5 )
	+ for ( unsigned i6 = 0 ; i6 < unsigned(rsh.N6) ; ++i6 )
	+ for ( unsigned i7 = 0 ; i7 < unsigned(rsh.N7) ; ++i7 )
	+ {
	+ const long j = & right( i0, i1, i2, i3, i4, i5, i6, i7 ) -
	+ & right( 0, 0, 0, 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+
	+ if ( & right(i0,i1,i2,i3,i4,i5,i6,i7) !=
	+ & right_stride(i0,i1,i2,i3,i4,i5,i6,i7) ) {
	+ update \|= 8 ;
	+ }
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 7 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( "left" )
	+ , right( "right" )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i6 = 0 ; i6 < unsigned(lsh.N6) ; ++i6 )
	+ for ( unsigned i5 = 0 ; i5 < unsigned(lsh.N5) ; ++i5 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(lsh.N4) ; ++i4 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(lsh.N3) ; ++i3 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1, i2, i3, i4, i5, i6 ) -
	+ & left( 0, 0, 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(rsh.N2) ; ++i2 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(rsh.N3) ; ++i3 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(rsh.N4) ; ++i4 )
	+ for ( unsigned i5 = 0 ; i5 < unsigned(rsh.N5) ; ++i5 )
	+ for ( unsigned i6 = 0 ; i6 < unsigned(rsh.N6) ; ++i6 )
	+ {
	+ const long j = & right( i0, i1, i2, i3, i4, i5, i6 ) -
	+ & right( 0, 0, 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 6 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( "left" )
	+ , right( "right" )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i5 = 0 ; i5 < unsigned(lsh.N5) ; ++i5 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(lsh.N4) ; ++i4 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(lsh.N3) ; ++i3 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1, i2, i3, i4, i5 ) -
	+ & left( 0, 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(rsh.N2) ; ++i2 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(rsh.N3) ; ++i3 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(rsh.N4) ; ++i4 )
	+ for ( unsigned i5 = 0 ; i5 < unsigned(rsh.N5) ; ++i5 )
	+ {
	+ const long j = & right( i0, i1, i2, i3, i4, i5 ) -
	+ & right( 0, 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 5 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutStride, execution_space > stride_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ stride_view left_stride ;
	+ stride_view right_stride ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( "left" )
	+ , right( "right" )
	+ , left_stride( left )
	+ , right_stride( right )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i4 = 0 ; i4 < unsigned(lsh.N4) ; ++i4 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(lsh.N3) ; ++i3 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1, i2, i3, i4 ) -
	+ & left( 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+
	+ if ( & left( i0, i1, i2, i3, i4 ) !=
	+ & left_stride( i0, i1, i2, i3, i4 ) ) { update \|= 4 ; }
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(rsh.N2) ; ++i2 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(rsh.N3) ; ++i3 )
	+ for ( unsigned i4 = 0 ; i4 < unsigned(rsh.N4) ; ++i4 )
	+ {
	+ const long j = & right( i0, i1, i2, i3, i4 ) -
	+ & right( 0, 0, 0, 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+
	+ if ( & right( i0, i1, i2, i3, i4 ) !=
	+ & right_stride( i0, i1, i2, i3, i4 ) ) { update \|= 8 ; }
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 4 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( "left" )
	+ , right( "right" )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i3 = 0 ; i3 < unsigned(lsh.N3) ; ++i3 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1, i2, i3 ) -
	+ & left( 0, 0, 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(rsh.N2) ; ++i2 )
	+ for ( unsigned i3 = 0 ; i3 < unsigned(rsh.N3) ; ++i3 )
	+ {
	+ const long j = & right( i0, i1, i2, i3 ) -
	+ & right( 0, 0, 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 3 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutStride, execution_space > stride_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ stride_view left_stride ;
	+ stride_view right_stride ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( std::string("left") )
	+ , right( std::string("right") )
	+ , left_stride( left )
	+ , right_stride( right )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1, i2 ) -
	+ & left( 0, 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+
	+ if ( & left(i0,i1,i2) != & left_stride(i0,i1,i2) ) { update \|= 4 ; }
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(rsh.N2) ; ++i2 )
	+ {
	+ const long j = & right( i0, i1, i2 ) -
	+ & right( 0, 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+
	+ if ( & right(i0,i1,i2) != & right_stride(i0,i1,i2) ) { update \|= 8 ; }
	+ }
	+
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i2 = 0 ; i2 < unsigned(lsh.N2) ; ++i2 )
	+ {
	+ if ( & left(i0,i1,i2) != & left.at(i0,i1,i2,0,0,0,0,0) ) { update \|= 3 ; }
	+ if ( & right(i0,i1,i2) != & right.at(i0,i1,i2,0,0,0,0,0) ) { update \|= 3 ; }
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 2 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( Kokkos::ViewAllocate("left") )
	+ , right( Kokkos::ViewAllocate("right") )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ long offset ;
	+
	+ offset = -1 ;
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ const long j = & left( i0, i1 ) -
	+ & left( 0, 0 );
	+ if ( j <= offset \|\| left_alloc <= j ) { update \|= 1 ; }
	+ offset = j ;
	+ }
	+
	+ offset = -1 ;
	+ for ( unsigned i0 = 0 ; i0 < unsigned(rsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(rsh.N1) ; ++i1 )
	+ {
	+ const long j = & right( i0, i1 ) -
	+ & right( 0, 0 );
	+ if ( j <= offset \|\| right_alloc <= j ) { update \|= 2 ; }
	+ offset = j ;
	+ }
	+
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ for ( unsigned i1 = 0 ; i1 < unsigned(lsh.N1) ; ++i1 )
	+ {
	+ if ( & left(i0,i1) != & left.at(i0,i1,0,0,0,0,0,0) ) { update \|= 3 ; }
	+ if ( & right(i0,i1) != & right.at(i0,i1,0,0,0,0,0,0) ) { update \|= 3 ; }
	+ }
	+ }
	+};
	+
	+template< class DataType , class DeviceType >
	+struct TestViewOperator_LeftAndRight< DataType , DeviceType , 1 >
	+{
	+ typedef DeviceType execution_space ;
	+ typedef typename execution_space::memory_space memory_space ;
	+ typedef typename execution_space::size_type size_type ;
	+
	+ typedef int value_type ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void join( volatile value_type & update ,
	+ const volatile value_type & input )
	+ { update \|= input ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ static void init( value_type & update )
	+ { update = 0 ; }
	+
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutLeft, execution_space > left_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutRight, execution_space > right_view ;
	+
	+ typedef Kokkos::
	+ View< DataType, Kokkos::LayoutStride, execution_space > stride_view ;
	+
	+ typedef typename left_view ::shape_type left_shape ;
	+ typedef typename right_view::shape_type right_shape ;
	+
	+ left_shape lsh ;
	+ right_shape rsh ;
	+ left_view left ;
	+ right_view right ;
	+ stride_view left_stride ;
	+ stride_view right_stride ;
	+ long left_alloc ;
	+ long right_alloc ;
	+
	+ TestViewOperator_LeftAndRight()
	+ : lsh()
	+ , rsh()
	+ , left( Kokkos::ViewAllocate() )
	+ , right( Kokkos::ViewAllocate() )
	+ , left_stride( left )
	+ , right_stride( right )
	+ , left_alloc( allocation_count( left ) )
	+ , right_alloc( allocation_count( right ) )
	+ {}
	+
	+ static void testit()
	+ {
	+ TestViewOperator_LeftAndRight driver ;
	+
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.lsh ) <= driver.left_alloc );
	+ ASSERT_TRUE( (long) Kokkos::Impl::cardinality_count( driver.rsh ) <= driver.right_alloc );
	+
	+ int error_flag = 0 ;
	+
	+ Kokkos::parallel_reduce( 1 , driver , error_flag );
	+
	+ ASSERT_EQ( error_flag , 0 );
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const size_type , value_type & update ) const
	+ {
	+ for ( unsigned i0 = 0 ; i0 < unsigned(lsh.N0) ; ++i0 )
	+ {
	+ if ( & left(i0) != & left.at(i0,0,0,0,0,0,0,0) ) { update \|= 3 ; }
	+ if ( & right(i0) != & right.at(i0,0,0,0,0,0,0,0) ) { update \|= 3 ; }
	+ if ( & left(i0) != & left_stride(i0) ) { update \|= 4 ; }
	+ if ( & right(i0) != & right_stride(i0) ) { update \|= 8 ; }
	+ }
	+ }
	+};
	+
	+/--------------------------------------------------------------------------/
	+
	+template< typename T, class DeviceType >
	+class TestViewAPI
	+{
	+public:
	+ typedef DeviceType device ;
	+
	+ enum { N0 = 1000 ,
	+ N1 = 3 ,
	+ N2 = 5 ,
	+ N3 = 7 };
	+
	+ typedef Kokkos::View< T , device > dView0 ;
	+ typedef Kokkos::View< T* , device > dView1 ;
	+ typedef Kokkos::View< T*[N1] , device > dView2 ;
	+ typedef Kokkos::View< T*[N1][N2] , device > dView3 ;
	+ typedef Kokkos::View< T*[N1][N2][N3] , device > dView4 ;
	+ typedef Kokkos::View< const T*[N1][N2][N3] , device > const_dView4 ;
	+
	+ typedef Kokkos::View< T****, device, Kokkos::MemoryUnmanaged > dView4_unmanaged ;
	+
	+ typedef typename dView0::host_mirror_space host ;
	+
	+ TestViewAPI()
	+ {
	+ run_test_mirror();
	+ run_test();
	+ run_test_scalar();
	+ run_test_const();
	+ run_test_subview();
	+ run_test_subview_strided();
	+ run_test_vector();
	+
	+ TestViewOperator< T , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3][4][2][3][4][2][3] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3][4][2][3][4][2] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3][4][2][3][4] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3][4][2][3] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3][4][2] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3][4] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2][3] , device >::testit();
	+ TestViewOperator_LeftAndRight< int[2] , device >::testit();
	+ }
	+
	+ static void run_test_mirror()
	+ {
	+ typedef Kokkos::View< int , host > view_type ;
	+ typedef typename view_type::HostMirror mirror_type ;
	+ view_type a("a");
	+ mirror_type am = Kokkos::create_mirror_view(a);
	+ mirror_type ax = Kokkos::create_mirror(a);
	+ ASSERT_EQ( & a() , & am() );
	+ }
	+
	+ static void run_test_scalar()
	+ {
	+ typedef typename dView0::HostMirror hView0 ;
	+
	+ dView0 dx , dy ;
	+ hView0 hx , hy ;
	+
	+ dx = dView0( "dx" );
	+ dy = dView0( "dy" );
	+
	+ hx = Kokkos::create_mirror( dx );
	+ hy = Kokkos::create_mirror( dy );
	+
	+ hx = 1 ;
	+
	+ Kokkos::deep_copy( dx , hx );
	+ Kokkos::deep_copy( dy , dx );
	+ Kokkos::deep_copy( hy , dy );
	+
	+ ASSERT_EQ( hx(), hy() );
	+ }
	+
	+ static void run_test()
	+ {
	+ // mfh 14 Feb 2014: This test doesn't actually create instances of
	+ // these types. In order to avoid "declared but unused typedef"
	+ // warnings, we declare empty instances of these types, with the
	+ // usual "(void)" marker to avoid compiler warnings for unused
	+ // variables.
	+
	+ typedef typename dView0::HostMirror hView0 ;
	+ typedef typename dView1::HostMirror hView1 ;
	+ typedef typename dView2::HostMirror hView2 ;
	+ typedef typename dView3::HostMirror hView3 ;
	+ typedef typename dView4::HostMirror hView4 ;
	+
	+ {
	+ hView0 thing;
	+ (void) thing;
	+ }
	+ {
	+ hView1 thing;
	+ (void) thing;
	+ }
	+ {
	+ hView2 thing;
	+ (void) thing;
	+ }
	+ {
	+ hView3 thing;
	+ (void) thing;
	+ }
	+ {
	+ hView4 thing;
	+ (void) thing;
	+ }
	+
	+ dView4 dx , dy , dz ;
	+ hView4 hx , hy , hz ;
	+
	+ ASSERT_TRUE( dx.is_null() );
	+ ASSERT_TRUE( dy.is_null() );
	+ ASSERT_TRUE( dz.is_null() );
	+ ASSERT_TRUE( hx.is_null() );
	+ ASSERT_TRUE( hy.is_null() );
	+ ASSERT_TRUE( hz.is_null() );
	+ ASSERT_EQ( dx.dimension_0() , 0u );
	+ ASSERT_EQ( dy.dimension_0() , 0u );
	+ ASSERT_EQ( dz.dimension_0() , 0u );
	+ ASSERT_EQ( hx.dimension_0() , 0u );
	+ ASSERT_EQ( hy.dimension_0() , 0u );
	+ ASSERT_EQ( hz.dimension_0() , 0u );
	+ ASSERT_EQ( dx.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( dy.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( dz.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( hx.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( hy.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( hz.dimension_1() , unsigned(N1) );
	+
	+ dx = dView4( "dx" , N0 );
	+ dy = dView4( "dy" , N0 );
	+
	+
	+
	+ dView4_unmanaged unmanaged_dx = dx;
	+ dView4_unmanaged unmanaged_from_ptr_dx = dView4_unmanaged(dx.ptr_on_device(),
	+ dx.dimension_0(),
	+ dx.dimension_1(),
	+ dx.dimension_2(),
	+ dx.dimension_3());
	+
	+ {
	+ // Destruction of this view should be harmless
	+ const_dView4 unmanaged_from_ptr_const_dx( dx.ptr_on_device() ,
	+ dx.dimension_0() ,
	+ dx.dimension_1() ,
	+ dx.dimension_2() ,
	+ dx.dimension_3() );
	+ }
	+
	+ const_dView4 const_dx = dx ;
	+
	+
	+ ASSERT_FALSE( dx.is_null() );
	+ ASSERT_FALSE( const_dx.is_null() );
	+ ASSERT_FALSE( unmanaged_dx.is_null() );
	+ ASSERT_FALSE( unmanaged_from_ptr_dx.is_null() );
	+ ASSERT_FALSE( dy.is_null() );
	+ ASSERT_NE( dx , dy );
	+
	+ ASSERT_EQ( dx.dimension_0() , unsigned(N0) );
	+ ASSERT_EQ( dx.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( dx.dimension_2() , unsigned(N2) );
	+ ASSERT_EQ( dx.dimension_3() , unsigned(N3) );
	+
	+ ASSERT_EQ( dy.dimension_0() , unsigned(N0) );
	+ ASSERT_EQ( dy.dimension_1() , unsigned(N1) );
	+ ASSERT_EQ( dy.dimension_2() , unsigned(N2) );
	+ ASSERT_EQ( dy.dimension_3() , unsigned(N3) );
	+
	+ ASSERT_EQ( unmanaged_from_ptr_dx.capacity(),unsigned(N0)unsigned(N1)unsigned(N2)*unsigned(N3) );
	+
	+ hx = Kokkos::create_mirror( dx );
	+ hy = Kokkos::create_mirror( dy );
	+
	+ // T v1 = hx() ; // Generates compile error as intended
	+ // T v2 = hx(0,0) ; // Generates compile error as intended
	+ // hx(0,0) = v2 ; // Generates compile error as intended
	+
	+ size_t count = 0 ;
	+ for ( size_t ip = 0 ; ip < N0 ; ++ip ) {
	+ for ( size_t i1 = 0 ; i1 < hx.dimension_1() ; ++i1 ) {
	+ for ( size_t i2 = 0 ; i2 < hx.dimension_2() ; ++i2 ) {
	+ for ( size_t i3 = 0 ; i3 < hx.dimension_3() ; ++i3 ) {
	+ hx(ip,i1,i2,i3) = ++count ;
	+ }}}}
	+
	+ Kokkos::deep_copy( dx , hx );
	+ Kokkos::deep_copy( dy , dx );
	+ Kokkos::deep_copy( hy , dy );
	+
	+ for ( size_t ip = 0 ; ip < N0 ; ++ip ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ { ASSERT_EQ( hx(ip,i1,i2,i3) , hy(ip,i1,i2,i3) ); }
	+ }}}}
	+
	+ Kokkos::deep_copy( dx , T(0) );
	+ Kokkos::deep_copy( hx , dx );
	+
	+ for ( size_t ip = 0 ; ip < N0 ; ++ip ) {
	+ for ( size_t i1 = 0 ; i1 < N1 ; ++i1 ) {
	+ for ( size_t i2 = 0 ; i2 < N2 ; ++i2 ) {
	+ for ( size_t i3 = 0 ; i3 < N3 ; ++i3 ) {
	+ { ASSERT_EQ( hx(ip,i1,i2,i3) , T(0) ); }
	+ }}}}
	+
	+ dz = dx ; ASSERT_EQ( dx, dz); ASSERT_NE( dy, dz);
	+ dz = dy ; ASSERT_EQ( dy, dz); ASSERT_NE( dx, dz);
	+
	+ dx = dView4();
	+ ASSERT_TRUE( dx.is_null() );
	+ ASSERT_FALSE( dy.is_null() );
	+ ASSERT_FALSE( dz.is_null() );
	+ dy = dView4();
	+ ASSERT_TRUE( dx.is_null() );
	+ ASSERT_TRUE( dy.is_null() );
	+ ASSERT_FALSE( dz.is_null() );
	+ dz = dView4();
	+ ASSERT_TRUE( dx.is_null() );
	+ ASSERT_TRUE( dy.is_null() );
	+ ASSERT_TRUE( dz.is_null() );
	+ }
	+
	+ typedef T DataType[2] ;
	+
	+ static void
	+ check_auto_conversion_to_const(
	+ const Kokkos::View< const DataType , device > & arg_const ,
	+ const Kokkos::View< DataType , device > & arg )
	+ {
	+ ASSERT_TRUE( arg_const == arg );
	+ }
	+
	+ static void run_test_const()
	+ {
	+ typedef Kokkos::View< DataType , device > typeX ;
	+ typedef Kokkos::View< const DataType , device > const_typeX ;
	+ typedef Kokkos::View< const DataType , device , Kokkos::MemoryRandomAccess > const_typeR ;
	+ typeX x( "X" );
	+ const_typeX xc = x ;
	+ const_typeR xr = x ;
	+
	+ ASSERT_TRUE( xc == x );
	+ ASSERT_TRUE( x == xc );
	+ ASSERT_TRUE( x.ptr_on_device() == xr.ptr_on_device() );
	+
	+ // typeX xf = xc ; // setting non-const from const must not compile
	+
	+ check_auto_conversion_to_const( x , x );
	+ }
	+
	+ static void run_test_subview()
	+ {
	+ typedef Kokkos::View< const T , device > sView ;
	+
	+ dView0 d0( "d0" );
	+ dView1 d1( "d1" , N0 );
	+ dView2 d2( "d2" , N0 );
	+ dView3 d3( "d3" , N0 );
	+ dView4 d4( "d4" , N0 );
	+
	+ sView s0 = d0 ;
	+ sView s1 = Kokkos::subview( d1 , 1 );
	+ sView s2 = Kokkos::subview( d2 , 1 , 1 );
	+ sView s3 = Kokkos::subview( d3 , 1 , 1 , 1 );
	+ sView s4 = Kokkos::subview( d4 , 1 , 1 , 1 , 1 );
	+ }
	+
	+ static void run_test_subview_strided()
	+ {
	+ typedef Kokkos::View< int **** , Kokkos::LayoutLeft , host > view_left_4 ;
	+ typedef Kokkos::View< int **** , Kokkos::LayoutRight , host > view_right_4 ;
	+ typedef Kokkos::View< int ** , Kokkos::LayoutLeft , host > view_left_2 ;
	+ typedef Kokkos::View< int ** , Kokkos::LayoutRight , host > view_right_2 ;
	+
	+ typedef Kokkos::View< int * , Kokkos::LayoutStride , host > view_stride_1 ;
	+ typedef Kokkos::View< int ** , Kokkos::LayoutStride , host > view_stride_2 ;
	+
	+ view_left_2 xl2("xl2", 100 , 200 );
	+ view_right_2 xr2("xr2", 100 , 200 );
	+ view_stride_1 yl1 = Kokkos::subview( xl2 , 0 , Kokkos::ALL() );
	+ view_stride_1 yl2 = Kokkos::subview( xl2 , 1 , Kokkos::ALL() );
	+ view_stride_1 yr1 = Kokkos::subview( xr2 , 0 , Kokkos::ALL() );
	+ view_stride_1 yr2 = Kokkos::subview( xr2 , 1 , Kokkos::ALL() );
	+
	+ ASSERT_EQ( yl1.dimension_0() , xl2.dimension_1() );
	+ ASSERT_EQ( yl2.dimension_0() , xl2.dimension_1() );
	+ ASSERT_EQ( yr1.dimension_0() , xr2.dimension_1() );
	+ ASSERT_EQ( yr2.dimension_0() , xr2.dimension_1() );
	+
	+ ASSERT_EQ( & yl1(0) - & xl2(0,0) , 0 );
	+ ASSERT_EQ( & yl2(0) - & xl2(1,0) , 0 );
	+ ASSERT_EQ( & yr1(0) - & xr2(0,0) , 0 );
	+ ASSERT_EQ( & yr2(0) - & xr2(1,0) , 0 );
	+
	+ view_left_4 xl4( "xl4" , 10 , 20 , 30 , 40 );
	+ view_right_4 xr4( "xr4" , 10 , 20 , 30 , 40 );
	+
	+ view_stride_2 yl4 = Kokkos::subview( xl4 , 1 , Kokkos::ALL() , 2 , Kokkos::ALL() );
	+ view_stride_2 yr4 = Kokkos::subview( xr4 , 1 , Kokkos::ALL() , 2 , Kokkos::ALL() );
	+
	+ ASSERT_EQ( yl4.dimension_0() , xl4.dimension_1() );
	+ ASSERT_EQ( yl4.dimension_1() , xl4.dimension_3() );
	+ ASSERT_EQ( yr4.dimension_0() , xr4.dimension_1() );
	+ ASSERT_EQ( yr4.dimension_1() , xr4.dimension_3() );
	+
	+ ASSERT_EQ( & yl4(4,4) - & xl4(1,4,2,4) , 0 );
	+ ASSERT_EQ( & yr4(4,4) - & xr4(1,4,2,4) , 0 );
	+ }
	+
	+ static void run_test_vector()
	+ {
	+ static const unsigned Length = 1000 , Count = 8 ;
	+
	+ typedef Kokkos::View< T* , Kokkos::LayoutLeft , host > vector_type ;
	+ typedef Kokkos::View< T** , Kokkos::LayoutLeft , host > multivector_type ;
	+
	+ typedef Kokkos::View< T* , Kokkos::LayoutRight , host > vector_right_type ;
	+ typedef Kokkos::View< T** , Kokkos::LayoutRight , host > multivector_right_type ;
	+
	+ typedef Kokkos::View< const T* , Kokkos::LayoutRight, host > const_vector_right_type ;
	+ typedef Kokkos::View< const T* , Kokkos::LayoutLeft , host > const_vector_type ;
	+ typedef Kokkos::View< const T** , Kokkos::LayoutLeft , host > const_multivector_type ;
	+
	+ multivector_type mv = multivector_type( "mv" , Length , Count );
	+ multivector_right_type mv_right = multivector_right_type( "mv" , Length , Count );
	+
	+ vector_type v1 = Kokkos::subview( mv , Kokkos::ALL() , 0 );
	+ vector_type v2 = Kokkos::subview( mv , Kokkos::ALL() , 1 );
	+ vector_type v3 = Kokkos::subview( mv , Kokkos::ALL() , 2 );
	+
	+ vector_type rv1 = Kokkos::subview( mv_right , 0 , Kokkos::ALL() );
	+ vector_type rv2 = Kokkos::subview( mv_right , 1 , Kokkos::ALL() );
	+ vector_type rv3 = Kokkos::subview( mv_right , 2 , Kokkos::ALL() );
	+
	+ multivector_type mv1 = Kokkos::subview( mv , std::make_pair( 1 , 998 ) ,
	+ std::make_pair( 2 , 5 ) );
	+
	+ multivector_right_type mvr1 =
	+ Kokkos::subview( mv_right ,
	+ std::make_pair( 1 , 998 ) ,
	+ std::make_pair( 2 , 5 ) );
	+
	+ const_vector_type cv1 = Kokkos::subview( mv , Kokkos::ALL(), 0 );
	+ const_vector_type cv2 = Kokkos::subview( mv , Kokkos::ALL(), 1 );
	+ const_vector_type cv3 = Kokkos::subview( mv , Kokkos::ALL(), 2 );
	+
	+ vector_right_type vr1 = Kokkos::subview( mv , Kokkos::ALL() , 0 );
	+ vector_right_type vr2 = Kokkos::subview( mv , Kokkos::ALL() , 1 );
	+ vector_right_type vr3 = Kokkos::subview( mv , Kokkos::ALL() , 2 );
	+
	+ const_vector_right_type cvr1 = Kokkos::subview( mv , Kokkos::ALL() , 0 );
	+ const_vector_right_type cvr2 = Kokkos::subview( mv , Kokkos::ALL() , 1 );
	+ const_vector_right_type cvr3 = Kokkos::subview( mv , Kokkos::ALL() , 2 );
	+
	+ ASSERT_TRUE( & v1[0] == & v1(0) );
	+ ASSERT_TRUE( & v1[0] == & mv(0,0) );
	+ ASSERT_TRUE( & v2[0] == & mv(0,1) );
	+ ASSERT_TRUE( & v3[0] == & mv(0,2) );
	+
	+ ASSERT_TRUE( & cv1[0] == & mv(0,0) );
	+ ASSERT_TRUE( & cv2[0] == & mv(0,1) );
	+ ASSERT_TRUE( & cv3[0] == & mv(0,2) );
	+
	+ ASSERT_TRUE( & vr1[0] == & mv(0,0) );
	+ ASSERT_TRUE( & vr2[0] == & mv(0,1) );
	+ ASSERT_TRUE( & vr3[0] == & mv(0,2) );
	+
	+ ASSERT_TRUE( & cvr1[0] == & mv(0,0) );
	+ ASSERT_TRUE( & cvr2[0] == & mv(0,1) );
	+ ASSERT_TRUE( & cvr3[0] == & mv(0,2) );
	+
	+ ASSERT_TRUE( & mv1(0,0) == & mv( 1 , 2 ) );
	+ ASSERT_TRUE( & mv1(1,1) == & mv( 2 , 3 ) );
	+ ASSERT_TRUE( & mv1(3,2) == & mv( 4 , 4 ) );
	+ ASSERT_TRUE( & mvr1(0,0) == & mv_right( 1 , 2 ) );
	+ ASSERT_TRUE( & mvr1(1,1) == & mv_right( 2 , 3 ) );
	+ ASSERT_TRUE( & mvr1(3,2) == & mv_right( 4 , 4 ) );
	+
	+ const_vector_type c_cv1( v1 );
	+ typename vector_type::const_type c_cv2( v2 );
	+ typename const_vector_type::const_type c_ccv2( v2 );
	+
	+ const_multivector_type cmv( mv );
	+ typename multivector_type::const_type cmvX( cmv );
	+ typename const_multivector_type::const_type ccmvX( cmv );
	+ }
	+};
	+
	+} // namespace Test
	+
	+#endif
	+
	+/--------------------------------------------------------------------------/
	+
	diff --git a/lib/kokkos/core/unit_test/TestViewImpl.hpp b/lib/kokkos/core/unit_test/TestViewImpl.hpp
	new file mode 100755
	index 000000000..c51588777
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestViewImpl.hpp
	@@ -0,0 +1,289 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+#include <Kokkos_Core.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+#if defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+namespace Test {
	+
	+template < class Device >
	+void test_view_impl() {}
	+
	+}
	+
	+#else
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+struct DummyMemorySpace
	+{
	+ typedef DummyMemorySpace memory_space ;
	+ typedef unsigned size_type ;
	+};
	+
	+/--------------------------------------------------------------------------/
	+
	+template< class Type >
	+struct DefineShape {
	+ typedef typename Kokkos::Impl::AnalyzeShape<Type>::shape type ;
	+};
	+
	+template< class Type >
	+struct ExtractValueType {
	+ typedef typename Kokkos::Impl::AnalyzeShape<Type>::value_type type ;
	+};
	+
	+template< class Type >
	+struct ArrayType { typedef Type type ; };
	+
	+template < class Device >
	+void test_view_impl()
	+{
	+ //typedef typename Device::memory_space memory_space ; // unused
	+
	+ typedef ArrayType< int[100] >::type type_01 ;
	+ typedef ArrayType< int* >::type type_11 ;
	+ typedef ArrayType< int[5][6][700] >::type type_03 ;
	+ typedef ArrayType< double*[8][9][900] >::type type_14 ;
	+ typedef ArrayType< long** >::type type_22 ;
	+ typedef ArrayType< short **[5][6][7] >::type type_25 ;
	+ typedef ArrayType< const short **[5][6][7] >::type const_type_25 ;
	+ typedef ArrayType< short***[5][6][7] >::type type_36 ;
	+ typedef ArrayType< const short***[5][6][7] >::type const_type_36 ;
	+
	+ // mfh 14 Feb 2014: With gcc 4.8.2 -Wall, this emits a warning:
	+ //
	+ // typedef ‘ok_const_25’ locally defined but not used [-Wunused-local-typedefs]
	+ //
	+ // It's unfortunate that this is the case, because the typedef is
	+ // being used for a compile-time check! We deal with this by
	+ // declaring an instance of ok_const_25, and marking it with
	+ // "(void)" so that instance doesn't emit an "unused variable"
	+ // warning.
	+ //
	+ // typedef typename Kokkos::Impl::StaticAssertSame<
	+ // typename Kokkos::Impl::AnalyzeShape<type_25>::const_type ,
	+ // typename Kokkos::Impl::AnalyzeShape<const_type_25>::type
	+ // > ok_const_25 ;
	+
	+ typedef typename Kokkos::Impl::StaticAssertSame<
	+ typename Kokkos::Impl::AnalyzeShape<type_25>::const_type,
	+ typename Kokkos::Impl::AnalyzeShape<const_type_25>::type
	+ > ok_const_25 ;
	+
	+ typedef typename Kokkos::Impl::StaticAssertSame<
	+ typename Kokkos::Impl::AnalyzeShape<type_36>::const_type,
	+ typename Kokkos::Impl::AnalyzeShape<const_type_36>::type
	+ > ok_const_36 ;
	+ {
	+ ok_const_25 thing_25 ;
	+ ok_const_36 thing_36 ;
	+ (void) thing_25 ; // silence warning for unused variable
	+ (void) thing_36 ; // silence warning for unused variable
	+ }
	+
	+ ASSERT_TRUE( ( Kokkos::Impl::is_same< ExtractValueType<type_03>::type , int >::value ) );
	+ ASSERT_TRUE( ( Kokkos::Impl::is_same< ExtractValueType<type_14>::type , double >::value ) );
	+ ASSERT_TRUE( ( Kokkos::Impl::is_same< ExtractValueType<type_22>::type , long >::value ) );
	+ ASSERT_TRUE( ( Kokkos::Impl::is_same< ExtractValueType<type_36>::type , short >::value ) );
	+
	+ ASSERT_FALSE( ( Kokkos::Impl::is_same< ExtractValueType<type_36>::type , int >::value ) );
	+
	+ typedef typename DefineShape< type_01 >::type shape_01_type ;
	+ typedef typename DefineShape< type_11 >::type shape_11_type ;
	+ typedef typename DefineShape< type_03 >::type shape_03_type ;
	+ typedef typename DefineShape< type_14 >::type shape_14_type ;
	+ typedef typename DefineShape< type_22 >::type shape_22_type ;
	+ typedef typename DefineShape< type_36 >::type shape_36_type ;
	+
	+ ASSERT_TRUE( ( Kokkos::Impl::StaticAssert< shape_36_type::rank == 6 >::value ) );
	+ ASSERT_TRUE( ( Kokkos::Impl::StaticAssert< shape_03_type::rank == 3 >::value ) );
	+
	+ shape_01_type shape_01 ; shape_01_type::assign( shape_01 );
	+ shape_11_type shape_11 ; shape_11_type::assign( shape_11, 1000 );
	+ shape_03_type shape_03 ; shape_03_type::assign( shape_03 );
	+ shape_14_type shape_14 ; shape_14_type::assign( shape_14 , 0 );
	+ shape_22_type shape_22 ; shape_22_type::assign( shape_22 , 0 , 0 );
	+ shape_36_type shape_36 ; shape_36_type::assign( shape_36 , 10 , 20 , 30 );
	+
	+ ASSERT_TRUE( shape_01.rank_dynamic == 0u );
	+ ASSERT_TRUE( shape_01.rank == 1u );
	+ ASSERT_TRUE( shape_01.N0 == 100u );
	+
	+ ASSERT_TRUE( shape_11.rank_dynamic == 1u );
	+ ASSERT_TRUE( shape_11.rank == 1u );
	+ ASSERT_TRUE( shape_11.N0 == 1000u );
	+
	+ ASSERT_TRUE( shape_03.rank_dynamic == 0u );
	+ ASSERT_TRUE( shape_03.rank == 3u );
	+ ASSERT_TRUE( shape_03.N0 == 5u );
	+ ASSERT_TRUE( shape_03.N1 == 6u );
	+ ASSERT_TRUE( shape_03.N2 == 700u );
	+
	+ ASSERT_TRUE( shape_14.rank_dynamic == 1u );
	+ ASSERT_TRUE( shape_14.rank == 4u );
	+ ASSERT_TRUE( shape_14.N0 == 0u );
	+ ASSERT_TRUE( shape_14.N1 == 8u );
	+ ASSERT_TRUE( shape_14.N2 == 9u );
	+ ASSERT_TRUE( shape_14.N3 == 900u );
	+
	+ ASSERT_TRUE( shape_22.rank_dynamic == 2u );
	+ ASSERT_TRUE( shape_22.rank == 2u );
	+ ASSERT_TRUE( shape_22.N0 == 0u );
	+ ASSERT_TRUE( shape_22.N1 == 0u );
	+
	+ ASSERT_TRUE( shape_36.rank_dynamic == 3u );
	+ ASSERT_TRUE( shape_36.rank == 6u );
	+ ASSERT_TRUE( shape_36.N0 == 10u );
	+ ASSERT_TRUE( shape_36.N1 == 20u );
	+ ASSERT_TRUE( shape_36.N2 == 30u );
	+ ASSERT_TRUE( shape_36.N3 == 5u );
	+ ASSERT_TRUE( shape_36.N4 == 6u );
	+ ASSERT_TRUE( shape_36.N5 == 7u );
	+
	+
	+ ASSERT_TRUE( shape_01 == shape_01 );
	+ ASSERT_TRUE( shape_11 == shape_11 );
	+ ASSERT_TRUE( shape_36 == shape_36 );
	+ ASSERT_TRUE( shape_01 != shape_36 );
	+ ASSERT_TRUE( shape_22 != shape_36 );
	+
	+ //------------------------------------------------------------------------
	+
	+ typedef Kokkos::Impl::ViewOffset< shape_01_type , Kokkos::LayoutLeft > shape_01_left_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_11_type , Kokkos::LayoutLeft > shape_11_left_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_03_type , Kokkos::LayoutLeft > shape_03_left_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_14_type , Kokkos::LayoutLeft > shape_14_left_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_22_type , Kokkos::LayoutLeft > shape_22_left_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_36_type , Kokkos::LayoutLeft > shape_36_left_offset ;
	+
	+ typedef Kokkos::Impl::ViewOffset< shape_01_type , Kokkos::LayoutRight > shape_01_right_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_11_type , Kokkos::LayoutRight > shape_11_right_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_03_type , Kokkos::LayoutRight > shape_03_right_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_14_type , Kokkos::LayoutRight > shape_14_right_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_22_type , Kokkos::LayoutRight > shape_22_right_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_36_type , Kokkos::LayoutRight > shape_36_right_offset ;
	+
	+ ASSERT_TRUE( ! shape_01_left_offset::has_padding );
	+ ASSERT_TRUE( ! shape_11_left_offset::has_padding );
	+ ASSERT_TRUE( ! shape_03_left_offset::has_padding );
	+ ASSERT_TRUE( shape_14_left_offset::has_padding );
	+ ASSERT_TRUE( shape_22_left_offset::has_padding );
	+ ASSERT_TRUE( shape_36_left_offset::has_padding );
	+
	+ ASSERT_TRUE( ! shape_01_right_offset::has_padding );
	+ ASSERT_TRUE( ! shape_11_right_offset::has_padding );
	+ ASSERT_TRUE( ! shape_03_right_offset::has_padding );
	+ ASSERT_TRUE( ! shape_14_right_offset::has_padding );
	+ ASSERT_TRUE( shape_22_right_offset::has_padding );
	+ ASSERT_TRUE( shape_36_right_offset::has_padding );
	+
	+ //------------------------------------------------------------------------
	+
	+ typedef Kokkos::Impl::ViewOffset< shape_01_type , Kokkos::LayoutStride > shape_01_stride_offset ;
	+ typedef Kokkos::Impl::ViewOffset< shape_36_type , Kokkos::LayoutStride > shape_36_stride_offset ;
	+
	+ {
	+ shape_01_stride_offset stride_offset_01 ;
	+
	+ stride_offset_01.assign( 1, stride_offset_01.N0, 0,0,0,0,0,0,0 );
	+
	+ ASSERT_EQ( int(stride_offset_01.S[0]) , int(1) );
	+ ASSERT_EQ( int(stride_offset_01.S[1]) , int(stride_offset_01.N0) );
	+ }
	+
	+ {
	+ shape_36_stride_offset stride_offset_36 ;
	+
	+ size_t str[7] ;
	+ str[5] = 1 ;
	+ str[4] = str[5] * stride_offset_36.N5 ;
	+ str[3] = str[4] * stride_offset_36.N4 ;
	+ str[2] = str[3] * stride_offset_36.N3 ;
	+ str[1] = str[2] * 100 ;
	+ str[0] = str[1] * 200 ;
	+ str[6] = str[0] * 300 ;
	+
	+ stride_offset_36.assign( str[0] , str[1] , str[2] , str[3] , str[4] , str[5] , str[6] , 0 , 0 );
	+
	+ ASSERT_EQ( size_t(stride_offset_36.S[6]) , size_t(str[6]) );
	+ ASSERT_EQ( size_t(stride_offset_36.N2) , size_t(100) );
	+ ASSERT_EQ( size_t(stride_offset_36.N1) , size_t(200) );
	+ ASSERT_EQ( size_t(stride_offset_36.N0) , size_t(300) );
	+ }
	+
	+ //------------------------------------------------------------------------
	+
	+ {
	+ const int rank = 6 ;
	+ const int order[] = { 5 , 3 , 1 , 0 , 2 , 4 };
	+ const unsigned dim[] = { 2 , 3 , 5 , 7 , 11 , 13 };
	+ Kokkos::LayoutStride stride_6 = Kokkos::LayoutStride::order_dimensions( rank , order , dim );
	+ size_t n = 1 ;
	+ for ( int i = 0 ; i < rank ; ++i ) {
	+ ASSERT_EQ( size_t(dim[i]) , size_t( stride_6.dimension[i] ) );
	+ ASSERT_EQ( size_t(n) , size_t( stride_6.stride[ order[i] ] ) );
	+ n *= dim[order[i]] ;
	+ }
	+ }
	+
	+ //------------------------------------------------------------------------
	+}
	+
	+} /* namespace Test */
	+
	+#endif
	+
	+/--------------------------------------------------------------------------/
	+
	diff --git a/lib/kokkos/core/unit_test/TestViewMapping.hpp b/lib/kokkos/core/unit_test/TestViewMapping.hpp
	new file mode 100755
	index 000000000..31e0c6a7b
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestViewMapping.hpp
	@@ -0,0 +1,1018 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+#include <Kokkos_Core.hpp>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace Test {
	+
	+template< class RangeType >
	+void test_view_range( const size_t N , const RangeType & range , const size_t begin , const size_t dim )
	+{
	+ typedef Kokkos::Experimental::Impl::ViewOffsetRange< RangeType > query ;
	+
	+ ASSERT_EQ( query::begin( range ) , begin );
	+ ASSERT_EQ( query::dimension( N , range ) , dim );
	+ ASSERT_EQ( query::is_range , dim != 0 );
	+}
	+
	+
	+template< class ExecSpace >
	+void test_view_mapping()
	+{
	+ typedef Kokkos::Experimental::Impl::ViewDimension<> dim_0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<2> dim_s2 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<2,3> dim_s2_s3 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<2,3,4> dim_s2_s3_s4 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0> dim_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,3> dim_s0_s3 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,3,4> dim_s0_s3_s4 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0> dim_s0_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,4> dim_s0_s0_s4 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0> dim_s0_s0_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0,0> dim_s0_s0_s0_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0,0,0> dim_s0_s0_s0_s0_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0,0,0,0> dim_s0_s0_s0_s0_s0_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0,0,0,0,0> dim_s0_s0_s0_s0_s0_s0_s0 ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0,0,0,0,0,0> dim_s0_s0_s0_s0_s0_s0_s0_s0 ;
	+
	+ // Fully static dimensions should not be larger than an int
	+ ASSERT_LE( sizeof(dim_0) , sizeof(int) );
	+ ASSERT_LE( sizeof(dim_s2) , sizeof(int) );
	+ ASSERT_LE( sizeof(dim_s2_s3) , sizeof(int) );
	+ ASSERT_LE( sizeof(dim_s2_s3_s4) , sizeof(int) );
	+
	+ // Rank 1 is size_t
	+ ASSERT_EQ( sizeof(dim_s0) , sizeof(size_t) );
	+ ASSERT_EQ( sizeof(dim_s0_s3) , sizeof(size_t) );
	+ ASSERT_EQ( sizeof(dim_s0_s3_s4) , sizeof(size_t) );
	+
	+ // Allow for padding
	+ ASSERT_LE( sizeof(dim_s0_s0) , 2 * sizeof(size_t) );
	+ ASSERT_LE( sizeof(dim_s0_s0_s4) , 2 * sizeof(size_t) );
	+
	+ ASSERT_LE( sizeof(dim_s0_s0_s0) , 4 * sizeof(size_t) );
	+ ASSERT_EQ( sizeof(dim_s0_s0_s0_s0) , 4 * sizeof(unsigned) );
	+ ASSERT_LE( sizeof(dim_s0_s0_s0_s0_s0) , 6 * sizeof(unsigned) );
	+ ASSERT_EQ( sizeof(dim_s0_s0_s0_s0_s0_s0) , 6 * sizeof(unsigned) );
	+ ASSERT_LE( sizeof(dim_s0_s0_s0_s0_s0_s0_s0) , 8 * sizeof(unsigned) );
	+ ASSERT_EQ( sizeof(dim_s0_s0_s0_s0_s0_s0_s0_s0) , 8 * sizeof(unsigned) );
	+
	+ ASSERT_EQ( int(dim_0::rank) , int(0) );
	+ ASSERT_EQ( int(dim_0::rank_dynamic) , int(0) );
	+
	+ ASSERT_EQ( int(dim_s2::rank) , int(1) );
	+ ASSERT_EQ( int(dim_s2::rank_dynamic) , int(0) );
	+
	+ ASSERT_EQ( int(dim_s2_s3::rank) , int(2) );
	+ ASSERT_EQ( int(dim_s2_s3::rank_dynamic) , int(0) );
	+
	+ ASSERT_EQ( int(dim_s2_s3_s4::rank) , int(3) );
	+ ASSERT_EQ( int(dim_s2_s3_s4::rank_dynamic) , int(0) );
	+
	+ ASSERT_EQ( int(dim_s0::rank) , int(1) );
	+ ASSERT_EQ( int(dim_s0::rank_dynamic) , int(1) );
	+
	+ ASSERT_EQ( int(dim_s0_s3::rank) , int(2) );
	+ ASSERT_EQ( int(dim_s0_s3::rank_dynamic) , int(1) );
	+
	+ ASSERT_EQ( int(dim_s0_s3_s4::rank) , int(3) );
	+ ASSERT_EQ( int(dim_s0_s3_s4::rank_dynamic) , int(1) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s4::rank) , int(3) );
	+ ASSERT_EQ( int(dim_s0_s0_s4::rank_dynamic) , int(2) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s0::rank) , int(3) );
	+ ASSERT_EQ( int(dim_s0_s0_s0::rank_dynamic) , int(3) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0::rank) , int(4) );
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0::rank_dynamic) , int(4) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0::rank) , int(5) );
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0::rank_dynamic) , int(5) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0_s0::rank) , int(6) );
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0_s0::rank_dynamic) , int(6) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0_s0_s0::rank) , int(7) );
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0_s0_s0::rank_dynamic) , int(7) );
	+
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0_s0_s0_s0::rank) , int(8) );
	+ ASSERT_EQ( int(dim_s0_s0_s0_s0_s0_s0_s0_s0::rank_dynamic) , int(8) );
	+
	+ dim_s0 d1( 2, 3, 4, 5, 6, 7, 8, 9 );
	+ dim_s0_s0 d2( 2, 3, 4, 5, 6, 7, 8, 9 );
	+ dim_s0_s0_s0 d3( 2, 3, 4, 5, 6, 7, 8, 9 );
	+ dim_s0_s0_s0_s0 d4( 2, 3, 4, 5, 6, 7, 8, 9 );
	+
	+ ASSERT_EQ( d1.N0 , 2 );
	+ ASSERT_EQ( d2.N0 , 2 );
	+ ASSERT_EQ( d3.N0 , 2 );
	+ ASSERT_EQ( d4.N0 , 2 );
	+
	+ ASSERT_EQ( d1.N1 , 1 );
	+ ASSERT_EQ( d2.N1 , 3 );
	+ ASSERT_EQ( d3.N1 , 3 );
	+ ASSERT_EQ( d4.N1 , 3 );
	+
	+ ASSERT_EQ( d1.N2 , 1 );
	+ ASSERT_EQ( d2.N2 , 1 );
	+ ASSERT_EQ( d3.N2 , 4 );
	+ ASSERT_EQ( d4.N2 , 4 );
	+
	+ ASSERT_EQ( d1.N3 , 1 );
	+ ASSERT_EQ( d2.N3 , 1 );
	+ ASSERT_EQ( d3.N3 , 1 );
	+ ASSERT_EQ( d4.N3 , 5 );
	+
	+ //----------------------------------------
	+
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s0 , Kokkos::LayoutStride > stride_s0_s0_s0 ;
	+
	+ //----------------------------------------
	+ // Static dimension
	+ {
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s2_s3_s4 , Kokkos::LayoutLeft > left_s2_s3_s4 ;
	+
	+ ASSERT_EQ( sizeof(left_s2_s3_s4) , sizeof(dim_s2_s3_s4) );
	+
	+ left_s2_s3_s4 off3 ;
	+
	+ stride_s0_s0_s0 stride3( off3 );
	+
	+ ASSERT_EQ( off3.stride_0() , 1 );
	+ ASSERT_EQ( off3.stride_1() , 2 );
	+ ASSERT_EQ( off3.stride_2() , 6 );
	+ ASSERT_EQ( off3.span() , 24 );
	+
	+ ASSERT_EQ( off3.stride_0() , stride3.stride_0() );
	+ ASSERT_EQ( off3.stride_1() , stride3.stride_1() );
	+ ASSERT_EQ( off3.stride_2() , stride3.stride_2() );
	+ ASSERT_EQ( off3.span() , stride3.span() );
	+
	+ int offset = 0 ;
	+
	+ for ( int k = 0 ; k < 4 ; ++k ){
	+ for ( int j = 0 ; j < 3 ; ++j ){
	+ for ( int i = 0 ; i < 2 ; ++i , ++offset ){
	+ ASSERT_EQ( off3(i,j,k) , offset );
	+ ASSERT_EQ( stride3(i,j,k) , off3(i,j,k) );
	+ }}}
	+ }
	+
	+ //----------------------------------------
	+ // Small dimension is unpadded
	+ {
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s4 , Kokkos::LayoutLeft > left_s0_s0_s4 ;
	+
	+ left_s0_s0_s4 dyn_off3( std::integral_constant<unsigned,sizeof(int)>(), 2, 3, 0, 0, 0, 0, 0, 0 );
	+
	+ stride_s0_s0_s0 stride3( dyn_off3 );
	+
	+ ASSERT_EQ( dyn_off3.m_dim.rank , 3 );
	+ ASSERT_EQ( dyn_off3.m_dim.N0 , 2 );
	+ ASSERT_EQ( dyn_off3.m_dim.N1 , 3 );
	+ ASSERT_EQ( dyn_off3.m_dim.N2 , 4 );
	+ ASSERT_EQ( dyn_off3.m_dim.N3 , 1 );
	+ ASSERT_EQ( dyn_off3.size() , 2 * 3 * 4 );
	+
	+ ASSERT_EQ( stride3.m_dim.rank , 3 );
	+ ASSERT_EQ( stride3.m_dim.N0 , 2 );
	+ ASSERT_EQ( stride3.m_dim.N1 , 3 );
	+ ASSERT_EQ( stride3.m_dim.N2 , 4 );
	+ ASSERT_EQ( stride3.m_dim.N3 , 1 );
	+ ASSERT_EQ( stride3.size() , 2 * 3 * 4 );
	+
	+ int offset = 0 ;
	+
	+ for ( int k = 0 ; k < 4 ; ++k ){
	+ for ( int j = 0 ; j < 3 ; ++j ){
	+ for ( int i = 0 ; i < 2 ; ++i , ++offset ){
	+ ASSERT_EQ( offset , dyn_off3(i,j,k) );
	+ ASSERT_EQ( stride3(i,j,k) , dyn_off3(i,j,k) );
	+ }}}
	+
	+ ASSERT_EQ( dyn_off3.span() , offset );
	+ ASSERT_EQ( stride3.span() , dyn_off3.span() );
	+ }
	+
	+ // Large dimension is likely padded
	+ {
	+ constexpr int N0 = 2000 ;
	+ constexpr int N1 = 300 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s4 , Kokkos::LayoutLeft > left_s0_s0_s4 ;
	+
	+ left_s0_s0_s4 dyn_off3( std::integral_constant<unsigned,sizeof(int)>(), N0, N1, 0, 0, 0, 0, 0, 0 );
	+
	+ stride_s0_s0_s0 stride3( dyn_off3 );
	+
	+ ASSERT_EQ( dyn_off3.m_dim.rank , 3 );
	+ ASSERT_EQ( dyn_off3.m_dim.N0 , N0 );
	+ ASSERT_EQ( dyn_off3.m_dim.N1 , N1 );
	+ ASSERT_EQ( dyn_off3.m_dim.N2 , 4 );
	+ ASSERT_EQ( dyn_off3.m_dim.N3 , 1 );
	+ ASSERT_EQ( dyn_off3.size() , N0 * N1 * 4 );
	+
	+ ASSERT_EQ( stride3.m_dim.rank , 3 );
	+ ASSERT_EQ( stride3.m_dim.N0 , N0 );
	+ ASSERT_EQ( stride3.m_dim.N1 , N1 );
	+ ASSERT_EQ( stride3.m_dim.N2 , 4 );
	+ ASSERT_EQ( stride3.m_dim.N3 , 1 );
	+ ASSERT_EQ( stride3.size() , N0 * N1 * 4 );
	+ ASSERT_EQ( stride3.span() , dyn_off3.span() );
	+
	+ int offset = 0 ;
	+
	+ for ( int k = 0 ; k < 4 ; ++k ){
	+ for ( int j = 0 ; j < N1 ; ++j ){
	+ for ( int i = 0 ; i < N0 ; ++i ){
	+ ASSERT_LE( offset , dyn_off3(i,j,k) );
	+ ASSERT_EQ( stride3(i,j,k) , dyn_off3(i,j,k) );
	+ offset = dyn_off3(i,j,k) + 1 ;
	+ }}}
	+
	+ ASSERT_LE( offset , dyn_off3.span() );
	+ }
	+
	+ //----------------------------------------
	+ // Static dimension
	+ {
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s2_s3_s4 , Kokkos::LayoutRight > right_s2_s3_s4 ;
	+
	+ ASSERT_EQ( sizeof(right_s2_s3_s4) , sizeof(dim_s2_s3_s4) );
	+
	+ right_s2_s3_s4 off3 ;
	+
	+ stride_s0_s0_s0 stride3( off3 );
	+
	+ ASSERT_EQ( off3.stride_0() , 12 );
	+ ASSERT_EQ( off3.stride_1() , 4 );
	+ ASSERT_EQ( off3.stride_2() , 1 );
	+
	+ ASSERT_EQ( off3.dimension_0() , stride3.dimension_0() );
	+ ASSERT_EQ( off3.dimension_1() , stride3.dimension_1() );
	+ ASSERT_EQ( off3.dimension_2() , stride3.dimension_2() );
	+ ASSERT_EQ( off3.stride_0() , stride3.stride_0() );
	+ ASSERT_EQ( off3.stride_1() , stride3.stride_1() );
	+ ASSERT_EQ( off3.stride_2() , stride3.stride_2() );
	+ ASSERT_EQ( off3.span() , stride3.span() );
	+
	+ int offset = 0 ;
	+
	+ for ( int i = 0 ; i < 2 ; ++i ){
	+ for ( int j = 0 ; j < 3 ; ++j ){
	+ for ( int k = 0 ; k < 4 ; ++k , ++offset ){
	+ ASSERT_EQ( off3(i,j,k) , offset );
	+ ASSERT_EQ( off3(i,j,k) , stride3(i,j,k) );
	+ }}}
	+
	+ ASSERT_EQ( off3.span() , offset );
	+ }
	+
	+ //----------------------------------------
	+ // Small dimension is unpadded
	+ {
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s4 , Kokkos::LayoutRight > right_s0_s0_s4 ;
	+
	+ right_s0_s0_s4 dyn_off3( std::integral_constant<unsigned,sizeof(int)>(), 2, 3, 0, 0, 0, 0, 0, 0 );
	+
	+ stride_s0_s0_s0 stride3( dyn_off3 );
	+
	+ ASSERT_EQ( dyn_off3.m_dim.rank , 3 );
	+ ASSERT_EQ( dyn_off3.m_dim.N0 , 2 );
	+ ASSERT_EQ( dyn_off3.m_dim.N1 , 3 );
	+ ASSERT_EQ( dyn_off3.m_dim.N2 , 4 );
	+ ASSERT_EQ( dyn_off3.m_dim.N3 , 1 );
	+ ASSERT_EQ( dyn_off3.size() , 2 * 3 * 4 );
	+
	+ ASSERT_EQ( dyn_off3.dimension_0() , stride3.dimension_0() );
	+ ASSERT_EQ( dyn_off3.dimension_1() , stride3.dimension_1() );
	+ ASSERT_EQ( dyn_off3.dimension_2() , stride3.dimension_2() );
	+ ASSERT_EQ( dyn_off3.stride_0() , stride3.stride_0() );
	+ ASSERT_EQ( dyn_off3.stride_1() , stride3.stride_1() );
	+ ASSERT_EQ( dyn_off3.stride_2() , stride3.stride_2() );
	+ ASSERT_EQ( dyn_off3.span() , stride3.span() );
	+
	+ int offset = 0 ;
	+
	+ for ( int i = 0 ; i < 2 ; ++i ){
	+ for ( int j = 0 ; j < 3 ; ++j ){
	+ for ( int k = 0 ; k < 4 ; ++k , ++offset ){
	+ ASSERT_EQ( offset , dyn_off3(i,j,k) );
	+ ASSERT_EQ( dyn_off3(i,j,k) , stride3(i,j,k) );
	+ }}}
	+
	+ ASSERT_EQ( dyn_off3.span() , offset );
	+ }
	+
	+ // Large dimension is likely padded
	+ {
	+ constexpr int N0 = 2000 ;
	+ constexpr int N1 = 300 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s4 , Kokkos::LayoutRight > right_s0_s0_s4 ;
	+
	+ right_s0_s0_s4 dyn_off3( std::integral_constant<unsigned,sizeof(int)>(), N0, N1, 0, 0, 0, 0, 0, 0 );
	+
	+ stride_s0_s0_s0 stride3( dyn_off3 );
	+
	+ ASSERT_EQ( dyn_off3.m_dim.rank , 3 );
	+ ASSERT_EQ( dyn_off3.m_dim.N0 , N0 );
	+ ASSERT_EQ( dyn_off3.m_dim.N1 , N1 );
	+ ASSERT_EQ( dyn_off3.m_dim.N2 , 4 );
	+ ASSERT_EQ( dyn_off3.m_dim.N3 , 1 );
	+ ASSERT_EQ( dyn_off3.size() , N0 * N1 * 4 );
	+
	+ ASSERT_EQ( dyn_off3.dimension_0() , stride3.dimension_0() );
	+ ASSERT_EQ( dyn_off3.dimension_1() , stride3.dimension_1() );
	+ ASSERT_EQ( dyn_off3.dimension_2() , stride3.dimension_2() );
	+ ASSERT_EQ( dyn_off3.stride_0() , stride3.stride_0() );
	+ ASSERT_EQ( dyn_off3.stride_1() , stride3.stride_1() );
	+ ASSERT_EQ( dyn_off3.stride_2() , stride3.stride_2() );
	+ ASSERT_EQ( dyn_off3.span() , stride3.span() );
	+
	+ int offset = 0 ;
	+
	+ for ( int i = 0 ; i < N0 ; ++i ){
	+ for ( int j = 0 ; j < N1 ; ++j ){
	+ for ( int k = 0 ; k < 4 ; ++k ){
	+ ASSERT_LE( offset , dyn_off3(i,j,k) );
	+ ASSERT_EQ( dyn_off3(i,j,k) , stride3(i,j,k) );
	+ offset = dyn_off3(i,j,k) + 1 ;
	+ }}}
	+
	+ ASSERT_LE( offset , dyn_off3.span() );
	+ }
	+
	+ //----------------------------------------
	+ // Subview
	+ {
	+ constexpr int N0 = 2000 ;
	+ constexpr int N1 = 300 ;
	+
	+ constexpr int sub_N0 = 1000 ;
	+ constexpr int sub_N1 = 200 ;
	+ constexpr int sub_N2 = 4 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s4 , Kokkos::LayoutLeft > left_s0_s0_s4 ;
	+
	+ left_s0_s0_s4 dyn_off3( std::integral_constant<unsigned,sizeof(int)>(), N0, N1, 0, 0, 0, 0, 0, 0 );
	+
	+ stride_s0_s0_s0 stride3( dyn_off3 , sub_N0 , sub_N1 , sub_N2 , 0 , 0 , 0 , 0 , 0 );
	+
	+ ASSERT_EQ( stride3.dimension_0() , sub_N0 );
	+ ASSERT_EQ( stride3.dimension_1() , sub_N1 );
	+ ASSERT_EQ( stride3.dimension_2() , sub_N2 );
	+ ASSERT_EQ( stride3.size() , sub_N0 * sub_N1 * sub_N2 );
	+
	+ ASSERT_EQ( dyn_off3.stride_0() , stride3.stride_0() );
	+ ASSERT_EQ( dyn_off3.stride_1() , stride3.stride_1() );
	+ ASSERT_EQ( dyn_off3.stride_2() , stride3.stride_2() );
	+ ASSERT_GE( dyn_off3.span() , stride3.span() );
	+
	+ for ( int k = 0 ; k < sub_N2 ; ++k ){
	+ for ( int j = 0 ; j < sub_N1 ; ++j ){
	+ for ( int i = 0 ; i < sub_N0 ; ++i ){
	+ ASSERT_EQ( stride3(i,j,k) , dyn_off3(i,j,k) );
	+ }}}
	+ }
	+
	+ {
	+ constexpr int N0 = 2000 ;
	+ constexpr int N1 = 300 ;
	+
	+ constexpr int sub_N0 = 1000 ;
	+ constexpr int sub_N1 = 200 ;
	+ constexpr int sub_N2 = 4 ;
	+
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dim_s0_s0_s4 , Kokkos::LayoutRight > right_s0_s0_s4 ;
	+
	+ right_s0_s0_s4 dyn_off3( std::integral_constant<unsigned,sizeof(int)>(), N0, N1, 0, 0, 0, 0, 0, 0 );
	+
	+ stride_s0_s0_s0 stride3( dyn_off3 , sub_N0 , sub_N1 , sub_N2 , 0 , 0 , 0 , 0 , 0 );
	+
	+ ASSERT_EQ( stride3.dimension_0() , sub_N0 );
	+ ASSERT_EQ( stride3.dimension_1() , sub_N1 );
	+ ASSERT_EQ( stride3.dimension_2() , sub_N2 );
	+ ASSERT_EQ( stride3.size() , sub_N0 * sub_N1 * sub_N2 );
	+
	+ ASSERT_EQ( dyn_off3.stride_0() , stride3.stride_0() );
	+ ASSERT_EQ( dyn_off3.stride_1() , stride3.stride_1() );
	+ ASSERT_EQ( dyn_off3.stride_2() , stride3.stride_2() );
	+ ASSERT_GE( dyn_off3.span() , stride3.span() );
	+
	+ for ( int i = 0 ; i < sub_N0 ; ++i ){
	+ for ( int j = 0 ; j < sub_N1 ; ++j ){
	+ for ( int k = 0 ; k < sub_N2 ; ++k ){
	+ ASSERT_EQ( stride3(i,j,k) , dyn_off3(i,j,k) );
	+ }}}
	+ }
	+
	+ //----------------------------------------
	+ {
	+ constexpr int N = 1000 ;
	+
	+ test_view_range( N , N / 2 , N / 2 , 0 );
	+ test_view_range( N , Kokkos::Experimental::ALL , 0 , N );
	+ test_view_range( N , std::pair<int,int>( N / 4 , 10 + N / 4 ) , N / 4 , 10 );
	+ test_view_range( N , Kokkos::pair<int,int>( N / 4 , 10 + N / 4 ) , N / 4 , 10 );
	+ }
	+ //----------------------------------------
	+ // view data analysis
	+
	+ {
	+ typedef Kokkos::Experimental::Impl::ViewDataAnalysis< const int[] > a_const_int_r1 ;
	+
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::specialize , void >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::dimension , Kokkos::Experimental::Impl::ViewDimension<0> >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::type , const int[] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::value_type , const int >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::array_scalar_type , const int[] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::const_type , const int[] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::const_value_type , const int >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::const_array_scalar_type , const int[] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::non_const_type , int [] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r1::non_const_value_type , int >::value ));
	+
	+ typedef Kokkos::Experimental::Impl::ViewDataAnalysis< const int**[4] > a_const_int_r3 ;
	+
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::specialize , void >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::dimension , Kokkos::Experimental::Impl::ViewDimension<0,0,4> >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::type , const int**[4] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::value_type , const int >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::array_scalar_type , const int**[4] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::const_type , const int**[4] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::const_value_type , const int >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::const_array_scalar_type , const int**[4] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::non_const_type , int**[4] >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::non_const_value_type , int >::value ));
	+ ASSERT_TRUE( ( std::is_same< typename a_const_int_r3::non_const_array_scalar_type , int**[4] >::value ));
	+ }
	+
	+ //----------------------------------------
	+
	+ {
	+ constexpr int N = 10 ;
	+
	+ typedef Kokkos::Experimental::View<int*,ExecSpace> T ;
	+ typedef Kokkos::Experimental::View<const int*,ExecSpace> C ;
	+
	+ int data[N] ;
	+
	+ T vr1(data,N);
	+ C cr1(vr1);
	+
	+ // Generate static_assert error:
	+ // T tmp( cr1 );
	+
	+ ASSERT_EQ( vr1.span() , N );
	+ ASSERT_EQ( cr1.span() , N );
	+ ASSERT_EQ( vr1.data() , & data[0] );
	+ ASSERT_EQ( cr1.data() , & data[0] );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::data_type , int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::const_data_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::non_const_data_type , int* >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::array_scalar_type , int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::const_array_scalar_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::non_const_array_scalar_type , int* >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::value_type , int >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::const_value_type , const int >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::non_const_value_type , int >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::memory_space , typename ExecSpace::memory_space >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::reference_type , int & >::value ) );
	+
	+ ASSERT_EQ( T::Rank , 1 );
	+
	+ ASSERT_TRUE( ( std::is_same< typename C::data_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::const_data_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::non_const_data_type , int* >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename C::array_scalar_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::const_array_scalar_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::non_const_array_scalar_type , int* >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename C::value_type , const int >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::const_value_type , const int >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::non_const_value_type , int >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename C::memory_space , typename ExecSpace::memory_space >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename C::reference_type , const int & >::value ) );
	+
	+ ASSERT_EQ( C::Rank , 1 );
	+
	+ ASSERT_EQ( vr1.dimension_0() , N );
	+
	+ if ( Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< typename ExecSpace::memory_space , Kokkos::HostSpace >::value ) {
	+ for ( int i = 0 ; i < N ; ++i ) data[i] = i + 1 ;
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( vr1[i] , i + 1 );
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( cr1[i] , i + 1 );
	+
	+ {
	+ T tmp( vr1 );
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( tmp[i] , i + 1 );
	+ for ( int i = 0 ; i < N ; ++i ) vr1(i) = i + 2 ;
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( tmp[i] , i + 2 );
	+ }
	+
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( vr1[i] , i + 2 );
	+ }
	+ }
	+
	+ {
	+ constexpr int N = 10 ;
	+ typedef Kokkos::Experimental::View<int*,ExecSpace> T ;
	+ typedef Kokkos::Experimental::View<const int*,ExecSpace> C ;
	+
	+ T vr1("vr1",N);
	+ C cr1(vr1);
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::data_type , int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::const_data_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::non_const_data_type , int* >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::array_scalar_type , int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::const_array_scalar_type , const int* >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::non_const_array_scalar_type , int* >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::value_type , int >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::const_value_type , const int >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::non_const_value_type , int >::value ) );
	+
	+ ASSERT_TRUE( ( std::is_same< typename T::memory_space , typename ExecSpace::memory_space >::value ) );
	+ ASSERT_TRUE( ( std::is_same< typename T::reference_type , int & >::value ) );
	+ ASSERT_EQ( T::Rank , 1 );
	+
	+ ASSERT_EQ( vr1.dimension_0() , N );
	+
	+ if ( Kokkos::Impl::VerifyExecutionCanAccessMemorySpace< typename ExecSpace::memory_space , Kokkos::HostSpace >::value ) {
	+ for ( int i = 0 ; i < N ; ++i ) vr1(i) = i + 1 ;
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( vr1[i] , i + 1 );
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( cr1[i] , i + 1 );
	+
	+ {
	+ T tmp( vr1 );
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( tmp[i] , i + 1 );
	+ for ( int i = 0 ; i < N ; ++i ) vr1(i) = i + 2 ;
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( tmp[i] , i + 2 );
	+ }
	+
	+ for ( int i = 0 ; i < N ; ++i ) ASSERT_EQ( vr1[i] , i + 2 );
	+ }
	+ }
	+
	+ {
	+ using namespace Kokkos::Experimental ;
	+
	+ typedef typename ExecSpace::memory_space memory_space ;
	+ typedef View<int*,memory_space> V ;
	+
	+ constexpr int N = 10 ;
	+
	+ memory_space mem_space ;
	+
	+ V v( "v" , N );
	+ V va( view_alloc() , N );
	+ V vb( view_alloc( "vb" ) , N );
	+ V vc( view_alloc( "vc" , AllowPadding ) , N );
	+ V vd( view_alloc( "vd" , WithoutInitializing ) , N );
	+ V ve( view_alloc( "ve" , WithoutInitializing , AllowPadding ) , N );
	+ V vf( view_alloc( "vf" , mem_space , WithoutInitializing , AllowPadding ) , N );
	+ V vg( view_alloc( mem_space , "vg" , WithoutInitializing , AllowPadding ) , N );
	+ V vh( view_alloc( WithoutInitializing , AllowPadding ) , N );
	+ V vi( view_alloc( WithoutInitializing ) , N );
	+ V vj( view_alloc( std::string("vj") , AllowPadding ) , N );
	+ V vk( view_alloc( mem_space , std::string("vk") , AllowPadding ) , N );
	+ }
	+
	+ {
	+ typedef Kokkos::Experimental::ViewTraits<int***,Kokkos::LayoutStride,ExecSpace> traits_t ;
	+ typedef Kokkos::Experimental::Impl::ViewDimension<0,0,0> dims_t ;
	+ typedef Kokkos::Experimental::Impl::ViewOffset< dims_t , Kokkos::LayoutStride > offset_t ;
	+
	+ Kokkos::LayoutStride stride ;
	+
	+ stride.dimension[0] = 3 ;
	+ stride.dimension[1] = 4 ;
	+ stride.dimension[2] = 5 ;
	+ stride.stride[0] = 4 ;
	+ stride.stride[1] = 1 ;
	+ stride.stride[2] = 12 ;
	+
	+ const offset_t offset( stride );
	+
	+ ASSERT_EQ( offset.dimension_0() , 3 );
	+ ASSERT_EQ( offset.dimension_1() , 4 );
	+ ASSERT_EQ( offset.dimension_2() , 5 );
	+
	+ ASSERT_EQ( offset.stride_0() , 4 );
	+ ASSERT_EQ( offset.stride_1() , 1 );
	+ ASSERT_EQ( offset.stride_2() , 12 );
	+
	+ ASSERT_EQ( offset.span() , 60 );
	+ ASSERT_TRUE( offset.span_is_contiguous() );
	+
	+ Kokkos::Experimental::Impl::ViewMapping< traits_t , void > v( (int*) 0 , std::false_type() , stride );
	+ }
	+
	+ {
	+ typedef Kokkos::Experimental::View<int**,ExecSpace> V ;
	+ typedef typename V::HostMirror M ;
	+
	+ constexpr int N0 = 10 ;
	+ constexpr int N1 = 11 ;
	+
	+ V a("a",N0,N1);
	+ M b = Kokkos::Experimental::create_mirror(a);
	+ M c = Kokkos::Experimental::create_mirror_view(a);
	+
	+ for ( int i0 = 0 ; i0 < N0 ; ++i0 )
	+ for ( int i1 = 0 ; i1 < N1 ; ++i1 )
	+ b(i0,i1) = 1 + i0 + i1 * N0 ;
	+
	+ Kokkos::Experimental::deep_copy( a , b );
	+ Kokkos::Experimental::deep_copy( c , a );
	+
	+ for ( int i0 = 0 ; i0 < N0 ; ++i0 )
	+ for ( int i1 = 0 ; i1 < N1 ; ++i1 )
	+ ASSERT_EQ( b(i0,i1) , c(i0,i1) );
	+
	+ Kokkos::Experimental::resize( b , 5 , 6 );
	+ Kokkos::Experimental::realloc( c , 5 , 6 );
	+
	+ ASSERT_EQ( b.dimension_0() , 5 );
	+ ASSERT_EQ( b.dimension_1() , 6 );
	+ ASSERT_EQ( c.dimension_0() , 5 );
	+ ASSERT_EQ( c.dimension_1() , 6 );
	+ }
	+}
	+
	+template< class ExecSpace >
	+struct TestViewMappingSubview {
	+
	+ constexpr static int AN = 10 ;
	+ typedef Kokkos::Experimental::View<int*,ExecSpace> AT ;
	+ typedef Kokkos::Experimental::Subview< AT , true > AS ;
	+
	+ constexpr static int BN0 = 10 , BN1 = 11 , BN2 = 12 ;
	+ typedef Kokkos::Experimental::View<int***,ExecSpace> BT ;
	+ typedef Kokkos::Experimental::Subview< BT , true , true , true > BS ;
	+
	+ constexpr static int CN0 = 10 , CN1 = 11 , CN2 = 12 ;
	+ typedef Kokkos::Experimental::View<int***[13][14],ExecSpace> CT ;
	+ typedef Kokkos::Experimental::Subview< CT , true , true , true , false , false > CS ;
	+
	+ constexpr static int DN0 = 10 , DN1 = 11 , DN2 = 12 ;
	+ typedef Kokkos::Experimental::View<int***[13][14],ExecSpace> DT ;
	+ typedef Kokkos::Experimental::Subview< DT , false , true , true , true , false > DS ;
	+
	+
	+ typedef Kokkos::Experimental::View<int***[13][14],Kokkos::LayoutLeft,ExecSpace> DLT ;
	+ typedef Kokkos::Experimental::Subview< DLT , true , false , false , false , false > DLS1 ;
	+
	+ static_assert( DLS1::rank == 1 && std::is_same< typename DLS1::array_layout , Kokkos::LayoutLeft >::value
	+ , "Subview layout error for rank 1 subview of left-most range of LayoutLeft" );
	+
	+ typedef Kokkos::Experimental::View<int***[13][14],Kokkos::LayoutRight,ExecSpace> DRT ;
	+ typedef Kokkos::Experimental::Subview< DRT , false , false , false , false , true > DRS1 ;
	+
	+ static_assert( DRS1::rank == 1 && std::is_same< typename DRS1::array_layout , Kokkos::LayoutRight >::value
	+ , "Subview layout error for rank 1 subview of right-most range of LayoutRight" );
	+
	+ AT Aa ;
	+ AS Ab ;
	+ BT Ba ;
	+ BS Bb ;
	+ CT Ca ;
	+ CS Cb ;
	+ DT Da ;
	+ DS Db ;
	+
	+ TestViewMappingSubview()
	+ : Aa("Aa",AN)
	+ , Ab( Kokkos::Experimental::subview( Aa , std::pair<int,int>(1,AN-1) ) )
	+ , Ba("Ba",BN0,BN1,BN2)
	+ , Bb( Kokkos::Experimental::subview( Ba
	+ , std::pair<int,int>(1,BN0-1)
	+ , std::pair<int,int>(1,BN1-1)
	+ , std::pair<int,int>(1,BN2-1)
	+ ) )
	+ , Ca("Ca",CN0,CN1,CN2)
	+ , Cb( Kokkos::Experimental::subview( Ca
	+ , std::pair<int,int>(1,CN0-1)
	+ , std::pair<int,int>(1,CN1-1)
	+ , std::pair<int,int>(1,CN2-1)
	+ , 1
	+ , 2
	+ ) )
	+ , Da("Da",DN0,DN1,DN2)
	+ , Db( Kokkos::Experimental::subview( Da
	+ , 1
	+ , std::pair<int,int>(1,DN0-1)
	+ , std::pair<int,int>(1,DN1-1)
	+ , std::pair<int,int>(1,DN2-1)
	+ , 2
	+ ) )
	+ {
	+ }
	+
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const int , long & error_count ) const
	+ {
	+ for ( int i = 1 ; i < AN-1 ; ++i ) if( & Aa[i] != & Ab[i-1] ) ++error_count ;
	+
	+ for ( int i2 = 1 ; i2 < BN2-1 ; ++i2 ) {
	+ for ( int i1 = 1 ; i1 < BN1-1 ; ++i1 ) {
	+ for ( int i0 = 1 ; i0 < BN0-1 ; ++i0 ) {
	+ if ( & Ba(i0,i1,i2) != & Bb(i0-1,i1-1,i2-1) ) ++error_count ;
	+ }}}
	+
	+ for ( int i2 = 1 ; i2 < CN2-1 ; ++i2 ) {
	+ for ( int i1 = 1 ; i1 < CN1-1 ; ++i1 ) {
	+ for ( int i0 = 1 ; i0 < CN0-1 ; ++i0 ) {
	+ if ( & Ca(i0,i1,i2,1,2) != & Cb(i0-1,i1-1,i2-1) ) ++error_count ;
	+ }}}
	+
	+ for ( int i2 = 1 ; i2 < DN2-1 ; ++i2 ) {
	+ for ( int i1 = 1 ; i1 < DN1-1 ; ++i1 ) {
	+ for ( int i0 = 1 ; i0 < DN0-1 ; ++i0 ) {
	+ if ( & Da(1,i0,i1,i2,2) != & Db(i0-1,i1-1,i2-1) ) ++error_count ;
	+ }}}
	+ }
	+
	+ static void run()
	+ {
	+ TestViewMappingSubview self ;
	+
	+ ASSERT_EQ( self.Da.stride_1() , self.Db.stride_0() );
	+ ASSERT_EQ( self.Da.stride_2() , self.Db.stride_1() );
	+ ASSERT_EQ( self.Da.stride_3() , self.Db.stride_2() );
	+
	+ long error_count = -1 ;
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy< ExecSpace >(0,1) , self , error_count );
	+ ASSERT_EQ( error_count , 0 );
	+ }
	+
	+};
	+
	+template< class ExecSpace >
	+void test_view_mapping_subview()
	+{
	+ TestViewMappingSubview< ExecSpace >::run();
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+template< class ViewType >
	+struct TestViewMapOperator {
	+
	+ static_assert( ViewType::reference_type_is_lvalue_reference
	+ , "Test only valid for lvalue reference type" );
	+
	+ const ViewType v ;
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void test_left( size_t i0 , long & error_count ) const
	+ {
	+ typename ViewType::value_type * const base_ptr = & v(0,0,0,0,0,0,0,0);
	+ const size_t n1 = v.dimension_1();
	+ const size_t n2 = v.dimension_2();
	+ const size_t n3 = v.dimension_3();
	+ const size_t n4 = v.dimension_4();
	+ const size_t n5 = v.dimension_5();
	+ const size_t n6 = v.dimension_6();
	+ const size_t n7 = v.dimension_7();
	+
	+ long offset = 0 ;
	+
	+ for ( size_t i7 = 0 ; i7 < n7 ; ++i7 )
	+ for ( size_t i6 = 0 ; i6 < n6 ; ++i6 )
	+ for ( size_t i5 = 0 ; i5 < n5 ; ++i5 )
	+ for ( size_t i4 = 0 ; i4 < n4 ; ++i4 )
	+ for ( size_t i3 = 0 ; i3 < n3 ; ++i3 )
	+ for ( size_t i2 = 0 ; i2 < n2 ; ++i2 )
	+ for ( size_t i1 = 0 ; i1 < n1 ; ++i1 )
	+ {
	+ const long d = & v(i0,i1,i2,i3,i4,i5,i6,i7) - base_ptr ;
	+ if ( d < offset ) ++error_count ;
	+ offset = d ;
	+ }
	+
	+ if ( v.span() <= size_t(offset) ) ++error_count ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void test_right( size_t i0 , long & error_count ) const
	+ {
	+ typename ViewType::value_type * const base_ptr = & v(0,0,0,0,0,0,0,0);
	+ const size_t n1 = v.dimension_1();
	+ const size_t n2 = v.dimension_2();
	+ const size_t n3 = v.dimension_3();
	+ const size_t n4 = v.dimension_4();
	+ const size_t n5 = v.dimension_5();
	+ const size_t n6 = v.dimension_6();
	+ const size_t n7 = v.dimension_7();
	+
	+ long offset = 0 ;
	+
	+ for ( size_t i1 = 0 ; i1 < n1 ; ++i1 )
	+ for ( size_t i2 = 0 ; i2 < n2 ; ++i2 )
	+ for ( size_t i3 = 0 ; i3 < n3 ; ++i3 )
	+ for ( size_t i4 = 0 ; i4 < n4 ; ++i4 )
	+ for ( size_t i5 = 0 ; i5 < n5 ; ++i5 )
	+ for ( size_t i6 = 0 ; i6 < n6 ; ++i6 )
	+ for ( size_t i7 = 0 ; i7 < n7 ; ++i7 )
	+ {
	+ const long d = & v(i0,i1,i2,i3,i4,i5,i6,i7) - base_ptr ;
	+ if ( d < offset ) ++error_count ;
	+ offset = d ;
	+ }
	+
	+ if ( v.span() <= size_t(offset) ) ++error_count ;
	+ }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( size_t i , long & error_count ) const
	+ {
	+ if ( std::is_same< typename ViewType::array_layout , Kokkos::LayoutLeft >::value )
	+ test_left(i,error_count);
	+ else if ( std::is_same< typename ViewType::array_layout , Kokkos::LayoutRight >::value )
	+ test_right(i,error_count);
	+ }
	+
	+ constexpr static size_t N0 = 10 ;
	+ constexpr static size_t N1 = 9 ;
	+ constexpr static size_t N2 = 8 ;
	+ constexpr static size_t N3 = 7 ;
	+ constexpr static size_t N4 = 6 ;
	+ constexpr static size_t N5 = 5 ;
	+ constexpr static size_t N6 = 4 ;
	+ constexpr static size_t N7 = 3 ;
	+
	+ TestViewMapOperator() : v( "Test" , N0, N1, N2, N3, N4, N5, N6, N7 ) {}
	+
	+ static void run()
	+ {
	+ TestViewMapOperator self ;
	+
	+ ASSERT_EQ( self.v.dimension_0() , ( 0 < ViewType::rank ? N0 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_1() , ( 1 < ViewType::rank ? N1 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_2() , ( 2 < ViewType::rank ? N2 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_3() , ( 3 < ViewType::rank ? N3 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_4() , ( 4 < ViewType::rank ? N4 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_5() , ( 5 < ViewType::rank ? N5 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_6() , ( 6 < ViewType::rank ? N6 : 1 ) );
	+ ASSERT_EQ( self.v.dimension_7() , ( 7 < ViewType::rank ? N7 : 1 ) );
	+
	+ ASSERT_LE( self.v.dimension_0()*
	+ self.v.dimension_1()*
	+ self.v.dimension_2()*
	+ self.v.dimension_3()*
	+ self.v.dimension_4()*
	+ self.v.dimension_5()*
	+ self.v.dimension_6()*
	+ self.v.dimension_7()
	+ , self.v.span() );
	+
	+ long error_count ;
	+ Kokkos::RangePolicy< typename ViewType::execution_space > range(0,self.v.dimension_0());
	+ Kokkos::parallel_reduce( range , self , error_count );
	+ ASSERT_EQ( 0 , error_count );
	+ }
	+};
	+
	+
	+template< class ExecSpace >
	+void test_view_mapping_operator()
	+{
	+ TestViewMapOperator< Kokkos::Experimental::View<int,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int*,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int**,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int***,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int****,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int*****,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int******,Kokkos::LayoutLeft,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int*******,Kokkos::LayoutLeft,ExecSpace> >::run();
	+
	+ TestViewMapOperator< Kokkos::Experimental::View<int,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int*,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int**,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int***,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int****,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int*****,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int******,Kokkos::LayoutRight,ExecSpace> >::run();
	+ TestViewMapOperator< Kokkos::Experimental::View<int*******,Kokkos::LayoutRight,ExecSpace> >::run();
	+}
	+
	+/--------------------------------------------------------------------------/
	+
	+template< class ExecSpace >
	+struct TestViewMappingAtomic {
	+ typedef Kokkos::MemoryTraits< Kokkos::Atomic > mem_trait ;
	+
	+ typedef Kokkos::Experimental::View< int * , ExecSpace > T ;
	+ typedef Kokkos::Experimental::View< int * , ExecSpace , mem_trait > T_atom ;
	+
	+ T x ;
	+ T_atom x_atom ;
	+
	+ constexpr static size_t N = 100000 ;
	+
	+ struct TagInit {};
	+ struct TagUpdate {};
	+ struct TagVerify {};
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagInit & , const int i ) const
	+ { x(i) = i ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagUpdate & , const int i ) const
	+ { x_atom(i%2) += 1 ; }
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator()( const TagVerify & , const int i , long & error_count ) const
	+ {
	+ if ( i < 2 ) { if ( x(i) != int(i + N / 2) ) ++error_count ; }
	+ else { if ( x(i) != int(i) ) ++error_count ; }
	+ }
	+
	+ TestViewMappingAtomic()
	+ : x("x",N)
	+ , x_atom( x )
	+ {}
	+
	+ static void run()
	+ {
	+ ASSERT_TRUE( T::reference_type_is_lvalue_reference );
	+ ASSERT_FALSE( T_atom::reference_type_is_lvalue_reference );
	+
	+ TestViewMappingAtomic self ;
	+ Kokkos::parallel_for( Kokkos::RangePolicy< ExecSpace , TagInit >(0,N) , self );
	+ Kokkos::parallel_for( Kokkos::RangePolicy< ExecSpace , TagUpdate >(0,N) , self );
	+ long error_count = -1 ;
	+ Kokkos::parallel_reduce( Kokkos::RangePolicy< ExecSpace , TagVerify >(0,N) , self , error_count );
	+ ASSERT_EQ( 0 , error_count );
	+ }
	+};
	+
	+
	+} /* namespace Test */
	+
	+/--------------------------------------------------------------------------/
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp b/lib/kokkos/core/unit_test/TestViewOfClass.hpp
	similarity index 55%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	copy to lib/kokkos/core/unit_test/TestViewOfClass.hpp
	index 1e9ff91c2..09abacd80 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.cpp
	+++ b/lib/kokkos/core/unit_test/TestViewOfClass.hpp
	@@ -1,80 +1,126 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	-#include <Kokkos_Macros.hpp>
	-#include <impl/Kokkos_spinwait.hpp>
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>

	/--------------------------------------------------------------------------/

	-#if ( KOKKOS_ENABLE_ASM )
	- #if defined( __arm__ )
	- /* No-operation instruction to idle the thread. */
	- #define YIELD asm volatile("nop")
	- #else
	- /* Pause instruction to prevent excess processor bus usage */
	- #define YIELD asm volatile("pause\n":::"memory")
	- #endif
	-#elif defined( KOKKOS_HAVE_WINTHREAD )
	- #include <process.h>
	- #define YIELD Sleep(0)
	+namespace Test {
	+
	+namespace {
	+volatile int nested_view_count ;
	+}
	+
	+template< class Space >
	+class NestedView {
	+private:
	+ Kokkos::View<int*,Space> member ;
	+
	+public:
	+
	+ KOKKOS_INLINE_FUNCTION
	+ NestedView()
	+#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	+ : member("member",2)
	+ { Kokkos::atomic_increment( & nested_view_count ); }
	#else
	- #include <sched.h>
	- #define YIELD sched_yield()
	+ : member(){}
	#endif

	-/--------------------------------------------------------------------------/
	-
	-namespace Kokkos {
	-namespace Impl {
	+ ~NestedView()
	#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value )
	+ { Kokkos::atomic_decrement( & nested_view_count ); }
	+#else
	+ {}
	+#endif
	+
	+};
	+
	+
	+template< class Space >
	+void view_nested_view()
	{
	- while ( value == flag ) {
	- YIELD ;
	+ ASSERT_EQ( 0 , nested_view_count );
	+ {
	+ Kokkos::View< NestedView<Space> * , Space > a("a_nested_view",2);
	+ ASSERT_EQ( 2 , nested_view_count );
	+ Kokkos::View< NestedView<Space> * , Space > b("b_nested_view",2);
	+ ASSERT_EQ( 4 , nested_view_count );
	}
	+ // ASSERT_EQ( 0 , nested_view_count );
	}
	-#endif

	-} /* namespace Impl */
	-} /* namespace Kokkos */
	+}
	+
	+namespace Kokkos {
	+namespace Impl {
	+
	+template< class ExecSpace , class S >
	+struct ViewDefaultConstruct< ExecSpace , Test::NestedView<S> , true >
	+{
	+ typedef Test::NestedView<S> type ;
	+ type * const m_ptr ;
	+
	+ KOKKOS_FORCEINLINE_FUNCTION
	+ void operator()( const typename ExecSpace::size_type& i ) const
	+ { new(m_ptr+i) type(); }
	+
	+ ViewDefaultConstruct( type * pointer , size_t capacity )
	+ : m_ptr( pointer )
	+ {
	+ Kokkos::RangePolicy< ExecSpace > range( 0 , capacity );
	+ parallel_for( range , *this );
	+ ExecSpace::fence();
	+ }
	+};
	+
	+} // namespace Impl
	+} // namespace Kokkos
	+
	+/--------------------------------------------------------------------------/

	diff --git a/lib/kokkos/core/unit_test/TestViewSubview.hpp b/lib/kokkos/core/unit_test/TestViewSubview.hpp
	new file mode 100755
	index 000000000..8bf201fb4
	--- /dev/null
	+++ b/lib/kokkos/core/unit_test/TestViewSubview.hpp
	@@ -0,0 +1,632 @@
	+/*
	+//@HEADER
	+// ************************************************************************
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	+// the U.S. Government retains certain rights in this software.
	+//
	+// Redistribution and use in source and binary forms, with or without
	+// modification, are permitted provided that the following conditions are
	+// met:
	+//
	+// 1. Redistributions of source code must retain the above copyright
	+// notice, this list of conditions and the following disclaimer.
	+//
	+// 2. Redistributions in binary form must reproduce the above copyright
	+// notice, this list of conditions and the following disclaimer in the
	+// documentation and/or other materials provided with the distribution.
	+//
	+// 3. Neither the name of the Corporation nor the names of the
	+// contributors may be used to endorse or promote products derived from
	+// this software without specific prior written permission.
	+//
	+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	+//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	+// ************************************************************************
	+//@HEADER
	+*/
	+
	+#include <gtest/gtest.h>
	+
	+#include <Kokkos_Core.hpp>
	+#include <stdexcept>
	+#include <sstream>
	+#include <iostream>
	+
	+/--------------------------------------------------------------------------/
	+
	+namespace TestViewSubview {
	+
	+#if defined( KOKKOS_USING_EXPERIMENTAL_VIEW )
	+
	+using Kokkos::Experimental::ALL ;
	+
	+#else
	+
	+namespace {
	+
	+const Kokkos::ALL ALL ;
	+
	+}
	+
	+#endif
	+
	+template<class Layout, class Space>
	+struct getView {
	+ static
	+ Kokkos::View<double**,Layout,Space> get(int n, int m) {
	+ return Kokkos::View<double**,Layout,Space>("G",n,m);
	+ }
	+};
	+
	+template<class Space>
	+struct getView<Kokkos::LayoutStride,Space> {
	+ static
	+ Kokkos::View<double**,Kokkos::LayoutStride,Space> get(int n, int m) {
	+ const int rank = 2 ;
	+ const int order[] = { 0, 1 };
	+ const unsigned dim[] = { unsigned(n), unsigned(m) };
	+ Kokkos::LayoutStride stride = Kokkos::LayoutStride::order_dimensions( rank , order , dim );
	+ return Kokkos::View<double**,Kokkos::LayoutStride,Space>("G",stride);
	+ }
	+};
	+
	+template<class ViewType, class Space>
	+struct fill_1D {
	+ typedef typename Space::execution_space execution_space;
	+ typedef typename ViewType::size_type size_type;
	+ ViewType a;
	+ double val;
	+ fill_1D(ViewType a_, double val_):a(a_),val(val_) {
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (const int i) const {
	+ a(i) = val;
	+ }
	+};
	+
	+template<class ViewType, class Space>
	+struct fill_2D {
	+ typedef typename Space::execution_space execution_space;
	+ typedef typename ViewType::size_type size_type;
	+ ViewType a;
	+ double val;
	+ fill_2D(ViewType a_, double val_):a(a_),val(val_) {
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator() (const int i) const{
	+ for(int j = 0; j < static_cast<int>(a.dimension_1()); j++)
	+ a(i,j) = val;
	+ }
	+};
	+
	+template<class Layout, class Space>
	+void test_auto_1d ()
	+{
	+ typedef Kokkos::View<double**, Layout, Space> mv_type;
	+ typedef typename mv_type::size_type size_type;
	+ const double ZERO = 0.0;
	+ const double ONE = 1.0;
	+ const double TWO = 2.0;
	+
	+ const size_type numRows = 10;
	+ const size_type numCols = 3;
	+
	+ mv_type X = getView<Layout,Space>::get(numRows, numCols);
	+ typename mv_type::HostMirror X_h = Kokkos::create_mirror_view (X);
	+
	+ fill_2D<mv_type,Space> f1(X, ONE);
	+ Kokkos::parallel_for(X.dimension_0(),f1);
	+ Kokkos::deep_copy (X_h, X);
	+ for (size_type j = 0; j < numCols; ++j) {
	+ for (size_type i = 0; i < numRows; ++i) {
	+ ASSERT_TRUE(X_h(i,j) == ONE);
	+ }
	+ }
	+
	+ fill_2D<mv_type,Space> f2(X, 0.0);
	+ Kokkos::parallel_for(X.dimension_0(),f2);
	+ Kokkos::deep_copy (X_h, X);
	+ for (size_type j = 0; j < numCols; ++j) {
	+ for (size_type i = 0; i < numRows; ++i) {
	+ ASSERT_TRUE(X_h(i,j) == ZERO);
	+ }
	+ }
	+
	+ fill_2D<mv_type,Space> f3(X, TWO);
	+ Kokkos::parallel_for(X.dimension_0(),f3);
	+ Kokkos::deep_copy (X_h, X);
	+ for (size_type j = 0; j < numCols; ++j) {
	+ for (size_type i = 0; i < numRows; ++i) {
	+ ASSERT_TRUE(X_h(i,j) == TWO);
	+ }
	+ }
	+
	+ for (size_type j = 0; j < numCols; ++j) {
	+ auto X_j = Kokkos::subview (X, TestViewSubview::ALL, j);
	+
	+ fill_1D<decltype(X_j),Space> f4(X_j, ZERO);
	+ Kokkos::parallel_for(X_j.dimension_0(),f4);
	+ Kokkos::deep_copy (X_h, X);
	+ for (size_type i = 0; i < numRows; ++i) {
	+ ASSERT_TRUE(X_h(i,j) == ZERO);
	+ }
	+
	+ for (size_type jj = 0; jj < numCols; ++jj) {
	+ auto X_jj = Kokkos::subview (X, TestViewSubview::ALL, jj);
	+ fill_1D<decltype(X_jj),Space> f5(X_jj, ONE);
	+ Kokkos::parallel_for(X_jj.dimension_0(),f5);
	+ Kokkos::deep_copy (X_h, X);
	+ for (size_type i = 0; i < numRows; ++i) {
	+ ASSERT_TRUE(X_h(i,jj) == ONE);
	+ }
	+ }
	+ }
	+}
	+
	+template<class LD, class LS, class Space>
	+void test_1d_strided_assignment_impl(bool a, bool b, bool c, bool d, int n, int m) {
	+ Kokkos::View<double**,LS,Space> l2d("l2d",n,m);
	+
	+ int col = n>2?2:0;
	+ int row = m>2?2:0;
	+
	+ if(Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<Kokkos::HostSpace,Space>::value) {
	+ if(a) {
	+ Kokkos::View<double*,LD,Space> l1da = Kokkos::subview(l2d,TestViewSubview::ALL,row);
	+ ASSERT_TRUE( & l1da(0) == & l2d(0,row) );
	+ if(n>1)
	+ ASSERT_TRUE( & l1da(1) == & l2d(1,row) );
	+ }
	+ if(b && n>13) {
	+ Kokkos::View<double*,LD,Space> l1db = Kokkos::subview(l2d,std::pair<unsigned,unsigned>(2,13),row);
	+ ASSERT_TRUE( & l1db(0) == & l2d(2,row) );
	+ ASSERT_TRUE( & l1db(1) == & l2d(3,row) );
	+ }
	+ if(c) {
	+ Kokkos::View<double*,LD,Space> l1dc = Kokkos::subview(l2d,col,TestViewSubview::ALL);
	+ ASSERT_TRUE( & l1dc(0) == & l2d(col,0) );
	+ if(m>1)
	+ ASSERT_TRUE( & l1dc(1) == & l2d(col,1) );
	+ }
	+ if(d && m>13) {
	+ Kokkos::View<double*,LD,Space> l1dd = Kokkos::subview(l2d,col,std::pair<unsigned,unsigned>(2,13));
	+ ASSERT_TRUE( & l1dd(0) == & l2d(col,2) );
	+ ASSERT_TRUE( & l1dd(1) == & l2d(col,3) );
	+ }
	+ }
	+
	+}
	+
	+template<class Space >
	+void test_1d_strided_assignment() {
	+ test_1d_strided_assignment_impl<Kokkos::LayoutStride,Kokkos::LayoutLeft,Space>(true,true,true,true,17,3);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutStride,Kokkos::LayoutRight,Space>(true,true,true,true,17,3);
	+
	+ test_1d_strided_assignment_impl<Kokkos::LayoutLeft,Kokkos::LayoutLeft,Space>(true,true,false,false,17,3);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutRight,Kokkos::LayoutLeft,Space>(true,true,false,false,17,3);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutLeft,Kokkos::LayoutRight,Space>(false,false,true,true,17,3);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutRight,Kokkos::LayoutRight,Space>(false,false,true,true,17,3);
	+
	+ test_1d_strided_assignment_impl<Kokkos::LayoutLeft,Kokkos::LayoutLeft,Space>(true,true,false,false,17,1);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutLeft,Kokkos::LayoutLeft,Space>(true,true,true,true,1,17);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutRight,Kokkos::LayoutLeft,Space>(true,true,true,true,1,17);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutRight,Kokkos::LayoutLeft,Space>(true,true,false,false,17,1);
	+
	+ test_1d_strided_assignment_impl<Kokkos::LayoutLeft,Kokkos::LayoutRight,Space>(true,true,true,true,17,1);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutLeft,Kokkos::LayoutRight,Space>(false,false,true,true,1,17);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutRight,Kokkos::LayoutRight,Space>(false,false,true,true,1,17);
	+ test_1d_strided_assignment_impl<Kokkos::LayoutRight,Kokkos::LayoutRight,Space>(true,true,true,true,17,1);
	+}
	+
	+template< class Space >
	+void test_left_0()
	+{
	+ typedef Kokkos::View< int [2][3][4][5][2][3][4][5] , Kokkos::LayoutLeft , Space >
	+ view_static_8_type ;
	+
	+ view_static_8_type x_static_8("x_static_left_8");
	+
	+ ASSERT_TRUE( x_static_8.is_contiguous() );
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,Space> x0 = Kokkos::subview( x_static_8 , 0, 0, 0, 0, 0, 0, 0, 0 );
	+
	+ ASSERT_TRUE( x0.is_contiguous() );
	+ ASSERT_TRUE( & x0() == & x_static_8(0,0,0,0,0,0,0,0) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutLeft,Space> x1 =
	+ Kokkos::subview( x_static_8, Kokkos::pair<int,int>(0,2), 1, 2, 3, 0, 1, 2, 3 );
	+
	+ ASSERT_TRUE( x1.is_contiguous() );
	+ ASSERT_TRUE( & x1(0) == & x_static_8(0,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & x1(1) == & x_static_8(1,1,2,3,0,1,2,3) );
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2 =
	+ Kokkos::subview( x_static_8, Kokkos::pair<int,int>(0,2), 1, 2, 3
	+ , Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( ! x2.is_contiguous() );
	+ ASSERT_TRUE( & x2(0,0) == & x_static_8(0,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & x2(1,0) == & x_static_8(1,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & x2(0,1) == & x_static_8(0,1,2,3,1,1,2,3) );
	+ ASSERT_TRUE( & x2(1,1) == & x_static_8(1,1,2,3,1,1,2,3) );
	+
	+ // Kokkos::View<int**,Kokkos::LayoutLeft,Space> error_2 =
	+ Kokkos::View<int**,Kokkos::LayoutStride,Space> sx2 =
	+ Kokkos::subview( x_static_8, 1, Kokkos::pair<int,int>(0,2), 2, 3
	+ , Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( ! sx2.is_contiguous() );
	+ ASSERT_TRUE( & sx2(0,0) == & x_static_8(1,0,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,0) == & x_static_8(1,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(0,1) == & x_static_8(1,0,2,3,1,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,1) == & x_static_8(1,1,2,3,1,1,2,3) );
	+
	+ Kokkos::View<int****,Kokkos::LayoutStride,Space> sx4 =
	+ Kokkos::subview( x_static_8, 0, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 1, Kokkos::pair<int,int>(1,3) /* of [5] */
	+ , 1, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 2, Kokkos::pair<int,int>(2,4) /* of [5] */
	+ );
	+
	+ ASSERT_TRUE( ! sx4.is_contiguous() );
	+
	+ for ( int i0 = 0 ; i0 < (int) sx4.dimension_0() ; ++i0 )
	+ for ( int i1 = 0 ; i1 < (int) sx4.dimension_1() ; ++i1 )
	+ for ( int i2 = 0 ; i2 < (int) sx4.dimension_2() ; ++i2 )
	+ for ( int i3 = 0 ; i3 < (int) sx4.dimension_3() ; ++i3 ) {
	+ ASSERT_TRUE( & sx4(i0,i1,i2,i3) == & x_static_8(0,0+i0, 1,1+i1, 1,0+i2, 2,2+i3) );
	+ }
	+}
	+
	+template< class Space >
	+void test_left_1()
	+{
	+ typedef Kokkos::View< int ****[2][3][4][5] , Kokkos::LayoutLeft , Space >
	+ view_type ;
	+
	+ view_type x8("x_left_8",2,3,4,5);
	+
	+ ASSERT_TRUE( x8.is_contiguous() );
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,Space> x0 = Kokkos::subview( x8 , 0, 0, 0, 0, 0, 0, 0, 0 );
	+
	+ ASSERT_TRUE( x0.is_contiguous() );
	+ ASSERT_TRUE( & x0() == & x8(0,0,0,0,0,0,0,0) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutLeft,Space> x1 =
	+ Kokkos::subview( x8, Kokkos::pair<int,int>(0,2), 1, 2, 3, 0, 1, 2, 3 );
	+
	+ ASSERT_TRUE( x1.is_contiguous() );
	+ ASSERT_TRUE( & x1(0) == & x8(0,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & x1(1) == & x8(1,1,2,3,0,1,2,3) );
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2 =
	+ Kokkos::subview( x8, Kokkos::pair<int,int>(0,2), 1, 2, 3
	+ , Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( ! x2.is_contiguous() );
	+ ASSERT_TRUE( & x2(0,0) == & x8(0,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & x2(1,0) == & x8(1,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & x2(0,1) == & x8(0,1,2,3,1,1,2,3) );
	+ ASSERT_TRUE( & x2(1,1) == & x8(1,1,2,3,1,1,2,3) );
	+
	+ // Kokkos::View<int**,Kokkos::LayoutLeft,Space> error_2 =
	+ Kokkos::View<int**,Kokkos::LayoutStride,Space> sx2 =
	+ Kokkos::subview( x8, 1, Kokkos::pair<int,int>(0,2), 2, 3
	+ , Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( ! sx2.is_contiguous() );
	+ ASSERT_TRUE( & sx2(0,0) == & x8(1,0,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,0) == & x8(1,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(0,1) == & x8(1,0,2,3,1,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,1) == & x8(1,1,2,3,1,1,2,3) );
	+
	+ Kokkos::View<int****,Kokkos::LayoutStride,Space> sx4 =
	+ Kokkos::subview( x8, 0, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 1, Kokkos::pair<int,int>(1,3) /* of [5] */
	+ , 1, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 2, Kokkos::pair<int,int>(2,4) /* of [5] */
	+ );
	+
	+ ASSERT_TRUE( ! sx4.is_contiguous() );
	+
	+ for ( int i0 = 0 ; i0 < (int) sx4.dimension_0() ; ++i0 )
	+ for ( int i1 = 0 ; i1 < (int) sx4.dimension_1() ; ++i1 )
	+ for ( int i2 = 0 ; i2 < (int) sx4.dimension_2() ; ++i2 )
	+ for ( int i3 = 0 ; i3 < (int) sx4.dimension_3() ; ++i3 ) {
	+ ASSERT_TRUE( & sx4(i0,i1,i2,i3) == & x8(0,0+i0, 1,1+i1, 1,0+i2, 2,2+i3) );
	+ }
	+}
	+
	+template< class Space >
	+void test_left_2()
	+{
	+ typedef Kokkos::View< int **** , Kokkos::LayoutLeft , Space > view_type ;
	+
	+ view_type x4("x4",2,3,4,5);
	+
	+ ASSERT_TRUE( x4.is_contiguous() );
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,Space> x0 = Kokkos::subview( x4 , 0, 0, 0, 0 );
	+
	+ ASSERT_TRUE( x0.is_contiguous() );
	+ ASSERT_TRUE( & x0() == & x4(0,0,0,0) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutLeft,Space> x1 =
	+ Kokkos::subview( x4, Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( x1.is_contiguous() );
	+ ASSERT_TRUE( & x1(0) == & x4(0,1,2,3) );
	+ ASSERT_TRUE( & x1(1) == & x4(1,1,2,3) );
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2 =
	+ Kokkos::subview( x4, Kokkos::pair<int,int>(0,2), 1, Kokkos::pair<int,int>(1,3), 2 );
	+
	+ ASSERT_TRUE( ! x2.is_contiguous() );
	+ ASSERT_TRUE( & x2(0,0) == & x4(0,1,1,2) );
	+ ASSERT_TRUE( & x2(1,0) == & x4(1,1,1,2) );
	+ ASSERT_TRUE( & x2(0,1) == & x4(0,1,2,2) );
	+ ASSERT_TRUE( & x2(1,1) == & x4(1,1,2,2) );
	+
	+ // Kokkos::View<int**,Kokkos::LayoutLeft,Space> error_2 =
	+ Kokkos::View<int**,Kokkos::LayoutStride,Space> sx2 =
	+ Kokkos::subview( x4, 1, Kokkos::pair<int,int>(0,2)
	+ , 2, Kokkos::pair<int,int>(1,4) );
	+
	+ ASSERT_TRUE( ! sx2.is_contiguous() );
	+ ASSERT_TRUE( & sx2(0,0) == & x4(1,0,2,1) );
	+ ASSERT_TRUE( & sx2(1,0) == & x4(1,1,2,1) );
	+ ASSERT_TRUE( & sx2(0,1) == & x4(1,0,2,2) );
	+ ASSERT_TRUE( & sx2(1,1) == & x4(1,1,2,2) );
	+ ASSERT_TRUE( & sx2(0,2) == & x4(1,0,2,3) );
	+ ASSERT_TRUE( & sx2(1,2) == & x4(1,1,2,3) );
	+
	+ Kokkos::View<int****,Kokkos::LayoutStride,Space> sx4 =
	+ Kokkos::subview( x4, Kokkos::pair<int,int>(1,2) /* of [2] */
	+ , Kokkos::pair<int,int>(1,3) /* of [3] */
	+ , Kokkos::pair<int,int>(0,4) /* of [4] */
	+ , Kokkos::pair<int,int>(2,4) /* of [5] */
	+ );
	+
	+ ASSERT_TRUE( ! sx4.is_contiguous() );
	+
	+ for ( int i0 = 0 ; i0 < (int) sx4.dimension_0() ; ++i0 )
	+ for ( int i1 = 0 ; i1 < (int) sx4.dimension_1() ; ++i1 )
	+ for ( int i2 = 0 ; i2 < (int) sx4.dimension_2() ; ++i2 )
	+ for ( int i3 = 0 ; i3 < (int) sx4.dimension_3() ; ++i3 ) {
	+ ASSERT_TRUE( & sx4(i0,i1,i2,i3) == & x4( 1+i0, 1+i1, 0+i2, 2+i3 ) );
	+ }
	+}
	+
	+template< class Space >
	+void test_left_3()
	+{
	+ typedef Kokkos::View< int ** , Kokkos::LayoutLeft , Space > view_type ;
	+
	+ view_type xm("x4",10,5);
	+
	+ ASSERT_TRUE( xm.is_contiguous() );
	+
	+ Kokkos::View<int,Kokkos::LayoutLeft,Space> x0 = Kokkos::subview( xm , 5, 3 );
	+
	+ ASSERT_TRUE( x0.is_contiguous() );
	+ ASSERT_TRUE( & x0() == & xm(5,3) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutLeft,Space> x1 =
	+ Kokkos::subview( xm, TestViewSubview::ALL, 3 );
	+
	+ ASSERT_TRUE( x1.is_contiguous() );
	+ for ( int i = 0 ; i < int(xm.dimension_0()) ; ++i ) {
	+ ASSERT_TRUE( & x1(i) == & xm(i,3) );
	+ }
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2 =
	+ Kokkos::subview( xm, Kokkos::pair<int,int>(1,9), TestViewSubview::ALL );
	+
	+ ASSERT_TRUE( ! x2.is_contiguous() );
	+ for ( int j = 0 ; j < int(x2.dimension_1()) ; ++j )
	+ for ( int i = 0 ; i < int(x2.dimension_0()) ; ++i ) {
	+ ASSERT_TRUE( & x2(i,j) == & xm(1+i,j) );
	+ }
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2c =
	+ Kokkos::subview( xm, TestViewSubview::ALL, std::pair<int,int>(2,4) );
	+
	+ ASSERT_TRUE( x2c.is_contiguous() );
	+ for ( int j = 0 ; j < int(x2c.dimension_1()) ; ++j )
	+ for ( int i = 0 ; i < int(x2c.dimension_0()) ; ++i ) {
	+ ASSERT_TRUE( & x2c(i,j) == & xm(i,2+j) );
	+ }
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2_n1 =
	+ Kokkos::subview( xm , std::pair<int,int>(1,1) , TestViewSubview::ALL );
	+
	+ ASSERT_TRUE( x2_n1.dimension_0() == 0 );
	+ ASSERT_TRUE( x2_n1.dimension_1() == xm.dimension_1() );
	+
	+ Kokkos::View<int**,Kokkos::LayoutLeft,Space> x2_n2 =
	+ Kokkos::subview( xm , TestViewSubview::ALL , std::pair<int,int>(1,1) );
	+
	+ ASSERT_TRUE( x2_n2.dimension_0() == xm.dimension_0() );
	+ ASSERT_TRUE( x2_n2.dimension_1() == 0 );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+template< class Space >
	+void test_right_0()
	+{
	+ typedef Kokkos::View< int [2][3][4][5][2][3][4][5] , Kokkos::LayoutRight , Space >
	+ view_static_8_type ;
	+
	+ view_static_8_type x_static_8("x_static_right_8");
	+
	+ Kokkos::View<int,Kokkos::LayoutRight,Space> x0 = Kokkos::subview( x_static_8 , 0, 0, 0, 0, 0, 0, 0, 0 );
	+
	+ ASSERT_TRUE( & x0() == & x_static_8(0,0,0,0,0,0,0,0) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutRight,Space> x1 =
	+ Kokkos::subview( x_static_8, 0, 1, 2, 3, 0, 1, 2, Kokkos::pair<int,int>(1,3) );
	+
	+ ASSERT_TRUE( & x1(0) == & x_static_8(0,1,2,3,0,1,2,1) );
	+ ASSERT_TRUE( & x1(1) == & x_static_8(0,1,2,3,0,1,2,2) );
	+
	+ Kokkos::View<int**,Kokkos::LayoutRight,Space> x2 =
	+ Kokkos::subview( x_static_8, 0, 1, 2, Kokkos::pair<int,int>(1,3)
	+ , 0, 1, 2, Kokkos::pair<int,int>(1,3) );
	+
	+ ASSERT_TRUE( & x2(0,0) == & x_static_8(0,1,2,1,0,1,2,1) );
	+ ASSERT_TRUE( & x2(1,0) == & x_static_8(0,1,2,2,0,1,2,1) );
	+ ASSERT_TRUE( & x2(0,1) == & x_static_8(0,1,2,1,0,1,2,2) );
	+ ASSERT_TRUE( & x2(1,1) == & x_static_8(0,1,2,2,0,1,2,2) );
	+
	+ // Kokkos::View<int**,Kokkos::LayoutRight,Space> error_2 =
	+ Kokkos::View<int**,Kokkos::LayoutStride,Space> sx2 =
	+ Kokkos::subview( x_static_8, 1, Kokkos::pair<int,int>(0,2), 2, 3
	+ , Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( & sx2(0,0) == & x_static_8(1,0,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,0) == & x_static_8(1,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(0,1) == & x_static_8(1,0,2,3,1,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,1) == & x_static_8(1,1,2,3,1,1,2,3) );
	+
	+ Kokkos::View<int****,Kokkos::LayoutStride,Space> sx4 =
	+ Kokkos::subview( x_static_8, 0, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 1, Kokkos::pair<int,int>(1,3) /* of [5] */
	+ , 1, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 2, Kokkos::pair<int,int>(2,4) /* of [5] */
	+ );
	+
	+ for ( int i0 = 0 ; i0 < (int) sx4.dimension_0() ; ++i0 )
	+ for ( int i1 = 0 ; i1 < (int) sx4.dimension_1() ; ++i1 )
	+ for ( int i2 = 0 ; i2 < (int) sx4.dimension_2() ; ++i2 )
	+ for ( int i3 = 0 ; i3 < (int) sx4.dimension_3() ; ++i3 ) {
	+ ASSERT_TRUE( & sx4(i0,i1,i2,i3) == & x_static_8(0, 0+i0, 1, 1+i1, 1, 0+i2, 2, 2+i3) );
	+ }
	+}
	+
	+template< class Space >
	+void test_right_1()
	+{
	+ typedef Kokkos::View< int ****[2][3][4][5] , Kokkos::LayoutRight , Space >
	+ view_type ;
	+
	+ view_type x8("x_right_8",2,3,4,5);
	+
	+ Kokkos::View<int,Kokkos::LayoutRight,Space> x0 = Kokkos::subview( x8 , 0, 0, 0, 0, 0, 0, 0, 0 );
	+
	+ ASSERT_TRUE( & x0() == & x8(0,0,0,0,0,0,0,0) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutRight,Space> x1 =
	+ Kokkos::subview( x8, 0, 1, 2, 3, 0, 1, 2, Kokkos::pair<int,int>(1,3) );
	+
	+ ASSERT_TRUE( & x1(0) == & x8(0,1,2,3,0,1,2,1) );
	+ ASSERT_TRUE( & x1(1) == & x8(0,1,2,3,0,1,2,2) );
	+
	+ Kokkos::View<int**,Kokkos::LayoutRight,Space> x2 =
	+ Kokkos::subview( x8, 0, 1, 2, Kokkos::pair<int,int>(1,3)
	+ , 0, 1, 2, Kokkos::pair<int,int>(1,3) );
	+
	+ ASSERT_TRUE( & x2(0,0) == & x8(0,1,2,1,0,1,2,1) );
	+ ASSERT_TRUE( & x2(1,0) == & x8(0,1,2,2,0,1,2,1) );
	+ ASSERT_TRUE( & x2(0,1) == & x8(0,1,2,1,0,1,2,2) );
	+ ASSERT_TRUE( & x2(1,1) == & x8(0,1,2,2,0,1,2,2) );
	+
	+ // Kokkos::View<int**,Kokkos::LayoutRight,Space> error_2 =
	+ Kokkos::View<int**,Kokkos::LayoutStride,Space> sx2 =
	+ Kokkos::subview( x8, 1, Kokkos::pair<int,int>(0,2), 2, 3
	+ , Kokkos::pair<int,int>(0,2), 1, 2, 3 );
	+
	+ ASSERT_TRUE( & sx2(0,0) == & x8(1,0,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,0) == & x8(1,1,2,3,0,1,2,3) );
	+ ASSERT_TRUE( & sx2(0,1) == & x8(1,0,2,3,1,1,2,3) );
	+ ASSERT_TRUE( & sx2(1,1) == & x8(1,1,2,3,1,1,2,3) );
	+
	+ Kokkos::View<int****,Kokkos::LayoutStride,Space> sx4 =
	+ Kokkos::subview( x8, 0, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 1, Kokkos::pair<int,int>(1,3) /* of [5] */
	+ , 1, Kokkos::pair<int,int>(0,2) /* of [3] */
	+ , 2, Kokkos::pair<int,int>(2,4) /* of [5] */
	+ );
	+
	+ for ( int i0 = 0 ; i0 < (int) sx4.dimension_0() ; ++i0 )
	+ for ( int i1 = 0 ; i1 < (int) sx4.dimension_1() ; ++i1 )
	+ for ( int i2 = 0 ; i2 < (int) sx4.dimension_2() ; ++i2 )
	+ for ( int i3 = 0 ; i3 < (int) sx4.dimension_3() ; ++i3 ) {
	+ ASSERT_TRUE( & sx4(i0,i1,i2,i3) == & x8(0,0+i0, 1,1+i1, 1,0+i2, 2,2+i3) );
	+ }
	+}
	+
	+template< class Space >
	+void test_right_3()
	+{
	+ typedef Kokkos::View< int ** , Kokkos::LayoutRight , Space > view_type ;
	+
	+ view_type xm("x4",10,5);
	+
	+ ASSERT_TRUE( xm.is_contiguous() );
	+
	+ Kokkos::View<int,Kokkos::LayoutRight,Space> x0 = Kokkos::subview( xm , 5, 3 );
	+
	+ ASSERT_TRUE( x0.is_contiguous() );
	+ ASSERT_TRUE( & x0() == & xm(5,3) );
	+
	+ Kokkos::View<int*,Kokkos::LayoutRight,Space> x1 =
	+ Kokkos::subview( xm, 3, TestViewSubview::ALL );
	+
	+ ASSERT_TRUE( x1.is_contiguous() );
	+ for ( int i = 0 ; i < int(xm.dimension_1()) ; ++i ) {
	+ ASSERT_TRUE( & x1(i) == & xm(3,i) );
	+ }
	+
	+ Kokkos::View<int**,Kokkos::LayoutRight,Space> x2c =
	+ Kokkos::subview( xm, Kokkos::pair<int,int>(1,9), TestViewSubview::ALL );
	+
	+ ASSERT_TRUE( x2c.is_contiguous() );
	+ for ( int j = 0 ; j < int(x2c.dimension_1()) ; ++j )
	+ for ( int i = 0 ; i < int(x2c.dimension_0()) ; ++i ) {
	+ ASSERT_TRUE( & x2c(i,j) == & xm(1+i,j) );
	+ }
	+
	+ Kokkos::View<int**,Kokkos::LayoutRight,Space> x2 =
	+ Kokkos::subview( xm, TestViewSubview::ALL, std::pair<int,int>(2,4) );
	+
	+ ASSERT_TRUE( ! x2.is_contiguous() );
	+ for ( int j = 0 ; j < int(x2.dimension_1()) ; ++j )
	+ for ( int i = 0 ; i < int(x2.dimension_0()) ; ++i ) {
	+ ASSERT_TRUE( & x2(i,j) == & xm(i,2+j) );
	+ }
	+
	+ Kokkos::View<int**,Kokkos::LayoutRight,Space> x2_n1 =
	+ Kokkos::subview( xm , std::pair<int,int>(1,1) , TestViewSubview::ALL );
	+
	+ ASSERT_TRUE( x2_n1.dimension_0() == 0 );
	+ ASSERT_TRUE( x2_n1.dimension_1() == xm.dimension_1() );
	+
	+ Kokkos::View<int**,Kokkos::LayoutRight,Space> x2_n2 =
	+ Kokkos::subview( xm , TestViewSubview::ALL , std::pair<int,int>(1,1) );
	+
	+ ASSERT_TRUE( x2_n2.dimension_0() == xm.dimension_0() );
	+ ASSERT_TRUE( x2_n2.dimension_1() == 0 );
	+}
	+
	+//----------------------------------------------------------------------------
	+
	+}
	+
	diff --git a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp b/lib/kokkos/core/unit_test/UnitTestMain.cpp
	similarity index 74%
	copy from lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	copy to lib/kokkos/core/unit_test/UnitTestMain.cpp
	index 966291abd..f952ab3db 100755
	--- a/lib/kokkos/core/src/impl/Kokkos_spinwait.hpp
	+++ b/lib/kokkos/core/unit_test/UnitTestMain.cpp
	@@ -1,64 +1,50 @@
	/*
	//@HEADER
	// ************************************************************************
	-//
	-// Kokkos: Manycore Performance-Portable Multidimensional Arrays
	-// Copyright (2012) Sandia Corporation
	-//
	+//
	+// Kokkos v. 2.0
	+// Copyright (2014) Sandia Corporation
	+//
	// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
	// the U.S. Government retains certain rights in this software.
	-//
	+//
	// Redistribution and use in source and binary forms, with or without
	// modification, are permitted provided that the following conditions are
	// met:
	//
	// 1. Redistributions of source code must retain the above copyright
	// notice, this list of conditions and the following disclaimer.
	//
	// 2. Redistributions in binary form must reproduce the above copyright
	// notice, this list of conditions and the following disclaimer in the
	// documentation and/or other materials provided with the distribution.
	//
	// 3. Neither the name of the Corporation nor the names of the
	// contributors may be used to endorse or promote products derived from
	// this software without specific prior written permission.
	//
	// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
	// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
	// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
	// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
	// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
	// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
	// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
	// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
	// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
	// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	//
	-// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	-//
	+// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov)
	+//
	// ************************************************************************
	//@HEADER
	*/

	+#include <gtest/gtest.h>

	-#ifndef KOKKOS_SPINWAIT_HPP
	-#define KOKKOS_SPINWAIT_HPP
	-
	-#include <Kokkos_Macros.hpp>
	-
	-namespace Kokkos {
	-namespace Impl {
	-
	-#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
	-void spinwait( volatile int & flag , const int value );
	-#else
	-KOKKOS_INLINE_FUNCTION
	-void spinwait( volatile int & , const int ) {}
	-#endif
	-
	-} /* namespace Impl */
	-} /* namespace Kokkos */
	-
	-#endif /* #ifndef KOKKOS_SPINWAIT_HPP */
	+int main(int argc, char *argv[]) {
	+ ::testing::InitGoogleTest(&argc,argv);
	+ return RUN_ALL_TESTS();
	+}

	diff --git a/lib/kokkos/doc/Doxyfile b/lib/kokkos/doc/Doxyfile
	new file mode 100755
	index 000000000..bc5c7486b
	--- /dev/null
	+++ b/lib/kokkos/doc/Doxyfile
	@@ -0,0 +1,127 @@
	+#
	+# Include the global look and feel options
	+#
	+@INCLUDE = ../../common/Doxyfile
	+#
	+# Package options
	+#
	+PROJECT_NAME = "Kokkos Core Kernels Package"
	+PROJECT_NUMBER = "Version of the Day"
	+OUTPUT_DIRECTORY = .
	+OUTPUT_LANGUAGE = English
	+
	+EXTRACT_ALL = NO
	+EXTRACT_PRIVATE = NO
	+EXTRACT_STATIC = YES
	+HIDE_UNDOC_MEMBERS = YES
	+HIDE_UNDOC_CLASSES = YES
	+BRIEF_MEMBER_DESC = YES
	+REPEAT_BRIEF = YES
	+ALWAYS_DETAILED_SEC = YES
	+FULL_PATH_NAMES = NO
	+STRIP_FROM_PATH =
	+INTERNAL_DOCS = NO
	+CLASS_DIAGRAMS = YES
	+SOURCE_BROWSER = YES
	+INLINE_SOURCES = NO
	+STRIP_CODE_COMMENTS = YES
	+REFERENCED_BY_RELATION = NO
	+REFERENCES_RELATION = NO
	+CASE_SENSE_NAMES = YES
	+HIDE_SCOPE_NAMES = NO
	+VERBATIM_HEADERS = YES
	+SHOW_INCLUDE_FILES = YES
	+#JAVADOC_AUTOBRIEF = YES
	+INHERIT_DOCS = YES
	+INLINE_INHERITED_MEMB = YES
	+INLINE_INFO = YES
	+SORT_MEMBER_DOCS = NO
	+TAB_SIZE = 2
	+ENABLED_SECTIONS =
	+SORT_BRIEF_DOCS = NO
	+GENERATE_TODOLIST = YES
	+GENERATE_TESTLIST = YES
	+QUIET = NO
	+WARNINGS = YES
	+WARN_IF_UNDOCUMENTED = YES
	+WARN_FORMAT = "$file:$line: $text"
	+
	+#
	+# INPUT: Where to find files that Doxygen should process. ../classic
	+# has a doc/ subdirectory with its own Doxyfile that points to its own
	+# files. The other Kokkos subpackages don't currently have their own
	+# Doxyfile files, so we have to do it manually here.
	+#
	+# mfh 26 Sep 2013: I've only added those directories in the Core
	+# subpackage that constitute the "public interface" of that
	+# subpackage. Please feel free to include additional subdirectories
	+# of ../core if you want to generate their documentation as well.
	+#
	+# mfh 26 Sep 2013: I've only added the Kokkos subpackages here that I
	+# think are ready for Doxygen documentation generation. Please feel
	+# free to amend this list as you see fit.
	+#
	+
	+INPUT = index.doc ../classic ../core/src ../containers/src ../linalg/src
	+FILE_PATTERNS = .hpp .cpp .cuh .cu
	+RECURSIVE = NO
	+EXCLUDE_PATTERNS = .x .o *.out
	+EXAMPLE_PATH =
	+EXAMPLE_RECURSIVE = YES
	+EXAMPLE_PATTERNS = .cpp .hpp
	+IMAGE_PATH =
	+INPUT_FILTER =
	+ALPHABETICAL_INDEX = YES
	+COLS_IN_ALPHA_INDEX = 4
	+IGNORE_PREFIX =
	+#
	+# What diagrams are created
	+#
	+CLASS_GRAPH = YES
	+COLLABORATION_GRAPH = NO
	+INCLUDE_GRAPH = NO
	+INCLUDED_BY_GRAPH = NO
	+GRAPHICAL_HIERARCHY = YES
	+#
	+# Preprocessing
	+#
	+ENABLE_PREPROCESSING = YES
	+MACRO_EXPANSION = YES
	+EXPAND_ONLY_PREDEF = YES
	+SEARCH_INCLUDES = YES
	+INCLUDE_FILE_PATTERNS =
	+PREDEFINED = DOXYGEN_SHOULD_SKIP_THIS DOXYGEN_USE_ONLY
	+INCLUDE_PATH = ../src
	+EXPAND_AS_DEFINED =
	+#
	+# Links to other packages
	+#
	+TAGFILES = ../../common/tag_files/teuchos.tag=../../../teuchos/doc/html ../../common/tag_files/epetra.tag=../../../epetra/doc/html \
	+ ../../common/tag_files/belos.tag=../../../belos/doc/html ../../common/tag_files/anasazi.tag=../../../anasazi/doc/html \
	+ ../../common/tag_files/kokkos.tag=../../../kokkos/doc/html
	+GENERATE_TAGFILE = ../../common/tag_files/tpetra.tag
	+ALLEXTERNALS = NO
	+EXTERNAL_GROUPS = NO
	+#
	+# Environment
	+#
	+PERL_PATH = /usr/bin/perl
	+HAVE_DOT = YES
	+DOT_PATH =
	+MAX_DOT_GRAPH_WIDTH = 1024
	+MAX_DOT_GRAPH_HEIGHT = 1024
	+#
	+# What kind of documentation is generated
	+#
	+#GENERATE_HTML = YES
	+#HTML_OUTPUT = html
	+#HTML_HEADER = includes/header.html
	+#HTML_FOOTER = includes/footer.html
	+#HTML_STYLESHEET = includes/stylesheet.css
	+#HTML_ALIGN_MEMBERS = YES
	+GENERATE_HTMLHELP = NO
	+DISABLE_INDEX = NO
	+GENERATE_LATEX = NO
	+GENERATE_RTF = NO
	+GENERATE_MAN = NO
	+GENERATE_XML = NO
	diff --git a/lib/kokkos/doc/Kokkos_PG.pdf b/lib/kokkos/doc/Kokkos_PG.pdf
	new file mode 100755
	index 000000000..3c415698c
	Binary files /dev/null and b/lib/kokkos/doc/Kokkos_PG.pdf differ
	diff --git a/lib/kokkos/doc/README b/lib/kokkos/doc/README
	new file mode 100755
	index 000000000..31e75f365
	--- /dev/null
	+++ b/lib/kokkos/doc/README
	@@ -0,0 +1,32 @@
	+Kokkos uses the Doxygen tool for providing three documentation
	+sources:
	+- man pages
	+- Latex User Guide
	+- HTML Online User Guide.
	+
	+Man Pages
	+
	+Man pages are available for all files and functions in the directory
	+TRILINOS_HOME/doc/kokkos/man, where TRILINOS_HOME is the location of your
	+copy of Trilinos. To use these pages with the Unix man utility, add
	+the directory to your man path as follows:
	+
	+setenv MANPATH `echo $MANPATH`:TRILINOS_HOME/doc/kokkos/man
	+
	+
	+LaTeX User Guide
	+
	+A postscript version of this guide is in
	+TRILINOS_HOME/doc/kokkos/latex/user_guide.ps. The LaTeX source is in the
	+directory TRILINOS_HOME/doc/kokkos/latex.
	+
	+HTML Online User Guide
	+
	+The online guide is initiated by pointing your browser to
	+TRILINOS_HOME/doc/kokkos/html/index.html
	+
	+Any question, comments or suggestions are welcome. Please send to
	+Mike Heroux at
	+
	+320-845-7695
	+maherou@sandia.gov
	diff --git a/lib/kokkos/doc/build_docs b/lib/kokkos/doc/build_docs
	new file mode 100755
	index 000000000..da1d3e4f6
	--- /dev/null
	+++ b/lib/kokkos/doc/build_docs
	@@ -0,0 +1,15 @@
	+#!/bin/sh
	+
	+if [ $TRILINOS_HOME ]; then
	+ echo "TRILINOS_HOME has already been set!"
	+else
	+ echo "TRILINOS_HOME has not been set. Setting it!"
	+ export TRILINOS_HOME=`pwd`/../../..
	+fi
	+
	+echo
	+echo "Generating main Kokkos doxygen documentation ..."
	+echo
	+
	+doxygen Doxyfile
	+
	diff --git a/lib/kokkos/doc/index.doc b/lib/kokkos/doc/index.doc
	new file mode 100755
	index 000000000..27a9e4f2e
	--- /dev/null
	+++ b/lib/kokkos/doc/index.doc
	@@ -0,0 +1,72 @@
	+/*!
	+\mainpage Trilinos/Kokkos: Shared-memory programming interface and computational kernels
	+
	+\section Kokkos_Intro Introduction
	+
	+The %Kokkos package has two main components. The first, sometimes
	+called "%Kokkos Array" or just "%Kokkos," implements a
	+performance-portable shared-memory parallel programming model and data
	+containers. The second, called "%Kokkos Classic," consists of
	+computational kernels that support the %Tpetra package.
	+
	+\section Kokkos_Kokkos The %Kokkos programming model
	+
	+%Kokkos implements a performance-portable shared-memory parallel
	+programming model and data containers. It lets you write an algorithm
	+once, and just change a template parameter to get the optimal data
	+layout for your hardware. %Kokkos has back-ends for the following
	+parallel programming models:
	+
	+- Kokkos::Threads: POSIX Threads (Pthreads)
	+- Kokkos::OpenMP: OpenMP
	+- Kokkos::Cuda: NVIDIA's CUDA programming model for graphics
	+ processing units (GPUs)
	+- Kokkos::Serial: No thread parallelism
	+
	+%Kokkos also has optimizations for shared-memory parallel systems with
	+nonuniform memory access (NUMA). Its containers can hold data of any
	+primitive ("plain old") data type (and some aggregate types). %Kokkos
	+Array may be used as a stand-alone programming model.
	+
	+%Kokkos' parallel operations include the following:
	+
	+- parallel_for: a thread-parallel "for loop"
	+- parallel_reduce: a thread-parallel reduction
	+- parallel_scan: a thread-parallel prefix scan operation
	+
	+as well as expert-level platform-independent interfaces to thread
	+"teams," per-team "shared memory," synchronization, and atomic update
	+operations.
	+
	+%Kokkos' data containers include the following:
	+
	+- Kokkos::View: A multidimensional array suitable for thread-parallel
	+ operations. Its layout (e.g., row-major or column-major) is
	+ optimized by default for the particular thread-parallel device.
	+- Kokkos::Vector: A drop-in replacement for std::vector that eases
	+ porting from standard sequential C++ data structures to %Kokkos'
	+ parallel data structures.
	+- Kokkos::UnorderedMap: A parallel lookup table comparable in
	+ functionality to std::unordered_map.
	+
	+%Kokkos also uses the above basic containers to implement higher-level
	+data structures, like sparse graphs and matrices.
	+
	+A good place to start learning about %Kokkos would be <a href="http://trilinos.sandia.gov/events/trilinos_user_group_2013/presentations/2013-11-TUG-Kokkos-Tutorial.pdf">these tutorial slides</a> from the 2013 Trilinos Users' Group meeting.
	+
	+\section Kokkos_Classic %Kokkos Classic
	+
	+"%Kokkos Classic" consists of computational kernels that support the
	+%Tpetra package. These kernels include sparse matrix-vector multiply,
	+sparse triangular solve, Gauss-Seidel, and dense vector operations.
	+They are templated on the type of objects (\c Scalar) on which they
	+operate. This component was not meant to be visible to users; it is
	+an implementation detail of the %Tpetra distributed linear algebra
	+package.
	+
	+%Kokkos Classic also implements a shared-memory parallel programming
	+model. This inspired and preceded the %Kokkos programming model
	+described in the previous section. Users should consider the %Kokkos
	+Classic programming model deprecated, and prefer the new %Kokkos
	+programming model.
	+*/
	diff --git a/lib/kokkos/generate_makefile.bash b/lib/kokkos/generate_makefile.bash
	new file mode 100755
	index 000000000..2e595dcc1
	--- /dev/null
	+++ b/lib/kokkos/generate_makefile.bash
	@@ -0,0 +1,204 @@
	+#!/bin/bash
	+
	+KOKKOS_DEVICES=""
	+
	+while [[ $# > 0 ]]
	+do
	+key="$1"
	+
	+case $key in
	+ --kokkos-path*)
	+ KOKKOS_PATH="${key#*=}"
	+ ;;
	+ --prefix*)
	+ PREFIX="${key#*=}"
	+ ;;
	+ --with-cuda)
	+ KOKKOS_DEVICES="${KOKKOS_DEVICES},Cuda"
	+ CUDA_PATH_NVCC=`which nvcc`
	+ CUDA_PATH=${CUDA_PATH_NVCC%/bin/nvcc}
	+ ;;
	+ --with-cuda*)
	+ KOKKOS_DEVICES="${KOKKOS_DEVICES},Cuda"
	+ CUDA_PATH="${key#*=}"
	+ ;;
	+ --with-openmp)
	+ KOKKOS_DEVICES="${KOKKOS_DEVICES},OpenMP"
	+ ;;
	+ --with-pthread)
	+ KOKKOS_DEVICES="${KOKKOS_DEVICES},Pthread"
	+ ;;
	+ --with-serial)
	+ KOKKOS_DEVICES="${KOKKOS_DEVICES},Serial"
	+ ;;
	+ --with-devices*)
	+ DEVICES="${key#*=}"
	+ KOKKOS_DEVICES="${KOKKOS_DEVICES},${DEVICES}"
	+ ;;
	+ --with-gtest*)
	+ GTEST_PATH="${key#*=}"
	+ ;;
	+ --with-hwloc*)
	+ HWLOC_PATH="${key#*=}"
	+ ;;
	+ --arch*)
	+ KOKKOS_ARCH="${key#*=}"
	+ ;;
	+ --cxxflags*)
	+ CXXFLAGS="${key#*=}"
	+ ;;
	+ --ldflags*)
	+ LDFLAGS="${key#*=}"
	+ ;;
	+ --debug\|-dbg)
	+ KOKKOS_DEBUG=yes
	+ ;;
	+ --compiler*)
	+ COMPILER="${key#*=}"
	+ ;;
	+ --help)
	+ echo "Kokkos configure options:"
	+ echo "--kokkos-path=/Path/To/Kokkos: Path to the Kokkos root directory"
	+ echo ""
	+ echo "--with-cuda[=/Path/To/Cuda]: enable Cuda and set path to Cuda Toolkit"
	+ echo "--with-openmp: enable OpenMP backend"
	+ echo "--with-pthread: enable Pthreads backend"
	+ echo "--with-serial: enable Serial backend"
	+ echo "--with-devices: explicitly add a set of backends"
	+ echo ""
	+ echo "--arch=[OPTIONS]: set target architectures. Options are:"
	+ echo " SNB = Intel Sandy/Ivy Bridge CPUs"
	+ echo " HSW = Intel Haswell CPUs"
	+ echo " KNC = Intel Knights Corner Xeon Phi"
	+ echo " Kepler30 = NVIDIA Kepler generation CC 3.0"
	+ echo " Kepler35 = NVIDIA Kepler generation CC 3.5"
	+ echo " Kepler37 = NVIDIA Kepler generation CC 3.7"
	+ echo " Maxwell50 = NVIDIA Maxwell generation CC 5.0"
	+ echo " Power8 = IBM Power 8 CPUs"
	+ echo ""
	+ echo "--compiler=/Path/To/Compiler set the compiler"
	+ echo "--debug,-dbg: enable Debugging"
	+ echo "--cxxflags=[FLAGS] overwrite CXXFLAGS for library build and test build"
	+ echo " This will still set certain required flags via"
	+ echo " KOKKOS_CXXFLAGS (such as -fopenmp, --std=c++11, etc.)"
	+ echo "--ldflags=[FLAGS] overwrite LDFLAGS for library build and test build"
	+ echo " This will still set certain required flags via"
	+ echo " KOKKOS_LDFLAGS (such as -fopenmp, -lpthread, etc.)"
	+ echo "--with-gtest=/Path/To/Gtest: set path to gtest (used in unit and performance tests"
	+ echo "--with-hwloc=/Path/To/Hwloc: set path to hwloc"
	+ exit 0
	+ ;;
	+ *)
	+ # unknown option
	+ ;;
	+esac
	+shift
	+done
	+
	+# If KOKKOS_PATH undefined, assume parent dir of this
	+# script is the KOKKOS_PATH
	+if [ -z "$KOKKOS_PATH" ]; then
	+ KOKKOS_PATH=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
	+else
	+ # Ensure KOKKOS_PATH is abs path
	+ KOKKOS_PATH=$( cd $KOKKOS_PATH && pwd )
	+fi
	+
	+KOKKOS_OPTIONS="KOKKOS_PATH=${KOKKOS_PATH}"
	+
	+if [ ${#COMPILER} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} CXX=${COMPILER}"
	+fi
	+if [ ${#PREFIX} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} PREFIX=${PREFIX}"
	+fi
	+if [ ${#KOKKOS_DEVICES} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} KOKKOS_DEVICES=${KOKKOS_DEVICES}"
	+fi
	+if [ ${#KOKKOS_ARCH} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} KOKKOS_ARCH=${KOKKOS_ARCH}"
	+fi
	+if [ ${#KOKKOS_DEBUG} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} KOKKOS_DEBUG=${KOKKOS_DEBUG}"
	+fi
	+if [ ${#CUDA_PATH} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} CUDA_PATH=${CUDA_PATH}"
	+fi
	+if [ ${#CXXFLAGS} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} CXXFLAGS=\"${CXXFLAGS}\""
	+fi
	+if [ ${#LDFLAGS} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} LDFLAGS=\"${LDFLAGS}\""
	+fi
	+if [ ${#GTEST_PATH} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} GTEST_PATH=${GTEST_PATH}"
	+else
	+GTEST_PATH=${KOKKOS_PATH}/tpls/gtest
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} GTEST_PATH=${GTEST_PATH}"
	+fi
	+if [ ${#HWLOC_PATH} -gt 0 ]; then
	+KOKKOS_OPTIONS="${KOKKOS_OPTIONS} HWLOC_PATH=${HWLOC_PATH} KOKKOS_USE_TPLS=hwloc"
	+fi
	+mkdir core
	+mkdir core/unit_test
	+mkdir core/perf_test
	+mkdir containers
	+mkdir containers/unit_tests
	+mkdir containers/performance_tests
	+mkdir algorithms
	+mkdir algorithms/unit_tests
	+mkdir algorithms/performance_tests
	+mkdir example
	+mkdir example/fixture
	+mkdir example/feint
	+mkdir example/fenl
	+
	+
	+echo "Generating Makefile with options " ${KOKKOS_OPTIONS}
	+echo "KOKKOS_OPTIONS=${KOKKOS_OPTIONS}" > Makefile
	+echo "" >> Makefile
	+echo "lib:" >> Makefile
	+echo -e "\tcd core; \\" >> Makefile
	+echo -e "\tmake -j -j -f ${KOKKOS_PATH}/core/src/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo "" >> Makefile
	+echo "install: lib" >> Makefile
	+echo -e "\tcd core; \\" >> Makefile
	+echo -e "\tmake -j -f ${KOKKOS_PATH}/core/src/Makefile ${KOKKOS_OPTIONS} install" >> Makefile
	+echo "" >> Makefile
	+echo "build-test:" >> Makefile
	+echo -e "\tcd core/unit_test; \\" >> Makefile
	+echo -e "\tmake -j -f ${KOKKOS_PATH}/core/unit_test/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd core/perf_test; \\" >> Makefile
	+echo -e "\tmake -j -f ${KOKKOS_PATH}/core/perf_test/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd containers/unit_tests; \\" >> Makefile
	+echo -e "\tmake -j -f ${KOKKOS_PATH}/containers/unit_tests/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd containers/performance_tests; \\" >> Makefile
	+echo -e "\tmake -j -f ${KOKKOS_PATH}/containers/performance_tests/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd algorithms/unit_tests; \\" >> Makefile
	+echo -e "\tmake -j -f ${KOKKOS_PATH}/algorithms/unit_tests/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd example/fixture; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/example/fixture/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd example/feint; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/example/feint/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo -e "\tcd example/fenl; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/example/fenl/Makefile ${KOKKOS_OPTIONS}" >> Makefile
	+echo "" >> Makefile
	+echo "test: build-test" >> Makefile
	+echo -e "\tcd core/unit_test; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/core/unit_test/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd core/perf_test; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/core/perf_test/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd containers/unit_tests; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/containers/unit_tests/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd containers/performance_tests; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/containers/performance_tests/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd algorithms/unit_tests; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/algorithms/unit_tests/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd example/fixture; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/example/fixture/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd example/feint; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/example/feint/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+echo -e "\tcd example/fenl; \\" >> Makefile
	+echo -e "\tmake -f ${KOKKOS_PATH}/example/fenl/Makefile ${KOKKOS_OPTIONS} test" >> Makefile
	+
	+
	diff --git a/src/KIM/pair_kim_version.h b/src/KIM/pair_kim_version.h
	new file mode 100644
	index 000000000..02326645f
	--- /dev/null
	+++ b/src/KIM/pair_kim_version.h
	@@ -0,0 +1,64 @@
	+/* -- c++ -- ----------------------------------------------------------
	+ LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	+ http://lammps.sandia.gov, Sandia National Laboratories
	+ Steve Plimpton, sjplimp@sandia.gov
	+
	+ Copyright (2003) Sandia Corporation. Under the terms of Contract
	+ DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	+ certain rights in this software. This software is distributed under
	+ the GNU General Public License.
	+
	+ See the README file in the top-level LAMMPS directory.
	+------------------------------------------------------------------------- */
	+
	+/* ----------------------------------------------------------------------
	+ Contributing authors: Ryan S. Elliott,
	+------------------------------------------------------------------------- */
	+
	+#ifndef LMP_PAIR_KIM_VERSION_H
	+#define LMP_PAIR_KIM_VERSION_H
	+
	+//
	+// Release: This file is part of the pair-kim-v1.7.2 package.
	+//
	+
	+//
	+// This file defines the version information for the pair-kim package.
	+// The values specified here must conform to the Semantic Versioning
	+// 2.0.0 specification.
	+//
	+// Generally the version numbering for the pair-kim package will
	+// parallel the numbering for the kim-api package. However, if
	+// additional versioning increments are required for the pair-kim
	+// package, the build-metatdata field will be used to provide a
	+// "sub-patch" version number.
	+//
	+// The PATCH value should be incremented IMMEDIATELY after an official
	+// release.
	+//
	+// The MINOR value should be incremented AND the PATCH value reset to
	+// zero as soon as it becomes clear that the next official release
	+// MUST increment the MINOR version value.
	+//
	+// The MAJOR value should be incremented AND the MINOR and PATCH
	+// vaules reset to zero as soon as it becomes clear that the next
	+// official release MUST increment the MAJOR version value.
	+//
	+// The PRERELEASE value can be set to any value allowed by the
	+// Semantic Versioning specification. However, it will generally be
	+// empty. This value should be quoted as a string constant.
	+//
	+// The BUILD_METADATA value can be set to any value allowed by the
	+// Semantic Versioning specification. However, it will generally be
	+// emtpy; Except for when "sub-patch" versioning of the pair-kim
	+// package is necessary. This value should be quoted as a string
	+// constant.
	+//
	+
	+#define PAIR_KIM_VERSION_MAJOR 1
	+#define PAIR_KIM_VERSION_MINOR 7
	+#define PAIR_KIM_VERSION_PATCH 2
	+//#define PAIR_KIM_VERSION_PRERELEASE
	+//#define PAIR_KIM_VERSION_BUILD_METADATA
	+
	+#endif /* PAIR_KIM_VERSION_H */
	diff --git a/src/KOKKOS/Install.sh b/src/KOKKOS/Install.sh
	index 0eea611d2..811164ff3 100644
	--- a/src/KOKKOS/Install.sh
	+++ b/src/KOKKOS/Install.sh
	@@ -1,184 +1,186 @@
	# Install/unInstall package files in LAMMPS
	# mode = 0/1/2 for uninstall/install/update

	mode=$1

	# arg1 = file, arg2 = file it depends on

	action () {
	if (test $mode = 0) then
	rm -f ../$1
	elif (! cmp -s $1 ../$1) then
	if (test -z "$2" \|\| test -e ../$2) then
	cp $1 ..
	if (test $mode = 2) then
	echo " updating src/$1"
	fi
	fi
	elif (test -n "$2") then
	if (test ! -e ../$2) then
	rm -f ../$1
	fi
	fi
	}

	# force rebuild of files with LMP_KOKKOS switch

	touch ../accelerator_kokkos.h
	touch ../memory.h

	# list of files with optional dependcies

	action angle_charmm_kokkos.cpp angle_charmm.cpp
	action angle_charmm_kokkos.h angle_charmm.h
	action angle_harmonic_kokkos.cpp angle_harmonic.cpp
	action angle_harmonic_kokkos.h angle_harmonic.h
	action atom_kokkos.cpp
	action atom_kokkos.h
	action atom_vec_angle_kokkos.cpp atom_vec_angle.cpp
	action atom_vec_angle_kokkos.h atom_vec_angle.h
	action atom_vec_atomic_kokkos.cpp
	action atom_vec_atomic_kokkos.h
	action atom_vec_bond_kokkos.cpp atom_vec_bond.cpp
	action atom_vec_bond_kokkos.h atom_vec_bond.h
	action atom_vec_charge_kokkos.cpp
	action atom_vec_charge_kokkos.h
	action atom_vec_full_kokkos.cpp atom_vec_full.cpp
	action atom_vec_full_kokkos.h atom_vec_full.h
	action atom_vec_kokkos.cpp
	action atom_vec_kokkos.h
	action atom_vec_molecular_kokkos.cpp atom_vec_molecular.cpp
	action atom_vec_molecular_kokkos.h atom_vec_molecular.h
	action bond_fene_kokkos.cpp bond_fene.cpp
	action bond_fene_kokkos.h bond_fene.h
	action bond_harmonic_kokkos.cpp bond_harmonic.cpp
	action bond_harmonic_kokkos.h bond_harmonic.h
	action comm_kokkos.cpp
	action comm_kokkos.h
	action dihedral_charmm_kokkos.cpp dihedral_charmm.cpp
	action dihedral_charmm_kokkos.h dihedral_charmm.h
	action dihedral_opls_kokkos.cpp dihedral_opls.cpp
	action dihedral_opls_kokkos.h dihedral_opls.h
	action domain_kokkos.cpp
	action domain_kokkos.h
	action fix_langevin_kokkos.cpp
	action fix_langevin_kokkos.h
	action fix_nve_kokkos.cpp
	action fix_nve_kokkos.h
	action improper_harmonic_kokkos.cpp improper_harmonic.cpp
	action improper_harmonic_kokkos.h improper_harmonic.h
	action kokkos.cpp
	action kokkos.h
	action kokkos_type.h
	action memory_kokkos.h
	action modify_kokkos.cpp
	action modify_kokkos.h
	action neigh_bond_kokkos.cpp
	action neigh_bond_kokkos.h
	action neigh_full_kokkos.h
	action neigh_list_kokkos.cpp
	action neigh_list_kokkos.h
	action neighbor_kokkos.cpp
	action neighbor_kokkos.h
	action pair_buck_coul_cut_kokkos.cpp
	action pair_buck_coul_cut_kokkos.h
	action pair_buck_coul_long_kokkos.cpp pair_buck_coul_long.cpp
	action pair_buck_coul_long_kokkos.h pair_buck_coul_long.h
	action pair_buck_kokkos.cpp
	action pair_buck_kokkos.h
	action pair_coul_cut_kokkos.cpp
	action pair_coul_cut_kokkos.h
	action pair_coul_debye_kokkos.cpp
	action pair_coul_debye_kokkos.h
	action pair_coul_dsf_kokkos.cpp
	action pair_coul_dsf_kokkos.h
	action pair_coul_long_kokkos.cpp pair_coul_long.cpp
	action pair_coul_long_kokkos.h pair_coul_long.h
	action pair_coul_wolf_kokkos.cpp
	action pair_coul_wolf_kokkos.h
	action pair_eam_kokkos.cpp pair_eam.cpp
	action pair_eam_kokkos.h pair_eam.h
	action pair_eam_alloy_kokkos.cpp pair_eam_alloy.cpp
	action pair_eam_alloy_kokkos.h pair_eam_alloy.h
	action pair_eam_fs_kokkos.cpp pair_eam_fs.cpp
	action pair_eam_fs_kokkos.h pair_eam_fs.h
	action pair_kokkos.h
	action pair_lj_charmm_coul_charmm_implicit_kokkos.cpp pair_lj_charmm_coul_charmm_implicit.cpp
	action pair_lj_charmm_coul_charmm_implicit_kokkos.h pair_lj_charmm_coul_charmm_implicit.h
	action pair_lj_charmm_coul_charmm_kokkos.cpp pair_lj_charmm_coul_charmm.cpp
	action pair_lj_charmm_coul_charmm_kokkos.h pair_lj_charmm_coul_charmm.h
	action pair_lj_charmm_coul_long_kokkos.cpp pair_lj_charmm_coul_long.cpp
	action pair_lj_charmm_coul_long_kokkos.h pair_lj_charmm_coul_long.h
	action pair_lj_class2_coul_cut_kokkos.cpp pair_lj_class2_coul_cut.cpp
	action pair_lj_class2_coul_cut_kokkos.h pair_lj_class2_coul_cut.h
	action pair_lj_class2_coul_long_kokkos.cpp pair_lj_class2_coul_long.cpp
	action pair_lj_class2_coul_long_kokkos.h pair_lj_class2_coul_long.h
	action pair_lj_class2_kokkos.cpp pair_lj_class2.cpp
	action pair_lj_class2_kokkos.h pair_lj_class2.h
	action pair_lj_cut_coul_cut_kokkos.cpp
	action pair_lj_cut_coul_cut_kokkos.h
	action pair_lj_cut_coul_debye_kokkos.cpp
	action pair_lj_cut_coul_debye_kokkos.h
	action pair_lj_cut_coul_dsf_kokkos.cpp
	action pair_lj_cut_coul_dsf_kokkos.h
	action pair_lj_cut_coul_long_kokkos.cpp pair_lj_cut_coul_long.cpp
	action pair_lj_cut_coul_long_kokkos.h pair_lj_cut_coul_long.h
	action pair_lj_cut_kokkos.cpp
	action pair_lj_cut_kokkos.h
	action pair_lj_expand_kokkos.cpp
	action pair_lj_expand_kokkos.h
	action pair_lj_gromacs_coul_gromacs_kokkos.cpp
	action pair_lj_gromacs_coul_gromacs_kokkos.h
	action pair_lj_gromacs_kokkos.cpp
	action pair_lj_gromacs_kokkos.h
	action pair_lj_sdk_kokkos.cpp pair_lj_sdk.cpp
	action pair_lj_sdk_kokkos.h pair_lj_sdk.h
	action pair_sw_kokkos.cpp pair_sw.cpp
	action pair_sw_kokkos.h pair_sw.h
	action pair_table_kokkos.cpp
	action pair_table_kokkos.h
	action pair_tersoff_kokkos.cpp pair_tersoff.cpp
	action pair_tersoff_kokkos.h pair_tersoff.h
	action pair_tersoff_mod_kokkos.cpp pair_tersoff_mod.cpp
	action pair_tersoff_mod_kokkos.h pair_tersoff_mod.h
	action pair_tersoff_zbl_kokkos.cpp pair_tersoff_zbl.cpp
	action pair_tersoff_zbl_kokkos.h pair_tersoff_zbl.h
	action verlet_kokkos.cpp
	action verlet_kokkos.h

	# edit 2 Makefile.package files to include/exclude package info

	if (test $1 = 1) then

	if (test -e ../Makefile.package) then
	sed -i -e 's/[^ \t]kokkos[^ \t] //g' ../Makefile.package
	sed -i -e 's/[^ \t]KOKKOS[^ \t] //g' ../Makefile.package
	sed -i -e 's\|^PKG_INC =[ \t]*\|&-DLMP_KOKKOS \|' ../Makefile.package
	# sed -i -e 's\|^PKG_PATH =[ \t]*\|&-L..\/..\/lib\/kokkos\/core\/src \|' ../Makefile.package
	- sed -i -e 's\|^PKG_LIB =[ \t]*\|&-lkokkoscore \|' ../Makefile.package
	- sed -i -e 's\|^PKG_SYSINC =[ \t]*\|&$(KOKKOS_INC) \|' ../Makefile.package
	- sed -i -e 's\|^PKG_SYSLIB =[ \t]*\|&$(KOKKOS_LINK) \|' ../Makefile.package
	+ sed -i -e 's\|^PKG_CPP_DEPENDS =[ \t]*\|&$(KOKKOS_CPP_DEPENDS) \|' ../Makefile.package
	+ sed -i -e 's\|^PKG_LIB =[ \t]*\|&$(KOKKOS_LIBS) \|' ../Makefile.package
	+ sed -i -e 's\|^PKG_LINK_DEPENDS =[ \t]*\|&$(KOKKOS_LINK_DEPENDS) \|' ../Makefile.package
	+ sed -i -e 's\|^PKG_SYSINC =[ \t]*\|&$(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) \|' ../Makefile.package
	+ sed -i -e 's\|^PKG_SYSLIB =[ \t]*\|&$(KOKKOS_LDFLAGS) \|' ../Makefile.package
	# sed -i -e 's\|^PKG_SYSPATH =[ \t]*\|&$(kokkos_SYSPATH) \|' ../Makefile.package
	fi

	if (test -e ../Makefile.package.settings) then
	+ sed -i -e '/CXX\ =\ \$(CC)/d' ../Makefile.package.settings
	sed -i -e '/^include.kokkos.$/d' ../Makefile.package.settings
	# multiline form needed for BSD sed on Macs
	- sed -i -e '4 i \
	-include ..\/..\/lib\/kokkos\/Makefile.lammps
	-' ../Makefile.package.settings
	-
	+ sed -i -e '4 i \CXX = $(CC)' ../Makefile.package.settings
	+ sed -i -e '5 i \include ..\/..\/lib\/kokkos\/Makefile.kokkos' ../Makefile.package.settings
	fi

	elif (test $1 = 0) then

	if (test -e ../Makefile.package) then
	sed -i -e 's/[^ \t]kokkos[^ \t] //g' ../Makefile.package
	sed -i -e 's/[^ \t]KOKKOS[^ \t] //g' ../Makefile.package
	fi

	if (test -e ../Makefile.package.settings) then
	+ sed -i -e '/CXX\ =\ \$(CC)/d' ../Makefile.package.settings
	sed -i -e '/^include.kokkos.$/d' ../Makefile.package.settings
	fi

	fi
	diff --git a/src/KOKKOS/kokkos_type.h b/src/KOKKOS/kokkos_type.h
	index 123bbd1a8..1f9087c3c 100644
	--- a/src/KOKKOS/kokkos_type.h
	+++ b/src/KOKKOS/kokkos_type.h
	@@ -1,683 +1,748 @@
	/* -- c++ -- ----------------------------------------------------------
	LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	http://lammps.sandia.gov, Sandia National Laboratories
	Steve Plimpton, sjplimp@sandia.gov

	Copyright (2003) Sandia Corporation. Under the terms of Contract
	DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	certain rights in this software. This software is distributed under
	the GNU General Public License.

	See the README file in the top-level LAMMPS directory.
	------------------------------------------------------------------------- */

	#ifndef LMP_LMPTYPE_KOKKOS_H
	#define LMP_LMPTYPE_KOKKOS_H

	#include <Kokkos_Core.hpp>
	#include <Kokkos_DualView.hpp>
	#include <impl/Kokkos_Timer.hpp>
	#include <Kokkos_Vectorization.hpp>

	#define MAX_TYPES_STACKPARAMS 12
	#define NeighClusterSize 8

	+ struct lmp_float3 {
	+ float x,y,z;
	+ KOKKOS_INLINE_FUNCTION
	+ lmp_float3():x(0.0f),z(0.0f),y(0.0f) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator += (const lmp_float3& tmp) {
	+ x+=tmp.x;
	+ y+=tmp.y;
	+ z+=tmp.z;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator += (const lmp_float3& tmp) volatile {
	+ x+=tmp.x;
	+ y+=tmp.y;
	+ z+=tmp.z;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator = (const lmp_float3& tmp) {
	+ x=tmp.x;
	+ y=tmp.y;
	+ z=tmp.z;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator = (const lmp_float3& tmp) volatile {
	+ x=tmp.x;
	+ y=tmp.y;
	+ z=tmp.z;
	+ }
	+ };
	+
	+ struct lmp_double3 {
	+ double x,y,z;
	+ KOKKOS_INLINE_FUNCTION
	+ lmp_double3():x(0.0),z(0.0),y(0.0) {}
	+
	+ KOKKOS_INLINE_FUNCTION
	+ void operator += (const lmp_double3& tmp) {
	+ x+=tmp.x;
	+ y+=tmp.y;
	+ z+=tmp.z;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator += (const lmp_double3& tmp) volatile {
	+ x+=tmp.x;
	+ y+=tmp.y;
	+ z+=tmp.z;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator = (const lmp_double3& tmp) {
	+ x=tmp.x;
	+ y=tmp.y;
	+ z=tmp.z;
	+ }
	+ KOKKOS_INLINE_FUNCTION
	+ void operator = (const lmp_double3& tmp) volatile {
	+ x=tmp.x;
	+ y=tmp.y;
	+ z=tmp.z;
	+ }
	+ };
	+
	#if !defined(__CUDACC__) && !defined(__VECTOR_TYPES_H__)
	struct double2 {
	double x, y;
	};
	struct float2 {
	float x, y;
	};
	- struct double4 {
	- double x, y, z, w;
	- };
	struct float4 {
	float x, y, z, w;
	};
	+ struct double4 {
	+ double x, y, z, w;
	+ };
	#endif
	-
	// set LMPHostype and LMPDeviceType from Kokkos Default Types
	typedef Kokkos::DefaultExecutionSpace LMPDeviceType;
	typedef Kokkos::HostSpace::execution_space LMPHostType;

	// set ExecutionSpace stuct with variable "space"

	template<class Device>
	struct ExecutionSpaceFromDevice;

	template<>
	struct ExecutionSpaceFromDevice<LMPHostType> {
	static const LAMMPS_NS::ExecutionSpace space = LAMMPS_NS::Host;
	};
	#ifdef KOKKOS_HAVE_CUDA
	template<>
	struct ExecutionSpaceFromDevice<Kokkos::Cuda> {
	static const LAMMPS_NS::ExecutionSpace space = LAMMPS_NS::Device;
	};
	#endif

	// define precision
	// handle global precision, force, energy, positions, kspace separately

	#ifndef PRECISION
	#define PRECISION 2
	#endif
	#if PRECISION==1
	typedef float LMP_FLOAT;
	typedef float2 LMP_FLOAT2;
	+typedef lmp_float3 LMP_FLOAT3;
	typedef float4 LMP_FLOAT4;
	#else
	typedef double LMP_FLOAT;
	typedef double2 LMP_FLOAT2;
	+typedef lmp_double3 LMP_FLOAT3;
	typedef double4 LMP_FLOAT4;
	#endif

	#ifndef PREC_FORCE
	#define PREC_FORCE PRECISION
	#endif

	#if PREC_FORCE==1
	typedef float F_FLOAT;
	typedef float2 F_FLOAT2;
	+typedef lmp_float3 F_FLOAT3;
	typedef float4 F_FLOAT4;
	#else
	typedef double F_FLOAT;
	typedef double2 F_FLOAT2;
	+typedef lmp_double3 F_FLOAT3;
	typedef double4 F_FLOAT4;
	#endif

	#ifndef PREC_ENERGY
	#define PREC_ENERGY PRECISION
	#endif

	#if PREC_ENERGY==1
	typedef float E_FLOAT;
	typedef float2 E_FLOAT2;
	typedef float4 E_FLOAT4;
	#else
	typedef double E_FLOAT;
	typedef double2 E_FLOAT2;
	typedef double4 E_FLOAT4;
	#endif

	struct s_EV_FLOAT {
	E_FLOAT evdwl;
	E_FLOAT ecoul;
	E_FLOAT v[6];
	KOKKOS_INLINE_FUNCTION
	s_EV_FLOAT() {
	evdwl = 0;
	ecoul = 0;
	v[0] = 0; v[1] = 0; v[2] = 0;
	v[3] = 0; v[4] = 0; v[5] = 0;
	}

	KOKKOS_INLINE_FUNCTION
	void operator+=(const s_EV_FLOAT &rhs) {
	evdwl += rhs.evdwl;
	ecoul += rhs.ecoul;
	v[0] += rhs.v[0];
	v[1] += rhs.v[1];
	v[2] += rhs.v[2];
	v[3] += rhs.v[3];
	v[4] += rhs.v[4];
	v[5] += rhs.v[5];
	}

	KOKKOS_INLINE_FUNCTION
	void operator+=(const volatile s_EV_FLOAT &rhs) volatile {
	evdwl += rhs.evdwl;
	ecoul += rhs.ecoul;
	v[0] += rhs.v[0];
	v[1] += rhs.v[1];
	v[2] += rhs.v[2];
	v[3] += rhs.v[3];
	v[4] += rhs.v[4];
	v[5] += rhs.v[5];
	}
	};
	typedef struct s_EV_FLOAT EV_FLOAT;

	#ifndef PREC_POS
	#define PREC_POS PRECISION
	#endif

	#if PREC_POS==1
	typedef float X_FLOAT;
	typedef float2 X_FLOAT2;
	typedef float4 X_FLOAT4;
	#else
	typedef double X_FLOAT;
	typedef double2 X_FLOAT2;
	typedef double4 X_FLOAT4;
	#endif

	#ifndef PREC_VELOCITIES
	#define PREC_VELOCITIES PRECISION
	#endif

	#if PREC_VELOCITIES==1
	typedef float V_FLOAT;
	typedef float2 V_FLOAT2;
	typedef float4 V_FLOAT4;
	#else
	typedef double V_FLOAT;
	typedef double2 V_FLOAT2;
	typedef double4 V_FLOAT4;
	#endif

	#if PREC_KSPACE==1
	typedef float K_FLOAT;
	typedef float2 K_FLOAT2;
	typedef float4 K_FLOAT4;
	#else
	typedef double K_FLOAT;
	typedef double2 K_FLOAT2;
	typedef double4 K_FLOAT4;
	#endif

	// ------------------------------------------------------------------------

	// LAMMPS types

	template <class DeviceType>
	struct ArrayTypes;

	template <>
	struct ArrayTypes<LMPDeviceType> {

	// scalar types

	typedef Kokkos::
	DualView<int, LMPDeviceType::array_layout, LMPDeviceType> tdual_int_scalar;
	typedef tdual_int_scalar::t_dev t_int_scalar;
	typedef tdual_int_scalar::t_dev_const t_int_scalar_const;
	typedef tdual_int_scalar::t_dev_um t_int_scalar_um;
	typedef tdual_int_scalar::t_dev_const_um t_int_scalar_const_um;

	typedef Kokkos::
	DualView<LMP_FLOAT, LMPDeviceType::array_layout, LMPDeviceType>
	tdual_float_scalar;
	typedef tdual_float_scalar::t_dev t_float_scalar;
	typedef tdual_float_scalar::t_dev_const t_float_scalar_const;
	typedef tdual_float_scalar::t_dev_um t_float_scalar_um;
	typedef tdual_float_scalar::t_dev_const_um t_float_scalar_const_um;

	// generic array types

	typedef Kokkos::
	DualView<int*, LMPDeviceType::array_layout, LMPDeviceType> tdual_int_1d;
	typedef tdual_int_1d::t_dev t_int_1d;
	typedef tdual_int_1d::t_dev_const t_int_1d_const;
	typedef tdual_int_1d::t_dev_um t_int_1d_um;
	typedef tdual_int_1d::t_dev_const_um t_int_1d_const_um;
	typedef tdual_int_1d::t_dev_const_randomread t_int_1d_randomread;

	typedef Kokkos::
	DualView<int*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_int_1d_3;
	typedef tdual_int_1d_3::t_dev t_int_1d_3;
	typedef tdual_int_1d_3::t_dev_const t_int_1d_3_const;
	typedef tdual_int_1d_3::t_dev_um t_int_1d_3_um;
	typedef tdual_int_1d_3::t_dev_const_um t_int_1d_3_const_um;
	typedef tdual_int_1d_3::t_dev_const_randomread t_int_1d_3_randomread;

	typedef Kokkos::
	DualView<int**, Kokkos::LayoutRight, LMPDeviceType> tdual_int_2d;
	typedef tdual_int_2d::t_dev t_int_2d;
	typedef tdual_int_2d::t_dev_const t_int_2d_const;
	typedef tdual_int_2d::t_dev_um t_int_2d_um;
	typedef tdual_int_2d::t_dev_const_um t_int_2d_const_um;
	typedef tdual_int_2d::t_dev_const_randomread t_int_2d_randomread;

	typedef Kokkos::
	DualView<LAMMPS_NS::tagint*, LMPDeviceType::array_layout, LMPDeviceType>
	tdual_tagint_1d;
	typedef tdual_tagint_1d::t_dev t_tagint_1d;
	typedef tdual_tagint_1d::t_dev_const t_tagint_1d_const;
	typedef tdual_tagint_1d::t_dev_um t_tagint_1d_um;
	typedef tdual_tagint_1d::t_dev_const_um t_tagint_1d_const_um;
	typedef tdual_tagint_1d::t_dev_const_randomread t_tagint_1d_randomread;

	typedef Kokkos::
	DualView<LAMMPS_NS::tagint**, Kokkos::LayoutRight, LMPDeviceType>
	tdual_tagint_2d;
	typedef tdual_tagint_2d::t_dev t_tagint_2d;
	typedef tdual_tagint_2d::t_dev_const t_tagint_2d_const;
	typedef tdual_tagint_2d::t_dev_um t_tagint_2d_um;
	typedef tdual_tagint_2d::t_dev_const_um t_tagint_2d_const_um;
	typedef tdual_tagint_2d::t_dev_const_randomread t_tagint_2d_randomread;

	typedef Kokkos::
	DualView<LAMMPS_NS::imageint*, LMPDeviceType::array_layout, LMPDeviceType>
	tdual_imageint_1d;
	typedef tdual_imageint_1d::t_dev t_imageint_1d;
	typedef tdual_imageint_1d::t_dev_const t_imageint_1d_const;
	typedef tdual_imageint_1d::t_dev_um t_imageint_1d_um;
	typedef tdual_imageint_1d::t_dev_const_um t_imageint_1d_const_um;
	typedef tdual_imageint_1d::t_dev_const_randomread t_imageint_1d_randomread;

	typedef Kokkos::
	DualView<double*, Kokkos::LayoutRight, LMPDeviceType> tdual_double_1d;
	typedef tdual_double_1d::t_dev t_double_1d;
	typedef tdual_double_1d::t_dev_const t_double_1d_const;
	typedef tdual_double_1d::t_dev_um t_double_1d_um;
	typedef tdual_double_1d::t_dev_const_um t_double_1d_const_um;
	typedef tdual_double_1d::t_dev_const_randomread t_double_1d_randomread;

	typedef Kokkos::
	DualView<double**, Kokkos::LayoutRight, LMPDeviceType> tdual_double_2d;
	typedef tdual_double_2d::t_dev t_double_2d;
	typedef tdual_double_2d::t_dev_const t_double_2d_const;
	typedef tdual_double_2d::t_dev_um t_double_2d_um;
	typedef tdual_double_2d::t_dev_const_um t_double_2d_const_um;
	typedef tdual_double_2d::t_dev_const_randomread t_double_2d_randomread;

	// 1d float array n

	typedef Kokkos::DualView<LMP_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_float_1d;
	typedef tdual_float_1d::t_dev t_float_1d;
	typedef tdual_float_1d::t_dev_const t_float_1d_const;
	typedef tdual_float_1d::t_dev_um t_float_1d_um;
	typedef tdual_float_1d::t_dev_const_um t_float_1d_const_um;
	typedef tdual_float_1d::t_dev_const_randomread t_float_1d_randomread;

	//2d float array n
	typedef Kokkos::DualView<LMP_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_float_2d;
	typedef tdual_float_2d::t_dev t_float_2d;
	typedef tdual_float_2d::t_dev_const t_float_2d_const;
	typedef tdual_float_2d::t_dev_um t_float_2d_um;
	typedef tdual_float_2d::t_dev_const_um t_float_2d_const_um;
	typedef tdual_float_2d::t_dev_const_randomread t_float_2d_randomread;

	//Position Types
	//1d X_FLOAT array n
	typedef Kokkos::DualView<X_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_xfloat_1d;
	typedef tdual_xfloat_1d::t_dev t_xfloat_1d;
	typedef tdual_xfloat_1d::t_dev_const t_xfloat_1d_const;
	typedef tdual_xfloat_1d::t_dev_um t_xfloat_1d_um;
	typedef tdual_xfloat_1d::t_dev_const_um t_xfloat_1d_const_um;
	typedef tdual_xfloat_1d::t_dev_const_randomread t_xfloat_1d_randomread;

	//2d X_FLOAT array n*m
	typedef Kokkos::DualView<X_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_xfloat_2d;
	typedef tdual_xfloat_2d::t_dev t_xfloat_2d;
	typedef tdual_xfloat_2d::t_dev_const t_xfloat_2d_const;
	typedef tdual_xfloat_2d::t_dev_um t_xfloat_2d_um;
	typedef tdual_xfloat_2d::t_dev_const_um t_xfloat_2d_const_um;
	typedef tdual_xfloat_2d::t_dev_const_randomread t_xfloat_2d_randomread;

	//2d X_FLOAT array n*4
	#ifdef LMP_KOKKOS_NO_LEGACY
	typedef Kokkos::DualView<X_FLOAT*[3], Kokkos::LayoutLeft, LMPDeviceType> tdual_x_array;
	#else
	typedef Kokkos::DualView<X_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_x_array;
	#endif
	typedef tdual_x_array::t_dev t_x_array;
	typedef tdual_x_array::t_dev_const t_x_array_const;
	typedef tdual_x_array::t_dev_um t_x_array_um;
	typedef tdual_x_array::t_dev_const_um t_x_array_const_um;
	typedef tdual_x_array::t_dev_const_randomread t_x_array_randomread;

	//Velocity Types
	//1d V_FLOAT array n
	typedef Kokkos::DualView<V_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_vfloat_1d;
	typedef tdual_vfloat_1d::t_dev t_vfloat_1d;
	typedef tdual_vfloat_1d::t_dev_const t_vfloat_1d_const;
	typedef tdual_vfloat_1d::t_dev_um t_vfloat_1d_um;
	typedef tdual_vfloat_1d::t_dev_const_um t_vfloat_1d_const_um;
	typedef tdual_vfloat_1d::t_dev_const_randomread t_vfloat_1d_randomread;

	//2d V_FLOAT array n*m
	typedef Kokkos::DualView<V_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_vfloat_2d;
	typedef tdual_vfloat_2d::t_dev t_vfloat_2d;
	typedef tdual_vfloat_2d::t_dev_const t_vfloat_2d_const;
	typedef tdual_vfloat_2d::t_dev_um t_vfloat_2d_um;
	typedef tdual_vfloat_2d::t_dev_const_um t_vfloat_2d_const_um;
	typedef tdual_vfloat_2d::t_dev_const_randomread t_vfloat_2d_randomread;

	//2d V_FLOAT array n*3
	typedef Kokkos::DualView<V_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_v_array;
	//typedef Kokkos::DualView<V_FLOAT*[3], LMPDeviceType::array_layout, LMPDeviceType> tdual_v_array;
	typedef tdual_v_array::t_dev t_v_array;
	typedef tdual_v_array::t_dev_const t_v_array_const;
	typedef tdual_v_array::t_dev_um t_v_array_um;
	typedef tdual_v_array::t_dev_const_um t_v_array_const_um;
	typedef tdual_v_array::t_dev_const_randomread t_v_array_randomread;

	//Force Types
	//1d F_FLOAT array n

	typedef Kokkos::DualView<F_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_ffloat_1d;
	typedef tdual_ffloat_1d::t_dev t_ffloat_1d;
	typedef tdual_ffloat_1d::t_dev_const t_ffloat_1d_const;
	typedef tdual_ffloat_1d::t_dev_um t_ffloat_1d_um;
	typedef tdual_ffloat_1d::t_dev_const_um t_ffloat_1d_const_um;
	typedef tdual_ffloat_1d::t_dev_const_randomread t_ffloat_1d_randomread;

	//2d F_FLOAT array n*m

	typedef Kokkos::DualView<F_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_ffloat_2d;
	typedef tdual_ffloat_2d::t_dev t_ffloat_2d;
	typedef tdual_ffloat_2d::t_dev_const t_ffloat_2d_const;
	typedef tdual_ffloat_2d::t_dev_um t_ffloat_2d_um;
	typedef tdual_ffloat_2d::t_dev_const_um t_ffloat_2d_const_um;
	typedef tdual_ffloat_2d::t_dev_const_randomread t_ffloat_2d_randomread;

	//2d F_FLOAT array n*3

	typedef Kokkos::DualView<F_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_f_array;
	//typedef Kokkos::DualView<F_FLOAT*[3], LMPDeviceType::array_layout, LMPDeviceType> tdual_f_array;
	typedef tdual_f_array::t_dev t_f_array;
	typedef tdual_f_array::t_dev_const t_f_array_const;
	typedef tdual_f_array::t_dev_um t_f_array_um;
	typedef tdual_f_array::t_dev_const_um t_f_array_const_um;
	typedef tdual_f_array::t_dev_const_randomread t_f_array_randomread;

	//2d F_FLOAT array n*6 (for virial)

	typedef Kokkos::DualView<F_FLOAT*[6], Kokkos::LayoutRight, LMPDeviceType> tdual_virial_array;
	typedef tdual_virial_array::t_dev t_virial_array;
	typedef tdual_virial_array::t_dev_const t_virial_array_const;
	typedef tdual_virial_array::t_dev_um t_virial_array_um;
	typedef tdual_virial_array::t_dev_const_um t_virial_array_const_um;
	typedef tdual_virial_array::t_dev_const_randomread t_virial_array_randomread;

	//Energy Types
	//1d E_FLOAT array n

	typedef Kokkos::DualView<E_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_efloat_1d;
	typedef tdual_efloat_1d::t_dev t_efloat_1d;
	typedef tdual_efloat_1d::t_dev_const t_efloat_1d_const;
	typedef tdual_efloat_1d::t_dev_um t_efloat_1d_um;
	typedef tdual_efloat_1d::t_dev_const_um t_efloat_1d_const_um;
	typedef tdual_efloat_1d::t_dev_const_randomread t_efloat_1d_randomread;

	//2d E_FLOAT array n*m

	typedef Kokkos::DualView<E_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_efloat_2d;
	typedef tdual_efloat_2d::t_dev t_efloat_2d;
	typedef tdual_efloat_2d::t_dev_const t_efloat_2d_const;
	typedef tdual_efloat_2d::t_dev_um t_efloat_2d_um;
	typedef tdual_efloat_2d::t_dev_const_um t_efloat_2d_const_um;
	typedef tdual_efloat_2d::t_dev_const_randomread t_efloat_2d_randomread;

	//2d E_FLOAT array n*3

	typedef Kokkos::DualView<E_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_e_array;
	typedef tdual_e_array::t_dev t_e_array;
	typedef tdual_e_array::t_dev_const t_e_array_const;
	typedef tdual_e_array::t_dev_um t_e_array_um;
	typedef tdual_e_array::t_dev_const_um t_e_array_const_um;
	typedef tdual_e_array::t_dev_const_randomread t_e_array_randomread;

	//Neighbor Types

	typedef Kokkos::DualView<int**, LMPDeviceType::array_layout, LMPDeviceType> tdual_neighbors_2d;
	typedef tdual_neighbors_2d::t_dev t_neighbors_2d;
	typedef tdual_neighbors_2d::t_dev_const t_neighbors_2d_const;
	typedef tdual_neighbors_2d::t_dev_um t_neighbors_2d_um;
	typedef tdual_neighbors_2d::t_dev_const_um t_neighbors_2d_const_um;
	typedef tdual_neighbors_2d::t_dev_const_randomread t_neighbors_2d_randomread;

	};

	#ifdef KOKKOS_HAVE_CUDA
	template <>
	struct ArrayTypes<LMPHostType> {

	//Scalar Types

	typedef Kokkos::DualView<int, LMPDeviceType::array_layout, LMPDeviceType> tdual_int_scalar;
	typedef tdual_int_scalar::t_host t_int_scalar;
	typedef tdual_int_scalar::t_host_const t_int_scalar_const;
	typedef tdual_int_scalar::t_host_um t_int_scalar_um;
	typedef tdual_int_scalar::t_host_const_um t_int_scalar_const_um;

	typedef Kokkos::DualView<LMP_FLOAT, LMPDeviceType::array_layout, LMPDeviceType> tdual_float_scalar;
	typedef tdual_float_scalar::t_host t_float_scalar;
	typedef tdual_float_scalar::t_host_const t_float_scalar_const;
	typedef tdual_float_scalar::t_host_um t_float_scalar_um;
	typedef tdual_float_scalar::t_host_const_um t_float_scalar_const_um;

	//Generic ArrayTypes
	typedef Kokkos::DualView<int*, LMPDeviceType::array_layout, LMPDeviceType> tdual_int_1d;
	typedef tdual_int_1d::t_host t_int_1d;
	typedef tdual_int_1d::t_host_const t_int_1d_const;
	typedef tdual_int_1d::t_host_um t_int_1d_um;
	typedef tdual_int_1d::t_host_const_um t_int_1d_const_um;
	typedef tdual_int_1d::t_host_const_randomread t_int_1d_randomread;

	typedef Kokkos::DualView<int*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_int_1d_3;
	typedef tdual_int_1d_3::t_host t_int_1d_3;
	typedef tdual_int_1d_3::t_host_const t_int_1d_3_const;
	typedef tdual_int_1d_3::t_host_um t_int_1d_3_um;
	typedef tdual_int_1d_3::t_host_const_um t_int_1d_3_const_um;
	typedef tdual_int_1d_3::t_host_const_randomread t_int_1d_3_randomread;

	typedef Kokkos::DualView<int**, Kokkos::LayoutRight, LMPDeviceType> tdual_int_2d;
	typedef tdual_int_2d::t_host t_int_2d;
	typedef tdual_int_2d::t_host_const t_int_2d_const;
	typedef tdual_int_2d::t_host_um t_int_2d_um;
	typedef tdual_int_2d::t_host_const_um t_int_2d_const_um;
	typedef tdual_int_2d::t_host_const_randomread t_int_2d_randomread;

	typedef Kokkos::DualView<LAMMPS_NS::tagint*, LMPDeviceType::array_layout, LMPDeviceType> tdual_tagint_1d;
	typedef tdual_tagint_1d::t_host t_tagint_1d;
	typedef tdual_tagint_1d::t_host_const t_tagint_1d_const;
	typedef tdual_tagint_1d::t_host_um t_tagint_1d_um;
	typedef tdual_tagint_1d::t_host_const_um t_tagint_1d_const_um;
	typedef tdual_tagint_1d::t_host_const_randomread t_tagint_1d_randomread;

	typedef Kokkos::
	DualView<LAMMPS_NS::tagint**, Kokkos::LayoutRight, LMPDeviceType>
	tdual_tagint_2d;
	typedef tdual_tagint_2d::t_host t_tagint_2d;
	typedef tdual_tagint_2d::t_host_const t_tagint_2d_const;
	typedef tdual_tagint_2d::t_host_um t_tagint_2d_um;
	typedef tdual_tagint_2d::t_host_const_um t_tagint_2d_const_um;
	typedef tdual_tagint_2d::t_host_const_randomread t_tagint_2d_randomread;

	typedef Kokkos::
	DualView<LAMMPS_NS::imageint*, LMPDeviceType::array_layout, LMPDeviceType>
	tdual_imageint_1d;
	typedef tdual_imageint_1d::t_host t_imageint_1d;
	typedef tdual_imageint_1d::t_host_const t_imageint_1d_const;
	typedef tdual_imageint_1d::t_host_um t_imageint_1d_um;
	typedef tdual_imageint_1d::t_host_const_um t_imageint_1d_const_um;
	typedef tdual_imageint_1d::t_host_const_randomread t_imageint_1d_randomread;

	typedef Kokkos::
	DualView<double*, Kokkos::LayoutRight, LMPDeviceType> tdual_double_1d;
	typedef tdual_double_1d::t_host t_double_1d;
	typedef tdual_double_1d::t_host_const t_double_1d_const;
	typedef tdual_double_1d::t_host_um t_double_1d_um;
	typedef tdual_double_1d::t_host_const_um t_double_1d_const_um;
	typedef tdual_double_1d::t_host_const_randomread t_double_1d_randomread;

	typedef Kokkos::
	DualView<double**, Kokkos::LayoutRight, LMPDeviceType> tdual_double_2d;
	typedef tdual_double_2d::t_host t_double_2d;
	typedef tdual_double_2d::t_host_const t_double_2d_const;
	typedef tdual_double_2d::t_host_um t_double_2d_um;
	typedef tdual_double_2d::t_host_const_um t_double_2d_const_um;
	typedef tdual_double_2d::t_host_const_randomread t_double_2d_randomread;

	//1d float array n
	typedef Kokkos::DualView<LMP_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_float_1d;
	typedef tdual_float_1d::t_host t_float_1d;
	typedef tdual_float_1d::t_host_const t_float_1d_const;
	typedef tdual_float_1d::t_host_um t_float_1d_um;
	typedef tdual_float_1d::t_host_const_um t_float_1d_const_um;
	typedef tdual_float_1d::t_host_const_randomread t_float_1d_randomread;

	//2d float array n
	typedef Kokkos::DualView<LMP_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_float_2d;
	typedef tdual_float_2d::t_host t_float_2d;
	typedef tdual_float_2d::t_host_const t_float_2d_const;
	typedef tdual_float_2d::t_host_um t_float_2d_um;
	typedef tdual_float_2d::t_host_const_um t_float_2d_const_um;
	typedef tdual_float_2d::t_host_const_randomread t_float_2d_randomread;

	//Position Types
	//1d X_FLOAT array n
	typedef Kokkos::DualView<X_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_xfloat_1d;
	typedef tdual_xfloat_1d::t_host t_xfloat_1d;
	typedef tdual_xfloat_1d::t_host_const t_xfloat_1d_const;
	typedef tdual_xfloat_1d::t_host_um t_xfloat_1d_um;
	typedef tdual_xfloat_1d::t_host_const_um t_xfloat_1d_const_um;
	typedef tdual_xfloat_1d::t_host_const_randomread t_xfloat_1d_randomread;

	//2d X_FLOAT array n*m
	typedef Kokkos::DualView<X_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_xfloat_2d;
	typedef tdual_xfloat_2d::t_host t_xfloat_2d;
	typedef tdual_xfloat_2d::t_host_const t_xfloat_2d_const;
	typedef tdual_xfloat_2d::t_host_um t_xfloat_2d_um;
	typedef tdual_xfloat_2d::t_host_const_um t_xfloat_2d_const_um;
	typedef tdual_xfloat_2d::t_host_const_randomread t_xfloat_2d_randomread;

	//2d X_FLOAT array n*3
	typedef Kokkos::DualView<X_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_x_array;
	typedef tdual_x_array::t_host t_x_array;
	typedef tdual_x_array::t_host_const t_x_array_const;
	typedef tdual_x_array::t_host_um t_x_array_um;
	typedef tdual_x_array::t_host_const_um t_x_array_const_um;
	typedef tdual_x_array::t_host_const_randomread t_x_array_randomread;

	//Velocity Types
	//1d V_FLOAT array n
	typedef Kokkos::DualView<V_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_vfloat_1d;
	typedef tdual_vfloat_1d::t_host t_vfloat_1d;
	typedef tdual_vfloat_1d::t_host_const t_vfloat_1d_const;
	typedef tdual_vfloat_1d::t_host_um t_vfloat_1d_um;
	typedef tdual_vfloat_1d::t_host_const_um t_vfloat_1d_const_um;
	typedef tdual_vfloat_1d::t_host_const_randomread t_vfloat_1d_randomread;

	//2d V_FLOAT array n*m
	typedef Kokkos::DualView<V_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_vfloat_2d;
	typedef tdual_vfloat_2d::t_host t_vfloat_2d;
	typedef tdual_vfloat_2d::t_host_const t_vfloat_2d_const;
	typedef tdual_vfloat_2d::t_host_um t_vfloat_2d_um;
	typedef tdual_vfloat_2d::t_host_const_um t_vfloat_2d_const_um;
	typedef tdual_vfloat_2d::t_host_const_randomread t_vfloat_2d_randomread;

	//2d V_FLOAT array n*3
	typedef Kokkos::DualView<V_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_v_array;
	//typedef Kokkos::DualView<V_FLOAT*[3], LMPDeviceType::array_layout, LMPDeviceType> tdual_v_array;
	typedef tdual_v_array::t_host t_v_array;
	typedef tdual_v_array::t_host_const t_v_array_const;
	typedef tdual_v_array::t_host_um t_v_array_um;
	typedef tdual_v_array::t_host_const_um t_v_array_const_um;
	typedef tdual_v_array::t_host_const_randomread t_v_array_randomread;

	//Force Types
	//1d F_FLOAT array n
	typedef Kokkos::DualView<F_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_ffloat_1d;
	typedef tdual_ffloat_1d::t_host t_ffloat_1d;
	typedef tdual_ffloat_1d::t_host_const t_ffloat_1d_const;
	typedef tdual_ffloat_1d::t_host_um t_ffloat_1d_um;
	typedef tdual_ffloat_1d::t_host_const_um t_ffloat_1d_const_um;
	typedef tdual_ffloat_1d::t_host_const_randomread t_ffloat_1d_randomread;

	//2d F_FLOAT array n*m
	typedef Kokkos::DualView<F_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_ffloat_2d;
	typedef tdual_ffloat_2d::t_host t_ffloat_2d;
	typedef tdual_ffloat_2d::t_host_const t_ffloat_2d_const;
	typedef tdual_ffloat_2d::t_host_um t_ffloat_2d_um;
	typedef tdual_ffloat_2d::t_host_const_um t_ffloat_2d_const_um;
	typedef tdual_ffloat_2d::t_host_const_randomread t_ffloat_2d_randomread;

	//2d F_FLOAT array n*3
	typedef Kokkos::DualView<F_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_f_array;
	//typedef Kokkos::DualView<F_FLOAT*[3], LMPDeviceType::array_layout, LMPDeviceType> tdual_f_array;
	typedef tdual_f_array::t_host t_f_array;
	typedef tdual_f_array::t_host_const t_f_array_const;
	typedef tdual_f_array::t_host_um t_f_array_um;
	typedef tdual_f_array::t_host_const_um t_f_array_const_um;
	typedef tdual_f_array::t_host_const_randomread t_f_array_randomread;

	//2d F_FLOAT array n*6 (for virial)
	typedef Kokkos::DualView<F_FLOAT*[6], Kokkos::LayoutRight, LMPDeviceType> tdual_virial_array;
	typedef tdual_virial_array::t_host t_virial_array;
	typedef tdual_virial_array::t_host_const t_virial_array_const;
	typedef tdual_virial_array::t_host_um t_virial_array_um;
	typedef tdual_virial_array::t_host_const_um t_virial_array_const_um;
	typedef tdual_virial_array::t_host_const_randomread t_virial_array_randomread;



	//Energy Types
	//1d E_FLOAT array n
	typedef Kokkos::DualView<E_FLOAT*, LMPDeviceType::array_layout, LMPDeviceType> tdual_efloat_1d;
	typedef tdual_efloat_1d::t_host t_efloat_1d;
	typedef tdual_efloat_1d::t_host_const t_efloat_1d_const;
	typedef tdual_efloat_1d::t_host_um t_efloat_1d_um;
	typedef tdual_efloat_1d::t_host_const_um t_efloat_1d_const_um;
	typedef tdual_efloat_1d::t_host_const_randomread t_efloat_1d_randomread;

	//2d E_FLOAT array n*m
	typedef Kokkos::DualView<E_FLOAT**, Kokkos::LayoutRight, LMPDeviceType> tdual_efloat_2d;
	typedef tdual_efloat_2d::t_host t_efloat_2d;
	typedef tdual_efloat_2d::t_host_const t_efloat_2d_const;
	typedef tdual_efloat_2d::t_host_um t_efloat_2d_um;
	typedef tdual_efloat_2d::t_host_const_um t_efloat_2d_const_um;
	typedef tdual_efloat_2d::t_host_const_randomread t_efloat_2d_randomread;

	//2d E_FLOAT array n*3
	typedef Kokkos::DualView<E_FLOAT*[3], Kokkos::LayoutRight, LMPDeviceType> tdual_e_array;
	typedef tdual_e_array::t_host t_e_array;
	typedef tdual_e_array::t_host_const t_e_array_const;
	typedef tdual_e_array::t_host_um t_e_array_um;
	typedef tdual_e_array::t_host_const_um t_e_array_const_um;
	typedef tdual_e_array::t_host_const_randomread t_e_array_randomread;

	//Neighbor Types
	typedef Kokkos::DualView<int**, LMPDeviceType::array_layout, LMPDeviceType> tdual_neighbors_2d;
	typedef tdual_neighbors_2d::t_host t_neighbors_2d;
	typedef tdual_neighbors_2d::t_host_const t_neighbors_2d_const;
	typedef tdual_neighbors_2d::t_host_um t_neighbors_2d_um;
	typedef tdual_neighbors_2d::t_host_const_um t_neighbors_2d_const_um;
	typedef tdual_neighbors_2d::t_host_const_randomread t_neighbors_2d_randomread;

	};
	#endif
	//default LAMMPS Types
	typedef struct ArrayTypes<LMPDeviceType> DAT;
	typedef struct ArrayTypes<LMPHostType> HAT;

	template<class DeviceType, class BufferView, class DualView>
	void buffer_view(BufferView &buf, DualView &view,
	const size_t n0,
	const size_t n1 = 0,
	const size_t n2 = 0,
	const size_t n3 = 0,
	const size_t n4 = 0,
	const size_t n5 = 0,
	const size_t n6 = 0,
	const size_t n7 = 0) {

	buf = BufferView(
	view.template view<DeviceType>().ptr_on_device(),
	n0,n1,n2,n3,n4,n5,n6,n7);

	}

	template<class DeviceType>
	struct MemsetZeroFunctor {
	- typedef DeviceType device_type ;
	+ typedef DeviceType execution_space ;
	void* ptr;
	KOKKOS_INLINE_FUNCTION void operator()(const int i) const {
	((int*)ptr)[i] = 0;
	}
	};

	template<class ViewType>
	void memset_kokkos (ViewType &view) {
	- static MemsetZeroFunctor<typename ViewType::device_type> f;
	+ static MemsetZeroFunctor<typename ViewType::execution_space> f;
	f.ptr = view.ptr_on_device();
	Kokkos::parallel_for(view.capacity()*sizeof(typename ViewType::value_type)/4, f);
	- ViewType::device_type::fence();
	+ ViewType::execution_space::fence();
	}


	#endif
	diff --git a/src/KOKKOS/pair_kokkos.h b/src/KOKKOS/pair_kokkos.h
	index f3bef77b8..5f9b347dd 100644
	--- a/src/KOKKOS/pair_kokkos.h
	+++ b/src/KOKKOS/pair_kokkos.h
	@@ -1,760 +1,754 @@
	/* -- c++ -- ----------------------------------------------------------
	LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	http://lammps.sandia.gov, Sandia National Laboratories
	Steve Plimpton, sjplimp@sandia.gov

	Copyright (2003) Sandia Corporation. Under the terms of Contract
	DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	certain rights in this software. This software is distributed under
	the GNU General Public License.

	See the README file in the top-level LAMMPS directory.
	------------------------------------------------------------------------- */

	#ifdef PAIR_CLASS

	#else

	#ifndef LMP_PAIR_KOKKOS_H
	#define LMP_PAIR_KOKKOS_H

	#include "Kokkos_Macros.hpp"
	#include "pair.h"
	#include "neigh_list_kokkos.h"
	#include "Kokkos_Vectorization.hpp"

	namespace LAMMPS_NS {

	template<int Table>
	struct CoulLongTable {
	enum {DoTable = Table};
	};

	// Tags for doing coulomb calculations or not
	// They facilitate function overloading, since
	// partial template specialization of member functions is not allowed
	struct CoulTag {};
	struct NoCoulTag {};

	template<int FLAG>
	struct DoCoul {
	typedef NoCoulTag type;
	};

	template<>
	struct DoCoul<1> {
	typedef CoulTag type;
	};

	// Determine memory traits for force array
	// Do atomic trait when running HALFTHREAD neighbor list style
	template<int NEIGHFLAG>
	struct AtomicF {
	enum {value = Kokkos::Unmanaged};
	};

	template<>
	struct AtomicF<HALFTHREAD> {
	enum {value = Kokkos::Atomic\|Kokkos::Unmanaged};
	};

	//Specialisation for Neighborlist types Half, HalfThread, Full
	template <class PairStyle, int NEIGHFLAG, bool STACKPARAMS, class Specialisation = void>
	struct PairComputeFunctor {
	typedef typename PairStyle::device_type device_type ;

	// Reduction type, contains evdwl, ecoul and virial[6]
	typedef EV_FLOAT value_type;

	// The copy of the pair style
	PairStyle c;

	// The force array is atomic for Half/Thread neighbor style
	Kokkos::View<F_FLOAT*[3], typename DAT::t_f_array::array_layout,
	device_type,Kokkos::MemoryTraits<AtomicF<NEIGHFLAG>::value> > f;

	// The eatom and vatom arrays are atomic for Half/Thread neighbor style
	Kokkos::View<E_FLOAT*, typename DAT::t_efloat_1d::array_layout,
	device_type,Kokkos::MemoryTraits<AtomicF<NEIGHFLAG>::value> > eatom;
	Kokkos::View<F_FLOAT*[6], typename DAT::t_virial_array::array_layout,
	device_type,Kokkos::MemoryTraits<AtomicF<NEIGHFLAG>::value> > vatom;

	NeighListKokkos<device_type> list;

	PairComputeFunctor(PairStyle* c_ptr,
	NeighListKokkos<device_type>* list_ptr):
	c(*c_ptr),f(c.f),eatom(c.d_eatom),
	vatom(c.d_vatom),list(*list_ptr) {};

	// Call cleanup_copy which sets allocations NULL which are destructed by the PairStyle
	~PairComputeFunctor() {c.cleanup_copy();list.clean_copy();};

	KOKKOS_INLINE_FUNCTION int sbmask(const int& j) const {
	return j >> SBBITS & 3;
	}

	// Loop over neighbors of one atom without coulomb interaction
	// This function is called in parallel
	template<int EVFLAG, int NEWTON_PAIR>
	KOKKOS_FUNCTION
	EV_FLOAT compute_item(const int& ii,
	const NeighListKokkos<device_type> &list, const NoCoulTag&) const {
	EV_FLOAT ev;
	const int i = list.d_ilist[ii];
	const X_FLOAT xtmp = c.x(i,0);
	const X_FLOAT ytmp = c.x(i,1);
	const X_FLOAT ztmp = c.x(i,2);
	const int itype = c.type(i);

	const AtomNeighborsConst neighbors_i = list.get_neighbors_const(i);
	const int jnum = list.d_numneigh[i];

	F_FLOAT fxtmp = 0.0;
	F_FLOAT fytmp = 0.0;
	F_FLOAT fztmp = 0.0;

	for (int jj = 0; jj < jnum; jj++) {
	int j = neighbors_i(jj);
	const F_FLOAT factor_lj = c.special_lj[sbmask(j)];
	j &= NEIGHMASK;
	const X_FLOAT delx = xtmp - c.x(j,0);
	const X_FLOAT dely = ytmp - c.x(j,1);
	const X_FLOAT delz = ztmp - c.x(j,2);
	const int jtype = c.type(j);
	const F_FLOAT rsq = delxdelx + delydely + delz*delz;

	if(rsq < (STACKPARAMS?c.m_cutsq[itype][jtype]:c.d_cutsq(itype,jtype))) {

	const F_FLOAT fpair = factor_lj*c.template compute_fpair<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);

	fxtmp += delx*fpair;
	fytmp += dely*fpair;
	fztmp += delz*fpair;

	if ((NEIGHFLAG==HALF \|\| NEIGHFLAG==HALFTHREAD) && (NEWTON_PAIR \|\| j < c.nlocal)) {
	f(j,0) -= delx*fpair;
	f(j,1) -= dely*fpair;
	f(j,2) -= delz*fpair;
	}

	if (EVFLAG) {
	F_FLOAT evdwl = 0.0;
	if (c.eflag) {
	evdwl = factor_lj * c.template compute_evdwl<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	ev.evdwl += (((NEIGHFLAG==HALF \|\| NEIGHFLAG==HALFTHREAD)&&(NEWTON_PAIR\|\|(j<c.nlocal)))?1.0:0.5)*evdwl;
	}

	if (c.vflag_either \|\| c.eflag_atom) ev_tally(ev,i,j,evdwl,fpair,delx,dely,delz);
	}
	}

	}

	f(i,0) += fxtmp;
	f(i,1) += fytmp;
	f(i,2) += fztmp;

	return ev;
	}

	// Loop over neighbors of one atom with coulomb interaction
	// This function is called in parallel
	template<int EVFLAG, int NEWTON_PAIR>
	KOKKOS_FUNCTION
	EV_FLOAT compute_item(const int& ii,
	const NeighListKokkos<device_type> &list, const CoulTag& ) const {
	EV_FLOAT ev;
	const int i = list.d_ilist[ii];
	const X_FLOAT xtmp = c.x(i,0);
	const X_FLOAT ytmp = c.x(i,1);
	const X_FLOAT ztmp = c.x(i,2);
	const int itype = c.type(i);
	const F_FLOAT qtmp = c.q(i);

	const AtomNeighborsConst neighbors_i = list.get_neighbors_const(i);
	const int jnum = list.d_numneigh[i];

	F_FLOAT fxtmp = 0.0;
	F_FLOAT fytmp = 0.0;
	F_FLOAT fztmp = 0.0;

	for (int jj = 0; jj < jnum; jj++) {
	int j = neighbors_i(jj);
	const F_FLOAT factor_lj = c.special_lj[sbmask(j)];
	const F_FLOAT factor_coul = c.special_coul[sbmask(j)];
	j &= NEIGHMASK;
	const X_FLOAT delx = xtmp - c.x(j,0);
	const X_FLOAT dely = ytmp - c.x(j,1);
	const X_FLOAT delz = ztmp - c.x(j,2);
	const int jtype = c.type(j);
	const F_FLOAT rsq = delxdelx + delydely + delz*delz;

	if(rsq < (STACKPARAMS?c.m_cutsq[itype][jtype]:c.d_cutsq(itype,jtype))) {

	F_FLOAT fpair = F_FLOAT();

	if(rsq < (STACKPARAMS?c.m_cut_ljsq[itype][jtype]:c.d_cut_ljsq(itype,jtype)))
	fpair+=factor_lj*c.template compute_fpair<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	if(rsq < (STACKPARAMS?c.m_cut_coulsq[itype][jtype]:c.d_cut_coulsq(itype,jtype)))
	fpair+=c.template compute_fcoul<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype,factor_coul,qtmp);

	fxtmp += delx*fpair;
	fytmp += dely*fpair;
	fztmp += delz*fpair;

	if ((NEIGHFLAG==HALF \|\| NEIGHFLAG==HALFTHREAD) && (NEWTON_PAIR \|\| j < c.nlocal)) {
	f(j,0) -= delx*fpair;
	f(j,1) -= dely*fpair;
	f(j,2) -= delz*fpair;
	}

	if (EVFLAG) {
	F_FLOAT evdwl = 0.0;
	F_FLOAT ecoul = 0.0;
	if (c.eflag) {
	if(rsq < (STACKPARAMS?c.m_cut_ljsq[itype][jtype]:c.d_cut_ljsq(itype,jtype))) {
	evdwl = factor_lj * c.template compute_evdwl<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	ev.evdwl += (((NEIGHFLAG==HALF \|\| NEIGHFLAG==HALFTHREAD)&&(NEWTON_PAIR\|\|(j<c.nlocal)))?1.0:0.5)*evdwl;
	}
	if(rsq < (STACKPARAMS?c.m_cut_coulsq[itype][jtype]:c.d_cut_coulsq(itype,jtype))) {
	ecoul = c.template compute_ecoul<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype,factor_coul,qtmp);
	ev.ecoul += (((NEIGHFLAG==HALF \|\| NEIGHFLAG==HALFTHREAD)&&(NEWTON_PAIR\|\|(j<c.nlocal)))?1.0:0.5)*ecoul;
	}
	}

	if (c.vflag_either \|\| c.eflag_atom) ev_tally(ev,i,j,evdwl+ecoul,fpair,delx,dely,delz);
	}
	}
	}

	f(i,0) += fxtmp;
	f(i,1) += fytmp;
	f(i,2) += fztmp;

	return ev;
	}

	KOKKOS_INLINE_FUNCTION
	void ev_tally(EV_FLOAT &ev, const int &i, const int &j,
	const F_FLOAT &epair, const F_FLOAT &fpair, const F_FLOAT &delx,
	const F_FLOAT &dely, const F_FLOAT &delz) const
	{
	const int EFLAG = c.eflag;
	const int NEWTON_PAIR = c.newton_pair;
	const int VFLAG = c.vflag_either;

	if (EFLAG) {
	if (c.eflag_atom) {
	const E_FLOAT epairhalf = 0.5 * epair;
	if (NEWTON_PAIR \|\| i < c.nlocal) eatom[i] += epairhalf;
	if ((NEWTON_PAIR \|\| j < c.nlocal) && NEIGHFLAG != FULL) eatom[j] += epairhalf;
	}
	}

	if (VFLAG) {
	const E_FLOAT v0 = delxdelxfpair;
	const E_FLOAT v1 = delydelyfpair;
	const E_FLOAT v2 = delzdelzfpair;
	const E_FLOAT v3 = delxdelyfpair;
	const E_FLOAT v4 = delxdelzfpair;
	const E_FLOAT v5 = delydelzfpair;

	if (c.vflag_global) {
	if (NEIGHFLAG!=FULL) {
	if (NEWTON_PAIR) {
	ev.v[0] += v0;
	ev.v[1] += v1;
	ev.v[2] += v2;
	ev.v[3] += v3;
	ev.v[4] += v4;
	ev.v[5] += v5;
	} else {
	if (i < c.nlocal) {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}
	if (j < c.nlocal) {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}
	}
	} else {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}
	}

	if (c.vflag_atom) {
	if (NEWTON_PAIR \|\| i < c.nlocal) {
	vatom(i,0) += 0.5*v0;
	vatom(i,1) += 0.5*v1;
	vatom(i,2) += 0.5*v2;
	vatom(i,3) += 0.5*v3;
	vatom(i,4) += 0.5*v4;
	vatom(i,5) += 0.5*v5;
	}
	if ((NEWTON_PAIR \|\| j < c.nlocal) && NEIGHFLAG != FULL) {
	vatom(j,0) += 0.5*v0;
	vatom(j,1) += 0.5*v1;
	vatom(j,2) += 0.5*v2;
	vatom(j,3) += 0.5*v3;
	vatom(j,4) += 0.5*v4;
	vatom(j,5) += 0.5*v5;
	}
	}
	}
	}


	KOKKOS_INLINE_FUNCTION
	void operator()(const int i) const {
	if (c.newton_pair) compute_item<0,1>(i,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	else compute_item<0,0>(i,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(const int i, value_type &energy_virial) const {
	if (c.newton_pair)
	energy_virial += compute_item<1,1>(i,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	else
	energy_virial += compute_item<1,0>(i,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	}
	};

	template <class PairStyle, bool STACKPARAMS, class Specialisation>
	struct PairComputeFunctor<PairStyle,FULLCLUSTER,STACKPARAMS,Specialisation> {
	typedef typename PairStyle::device_type device_type ;
	- typedef Kokkos::Vectorization<device_type,NeighClusterSize> vectorization;
	typedef EV_FLOAT value_type;

	PairStyle c;
	NeighListKokkos<device_type> list;

	PairComputeFunctor(PairStyle* c_ptr,
	NeighListKokkos<device_type>* list_ptr):
	c(c_ptr),list(list_ptr) {};
	~PairComputeFunctor() {c.cleanup_copy();list.clean_copy();};

	KOKKOS_INLINE_FUNCTION int sbmask(const int& j) const {
	return j >> SBBITS & 3;
	}

	template<int EVFLAG, int NEWTON_PAIR>
	KOKKOS_FUNCTION
	EV_FLOAT compute_item(const typename Kokkos::TeamPolicy<device_type>::member_type& dev,
	const NeighListKokkos<device_type> &list, const NoCoulTag& ) const {
	EV_FLOAT ev;
	- const int i = vectorization::global_thread_rank(dev);
	+ const int i = dev.league_rank()*dev.team_size() + dev.team_rank();

	const X_FLOAT xtmp = c.c_x(i,0);
	const X_FLOAT ytmp = c.c_x(i,1);
	const X_FLOAT ztmp = c.c_x(i,2);
	const int itype = c.type(i);

	const AtomNeighborsConst neighbors_i = list.get_neighbors_const(i);
	const int jnum = list.d_numneigh[i];

	- F_FLOAT fxtmp = 0.0;
	- F_FLOAT fytmp = 0.0;
	- F_FLOAT fztmp = 0.0;
	+ F_FLOAT3 ftmp;

	for (int jj = 0; jj < jnum; jj++) {
	const int jjj = neighbors_i(jj);

	- for (int k = vectorization::begin(); k<NeighClusterSize; k+=vectorization::increment) {
	+ Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(dev,NeighClusterSize),[&] (const int& k, F_FLOAT3& fftmp) {
	const F_FLOAT factor_lj = c.special_lj[sbmask(jjj+k)];
	const int j = (jjj + k)&NEIGHMASK;
	- if((j==i)\|\|(j>=c.nall)) continue;
	+ if((j==i)\|\|(j>=c.nall)) return;
	const X_FLOAT delx = xtmp - c.c_x(j,0);
	const X_FLOAT dely = ytmp - c.c_x(j,1);
	const X_FLOAT delz = ztmp - c.c_x(j,2);
	const int jtype = c.type(j);
	const F_FLOAT rsq = (delxdelx + delydely + delz*delz);

	if(rsq < (STACKPARAMS?c.m_cutsq[itype][jtype]:c.d_cutsq(itype,jtype))) {

	const F_FLOAT fpair = factor_lj*c.template compute_fpair<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	- fxtmp += delx*fpair;
	- fytmp += dely*fpair;
	- fztmp += delz*fpair;
	+ fftmp.x += delx*fpair;
	+ fftmp.y += dely*fpair;
	+ fftmp.z += delz*fpair;

	if (EVFLAG) {
	F_FLOAT evdwl = 0.0;
	if (c.eflag) {
	evdwl = 0.5*
	factor_lj * c.template compute_evdwl<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	ev.evdwl += evdwl;
	}

	if (c.vflag_either \|\| c.eflag_atom) ev_tally(ev,i,j,evdwl,fpair,delx,dely,delz);
	}
	}
	- }
	+ },ftmp);
	}

	- const F_FLOAT fx = vectorization::reduce(fxtmp);
	- const F_FLOAT fy = vectorization::reduce(fytmp);
	- const F_FLOAT fz = vectorization::reduce(fztmp);
	- if(vectorization::is_lane_0(dev)) {
	- c.f(i,0) += fx;
	- c.f(i,1) += fy;
	- c.f(i,2) += fz;
	- }
	+ Kokkos::single(Kokkos::PerThread(dev), [&]() {
	+ c.f(i,0) += ftmp.x;
	+ c.f(i,1) += ftmp.y;
	+ c.f(i,2) += ftmp.z;
	+ });

	return ev;
	}

	KOKKOS_INLINE_FUNCTION
	void ev_tally(EV_FLOAT &ev, const int &i, const int &j,
	const F_FLOAT &epair, const F_FLOAT &fpair, const F_FLOAT &delx,
	const F_FLOAT &dely, const F_FLOAT &delz) const
	{
	const int EFLAG = c.eflag;
	const int NEWTON_PAIR = c.newton_pair;
	const int VFLAG = c.vflag_either;

	if (EFLAG) {
	if (c.eflag_atom) {
	const E_FLOAT epairhalf = 0.5 * epair;
	if (NEWTON_PAIR \|\| i < c.nlocal) c.d_eatom[i] += epairhalf;
	if (NEWTON_PAIR \|\| j < c.nlocal) c.d_eatom[j] += epairhalf;
	}
	}

	if (VFLAG) {
	const E_FLOAT v0 = delxdelxfpair;
	const E_FLOAT v1 = delydelyfpair;
	const E_FLOAT v2 = delzdelzfpair;
	const E_FLOAT v3 = delxdelyfpair;
	const E_FLOAT v4 = delxdelzfpair;
	const E_FLOAT v5 = delydelzfpair;

	if (c.vflag_global) {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}

	if (c.vflag_atom) {
	if (i < c.nlocal) {
	c.d_vatom(i,0) += 0.5*v0;
	c.d_vatom(i,1) += 0.5*v1;
	c.d_vatom(i,2) += 0.5*v2;
	c.d_vatom(i,3) += 0.5*v3;
	c.d_vatom(i,4) += 0.5*v4;
	c.d_vatom(i,5) += 0.5*v5;
	}
	}
	}
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(const typename Kokkos::TeamPolicy<device_type>::member_type& dev) const {
	if (c.newton_pair) compute_item<0,1>(dev,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	else compute_item<0,0>(dev,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(const typename Kokkos::TeamPolicy<device_type>::member_type& dev, value_type &energy_virial) const {
	if (c.newton_pair)
	energy_virial += compute_item<1,1>(dev,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	else
	energy_virial += compute_item<1,0>(dev,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	}
	};

	template <class PairStyle, bool STACKPARAMS, class Specialisation>
	struct PairComputeFunctor<PairStyle,N2,STACKPARAMS,Specialisation> {
	typedef typename PairStyle::device_type device_type ;
	typedef EV_FLOAT value_type;

	PairStyle c;
	NeighListKokkos<device_type> list;

	PairComputeFunctor(PairStyle* c_ptr,
	NeighListKokkos<device_type>* list_ptr):
	c(c_ptr),list(list_ptr) {};
	~PairComputeFunctor() {c.cleanup_copy();list.clean_copy();};

	KOKKOS_INLINE_FUNCTION int sbmask(const int& j) const {
	return j >> SBBITS & 3;
	}

	template<int EVFLAG, int NEWTON_PAIR>
	KOKKOS_FUNCTION
	EV_FLOAT compute_item(const int& ii,
	const NeighListKokkos<device_type> &list, const NoCoulTag&) const {
	(void) list;
	EV_FLOAT ev;
	const int i = ii;//list.d_ilist[ii];
	const X_FLOAT xtmp = c.x(i,0);
	const X_FLOAT ytmp = c.x(i,1);
	const X_FLOAT ztmp = c.x(i,2);
	const int itype = c.type(i);

	//const AtomNeighborsConst neighbors_i = list.get_neighbors_const(i);
	const int jnum = c.nall;

	F_FLOAT fxtmp = 0.0;
	F_FLOAT fytmp = 0.0;
	F_FLOAT fztmp = 0.0;

	for (int jj = 0; jj < jnum; jj++) {
	int j = jj;//neighbors_i(jj);
	if(i==j) continue;
	const F_FLOAT factor_lj = c.special_lj[sbmask(j)];
	j &= NEIGHMASK;
	const X_FLOAT delx = xtmp - c.x(j,0);
	const X_FLOAT dely = ytmp - c.x(j,1);
	const X_FLOAT delz = ztmp - c.x(j,2);
	const int jtype = c.type(j);
	const F_FLOAT rsq = delxdelx + delydely + delz*delz;

	if(rsq < (STACKPARAMS?c.m_cutsq[itype][jtype]:c.d_cutsq(itype,jtype))) {

	const F_FLOAT fpair = factor_lj*c.template compute_fpair<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	fxtmp += delx*fpair;
	fytmp += dely*fpair;
	fztmp += delz*fpair;

	if (EVFLAG) {
	F_FLOAT evdwl = 0.0;
	if (c.eflag) {
	evdwl = 0.5*
	factor_lj * c.template compute_evdwl<STACKPARAMS,Specialisation>(rsq,i,j,itype,jtype);
	ev.evdwl += evdwl;
	}

	if (c.vflag_either \|\| c.eflag_atom) ev_tally(ev,i,j,evdwl,fpair,delx,dely,delz);
	}
	}
	}

	c.f(i,0) += fxtmp;
	c.f(i,1) += fytmp;
	c.f(i,2) += fztmp;

	return ev;
	}

	KOKKOS_INLINE_FUNCTION
	void ev_tally(EV_FLOAT &ev, const int &i, const int &j,
	const F_FLOAT &epair, const F_FLOAT &fpair, const F_FLOAT &delx,
	const F_FLOAT &dely, const F_FLOAT &delz) const
	{
	const int EFLAG = c.eflag;
	const int VFLAG = c.vflag_either;

	if (EFLAG) {
	if (c.eflag_atom) {
	const E_FLOAT epairhalf = 0.5 * epair;
	if (i < c.nlocal) c.d_eatom[i] += epairhalf;
	if (j < c.nlocal) c.d_eatom[j] += epairhalf;
	}
	}

	if (VFLAG) {
	const E_FLOAT v0 = delxdelxfpair;
	const E_FLOAT v1 = delydelyfpair;
	const E_FLOAT v2 = delzdelzfpair;
	const E_FLOAT v3 = delxdelyfpair;
	const E_FLOAT v4 = delxdelzfpair;
	const E_FLOAT v5 = delydelzfpair;

	if (c.vflag_global) {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}

	if (c.vflag_atom) {
	if (i < c.nlocal) {
	c.d_vatom(i,0) += 0.5*v0;
	c.d_vatom(i,1) += 0.5*v1;
	c.d_vatom(i,2) += 0.5*v2;
	c.d_vatom(i,3) += 0.5*v3;
	c.d_vatom(i,4) += 0.5*v4;
	c.d_vatom(i,5) += 0.5*v5;
	}
	}
	}
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(const int i) const {
	compute_item<0,0>(i,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	}

	KOKKOS_INLINE_FUNCTION
	void operator()(const int i, value_type &energy_virial) const {
	energy_virial += compute_item<1,0>(i,list,typename DoCoul<PairStyle::COUL_FLAG>::type());
	}
	};

	// Filter out Neighflags which are not supported for PairStyle
	// The enable_if clause will invalidate the last parameter of the function, so that
	// a match is only achieved, if PairStyle supports the specific neighborlist variant.
	// This uses the fact that failure to match template parameters is not an error.
	// By having the enable_if with a ! and without it, exactly one of the two versions of the functions
	// pair_compute_neighlist and pair_compute_fullcluster will match - either the dummy version
	// or the real one further below.
	template<class PairStyle, unsigned NEIGHFLAG, class Specialisation>
	EV_FLOAT pair_compute_neighlist (PairStyle* fpair, typename Kokkos::Impl::enable_if<!(NEIGHFLAG&PairStyle::EnabledNeighFlags), NeighListKokkos<typename PairStyle::device_type>*>::type list) {
	EV_FLOAT ev;
	(void) fpair;
	(void) list;
	printf("ERROR: calling pair_compute with invalid neighbor list style: requested %i available %i \n",NEIGHFLAG,PairStyle::EnabledNeighFlags);
	return ev;
	}

	template<class PairStyle, class Specialisation>
	EV_FLOAT pair_compute_fullcluster (PairStyle* fpair, typename Kokkos::Impl::enable_if<!(FULLCLUSTER&PairStyle::EnabledNeighFlags), NeighListKokkos<typename PairStyle::device_type>*>::type list) {
	EV_FLOAT ev;
	(void) fpair;
	(void) list;
	printf("ERROR: calling pair_compute with invalid neighbor list style: requested %i available %i \n",FULLCLUSTER,PairStyle::EnabledNeighFlags);
	return ev;
	}

	// Submit ParallelFor for NEIGHFLAG=HALF,HALFTHREAD,FULL,N2
	template<class PairStyle, unsigned NEIGHFLAG, class Specialisation>
	EV_FLOAT pair_compute_neighlist (PairStyle* fpair, typename Kokkos::Impl::enable_if<NEIGHFLAG&PairStyle::EnabledNeighFlags, NeighListKokkos<typename PairStyle::device_type>*>::type list) {
	EV_FLOAT ev;
	if(fpair->atom->ntypes > MAX_TYPES_STACKPARAMS) {
	PairComputeFunctor<PairStyle,NEIGHFLAG,false,Specialisation > ff(fpair,list);
	if (fpair->eflag \|\| fpair->vflag) Kokkos::parallel_reduce(list->inum,ff,ev);
	else Kokkos::parallel_for(list->inum,ff);
	} else {
	PairComputeFunctor<PairStyle,NEIGHFLAG,true,Specialisation > ff(fpair,list);
	if (fpair->eflag \|\| fpair->vflag) Kokkos::parallel_reduce(list->inum,ff,ev);
	else Kokkos::parallel_for(list->inum,ff);
	}
	return ev;
	}

	// Submit ParallelFor for NEIGHFLAG=FULLCLUSTER
	template<class PairStyle, class Specialisation>
	EV_FLOAT pair_compute_fullcluster (PairStyle* fpair, typename Kokkos::Impl::enable_if<FULLCLUSTER&PairStyle::EnabledNeighFlags, NeighListKokkos<typename PairStyle::device_type>*>::type list) {
	EV_FLOAT ev;
	if(fpair->atom->ntypes > MAX_TYPES_STACKPARAMS) {
	typedef PairComputeFunctor<PairStyle,FULLCLUSTER,false,Specialisation >
	f_type;
	f_type ff(fpair, list);
	#ifdef KOKKOS_HAVE_CUDA
	- const int teamsize = Kokkos::Impl::is_same<typename f_type::device_type, Kokkos::Cuda>::value ? 256 : 1;
	+ const int teamsize = Kokkos::Impl::is_same<typename f_type::device_type, Kokkos::Cuda>::value ? 32 : 1;
	#else
	const int teamsize = 1;
	#endif
	- const int nteams = (list->inum*f_type::vectorization::increment+teamsize-1)/teamsize;
	- Kokkos::TeamPolicy<typename f_type::device_type> config(nteams,teamsize);
	+ const int nteams = (list->inum*+teamsize-1)/teamsize;
	+ Kokkos::TeamPolicy<typename f_type::device_type> config(nteams,teamsize,NeighClusterSize);
	if (fpair->eflag \|\| fpair->vflag) Kokkos::parallel_reduce(config,ff,ev);
	else Kokkos::parallel_for(config,ff);
	} else {
	typedef PairComputeFunctor<PairStyle,FULLCLUSTER,true,Specialisation >
	f_type;
	f_type ff(fpair, list);
	#ifdef KOKKOS_HAVE_CUDA
	- const int teamsize = Kokkos::Impl::is_same<typename f_type::device_type, Kokkos::Cuda>::value ? 256 : 1;
	+ const int teamsize = Kokkos::Impl::is_same<typename f_type::device_type, Kokkos::Cuda>::value ? 32 : 1;
	#else
	const int teamsize = 1;
	#endif
	- const int nteams = (list->inum*f_type::vectorization::increment+teamsize-1)/teamsize;
	- Kokkos::TeamPolicy<typename f_type::device_type> config(nteams,teamsize);
	+ const int nteams = (list->inum*+teamsize-1)/teamsize;
	+ Kokkos::TeamPolicy<typename f_type::device_type> config(nteams,teamsize,NeighClusterSize);
	if (fpair->eflag \|\| fpair->vflag) Kokkos::parallel_reduce(config,ff,ev);
	else Kokkos::parallel_for(config,ff);
	}
	return ev;
	}


	template<class PairStyle, class Specialisation>
	EV_FLOAT pair_compute (PairStyle* fpair, NeighListKokkos<typename PairStyle::device_type>* list) {
	EV_FLOAT ev;
	if (fpair->neighflag == FULL) {
	ev = pair_compute_neighlist<PairStyle,FULL,Specialisation> (fpair,list);
	} else if (fpair->neighflag == HALFTHREAD) {
	ev = pair_compute_neighlist<PairStyle,HALFTHREAD,Specialisation> (fpair,list);
	} else if (fpair->neighflag == HALF) {
	ev = pair_compute_neighlist<PairStyle,HALF,Specialisation> (fpair,list);
	} else if (fpair->neighflag == N2) {
	ev = pair_compute_neighlist<PairStyle,N2,Specialisation> (fpair,list);
	} else if (fpair->neighflag == FULLCLUSTER) {
	ev = pair_compute_fullcluster<PairStyle,Specialisation> (fpair,list);
	}
	return ev;
	}

	template<class DeviceType>
	struct PairVirialFDotRCompute {
	typedef ArrayTypes<DeviceType> AT;
	typedef EV_FLOAT value_type;
	- typename AT::t_x_array_const x;
	- typename AT::t_f_array_const f;
	+ typename AT::t_x_array_const_um x;
	+ typename AT::t_f_array_const_um f;
	const int offset;

	- PairVirialFDotRCompute( typename AT::t_x_array_const x_,
	- typename AT::t_f_array_const f_,
	+ PairVirialFDotRCompute( typename AT::t_x_array_const_um x_,
	+ typename AT::t_f_array_const_um f_,
	const int offset_):x(x_),f(f_),offset(offset_) {}

	KOKKOS_INLINE_FUNCTION
	void operator()(const int j, value_type &energy_virial) const {
	const int i = j + offset;
	energy_virial.v[0] += f(i,0)*x(i,0);
	energy_virial.v[1] += f(i,1)*x(i,1);
	energy_virial.v[2] += f(i,2)*x(i,2);
	energy_virial.v[3] += f(i,1)*x(i,0);
	energy_virial.v[4] += f(i,2)*x(i,0);
	energy_virial.v[5] += f(i,2)*x(i,1);
	}
	};

	template<class PairStyle>
	void pair_virial_fdotr_compute(PairStyle* fpair) {
	EV_FLOAT virial;
	if (fpair->neighbor->includegroup == 0) {
	int nall = fpair->atom->nlocal + fpair->atom->nghost;
	Kokkos::parallel_reduce(nall,PairVirialFDotRCompute<typename PairStyle::device_type>(fpair->x,fpair->f,0),virial);
	} else {
	Kokkos::parallel_reduce(fpair->atom->nfirst,PairVirialFDotRCompute<typename PairStyle::device_type>(fpair->x,fpair->f,0),virial);
	EV_FLOAT virial_ghost;
	Kokkos::parallel_reduce(fpair->atom->nghost,PairVirialFDotRCompute<typename PairStyle::device_type>(fpair->x,fpair->f,fpair->atom->nlocal),virial_ghost);
	virial+=virial_ghost;
	}
	fpair->vflag_fdotr = 0;
	fpair->virial[0] = virial.v[0];
	fpair->virial[1] = virial.v[1];
	fpair->virial[2] = virial.v[2];
	fpair->virial[3] = virial.v[3];
	fpair->virial[4] = virial.v[4];
	fpair->virial[5] = virial.v[5];
	}




	}

	#endif
	#endif

	/* ERROR/WARNING messages:

	*/
	diff --git a/src/KOKKOS/pair_table_kokkos.cpp b/src/KOKKOS/pair_table_kokkos.cpp
	index dfd6787c9..f0c6068bb 100644
	--- a/src/KOKKOS/pair_table_kokkos.cpp
	+++ b/src/KOKKOS/pair_table_kokkos.cpp
	@@ -1,1382 +1,1382 @@
	/* ----------------------------------------------------------------------
	LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	http://lammps.sandia.gov, Sandia National Laboratories
	Steve Plimpton, sjplimp@sandia.gov

	Copyright (2003) Sandia Corporation. Under the terms of Contract
	DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	certain rights in this software. This software is distributed under
	the GNU General Public License.

	See the README file in the top-level LAMMPS directory.
	------------------------------------------------------------------------- */

	/* ----------------------------------------------------------------------
	Contributing author: Paul Crozier (SNL)
	------------------------------------------------------------------------- */

	#include "mpi.h"
	#include "math.h"
	#include "stdlib.h"
	#include "string.h"
	#include "pair_table_kokkos.h"
	#include "kokkos.h"
	#include "atom.h"
	#include "force.h"
	#include "comm.h"
	#include "neighbor.h"
	#include "neigh_list.h"
	#include "neigh_request.h"
	#include "memory.h"
	#include "error.h"
	#include "atom_masks.h"

	using namespace LAMMPS_NS;

	enum{NONE,RLINEAR,RSQ,BMP};
	enum{FULL,HALFTHREAD,HALF};

	#define MAXLINE 1024

	/* ---------------------------------------------------------------------- */

	template<class DeviceType>
	PairTableKokkos<DeviceType>::PairTableKokkos(LAMMPS *lmp) : Pair(lmp)
	{
	update_table = 0;
	atomKK = (AtomKokkos *) atom;
	ntables = 0;
	tables = NULL;
	execution_space = ExecutionSpaceFromDevice<DeviceType>::space;
	datamask_read = X_MASK \| F_MASK \| TYPE_MASK \| ENERGY_MASK \| VIRIAL_MASK;
	datamask_modify = F_MASK \| ENERGY_MASK \| VIRIAL_MASK;
	h_table = new TableHost();
	d_table = new TableDevice();
	}

	/* ---------------------------------------------------------------------- */

	template<class DeviceType>
	PairTableKokkos<DeviceType>::~PairTableKokkos()
	{
	/* for (int m = 0; m < ntables; m++) free_table(&tables[m]);
	memory->sfree(tables);

	if (allocated) {
	memory->destroy(setflag);
	memory->destroy(cutsq);
	memory->destroy(tabindex);
	}*/
	delete h_table;
	delete d_table;

	}

	/* ---------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::compute(int eflag_in, int vflag_in)
	{
	if(update_table)
	create_kokkos_tables();
	if(tabstyle == LOOKUP)
	compute_style<LOOKUP>(eflag_in,vflag_in);
	if(tabstyle == LINEAR)
	compute_style<LINEAR>(eflag_in,vflag_in);
	if(tabstyle == SPLINE)
	compute_style<SPLINE>(eflag_in,vflag_in);
	if(tabstyle == BITMAP)
	compute_style<BITMAP>(eflag_in,vflag_in);
	}

	template<class DeviceType>
	template<int TABSTYLE>
	void PairTableKokkos<DeviceType>::compute_style(int eflag_in, int vflag_in)
	{
	eflag = eflag_in;
	vflag = vflag_in;

	if (neighflag == FULL \|\| neighflag == FULLCLUSTER) no_virial_fdotr_compute = 1;

	if (eflag \|\| vflag) ev_setup(eflag,vflag);
	else evflag = vflag_fdotr = 0;

	atomKK->sync(execution_space,datamask_read);
	//k_cutsq.template sync<DeviceType>();
	//k_params.template sync<DeviceType>();
	if (eflag \|\| vflag) atomKK->modified(execution_space,datamask_modify);
	else atomKK->modified(execution_space,F_MASK);

	x = c_x = atomKK->k_x.view<DeviceType>();
	f = atomKK->k_f.view<DeviceType>();
	type = atomKK->k_type.view<DeviceType>();
	nlocal = atom->nlocal;
	nall = atom->nlocal + atom->nghost;
	special_lj[0] = force->special_lj[0];
	special_lj[1] = force->special_lj[1];
	special_lj[2] = force->special_lj[2];
	special_lj[3] = force->special_lj[3];
	newton_pair = force->newton_pair;
	d_cutsq = d_table->cutsq;
	// loop over neighbors of my atoms

	EV_FLOAT ev;
	if(atom->ntypes > MAX_TYPES_STACKPARAMS) {
	if (neighflag == FULL) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,FULL,false,S_TableCompute<DeviceType,TABSTYLE> >
	ff(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(list->inum,ff,ev);
	else Kokkos::parallel_for(list->inum,ff);
	} else if (neighflag == HALFTHREAD) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,HALFTHREAD,false,S_TableCompute<DeviceType,TABSTYLE> >
	ff(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(list->inum,ff,ev);
	else Kokkos::parallel_for(list->inum,ff);
	} else if (neighflag == HALF) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,HALF,false,S_TableCompute<DeviceType,TABSTYLE> >
	f(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(list->inum,f,ev);
	else Kokkos::parallel_for(list->inum,f);
	} else if (neighflag == N2) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,N2,false,S_TableCompute<DeviceType,TABSTYLE> >
	f(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(nlocal,f,ev);
	else Kokkos::parallel_for(nlocal,f);
	} else if (neighflag == FULLCLUSTER) {
	typedef PairComputeFunctor<PairTableKokkos<DeviceType>,FULLCLUSTER,false,S_TableCompute<DeviceType,TABSTYLE> >
	f_type;
	f_type f(this,(NeighListKokkos<DeviceType>*) list);
	#ifdef KOKKOS_HAVE_CUDA
	- const int teamsize = Kokkos::Impl::is_same<typename f_type::device_type, Kokkos::Cuda>::value ? 256 : 1;
	+ const int teamsize = Kokkos::Impl::is_same<DeviceType, Kokkos::Cuda>::value ? 32 : 1;
	#else
	const int teamsize = 1;
	#endif
	- const int nteams = (list->inum*f_type::vectorization::increment+teamsize-1)/teamsize;
	- Kokkos::TeamPolicy<DeviceType> config(nteams,teamsize);
	+ const int nteams = (list->inum*+teamsize-1)/teamsize;
	+ Kokkos::TeamPolicy<DeviceType> config(nteams,teamsize,NeighClusterSize);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(config,f,ev);
	else Kokkos::parallel_for(config,f);
	}
	} else {
	if (neighflag == FULL) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,FULL,true,S_TableCompute<DeviceType,TABSTYLE> >
	f(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(list->inum,f,ev);
	else Kokkos::parallel_for(list->inum,f);
	} else if (neighflag == HALFTHREAD) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,HALFTHREAD,true,S_TableCompute<DeviceType,TABSTYLE> >
	f(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(list->inum,f,ev);
	else Kokkos::parallel_for(list->inum,f);
	} else if (neighflag == HALF) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,HALF,true,S_TableCompute<DeviceType,TABSTYLE> >
	f(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(list->inum,f,ev);
	else Kokkos::parallel_for(list->inum,f);
	} else if (neighflag == N2) {
	PairComputeFunctor<PairTableKokkos<DeviceType>,N2,true,S_TableCompute<DeviceType,TABSTYLE> >
	f(this,(NeighListKokkos<DeviceType>*) list);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(nlocal,f,ev);
	else Kokkos::parallel_for(nlocal,f);
	} else if (neighflag == FULLCLUSTER) {
	typedef PairComputeFunctor<PairTableKokkos<DeviceType>,FULLCLUSTER,true,S_TableCompute<DeviceType,TABSTYLE> >
	f_type;
	f_type f(this,(NeighListKokkos<DeviceType>*) list);
	#ifdef KOKKOS_HAVE_CUDA
	- const int teamsize = Kokkos::Impl::is_same<typename f_type::device_type, Kokkos::Cuda>::value ? 256 : 1;
	+ const int teamsize = Kokkos::Impl::is_same<DeviceType, Kokkos::Cuda>::value ? 32 : 1;
	#else
	const int teamsize = 1;
	#endif
	- const int nteams = (list->inum*f_type::vectorization::increment+teamsize-1)/teamsize;
	- Kokkos::TeamPolicy<DeviceType> config(nteams,teamsize);
	+ const int nteams = (list->inum*+teamsize-1)/teamsize;
	+ Kokkos::TeamPolicy<DeviceType> config(nteams,teamsize,NeighClusterSize);
	if (eflag \|\| vflag) Kokkos::parallel_reduce(config,f,ev);
	else Kokkos::parallel_for(config,f);
	}
	}
	DeviceType::fence();

	if (eflag) eng_vdwl += ev.evdwl;
	if (vflag_global) {
	virial[0] += ev.v[0];
	virial[1] += ev.v[1];
	virial[2] += ev.v[2];
	virial[3] += ev.v[3];
	virial[4] += ev.v[4];
	virial[5] += ev.v[5];
	}

	if (vflag_fdotr) pair_virial_fdotr_compute(this);
	}

	template<class DeviceType>
	template<bool STACKPARAMS, class Specialisation>
	KOKKOS_INLINE_FUNCTION
	F_FLOAT PairTableKokkos<DeviceType>::
	compute_fpair(const F_FLOAT& rsq, const int& i, const int&j, const int& itype, const int& jtype) const {
	(void) i;
	(void) j;
	union_int_float_t rsq_lookup;
	double fpair;
	const int tidx = d_table_const.tabindex(itype,jtype);
	//const Table* const tb = &tables[tabindex[itype][jtype]];

	//if (rsq < d_table_const.innersq(tidx))
	// error->one(FLERR,"Pair distance < table inner cutoff");

	if (Specialisation::TabStyle == LOOKUP) {
	const int itable = static_cast<int> ((rsq - d_table_const.innersq(tidx)) * d_table_const.invdelta(tidx));
	//if (itable >= tlm1)
	// error->one(FLERR,"Pair distance > table outer cutoff");
	fpair = d_table_const.f(tidx,itable);
	} else if (Specialisation::TabStyle == LINEAR) {
	const int itable = static_cast<int> ((rsq - d_table_const.innersq(tidx)) * d_table_const.invdelta(tidx));
	//if (itable >= tlm1)
	// error->one(FLERR,"Pair distance > table outer cutoff");
	const double fraction = (rsq - d_table_const.rsq(tidx,itable)) * d_table_const.invdelta(tidx);
	fpair = d_table_const.f(tidx,itable) + fraction*d_table_const.df(tidx,itable);
	} else if (Specialisation::TabStyle == SPLINE) {
	const int itable = static_cast<int> ((rsq - d_table_const.innersq(tidx)) * d_table_const.invdelta(tidx));
	//if (itable >= tlm1)
	// error->one(FLERR,"Pair distance > table outer cutoff");
	const double b = (rsq - d_table_const.rsq(tidx,itable)) * d_table_const.invdelta(tidx);
	const double a = 1.0 - b;
	fpair = a * d_table_const.f(tidx,itable) + b * d_table_const.f(tidx,itable+1) +
	((aaa-a)d_table_const.f2(tidx,itable) + (bbb-b)d_table_const.f2(tidx,itable+1)) *
	d_table_const.deltasq6(tidx);
	} else {
	rsq_lookup.f = rsq;
	int itable = rsq_lookup.i & d_table_const.nmask(tidx);
	itable >>= d_table_const.nshiftbits(tidx);
	const double fraction = (rsq_lookup.f - d_table_const.rsq(tidx,itable)) * d_table_const.drsq(tidx,itable);
	fpair = d_table_const.f(tidx,itable) + fraction*d_table_const.df(tidx,itable);
	}
	return fpair;
	}

	template<class DeviceType>
	template<bool STACKPARAMS, class Specialisation>
	KOKKOS_INLINE_FUNCTION
	F_FLOAT PairTableKokkos<DeviceType>::
	compute_evdwl(const F_FLOAT& rsq, const int& i, const int&j, const int& itype, const int& jtype) const {
	(void) i;
	(void) j;
	double evdwl;
	union_int_float_t rsq_lookup;
	const int tidx = d_table_const.tabindex(itype,jtype);
	//const Table* const tb = &tables[tabindex[itype][jtype]];

	//if (rsq < d_table_const.innersq(tidx))
	// error->one(FLERR,"Pair distance < table inner cutoff");

	if (Specialisation::TabStyle == LOOKUP) {
	const int itable = static_cast<int> ((rsq - d_table_const.innersq(tidx)) * d_table_const.invdelta(tidx));
	//if (itable >= tlm1)
	// error->one(FLERR,"Pair distance > table outer cutoff");
	evdwl = d_table_const.e(tidx,itable);
	} else if (Specialisation::TabStyle == LINEAR) {
	const int itable = static_cast<int> ((rsq - d_table_const.innersq(tidx)) * d_table_const.invdelta(tidx));
	//if (itable >= tlm1)
	// error->one(FLERR,"Pair distance > table outer cutoff");
	const double fraction = (rsq - d_table_const.rsq(tidx,itable)) * d_table_const.invdelta(tidx);
	evdwl = d_table_const.e(tidx,itable) + fraction*d_table_const.de(tidx,itable);
	} else if (Specialisation::TabStyle == SPLINE) {
	const int itable = static_cast<int> ((rsq - d_table_const.innersq(tidx)) * d_table_const.invdelta(tidx));
	//if (itable >= tlm1)
	// error->one(FLERR,"Pair distance > table outer cutoff");
	const double b = (rsq - d_table_const.rsq(tidx,itable)) * d_table_const.invdelta(tidx);
	const double a = 1.0 - b;
	evdwl = a * d_table_const.e(tidx,itable) + b * d_table_const.e(tidx,itable+1) +
	((aaa-a)d_table_const.e2(tidx,itable) + (bbb-b)d_table_const.e2(tidx,itable+1)) *
	d_table_const.deltasq6(tidx);
	} else {
	rsq_lookup.f = rsq;
	int itable = rsq_lookup.i & d_table_const.nmask(tidx);
	itable >>= d_table_const.nshiftbits(tidx);
	const double fraction = (rsq_lookup.f - d_table_const.rsq(tidx,itable)) * d_table_const.drsq(tidx,itable);
	evdwl = d_table_const.e(tidx,itable) + fraction*d_table_const.de(tidx,itable);
	}
	return evdwl;
	}

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::create_kokkos_tables()
	{
	const int tlm1 = tablength-1;

	memory->create_kokkos(d_table->nshiftbits,h_table->nshiftbits,ntables,"Table::nshiftbits");
	memory->create_kokkos(d_table->nmask,h_table->nmask,ntables,"Table::nmask");
	memory->create_kokkos(d_table->innersq,h_table->innersq,ntables,"Table::innersq");
	memory->create_kokkos(d_table->invdelta,h_table->invdelta,ntables,"Table::invdelta");
	memory->create_kokkos(d_table->deltasq6,h_table->deltasq6,ntables,"Table::deltasq6");

	if(tabstyle == LOOKUP) {
	memory->create_kokkos(d_table->e,h_table->e,ntables,tlm1,"Table::e");
	memory->create_kokkos(d_table->f,h_table->f,ntables,tlm1,"Table::f");
	}

	if(tabstyle == LINEAR) {
	memory->create_kokkos(d_table->rsq,h_table->rsq,ntables,tablength,"Table::rsq");
	memory->create_kokkos(d_table->e,h_table->e,ntables,tablength,"Table::e");
	memory->create_kokkos(d_table->f,h_table->f,ntables,tablength,"Table::f");
	memory->create_kokkos(d_table->de,h_table->de,ntables,tlm1,"Table::de");
	memory->create_kokkos(d_table->df,h_table->df,ntables,tlm1,"Table::df");
	}

	if(tabstyle == SPLINE) {
	memory->create_kokkos(d_table->rsq,h_table->rsq,ntables,tablength,"Table::rsq");
	memory->create_kokkos(d_table->e,h_table->e,ntables,tablength,"Table::e");
	memory->create_kokkos(d_table->f,h_table->f,ntables,tablength,"Table::f");
	memory->create_kokkos(d_table->e2,h_table->e2,ntables,tablength,"Table::e2");
	memory->create_kokkos(d_table->f2,h_table->f2,ntables,tablength,"Table::f2");
	}

	if(tabstyle == BITMAP) {
	int ntable = 1 << tablength;
	memory->create_kokkos(d_table->rsq,h_table->rsq,ntables,ntable,"Table::rsq");
	memory->create_kokkos(d_table->e,h_table->e,ntables,ntable,"Table::e");
	memory->create_kokkos(d_table->f,h_table->f,ntables,ntable,"Table::f");
	memory->create_kokkos(d_table->de,h_table->de,ntables,ntable,"Table::de");
	memory->create_kokkos(d_table->df,h_table->df,ntables,ntable,"Table::df");
	memory->create_kokkos(d_table->drsq,h_table->drsq,ntables,ntable,"Table::drsq");
	}



	for(int i=0; i < ntables; i++) {
	Table* tb = &tables[i];

	h_table->nshiftbits[i] = tb->nshiftbits;
	h_table->nmask[i] = tb->nmask;
	h_table->innersq[i] = tb->innersq;
	h_table->invdelta[i] = tb->invdelta;
	h_table->deltasq6[i] = tb->deltasq6;

	for(int j = 0; j<h_table->rsq.dimension_1(); j++)
	h_table->rsq(i,j) = tb->rsq[j];
	for(int j = 0; j<h_table->drsq.dimension_1(); j++)
	h_table->drsq(i,j) = tb->drsq[j];
	for(int j = 0; j<h_table->e.dimension_1(); j++)
	h_table->e(i,j) = tb->e[j];
	for(int j = 0; j<h_table->de.dimension_1(); j++)
	h_table->de(i,j) = tb->de[j];
	for(int j = 0; j<h_table->f.dimension_1(); j++)
	h_table->f(i,j) = tb->f[j];
	for(int j = 0; j<h_table->df.dimension_1(); j++)
	h_table->df(i,j) = tb->df[j];
	for(int j = 0; j<h_table->e2.dimension_1(); j++)
	h_table->e2(i,j) = tb->e2[j];
	for(int j = 0; j<h_table->f2.dimension_1(); j++)
	h_table->f2(i,j) = tb->f2[j];
	}


	Kokkos::deep_copy(d_table->nshiftbits,h_table->nshiftbits);
	Kokkos::deep_copy(d_table->nmask,h_table->nmask);
	Kokkos::deep_copy(d_table->innersq,h_table->innersq);
	Kokkos::deep_copy(d_table->invdelta,h_table->invdelta);
	Kokkos::deep_copy(d_table->deltasq6,h_table->deltasq6);
	Kokkos::deep_copy(d_table->rsq,h_table->rsq);
	Kokkos::deep_copy(d_table->drsq,h_table->drsq);
	Kokkos::deep_copy(d_table->e,h_table->e);
	Kokkos::deep_copy(d_table->de,h_table->de);
	Kokkos::deep_copy(d_table->f,h_table->f);
	Kokkos::deep_copy(d_table->df,h_table->df);
	Kokkos::deep_copy(d_table->e2,h_table->e2);
	Kokkos::deep_copy(d_table->f2,h_table->f2);
	Kokkos::deep_copy(d_table->tabindex,h_table->tabindex);

	d_table_const.nshiftbits = d_table->nshiftbits;
	d_table_const.nmask = d_table->nmask;
	d_table_const.innersq = d_table->innersq;
	d_table_const.invdelta = d_table->invdelta;
	d_table_const.deltasq6 = d_table->deltasq6;
	d_table_const.rsq = d_table->rsq;
	d_table_const.drsq = d_table->drsq;
	d_table_const.e = d_table->e;
	d_table_const.de = d_table->de;
	d_table_const.f = d_table->f;
	d_table_const.df = d_table->df;
	d_table_const.e2 = d_table->e2;
	d_table_const.f2 = d_table->f2;


	Kokkos::deep_copy(d_table->cutsq,h_table->cutsq);
	update_table = 0;
	}

	/* ----------------------------------------------------------------------
	allocate all arrays
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::allocate()
	{
	allocated = 1;
	const int nt = atom->ntypes + 1;

	memory->create(setflag,nt,nt,"pair:setflag");
	memory->create_kokkos(d_table->cutsq,h_table->cutsq,cutsq,nt,nt,"pair:cutsq");
	memory->create_kokkos(d_table->tabindex,h_table->tabindex,tabindex,nt,nt,"pair:tabindex");

	d_table_const.cutsq = d_table->cutsq;
	d_table_const.tabindex = d_table->tabindex;
	memset(&setflag[0][0],0,ntntsizeof(int));
	memset(&cutsq[0][0],0,ntntsizeof(double));
	memset(&tabindex[0][0],0,ntntsizeof(int));
	}

	/* ----------------------------------------------------------------------
	global settings
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::settings(int narg, char **arg)
	{
	if (narg < 2) error->all(FLERR,"Illegal pair_style command");

	// new settings

	if (strcmp(arg[0],"lookup") == 0) tabstyle = LOOKUP;
	else if (strcmp(arg[0],"linear") == 0) tabstyle = LINEAR;
	else if (strcmp(arg[0],"spline") == 0) tabstyle = SPLINE;
	else if (strcmp(arg[0],"bitmap") == 0) tabstyle = BITMAP;
	else error->all(FLERR,"Unknown table style in pair_style command");

	tablength = force->inumeric(FLERR,arg[1]);
	if (tablength < 2) error->all(FLERR,"Illegal number of pair table entries");

	// optional keywords
	// assert the tabulation is compatible with a specific long-range solver

	int iarg = 2;
	while (iarg < narg) {
	if (strcmp(arg[iarg],"ewald") == 0) ewaldflag = 1;
	else if (strcmp(arg[iarg],"pppm") == 0) pppmflag = 1;
	else if (strcmp(arg[iarg],"msm") == 0) msmflag = 1;
	else if (strcmp(arg[iarg],"dispersion") == 0) dispersionflag = 1;
	else if (strcmp(arg[iarg],"tip4p") == 0) tip4pflag = 1;
	else error->all(FLERR,"Illegal pair_style command");
	iarg++;
	}

	// delete old tables, since cannot just change settings

	for (int m = 0; m < ntables; m++) free_table(&tables[m]);
	memory->sfree(tables);

	if (allocated) {
	memory->destroy(setflag);

	d_table_const.tabindex = d_table->tabindex = typename ArrayTypes<DeviceType>::t_int_2d();
	h_table->tabindex = typename ArrayTypes<LMPHostType>::t_int_2d();

	d_table_const.cutsq = d_table->cutsq = typename ArrayTypes<DeviceType>::t_ffloat_2d();
	h_table->cutsq = typename ArrayTypes<LMPHostType>::t_ffloat_2d();
	}
	allocated = 0;

	ntables = 0;
	tables = NULL;
	}

	/* ----------------------------------------------------------------------
	set coeffs for one or more type pairs
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::coeff(int narg, char **arg)
	{
	if (narg != 4 && narg != 5) error->all(FLERR,"Illegal pair_coeff command");
	if (!allocated) allocate();

	int ilo,ihi,jlo,jhi;
	force->bounds(arg[0],atom->ntypes,ilo,ihi);
	force->bounds(arg[1],atom->ntypes,jlo,jhi);

	int me;
	MPI_Comm_rank(world,&me);
	tables = (Table *)
	memory->srealloc(tables,(ntables+1)*sizeof(Table),"pair:tables");
	Table *tb = &tables[ntables];
	null_table(tb);
	if (me == 0) read_table(tb,arg[2],arg[3]);
	bcast_table(tb);

	// set table cutoff

	if (narg == 5) tb->cut = force->numeric(FLERR,arg[4]);
	else if (tb->rflag) tb->cut = tb->rhi;
	else tb->cut = tb->rfile[tb->ninput-1];

	// error check on table parameters
	// insure cutoff is within table
	// for BITMAP tables, file values can be in non-ascending order

	if (tb->ninput <= 1) error->one(FLERR,"Invalid pair table length");
	double rlo,rhi;
	if (tb->rflag == 0) {
	rlo = tb->rfile[0];
	rhi = tb->rfile[tb->ninput-1];
	} else {
	rlo = tb->rlo;
	rhi = tb->rhi;
	}
	if (tb->cut <= rlo \|\| tb->cut > rhi)
	error->all(FLERR,"Invalid pair table cutoff");
	if (rlo <= 0.0) error->all(FLERR,"Invalid pair table cutoff");

	// match = 1 if don't need to spline read-in tables
	// this is only the case if r values needed by final tables
	// exactly match r values read from file
	// for tabstyle SPLINE, always need to build spline tables

	tb->match = 0;
	if (tabstyle == LINEAR && tb->ninput == tablength &&
	tb->rflag == RSQ && tb->rhi == tb->cut) tb->match = 1;
	if (tabstyle == BITMAP && tb->ninput == 1 << tablength &&
	tb->rflag == BMP && tb->rhi == tb->cut) tb->match = 1;
	if (tb->rflag == BMP && tb->match == 0)
	error->all(FLERR,"Bitmapped table in file does not match requested table");

	// spline read-in values and compute r,e,f vectors within table

	if (tb->match == 0) spline_table(tb);
	compute_table(tb);

	// store ptr to table in tabindex

	int count = 0;
	for (int i = ilo; i <= ihi; i++) {
	for (int j = MAX(jlo,i); j <= jhi; j++) {
	tabindex[i][j] = ntables;
	setflag[i][j] = 1;
	count++;
	}
	}

	if (count == 0) error->all(FLERR,"Illegal pair_coeff command");
	ntables++;
	}

	/* ----------------------------------------------------------------------
	init for one type pair i,j and corresponding j,i
	------------------------------------------------------------------------- */

	template<class DeviceType>
	double PairTableKokkos<DeviceType>::init_one(int i, int j)
	{
	if (setflag[i][j] == 0) error->all(FLERR,"All pair coeffs are not set");

	tabindex[j][i] = tabindex[i][j];

	if(i<MAX_TYPES_STACKPARAMS+1 && j<MAX_TYPES_STACKPARAMS+1) {
	m_cutsq[j][i] = m_cutsq[i][j] = tables[tabindex[i][j]].cut*tables[tabindex[i][j]].cut;
	}

	return tables[tabindex[i][j]].cut;
	}

	/* ----------------------------------------------------------------------
	read a table section from a tabulated potential file
	only called by proc 0
	this function sets these values in Table:
	ninput,rfile,efile,ffile,rflag,rlo,rhi,fpflag,fplo,fphi,ntablebits
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::read_table(Table tb, char file, char *keyword)
	{
	char line[MAXLINE];

	// open file

	FILE *fp = force->open_potential(file);
	if (fp == NULL) {
	char str[128];
	sprintf(str,"Cannot open file %s",file);
	error->one(FLERR,str);
	}

	// loop until section found with matching keyword

	while (1) {
	if (fgets(line,MAXLINE,fp) == NULL)
	error->one(FLERR,"Did not find keyword in table file");
	if (strspn(line," \t\n\r") == strlen(line)) continue; // blank line
	if (line[0] == '#') continue; // comment
	char *word = strtok(line," \t\n\r");
	if (strcmp(word,keyword) == 0) break; // matching keyword
	fgets(line,MAXLINE,fp); // no match, skip section
	param_extract(tb,line);
	fgets(line,MAXLINE,fp);
	for (int i = 0; i < tb->ninput; i++) fgets(line,MAXLINE,fp);
	}

	// read args on 2nd line of section
	// allocate table arrays for file values

	fgets(line,MAXLINE,fp);
	param_extract(tb,line);
	memory->create(tb->rfile,tb->ninput,"pair:rfile");
	memory->create(tb->efile,tb->ninput,"pair:efile");
	memory->create(tb->ffile,tb->ninput,"pair:ffile");

	// setup bitmap parameters for table to read in

	tb->ntablebits = 0;
	int masklo,maskhi,nmask,nshiftbits;
	if (tb->rflag == BMP) {
	while (1 << tb->ntablebits < tb->ninput) tb->ntablebits++;
	if (1 << tb->ntablebits != tb->ninput)
	error->one(FLERR,"Bitmapped table is incorrect length in table file");
	init_bitmap(tb->rlo,tb->rhi,tb->ntablebits,masklo,maskhi,nmask,nshiftbits);
	}

	// read r,e,f table values from file
	// if rflag set, compute r
	// if rflag not set, use r from file

	int itmp;
	double rtmp;
	union_int_float_t rsq_lookup;

	fgets(line,MAXLINE,fp);
	for (int i = 0; i < tb->ninput; i++) {
	fgets(line,MAXLINE,fp);
	sscanf(line,"%d %lg %lg %lg",&itmp,&rtmp,&tb->efile[i],&tb->ffile[i]);

	if (tb->rflag == RLINEAR)
	rtmp = tb->rlo + (tb->rhi - tb->rlo)*i/(tb->ninput-1);
	else if (tb->rflag == RSQ) {
	rtmp = tb->rlo*tb->rlo +
	(tb->rhitb->rhi - tb->rlotb->rlo)*i/(tb->ninput-1);
	rtmp = sqrt(rtmp);
	} else if (tb->rflag == BMP) {
	rsq_lookup.i = i << nshiftbits;
	rsq_lookup.i \|= masklo;
	if (rsq_lookup.f < tb->rlo*tb->rlo) {
	rsq_lookup.i = i << nshiftbits;
	rsq_lookup.i \|= maskhi;
	}
	rtmp = sqrtf(rsq_lookup.f);
	}

	tb->rfile[i] = rtmp;
	}

	// close file

	fclose(fp);
	}

	/* ----------------------------------------------------------------------
	broadcast read-in table info from proc 0 to other procs
	this function communicates these values in Table:
	ninput,rfile,efile,ffile,rflag,rlo,rhi,fpflag,fplo,fphi
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::bcast_table(Table *tb)
	{
	MPI_Bcast(&tb->ninput,1,MPI_INT,0,world);

	int me;
	MPI_Comm_rank(world,&me);
	if (me > 0) {
	memory->create(tb->rfile,tb->ninput,"pair:rfile");
	memory->create(tb->efile,tb->ninput,"pair:efile");
	memory->create(tb->ffile,tb->ninput,"pair:ffile");
	}

	MPI_Bcast(tb->rfile,tb->ninput,MPI_DOUBLE,0,world);
	MPI_Bcast(tb->efile,tb->ninput,MPI_DOUBLE,0,world);
	MPI_Bcast(tb->ffile,tb->ninput,MPI_DOUBLE,0,world);

	MPI_Bcast(&tb->rflag,1,MPI_INT,0,world);
	if (tb->rflag) {
	MPI_Bcast(&tb->rlo,1,MPI_DOUBLE,0,world);
	MPI_Bcast(&tb->rhi,1,MPI_DOUBLE,0,world);
	}
	MPI_Bcast(&tb->fpflag,1,MPI_INT,0,world);
	if (tb->fpflag) {
	MPI_Bcast(&tb->fplo,1,MPI_DOUBLE,0,world);
	MPI_Bcast(&tb->fphi,1,MPI_DOUBLE,0,world);
	}
	}

	/* ----------------------------------------------------------------------
	build spline representation of e,f over entire range of read-in table
	this function sets these values in Table: e2file,f2file
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::spline_table(Table *tb)
	{
	memory->create(tb->e2file,tb->ninput,"pair:e2file");
	memory->create(tb->f2file,tb->ninput,"pair:f2file");

	double ep0 = - tb->ffile[0];
	double epn = - tb->ffile[tb->ninput-1];
	spline(tb->rfile,tb->efile,tb->ninput,ep0,epn,tb->e2file);

	if (tb->fpflag == 0) {
	tb->fplo = (tb->ffile[1] - tb->ffile[0]) / (tb->rfile[1] - tb->rfile[0]);
	tb->fphi = (tb->ffile[tb->ninput-1] - tb->ffile[tb->ninput-2]) /
	(tb->rfile[tb->ninput-1] - tb->rfile[tb->ninput-2]);
	}

	double fp0 = tb->fplo;
	double fpn = tb->fphi;
	spline(tb->rfile,tb->ffile,tb->ninput,fp0,fpn,tb->f2file);
	}

	/* ----------------------------------------------------------------------
	extract attributes from parameter line in table section
	format of line: N value R/RSQ/BITMAP lo hi FP fplo fphi
	N is required, other params are optional
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::param_extract(Table tb, char line)
	{
	tb->ninput = 0;
	tb->rflag = NONE;
	tb->fpflag = 0;

	char *word = strtok(line," \t\n\r\f");
	while (word) {
	if (strcmp(word,"N") == 0) {
	word = strtok(NULL," \t\n\r\f");
	tb->ninput = atoi(word);
	} else if (strcmp(word,"R") == 0 \|\| strcmp(word,"RSQ") == 0 \|\|
	strcmp(word,"BITMAP") == 0) {
	if (strcmp(word,"R") == 0) tb->rflag = RLINEAR;
	else if (strcmp(word,"RSQ") == 0) tb->rflag = RSQ;
	else if (strcmp(word,"BITMAP") == 0) tb->rflag = BMP;
	word = strtok(NULL," \t\n\r\f");
	tb->rlo = atof(word);
	word = strtok(NULL," \t\n\r\f");
	tb->rhi = atof(word);
	} else if (strcmp(word,"FP") == 0) {
	tb->fpflag = 1;
	word = strtok(NULL," \t\n\r\f");
	tb->fplo = atof(word);
	word = strtok(NULL," \t\n\r\f");
	tb->fphi = atof(word);
	} else {
	error->one(FLERR,"Invalid keyword in pair table parameters");
	}
	word = strtok(NULL," \t\n\r\f");
	}

	if (tb->ninput == 0) error->one(FLERR,"Pair table parameters did not set N");
	}

	/* ----------------------------------------------------------------------
	compute r,e,f vectors from splined values
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::compute_table(Table *tb)
	{
	update_table = 1;
	int tlm1 = tablength-1;

	// inner = inner table bound
	// cut = outer table bound
	// delta = table spacing in rsq for N-1 bins

	double inner;
	if (tb->rflag) inner = tb->rlo;
	else inner = tb->rfile[0];
	tb->innersq = inner*inner;
	tb->delta = (tb->cut*tb->cut - tb->innersq) / tlm1;
	tb->invdelta = 1.0/tb->delta;

	// direct lookup tables
	// N-1 evenly spaced bins in rsq from inner to cut
	// e,f = value at midpt of bin
	// e,f are N-1 in length since store 1 value at bin midpt
	// f is converted to f/r when stored in f[i]
	// e,f are never a match to read-in values, always computed via spline interp

	if (tabstyle == LOOKUP) {
	memory->create(tb->e,tlm1,"pair:e");
	memory->create(tb->f,tlm1,"pair:f");

	double r,rsq;
	for (int i = 0; i < tlm1; i++) {
	rsq = tb->innersq + (i+0.5)*tb->delta;
	r = sqrt(rsq);
	tb->e[i] = splint(tb->rfile,tb->efile,tb->e2file,tb->ninput,r);
	tb->f[i] = splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,r)/r;
	}
	}

	// linear tables
	// N-1 evenly spaced bins in rsq from inner to cut
	// rsq,e,f = value at lower edge of bin
	// de,df values = delta from lower edge to upper edge of bin
	// rsq,e,f are N in length so de,df arrays can compute difference
	// f is converted to f/r when stored in f[i]
	// e,f can match read-in values, else compute via spline interp

	if (tabstyle == LINEAR) {
	memory->create(tb->rsq,tablength,"pair:rsq");
	memory->create(tb->e,tablength,"pair:e");
	memory->create(tb->f,tablength,"pair:f");
	memory->create(tb->de,tlm1,"pair:de");
	memory->create(tb->df,tlm1,"pair:df");

	double r,rsq;
	for (int i = 0; i < tablength; i++) {
	rsq = tb->innersq + i*tb->delta;
	r = sqrt(rsq);
	tb->rsq[i] = rsq;
	if (tb->match) {
	tb->e[i] = tb->efile[i];
	tb->f[i] = tb->ffile[i]/r;
	} else {
	tb->e[i] = splint(tb->rfile,tb->efile,tb->e2file,tb->ninput,r);
	tb->f[i] = splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,r)/r;
	}
	}

	for (int i = 0; i < tlm1; i++) {
	tb->de[i] = tb->e[i+1] - tb->e[i];
	tb->df[i] = tb->f[i+1] - tb->f[i];
	}
	}

	// cubic spline tables
	// N-1 evenly spaced bins in rsq from inner to cut
	// rsq,e,f = value at lower edge of bin
	// e2,f2 = spline coefficient for each bin
	// rsq,e,f,e2,f2 are N in length so have N-1 spline bins
	// f is converted to f/r after e is splined
	// e,f can match read-in values, else compute via spline interp

	if (tabstyle == SPLINE) {
	memory->create(tb->rsq,tablength,"pair:rsq");
	memory->create(tb->e,tablength,"pair:e");
	memory->create(tb->f,tablength,"pair:f");
	memory->create(tb->e2,tablength,"pair:e2");
	memory->create(tb->f2,tablength,"pair:f2");

	tb->deltasq6 = tb->delta*tb->delta / 6.0;

	double r,rsq;
	for (int i = 0; i < tablength; i++) {
	rsq = tb->innersq + i*tb->delta;
	r = sqrt(rsq);
	tb->rsq[i] = rsq;
	if (tb->match) {
	tb->e[i] = tb->efile[i];
	tb->f[i] = tb->ffile[i]/r;
	} else {
	tb->e[i] = splint(tb->rfile,tb->efile,tb->e2file,tb->ninput,r);
	tb->f[i] = splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,r);
	}
	}

	// ep0,epn = dh/dg at inner and at cut
	// h(r) = e(r) and g(r) = r^2
	// dh/dg = (de/dr) / 2r = -f/2r

	double ep0 = - tb->f[0] / (2.0 * sqrt(tb->innersq));
	double epn = - tb->f[tlm1] / (2.0 * tb->cut);
	spline(tb->rsq,tb->e,tablength,ep0,epn,tb->e2);

	// fp0,fpn = dh/dg at inner and at cut
	// h(r) = f(r)/r and g(r) = r^2
	// dh/dg = (1/r df/dr - f/r^2) / 2r
	// dh/dg in secant approx = (f(r2)/r2 - f(r1)/r1) / (g(r2) - g(r1))

	double fp0,fpn;
	double secant_factor = 0.1;
	if (tb->fpflag) fp0 = (tb->fplo/sqrt(tb->innersq) - tb->f[0]/tb->innersq) /
	(2.0 * sqrt(tb->innersq));
	else {
	double rsq1 = tb->innersq;
	double rsq2 = rsq1 + secant_factor*tb->delta;
	fp0 = (splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,sqrt(rsq2)) /
	sqrt(rsq2) - tb->f[0] / sqrt(rsq1)) / (secant_factor*tb->delta);
	}

	if (tb->fpflag && tb->cut == tb->rfile[tb->ninput-1]) fpn =
	(tb->fphi/tb->cut - tb->f[tlm1]/(tb->cuttb->cut)) / (2.0 tb->cut);
	else {
	double rsq2 = tb->cut * tb->cut;
	double rsq1 = rsq2 - secant_factor*tb->delta;
	fpn = (tb->f[tlm1] / sqrt(rsq2) -
	splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,sqrt(rsq1)) /
	sqrt(rsq1)) / (secant_factor*tb->delta);
	}

	for (int i = 0; i < tablength; i++) tb->f[i] /= sqrt(tb->rsq[i]);
	spline(tb->rsq,tb->f,tablength,fp0,fpn,tb->f2);
	}

	// bitmapped linear tables
	// 2^N bins from inner to cut, spaced in bitmapped manner
	// f is converted to f/r when stored in f[i]
	// e,f can match read-in values, else compute via spline interp

	if (tabstyle == BITMAP) {
	double r;
	union_int_float_t rsq_lookup;
	int masklo,maskhi;

	// linear lookup tables of length ntable = 2^n
	// stored value = value at lower edge of bin

	init_bitmap(inner,tb->cut,tablength,masklo,maskhi,tb->nmask,tb->nshiftbits);
	int ntable = 1 << tablength;
	int ntablem1 = ntable - 1;

	memory->create(tb->rsq,ntable,"pair:rsq");
	memory->create(tb->e,ntable,"pair:e");
	memory->create(tb->f,ntable,"pair:f");
	memory->create(tb->de,ntable,"pair:de");
	memory->create(tb->df,ntable,"pair:df");
	memory->create(tb->drsq,ntable,"pair:drsq");

	union_int_float_t minrsq_lookup;
	minrsq_lookup.i = 0 << tb->nshiftbits;
	minrsq_lookup.i \|= maskhi;

	for (int i = 0; i < ntable; i++) {
	rsq_lookup.i = i << tb->nshiftbits;
	rsq_lookup.i \|= masklo;
	if (rsq_lookup.f < tb->innersq) {
	rsq_lookup.i = i << tb->nshiftbits;
	rsq_lookup.i \|= maskhi;
	}
	r = sqrtf(rsq_lookup.f);
	tb->rsq[i] = rsq_lookup.f;
	if (tb->match) {
	tb->e[i] = tb->efile[i];
	tb->f[i] = tb->ffile[i]/r;
	} else {
	tb->e[i] = splint(tb->rfile,tb->efile,tb->e2file,tb->ninput,r);
	tb->f[i] = splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,r)/r;
	}
	minrsq_lookup.f = MIN(minrsq_lookup.f,rsq_lookup.f);
	}

	tb->innersq = minrsq_lookup.f;

	for (int i = 0; i < ntablem1; i++) {
	tb->de[i] = tb->e[i+1] - tb->e[i];
	tb->df[i] = tb->f[i+1] - tb->f[i];
	tb->drsq[i] = 1.0/(tb->rsq[i+1] - tb->rsq[i]);
	}

	// get the delta values for the last table entries
	// tables are connected periodically between 0 and ntablem1

	tb->de[ntablem1] = tb->e[0] - tb->e[ntablem1];
	tb->df[ntablem1] = tb->f[0] - tb->f[ntablem1];
	tb->drsq[ntablem1] = 1.0/(tb->rsq[0] - tb->rsq[ntablem1]);

	// get the correct delta values at itablemax
	// smallest r is in bin itablemin
	// largest r is in bin itablemax, which is itablemin-1,
	// or ntablem1 if itablemin=0

	// deltas at itablemax only needed if corresponding rsq < cut*cut
	// if so, compute deltas between rsq and cut*cut
	// if tb->match, data at cut*cut is unavailable, so we'll take
	// deltas at itablemax-1 as a good approximation

	double e_tmp,f_tmp;
	int itablemin = minrsq_lookup.i & tb->nmask;
	itablemin >>= tb->nshiftbits;
	int itablemax = itablemin - 1;
	if (itablemin == 0) itablemax = ntablem1;
	int itablemaxm1 = itablemax - 1;
	if (itablemax == 0) itablemaxm1 = ntablem1;
	rsq_lookup.i = itablemax << tb->nshiftbits;
	rsq_lookup.i \|= maskhi;
	if (rsq_lookup.f < tb->cut*tb->cut) {
	if (tb->match) {
	tb->de[itablemax] = tb->de[itablemaxm1];
	tb->df[itablemax] = tb->df[itablemaxm1];
	tb->drsq[itablemax] = tb->drsq[itablemaxm1];
	} else {
	rsq_lookup.f = tb->cut*tb->cut;
	r = sqrtf(rsq_lookup.f);
	e_tmp = splint(tb->rfile,tb->efile,tb->e2file,tb->ninput,r);
	f_tmp = splint(tb->rfile,tb->ffile,tb->f2file,tb->ninput,r)/r;
	tb->de[itablemax] = e_tmp - tb->e[itablemax];
	tb->df[itablemax] = f_tmp - tb->f[itablemax];
	tb->drsq[itablemax] = 1.0/(rsq_lookup.f - tb->rsq[itablemax]);
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	set all ptrs in a table to NULL, so can be freed safely
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::null_table(Table *tb)
	{
	tb->rfile = tb->efile = tb->ffile = NULL;
	tb->e2file = tb->f2file = NULL;
	tb->rsq = tb->drsq = tb->e = tb->de = NULL;
	tb->f = tb->df = tb->e2 = tb->f2 = NULL;
	}

	/* ----------------------------------------------------------------------
	free all arrays in a table
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::free_table(Table *tb)
	{
	memory->destroy(tb->rfile);
	memory->destroy(tb->efile);
	memory->destroy(tb->ffile);
	memory->destroy(tb->e2file);
	memory->destroy(tb->f2file);

	memory->destroy(tb->rsq);
	memory->destroy(tb->drsq);
	memory->destroy(tb->e);
	memory->destroy(tb->de);
	memory->destroy(tb->f);
	memory->destroy(tb->df);
	memory->destroy(tb->e2);
	memory->destroy(tb->f2);
	}

	/* ----------------------------------------------------------------------
	spline and splint routines modified from Numerical Recipes
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::spline(double x, double y, int n,
	double yp1, double ypn, double *y2)
	{
	int i,k;
	double p,qn,sig,un;
	double *u = new double[n];

	if (yp1 > 0.99e30) y2[0] = u[0] = 0.0;
	else {
	y2[0] = -0.5;
	u[0] = (3.0/(x[1]-x[0])) * ((y[1]-y[0]) / (x[1]-x[0]) - yp1);
	}
	for (i = 1; i < n-1; i++) {
	sig = (x[i]-x[i-1]) / (x[i+1]-x[i-1]);
	p = sig*y2[i-1] + 2.0;
	y2[i] = (sig-1.0) / p;
	u[i] = (y[i+1]-y[i]) / (x[i+1]-x[i]) - (y[i]-y[i-1]) / (x[i]-x[i-1]);
	u[i] = (6.0u[i] / (x[i+1]-x[i-1]) - sigu[i-1]) / p;
	}
	if (ypn > 0.99e30) qn = un = 0.0;
	else {
	qn = 0.5;
	un = (3.0/(x[n-1]-x[n-2])) * (ypn - (y[n-1]-y[n-2]) / (x[n-1]-x[n-2]));
	}
	y2[n-1] = (un-qnu[n-2]) / (qny2[n-2] + 1.0);
	for (k = n-2; k >= 0; k--) y2[k] = y2[k]*y2[k+1] + u[k];

	delete [] u;
	}

	/* ---------------------------------------------------------------------- */

	template<class DeviceType>
	double PairTableKokkos<DeviceType>::splint(double xa, double ya, double *y2a, int n, double x)
	{
	int klo,khi,k;
	double h,b,a,y;

	klo = 0;
	khi = n-1;
	while (khi-klo > 1) {
	k = (khi+klo) >> 1;
	if (xa[k] > x) khi = k;
	else klo = k;
	}
	h = xa[khi]-xa[klo];
	a = (xa[khi]-x) / h;
	b = (x-xa[klo]) / h;
	y = aya[klo] + bya[khi] +
	((aaa-a)y2a[klo] + (bbb-b)y2a[khi]) * (h*h)/6.0;
	return y;
	}

	/* ----------------------------------------------------------------------
	proc 0 writes to restart file
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::write_restart(FILE *fp)
	{
	write_restart_settings(fp);
	}

	/* ----------------------------------------------------------------------
	proc 0 reads from restart file, bcasts
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::read_restart(FILE *fp)
	{
	read_restart_settings(fp);
	allocate();
	}

	/* ----------------------------------------------------------------------
	proc 0 writes to restart file
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::write_restart_settings(FILE *fp)
	{
	fwrite(&tabstyle,sizeof(int),1,fp);
	fwrite(&tablength,sizeof(int),1,fp);
	fwrite(&ewaldflag,sizeof(int),1,fp);
	fwrite(&pppmflag,sizeof(int),1,fp);
	fwrite(&msmflag,sizeof(int),1,fp);
	fwrite(&dispersionflag,sizeof(int),1,fp);
	fwrite(&tip4pflag,sizeof(int),1,fp);
	}

	/* ----------------------------------------------------------------------
	proc 0 reads from restart file, bcasts
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::read_restart_settings(FILE *fp)
	{
	if (comm->me == 0) {
	fread(&tabstyle,sizeof(int),1,fp);
	fread(&tablength,sizeof(int),1,fp);
	fread(&ewaldflag,sizeof(int),1,fp);
	fread(&pppmflag,sizeof(int),1,fp);
	fread(&msmflag,sizeof(int),1,fp);
	fread(&dispersionflag,sizeof(int),1,fp);
	fread(&tip4pflag,sizeof(int),1,fp);
	}
	MPI_Bcast(&tabstyle,1,MPI_INT,0,world);
	MPI_Bcast(&tablength,1,MPI_INT,0,world);
	MPI_Bcast(&ewaldflag,1,MPI_INT,0,world);
	MPI_Bcast(&pppmflag,1,MPI_INT,0,world);
	MPI_Bcast(&msmflag,1,MPI_INT,0,world);
	MPI_Bcast(&dispersionflag,1,MPI_INT,0,world);
	MPI_Bcast(&tip4pflag,1,MPI_INT,0,world);
	}

	/* ---------------------------------------------------------------------- */

	template<class DeviceType>
	double PairTableKokkos<DeviceType>::single(int i, int j, int itype, int jtype, double rsq,
	double factor_coul, double factor_lj,
	double &fforce)
	{
	int itable;
	double fraction,value,a,b,phi;
	int tlm1 = tablength - 1;

	Table *tb = &tables[tabindex[itype][jtype]];
	if (rsq < tb->innersq) error->one(FLERR,"Pair distance < table inner cutoff");

	if (tabstyle == LOOKUP) {
	itable = static_cast<int> ((rsq-tb->innersq) * tb->invdelta);
	if (itable >= tlm1) error->one(FLERR,"Pair distance > table outer cutoff");
	fforce = factor_lj * tb->f[itable];
	} else if (tabstyle == LINEAR) {
	itable = static_cast<int> ((rsq-tb->innersq) * tb->invdelta);
	if (itable >= tlm1) error->one(FLERR,"Pair distance > table outer cutoff");
	fraction = (rsq - tb->rsq[itable]) * tb->invdelta;
	value = tb->f[itable] + fraction*tb->df[itable];
	fforce = factor_lj * value;
	} else if (tabstyle == SPLINE) {
	itable = static_cast<int> ((rsq-tb->innersq) * tb->invdelta);
	if (itable >= tlm1) error->one(FLERR,"Pair distance > table outer cutoff");
	b = (rsq - tb->rsq[itable]) * tb->invdelta;
	a = 1.0 - b;
	value = a * tb->f[itable] + b * tb->f[itable+1] +
	((aaa-a)tb->f2[itable] + (bbb-b)tb->f2[itable+1]) *
	tb->deltasq6;
	fforce = factor_lj * value;
	} else {
	union_int_float_t rsq_lookup;
	rsq_lookup.f = rsq;
	itable = rsq_lookup.i & tb->nmask;
	itable >>= tb->nshiftbits;
	fraction = (rsq_lookup.f - tb->rsq[itable]) * tb->drsq[itable];
	value = tb->f[itable] + fraction*tb->df[itable];
	fforce = factor_lj * value;
	}

	if (tabstyle == LOOKUP)
	phi = tb->e[itable];
	else if (tabstyle == LINEAR \|\| tabstyle == BITMAP)
	phi = tb->e[itable] + fraction*tb->de[itable];
	else
	phi = a * tb->e[itable] + b * tb->e[itable+1] +
	((aaa-a)tb->e2[itable] + (bbb-b)tb->e2[itable+1]) * tb->deltasq6;
	return factor_lj*phi;
	}

	/* ----------------------------------------------------------------------
	return the Coulomb cutoff for tabled potentials
	called by KSpace solvers which require that all pairwise cutoffs be the same
	loop over all tables not just those indexed by tabindex[i][j] since
	no way to know which tables are active since pair::init() not yet called
	------------------------------------------------------------------------- */

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::extract(const char str, int &dim)
	{
	if (strcmp(str,"cut_coul") != 0) return NULL;
	if (ntables == 0) error->all(FLERR,"All pair coeffs are not set");

	double cut_coul = tables[0].cut;
	for (int m = 1; m < ntables; m++)
	if (tables[m].cut != cut_coul)
	error->all(FLERR,
	"Pair table cutoffs must all be equal to use with KSpace");
	dim = 0;
	return &tables[0].cut;
	}

	template<class DeviceType>
	void PairTableKokkos<DeviceType>::init_style()
	{
	neighbor->request(this,instance_me);
	neighflag = lmp->kokkos->neighflag;
	int irequest = neighbor->nrequest - 1;

	neighbor->requests[irequest]->
	kokkos_host = Kokkos::Impl::is_same<DeviceType,LMPHostType>::value &&
	!Kokkos::Impl::is_same<DeviceType,LMPDeviceType>::value;
	neighbor->requests[irequest]->
	kokkos_device = Kokkos::Impl::is_same<DeviceType,LMPDeviceType>::value;

	if (neighflag == FULL) {
	neighbor->requests[irequest]->full = 1;
	neighbor->requests[irequest]->half = 0;
	neighbor->requests[irequest]->full_cluster = 0;
	} else if (neighflag == HALF \|\| neighflag == HALFTHREAD) {
	neighbor->requests[irequest]->full = 0;
	neighbor->requests[irequest]->half = 1;
	neighbor->requests[irequest]->full_cluster = 0;
	} else if (neighflag == N2) {
	neighbor->requests[irequest]->full = 0;
	neighbor->requests[irequest]->half = 0;
	neighbor->requests[irequest]->full_cluster = 0;
	} else if (neighflag == FULLCLUSTER) {
	neighbor->requests[irequest]->full_cluster = 1;
	neighbor->requests[irequest]->full = 1;
	neighbor->requests[irequest]->half = 0;
	} else {
	error->all(FLERR,"Cannot use chosen neighbor list style with lj/cut/kk");
	}
	}

	/*
	template <class DeviceType> template<int NEIGHFLAG>
	KOKKOS_INLINE_FUNCTION
	void PairTableKokkos<DeviceType>::
	ev_tally(EV_FLOAT &ev, const int &i, const int &j, const F_FLOAT &fpair,
	const F_FLOAT &delx, const F_FLOAT &dely, const F_FLOAT &delz) const
	{
	const int EFLAG = eflag;
	const int NEWTON_PAIR = newton_pair;
	const int VFLAG = vflag_either;

	if (EFLAG) {
	if (eflag_atom) {
	E_FLOAT epairhalf = 0.5 * (ev.evdwl + ev.ecoul);
	if (NEWTON_PAIR \|\| i < nlocal) eatom[i] += epairhalf;
	if (NEWTON_PAIR \|\| j < nlocal) eatom[j] += epairhalf;
	}
	}

	if (VFLAG) {
	const E_FLOAT v0 = delxdelxfpair;
	const E_FLOAT v1 = delydelyfpair;
	const E_FLOAT v2 = delzdelzfpair;
	const E_FLOAT v3 = delxdelyfpair;
	const E_FLOAT v4 = delxdelzfpair;
	const E_FLOAT v5 = delydelzfpair;

	if (vflag_global) {
	if (NEIGHFLAG) {
	if (NEWTON_PAIR) {
	ev.v[0] += v0;
	ev.v[1] += v1;
	ev.v[2] += v2;
	ev.v[3] += v3;
	ev.v[4] += v4;
	ev.v[5] += v5;
	} else {
	if (i < nlocal) {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}
	if (j < nlocal) {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}
	}
	} else {
	ev.v[0] += 0.5*v0;
	ev.v[1] += 0.5*v1;
	ev.v[2] += 0.5*v2;
	ev.v[3] += 0.5*v3;
	ev.v[4] += 0.5*v4;
	ev.v[5] += 0.5*v5;
	}
	}

	if (vflag_atom) {
	if (NEWTON_PAIR \|\| i < nlocal) {
	d_vatom(i,0) += 0.5*v0;
	d_vatom(i,1) += 0.5*v1;
	d_vatom(i,2) += 0.5*v2;
	d_vatom(i,3) += 0.5*v3;
	d_vatom(i,4) += 0.5*v4;
	d_vatom(i,5) += 0.5*v5;
	}
	if (NEWTON_PAIR \|\| (NEIGHFLAG && j < nlocal)) {
	d_vatom(j,0) += 0.5*v0;
	d_vatom(j,1) += 0.5*v1;
	d_vatom(j,2) += 0.5*v2;
	d_vatom(j,3) += 0.5*v3;
	d_vatom(j,4) += 0.5*v4;
	d_vatom(j,5) += 0.5*v5;
	}
	}
	}
	}
	*/
	template<class DeviceType>
	void PairTableKokkos<DeviceType>::cleanup_copy() {
	// WHY needed: this prevents parent copy from deallocating any arrays
	allocated = 0;
	cutsq = NULL;
	eatom = NULL;
	vatom = NULL;
	h_table=NULL; d_table=NULL;
	}

	template class PairTableKokkos<LMPDeviceType>;
	#ifdef KOKKOS_HAVE_CUDA
	template class PairTableKokkos<LMPHostType>;
	#endif

	diff --git a/src/MAKE/MACHINES/Makefile.beacon b/src/MAKE/MACHINES/Makefile.beacon
	index 60b9907c6..f547c2e7f 100755
	--- a/src/MAKE/MACHINES/Makefile.beacon
	+++ b/src/MAKE/MACHINES/Makefile.beacon
	@@ -1,114 +1,116 @@
	# linux = RedHat Linux box, Intel icc, MPICH2, FFTW

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpiicpc -openmp -DLMP_INTEL_OFFLOAD -DLAMMPS_MEMALIGN=64
	MIC_OPT = -offload-option,mic,compiler,"-fp-model fast=2 -mGLOB_default_function_attrs=\"gather_scatter_loop_unroll=4\""
	CCFLAGS = -O3 -xAVX -fno-alias -ansi-alias -restrict -override-limits $(MIC_OPT)
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpiicpc -openmp
	LINKFLAGS = -O3 -xAVX
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_MKL -DFFT_SINGLE -I$(MKLROOT)
	FFT_PATH =
	FFT_LIB = -L$(MKLROOT) -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.bgl b/src/MAKE/MACHINES/Makefile.bgl
	index 05d1cf33b..47e084216 100644
	--- a/src/MAKE/MACHINES/Makefile.bgl
	+++ b/src/MAKE/MACHINES/Makefile.bgl
	@@ -1,119 +1,121 @@
	# bgl = LLNL Blue Gene Light machine, xlC, native MPI, FFTW

	SHELL = /bin/sh
	.SUFFIXES: .cpp .u

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = /opt/ibmcmp/vacpp/7.0/bin/blrts_xlC
	CCFLAGS = -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = /opt/ibmcmp/vacpp/7.0/bin/blrts_xlC
	LINKFLAGS = -O \
	-L/opt/ibmcmp/xlf/9.1/blrts_lib \
	-L/opt/ibmcmp/vacpp/7.0/blrts_lib \
	-L/bgl/local/lib \
	-L/bgl/local/bglfftwgel-2.1.5.pre5/lib
	LIB = -lxlopt -lxlomp_ser -lxl -lxlfmath -lm \
	-lmsglayer.rts -lrts.rts -ldevices.rts -lmassv
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB = -lmpich.rts

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW
	FFT_PATH =
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.bgq b/src/MAKE/MACHINES/Makefile.bgq
	index 3ad1e98b0..c4081b04a 100644
	--- a/src/MAKE/MACHINES/Makefile.bgq
	+++ b/src/MAKE/MACHINES/Makefile.bgq
	@@ -1,55 +1,56 @@
	# bgq = IBM Blue Gene/Q, multiple compiler options, native MPI, ALCF FFTW2

	SHELL = /bin/bash
	.SUFFIXES: .cpp .u

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section
	# select which compiler by editing Makefile.bgq.details

	include ../MAKE/Makefile.bgq.details

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	-ifneq ($(COMPILER),XLC)
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@
	-endif
	+
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	+ $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	-
	-
	diff --git a/src/MAKE/MACHINES/Makefile.chama b/src/MAKE/MACHINES/Makefile.chama
	index 8a0ca9413..ef6d25ebc 100644
	--- a/src/MAKE/MACHINES/Makefile.chama
	+++ b/src/MAKE/MACHINES/Makefile.chama
	@@ -1,109 +1,115 @@
	# chama - Intel SandyBridge, mpic++, openmpi, no FFTW

	# need to load the following modules:
	# 1) intel/12.1
	# 2) openmpi-intel/1.4

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpic++
	CCFLAGS = -O3 -axAVX -funroll-loops -fstrict-aliasing -openmp
	DEPFLAGS = -M
	LINK = mpic++
	LINKFLAGS = -O3 -axAVX -openmp
	LIB = -lstdc++
	ARCHIVE = ar
	ARFLAGS = -rcsv
	SIZE = size

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC =
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	#FFT_INC = -DFFT_FFTW -I${FFTW_INCLUDE}
	#FFT_PATH = -L${FFTW_LIB}
	#FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	-include Makefile.package.settings
	-include Makefile.package
	+include Makefile.package.settings
	+include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	-# Library target
	+# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	+ $(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	+ $(OBJ) $(EXTRA_LIB) $(LIB)
	+
	# Compilation rules

	-%.o:%.cpp
	- $(CC) $(CCFLAGS) $(EXTRA_INC) -c $<
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	+ $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.cygwin b/src/MAKE/MACHINES/Makefile.cygwin
	index 159a821b5..41e9d811d 100644
	--- a/src/MAKE/MACHINES/Makefile.cygwin
	+++ b/src/MAKE/MACHINES/Makefile.cygwin
	@@ -1,113 +1,115 @@
	# cygwin = Windows Cygwin, mpicxx, MPICH, FFTW

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH = -L/cygdrive/c/cygwin/mpich2-1.0.4p1/lib
	MPI_LIB = -lmpich

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I/cygdrive/c/cygwin/usr/local/include
	FFT_PATH = -L/cygdrive/c/cygwin/usr/local/lib
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.glory b/src/MAKE/MACHINES/Makefile.glory
	index 4c1db4bc0..154c194b4 100644
	--- a/src/MAKE/MACHINES/Makefile.glory
	+++ b/src/MAKE/MACHINES/Makefile.glory
	@@ -1,130 +1,132 @@
	# glory = Linux cluster with 4-way quad cores, Intel mpicxx, native MPI, FFTW

	SHELL = /bin/sh

	# this Makefile builds LAMMPS for mvapich running on Glory
	# to invoke this Makefile, you need these modules loaded:
	# compilers/intel-11.1-f064-c064
	# mpi/mvapich-1.1_intel-11.1-f064-c064
	# libraries/fftw-2.1.5
	# you can determine which modules are loaded by typing:
	# module list
	# these modules are not the default ones, but can be enabled by
	# lines like this in your .cshrc or other start-up shell file
	# or by typing them before you build LAMMPS:
	# module swap mpi misc/env-mvapich
	# module load compilers/intel-11.1-f064-c064
	# module load mpi/mvapich-1.1_intel-11.1-f064-c064
	# module load libraries/fftw-2.1.5
	# these same modules need to be loaded to submit a LAMMPS job,
	# either interactively or via a batch script

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -O
	LIB = -lstdc++ -lm
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC =
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I${FFTW_INCLUDE}
	FFT_PATH = -L${FFTW_LIB}
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.jaguar b/src/MAKE/MACHINES/Makefile.jaguar
	index b065bf653..cc550becb 100644
	--- a/src/MAKE/MACHINES/Makefile.jaguar
	+++ b/src/MAKE/MACHINES/Makefile.jaguar
	@@ -1,113 +1,115 @@
	# jaguar = ORNL Jaguar Cray XT5, CC, native MPICH, FFTW

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CXX = CC
	CCFLAGS = -g -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = $(CXX)
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DNODE_PARTITION

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB = -lmpich -lpthread

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW3 -I$(FFTW_INC)
	FFT_PATH = -L$(FFTW_DIR)
	FFT_LIB = -lfftw3

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.mac b/src/MAKE/MACHINES/Makefile.mac
	index dff1a0bf2..99420e6e1 100755
	--- a/src/MAKE/MACHINES/Makefile.mac
	+++ b/src/MAKE/MACHINES/Makefile.mac
	@@ -1,113 +1,115 @@
	# mac = Apple PowerBook G4 laptop, c++, no MPI, FFTW 2.1.5

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = c++
	CCFLAGS = -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = c++
	LINKFLAGS = -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../STUBS
	MPI_PATH = -L../STUBS
	MPI_LIB = -lmpi_stubs

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW
	FFT_PATH =
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.mac_mpi b/src/MAKE/MACHINES/Makefile.mac_mpi
	index 119e0f415..fc8bf69de 100755
	--- a/src/MAKE/MACHINES/Makefile.mac_mpi
	+++ b/src/MAKE/MACHINES/Makefile.mac_mpi
	@@ -1,116 +1,118 @@
	# mac_mpi = Apple laptop, MacPorts Open MPI 1.4.3, gcc 4.8, fftw, jpeg

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# generally no need to edit this section
	# unless additional compiler/linker flags or libraries needed for your machine

	CC = /opt/local/bin/mpicxx-openmpi-mp
	CCFLAGS = -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = /opt/local/bin/mpicxx-openmpi-mp
	LINKFLAGS = -O3
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DOMPI_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFTW = /usr/local

	FFT_INC = -DFFT_FFTW -I${FFTW}/include
	FFT_PATH = -L${FFTW}/lib
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC = -I/opt/local/include
	JPG_PATH = -L/opt/local/lib
	JPG_LIB = -ljpeg

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.mingw32-cross b/src/MAKE/MACHINES/Makefile.mingw32-cross
	index 204666056..458186e41 100644
	--- a/src/MAKE/MACHINES/Makefile.mingw32-cross
	+++ b/src/MAKE/MACHINES/Makefile.mingw32-cross
	@@ -1,122 +1,120 @@
	# mingw32-cross = Win 32-bit, gcc-4.7.1, MinGW, internal FFT, no MPI, OpenMP

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = i686-w64-mingw32-g++
	CCFLAGS = -O3 -march=i686 -mtune=generic -mfpmath=387 -mpc64 -fopenmp \
	-ffast-math -fstrict-aliasing -Wall -W -Wno-uninitialized

	SHFLAGS = # -fPIC (not needed on windows, all code is PIC)
	DEPFLAGS = -M

	LINK = i686-w64-mingw32-g++ -static
	LINKFLAGS = -O2 -march=i686 -mtune=generic -mfpmath=387 -mpc64 -fopenmp
	LIB = -lwsock32 -static-libgcc -lquadmath
	SIZE = i686-w64-mingw32-size

	ARCHIVE = i686-w64-mingw32-ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# name of object file subdir for libraries in lib with leading '/'
	LIBOBJDIR = /Obj_mingw32

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_SMALLSMALL -DLAMMPS_JPEG -DLAMMPS_PNG -DLAMMPS_XDR -DLAMMPS_GZIP -DLAMMPS_FFMPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../STUBS
	MPI_PATH = -L../STUBS
	MPI_LIB = -lmpi_mingw32

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg -lpng -lz

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	-
	-# Local Variables:
	-# mode: makefile
	-# End:
	diff --git a/src/MAKE/MACHINES/Makefile.mingw32-cross-mpi b/src/MAKE/MACHINES/Makefile.mingw32-cross-mpi
	index e0b298e39..91e4e5b7c 100644
	--- a/src/MAKE/MACHINES/Makefile.mingw32-cross-mpi
	+++ b/src/MAKE/MACHINES/Makefile.mingw32-cross-mpi
	@@ -1,122 +1,120 @@
	# mingw32-cross-mpi = Win 32-bit, gcc-4.7.1, MinGW, internal FFT, MPICH2, OpenMP

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = i686-w64-mingw32-g++
	CCFLAGS = -O3 -march=i686 -mtune=generic -mfpmath=387 -mpc64 -fopenmp \
	-ffast-math -fstrict-aliasing -Wall -W -Wno-uninitialized

	SHFLAGS = # -fPIC (not needed on windows, all code is PIC)
	DEPFLAGS = -M

	LINK = i686-w64-mingw32-g++ -static
	LINKFLAGS = -O2 -march=i686 -mtune=generic -mfpmath=387 -mpc64 -fopenmp
	LIB = -lwsock32 -static-libgcc -lquadmath
	SIZE = i686-w64-mingw32-size

	ARCHIVE = i686-w64-mingw32-ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# name of object file subdir for libraries in lib with leading '/'
	LIBOBJDIR = /Obj_mingw32-mpi

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_SMALLSMALL -DLAMMPS_JPEG -DLAMMPS_PNG -DLAMMPS_XDR -DLAMMPS_GZIP -DLAMMPS_FFMPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../../tools/mingw-cross/mpich2-win32/include
	MPI_PATH = -L../../tools/mingw-cross/mpich2-win32/lib
	MPI_LIB = -lmpi

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg -lpng -lz

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	-
	-# Local Variables:
	-# mode: makefile
	-# End:
	diff --git a/src/MAKE/MACHINES/Makefile.mingw64-cross b/src/MAKE/MACHINES/Makefile.mingw64-cross
	index 5df08668a..74283c9d8 100644
	--- a/src/MAKE/MACHINES/Makefile.mingw64-cross
	+++ b/src/MAKE/MACHINES/Makefile.mingw64-cross
	@@ -1,122 +1,120 @@
	# mingw64-cross = Win 64-bit, gcc-4.7.1, MinGW, internal FFT, no MPI, OpenMP

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = x86_64-w64-mingw32-g++
	CCFLAGS = -O3 -march=core2 -mtune=core2 -mpc64 -msse2 -fopenmp \
	-ffast-math -fstrict-aliasing -Wall -W -Wno-uninitialized

	SHFLAGS = # -fPIC (not needed on windows, all code is PIC)
	DEPFLAGS = -M

	LINK = x86_64-w64-mingw32-g++ -static
	LINKFLAGS = -O2 -march=core2 -mtune=core2 -mpc64 -msse2 -fopenmp
	LIB = -lwsock32 -static-libgcc -lquadmath
	SIZE = x86_64-w64-mingw32-size

	ARCHIVE = x86_64-w64-mingw32-ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# name of object file subdir for libraries in lib with leading '/'
	LIBOBJDIR = /Obj_mingw64

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_SMALLBIG -DLAMMPS_JPEG -DLAMMPS_PNG -DLAMMPS_XDR -DLAMMPS_GZIP -DLAMMPS_FFMPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../STUBS
	MPI_PATH = -L../STUBS
	MPI_LIB = -lmpi_mingw64

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg -lpng -lz

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	-
	-# Local Variables:
	-# mode: makefile
	-# End:
	diff --git a/src/MAKE/MACHINES/Makefile.mingw64-cross-mpi b/src/MAKE/MACHINES/Makefile.mingw64-cross-mpi
	index 98b27798c..6ee48069e 100644
	--- a/src/MAKE/MACHINES/Makefile.mingw64-cross-mpi
	+++ b/src/MAKE/MACHINES/Makefile.mingw64-cross-mpi
	@@ -1,122 +1,120 @@
	# mingw64-cross-mpi = Win 64-bit, gcc-4.7.1, MinGW, internal FFT, MPICH2, OpenMP

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = x86_64-w64-mingw32-g++
	CCFLAGS = -O3 -march=core2 -mtune=core2 -mpc64 -msse2 -fopenmp \
	-ffast-math -fstrict-aliasing -Wall -W -Wno-uninitialized

	SHFLAGS = # -fPIC (not needed on windows, all code is PIC)
	DEPFLAGS = -M

	LINK = x86_64-w64-mingw32-g++ -static
	LINKFLAGS = -O2 -march=core2 -mtune=core2 -mpc64 -msse2 -fopenmp
	LIB = -lwsock32 -static-libgcc -lquadmath
	SIZE = x86_64-w64-mingw32-size

	ARCHIVE = x86_64-w64-mingw32-ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# name of object file subdir for libraries in lib with leading '/'
	LIBOBJDIR = /Obj_mingw64-mpi

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_SMALLBIG -DLAMMPS_JPEG -DLAMMPS_PNG -DLAMMPS_XDR -DLAMMPS_GZIP -DLAMMPS_FFMPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../../tools/mingw-cross/mpich2-win64/include
	MPI_PATH = -L../../tools/mingw-cross/mpich2-win64/lib
	MPI_LIB = -lmpi

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg -lpng -lz

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	-
	-# Local Variables:
	-# mode: makefile
	-# End:
	diff --git a/src/MAKE/MACHINES/Makefile.myrinet b/src/MAKE/MACHINES/Makefile.myrinet
	index 94c132200..59f2efb58 100755
	--- a/src/MAKE/MACHINES/Makefile.myrinet
	+++ b/src/MAKE/MACHINES/Makefile.myrinet
	@@ -1,113 +1,115 @@
	# myrinet = cluster, g++, myrinet MPI, no FFTs

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = g++
	CCFLAGS = -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = G++
	LINKFLAGS = -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I/opt/mpich-mx/include
	MPI_PATH = -L/opt/mpich-mx/lib -L/opt/mx/lib
	MPI_LIB = -lmpich -lmyriexpress

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_NONE
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.power b/src/MAKE/MACHINES/Makefile.power
	index 8199e762b..14a035dd2 100644
	--- a/src/MAKE/MACHINES/Makefile.power
	+++ b/src/MAKE/MACHINES/Makefile.power
	@@ -1,114 +1,116 @@
	# power = IBM Power5+, mpCC_r, native MPI, FFTW

	SHELL = /bin/sh
	.SUFFIXES: .cpp .u

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpCC_r
	CCFLAGS = -O3 -qnoipa -qlanglvl=oldmath
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpCC_r
	LINKFLAGS = -O -qnoipa -qlanglvl=oldmath -bmaxdata:0x70000000
	LIB = -lm
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC =
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I/scr/oppe/LAMMPS/fftw-2.1.5/include
	FFT_PATH = -L/scr/oppe/LAMMPS/fftw-2.1.5/lib
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.redsky b/src/MAKE/MACHINES/Makefile.redsky
	index daae95cf9..a310f5650 100644
	--- a/src/MAKE/MACHINES/Makefile.redsky
	+++ b/src/MAKE/MACHINES/Makefile.redsky
	@@ -1,133 +1,135 @@
	# redsky - SUN X6275 nodes, Nehalem procs, mpic++, openmpi, OpenMP, no FFTW

	SHELL = /bin/sh

	# This Makefile builds LAMMPS for RedSky with OpenMPI.
	# To use this Makefile, you need appropriate modules loaded.
	# You can determine which modules are loaded by typing:
	# module list
	# These modules can be enabled by lines like this in your .cshrc or
	# other start-up shell file or by typing them before you build LAMMPS:
	# module load mpi/openmpi-1.4.2_oobpr_intel-11.1-f064-c064
	# module load libraries/intel-mkl-11.1.064
	# module load libraries/fftw-2.1.5_openmpi-1.4.2_oobpr_intel-11.1-f064-c064
	# These same modules need to be loaded to submit a LAMMPS job,
	# either interactively or via a batch script.

	# IMPORTANT NOTE:
	# to run efficiently on RedSky, use the "numa_wrapper" mpiexec option,
	# to insure proceses and their memory are locked to specific cores
	# e.g. in your batch script:
	# nodes=$SLURM_JOB_NUM_NODES
	# cores=8
	# mpiexec --npernode $cores numa_wrapper --ppn $cores lmp_redsky < in > out

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpic++ -fopenmp
	CCFLAGS = -O2 -xsse4.2 -funroll-loops -fstrict-aliasing
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpic++ -fopenmp
	LINKFLAGS = -O -xsse4.2
	LIB = -lstdc++
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rcsv
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC =
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	#FFT_INC = -DFFT_FFTW -I${FFTW_INCLUDE}
	#FFT_PATH = -L${FFTW_LIB}
	#FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.serial b/src/MAKE/MACHINES/Makefile.serial
	index c63e49f08..bff42d953 100755
	--- a/src/MAKE/MACHINES/Makefile.serial
	+++ b/src/MAKE/MACHINES/Makefile.serial
	@@ -1,113 +1,115 @@
	# serial = RedHat Linux box, g++4, no MPI, no FFTs

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = g++
	CCFLAGS = -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = g++
	LINKFLAGS = -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../STUBS
	MPI_PATH = -L../STUBS
	MPI_LIB = -lmpi_stubs

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.stampede b/src/MAKE/MACHINES/Makefile.stampede
	index 864197f5b..e8b363896 100755
	--- a/src/MAKE/MACHINES/Makefile.stampede
	+++ b/src/MAKE/MACHINES/Makefile.stampede
	@@ -1,114 +1,116 @@
	# stampede = Intel Compiler, MKL FFT, Offload to Xeon Phi

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicc -openmp -DLMP_INTEL_OFFLOAD -DLAMMPS_MEMALIGN=64
	MIC_OPT = -offload-option,mic,compiler,"-fp-model fast=2 -mGLOB_default_function_attrs=\"gather_scatter_loop_unroll=4\""
	CCFLAGS = -O3 -xAVX -fno-alias -ansi-alias -restrict -override-limits $(MIC_OPT)
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicc -openmp
	LINKFLAGS = -O3 -xAVX
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_MKL -DFFT_SINGLE -I$(TACC_MKL_INC)
	FFT_PATH =
	FFT_LIB = -L$(TACC_MKL_LIB) -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.storm b/src/MAKE/MACHINES/Makefile.storm
	index d6e904b24..cb41e3ada 100644
	--- a/src/MAKE/MACHINES/Makefile.storm
	+++ b/src/MAKE/MACHINES/Makefile.storm
	@@ -1,107 +1,116 @@
	# storm = Cray Red Storm XT3, Cray CC, native MPI, FFTW

	SHELL = /bin/sh
	.SUFFIXES: .cpp .d

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = CC
	CCFLAGS = -fastsse
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = CC
	LINKFLAGS = -O
	LIB = -lstdc++
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I/projects/fftw/fftw-2.1.5/include
	FFT_PATH =
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-.cpp.o:
	- $(CC) $(CCFLAGS) $(EXTRA_INC) -c $<
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	+ $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<
	+
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	+ $(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@
	+
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	+ $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	-$(OBJ): $(INC)
	+DEPENDS = $(OBJ:.o=.d)
	+sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.tacc b/src/MAKE/MACHINES/Makefile.tacc
	index cd3b70588..4f2e9bd05 100644
	--- a/src/MAKE/MACHINES/Makefile.tacc
	+++ b/src/MAKE/MACHINES/Makefile.tacc
	@@ -1,116 +1,118 @@
	# tacc = UT Lonestar TACC machine, mpiCC, MPI, FFTW

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpiCC
	CCFLAGS = -O
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpiCC
	LINKFLAGS = -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB = -lmpich -lpthread

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFTW_INC = ${TACC_FFTW2_INC}
	FFTW_LIB = ${TACC_FFTW2_LIB}

	FFT_INC = -DFFT_FFTW -I${FFTW_INC}
	FFT_PATH = -L${FFTW_LIB}
	FFT_LIB = ${FFTW_LIB}/libfftw.a

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.ubuntu b/src/MAKE/MACHINES/Makefile.ubuntu
	index 45506b0e7..df4e60334 100644
	--- a/src/MAKE/MACHINES/Makefile.ubuntu
	+++ b/src/MAKE/MACHINES/Makefile.ubuntu
	@@ -1,117 +1,119 @@
	# ubuntu = Ubuntu Linux box, g++, openmpi, FFTW3

	# you have to install the packages g++, mpi-default-bin, mpi-default-dev,
	# libfftw3-dev, libjpeg-dev and libpng12-dev to compile LAMMPS with this
	# makefile

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpic++
	CCFLAGS = -g -O3 # -Wunused
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpic++
	LINKFLAGS = -g -O3
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG -DLAMMPS_PNG -DLAMMPS_FFMPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC =
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW3
	FFT_PATH =
	FFT_LIB = -lfftw3

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg -lpng

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.ubuntu_simple b/src/MAKE/MACHINES/Makefile.ubuntu_simple
	index 4db9f7c57..e4e45eae0 100644
	--- a/src/MAKE/MACHINES/Makefile.ubuntu_simple
	+++ b/src/MAKE/MACHINES/Makefile.ubuntu_simple
	@@ -1,116 +1,118 @@
	# ubuntu_simple = Ubuntu Linux box, g++, openmpi, KISS FFT

	# you have to install the packages g++, mpi-default-bin and mpi-default-dev
	# to compile LAMMPS with this makefile

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpic++
	CCFLAGS = -g -O3 # -Wunused
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpic++
	LINKFLAGS = -g -O3
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC =
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.xe6 b/src/MAKE/MACHINES/Makefile.xe6
	index 127527e83..3ad85bca4 100644
	--- a/src/MAKE/MACHINES/Makefile.xe6
	+++ b/src/MAKE/MACHINES/Makefile.xe6
	@@ -1,114 +1,116 @@
	# xe6 = Cray XE6, Cray CC, native MPI, FFTW

	SHELL = /bin/sh
	.SUFFIXES: .cpp .d

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = CC
	CCFLAGS = -fastsse
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = CC
	LINKFLAGS = -O
	LIB = -lstdc++
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I/home/sjplimp/fftw/fftw
	FFT_PATH = -L/home/sjplimp/fftw/fftw/.libs
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.xt3 b/src/MAKE/MACHINES/Makefile.xt3
	index 4734d3ebd..274a52369 100644
	--- a/src/MAKE/MACHINES/Makefile.xt3
	+++ b/src/MAKE/MACHINES/Makefile.xt3
	@@ -1,115 +1,117 @@
	# xt3 = PSC BigBen Cray XT3, CC, native MPI, FFTW

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = CC
	CCFLAGS = -O3 --target=catamount \
	-fomit-frame-pointer -finline-functions \
	-Wall -Wno-unused -funroll-all-loops
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = CC
	LINKFLAGS = --target=catamount -O
	LIB = -lgmalloc
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_XDR

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB = -lmpich -lpthread

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I$(FFTW_INC)
	FFT_PATH = -L$(FFTW_LIB)
	FFT_LIB = -ldfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/MACHINES/Makefile.xt5 b/src/MAKE/MACHINES/Makefile.xt5
	index 9f9a42166..920b5e5bd 100644
	--- a/src/MAKE/MACHINES/Makefile.xt5
	+++ b/src/MAKE/MACHINES/Makefile.xt5
	@@ -1,114 +1,116 @@
	# xt5 = Cray XT5, Cray CC, native MPI, FFTW

	SHELL = /bin/sh
	.SUFFIXES: .cpp .d

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = CC
	CCFLAGS = -fastsse
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = CC
	LINKFLAGS = -O
	LIB = -lstdc++
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW -I/home/sjplimp/fftw/fftw
	FFT_PATH = -L/home/sjplimp/fftw/fftw/.libs
	FFT_LIB = -lfftw

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/Makefile.mpi b/src/MAKE/Makefile.mpi
	index 3d1766bbd..c88985f33 100755
	--- a/src/MAKE/Makefile.mpi
	+++ b/src/MAKE/Makefile.mpi
	@@ -1,113 +1,115 @@
	# mpi = default MPI compiler, default MPI

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/Makefile.serial b/src/MAKE/Makefile.serial
	index 072b1fbb0..2833839ee 100755
	--- a/src/MAKE/Makefile.serial
	+++ b/src/MAKE/Makefile.serial
	@@ -1,113 +1,115 @@
	# serial = g++ compiler, no MPI, internal FFT

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = g++
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = g++
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../STUBS
	MPI_PATH = -L../STUBS
	MPI_LIB = -lmpi_stubs

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.fftw b/src/MAKE/OPTIONS/Makefile.fftw
	index 8675aa4bf..966f823a5 100755
	--- a/src/MAKE/OPTIONS/Makefile.fftw
	+++ b/src/MAKE/OPTIONS/Makefile.fftw
	@@ -1,113 +1,115 @@
	# fftw = default MPI compiler, default MPI, FFTW support

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW3 -I/usr/local/include
	FFT_PATH = -L/usr/local/lib
	FFT_LIB = -lfftw3

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.intel_cpu b/src/MAKE/OPTIONS/Makefile.intel_cpu
	index d044f6bd2..7c62515c6 100755
	--- a/src/MAKE/OPTIONS/Makefile.intel_cpu
	+++ b/src/MAKE/OPTIONS/Makefile.intel_cpu
	@@ -1,115 +1,117 @@
	# intel_cpu = USER-INTEL package with CPU optimizations, Intel MPI, MKL FFT

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpiicpc
	CCFLAGS = -g -O3 -openmp -DLAMMPS_MEMALIGN=64 -no-offload \
	-xHost -fno-alias -ansi-alias -restrict -override-limits
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpiicpc
	LINKFLAGS = -g -O3 -openmp -xHost
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_MKL -DFFT_SINGLE
	FFT_PATH =
	FFT_LIB = -L$MKLROOT/lib/intel64/ -lmkl_intel_ilp64 \
	-lmkl_intel_thread -lmkl_core

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.intel_phi b/src/MAKE/OPTIONS/Makefile.intel_phi
	index edba1df1b..ad9b0d35a 100755
	--- a/src/MAKE/OPTIONS/Makefile.intel_phi
	+++ b/src/MAKE/OPTIONS/Makefile.intel_phi
	@@ -1,116 +1,118 @@
	# intel_phi = USER-INTEL package with Phi offload support, Intel MPI, MKL FFT

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpiicpc
	MIC_OPT = -offload-option,mic,compiler,"-fp-model fast=2 -mGLOB_default_function_attrs=\"gather_scatter_loop_unroll=4\""
	CCFLAGS = -g -O3 -openmp -DLMP_INTEL_OFFLOAD -DLAMMPS_MEMALIGN=64 \
	-xHost -fno-alias -ansi-alias -restrict \
	-override-limits $(MIC_OPT)
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpiicpc
	LINKFLAGS = -g -O3 -xHost -openmp -offload
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_MKL -DFFT_SINGLE
	FFT_PATH =
	FFT_LIB = -L$(MKLROOT)/lib/intel64/ -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB = -ljpeg

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.jpeg b/src/MAKE/OPTIONS/Makefile.jpeg
	index b95868a4b..8cc5e86a6 100755
	--- a/src/MAKE/OPTIONS/Makefile.jpeg
	+++ b/src/MAKE/OPTIONS/Makefile.jpeg
	@@ -1,113 +1,115 @@
	# jpeg = default MPI compiler, default MPI, JPEG support

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_JPEG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC = -I/usr/include
	JPG_PATH = -L/usr/lib
	JPG_LIB = -ljpeg

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.kokkos_cuda b/src/MAKE/OPTIONS/Makefile.kokkos_cuda
	index e307885f2..b48a122fb 100755
	--- a/src/MAKE/OPTIONS/Makefile.kokkos_cuda
	+++ b/src/MAKE/OPTIONS/Makefile.kokkos_cuda
	@@ -1,115 +1,118 @@
	# kokkos_cuda = KOKKOS package with CUDA support, nvcc/mpicxx compiler

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = nvcc -ccbin=mpicxx
	-CCFLAGS = -g -O3 -arch=sm_20
	+CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = nvcc -ccbin=mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared
	-CUDA = yes
	-OMP = yes
	+KOKKOS_DEVICES = Cuda, OpenMP
	+KOKKOS_ARCH = Kepler35
	+CUDA_PATH = /usr/local/cuda

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.kokkos_omp b/src/MAKE/OPTIONS/Makefile.kokkos_omp
	index 68323fbf1..90fc7baa5 100644
	--- a/src/MAKE/OPTIONS/Makefile.kokkos_omp
	+++ b/src/MAKE/OPTIONS/Makefile.kokkos_omp
	@@ -1,114 +1,116 @@
	# kokkos_omp = KOKKOS package with OMP support, MPI compiler, default MPI

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared
	-OMP = yes
	+KOKKOS_DEVICES = OpenMP

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.kokkos_phi b/src/MAKE/OPTIONS/Makefile.kokkos_phi
	index c193b3d5a..cbc220458 100644
	--- a/src/MAKE/OPTIONS/Makefile.kokkos_phi
	+++ b/src/MAKE/OPTIONS/Makefile.kokkos_phi
	@@ -1,115 +1,117 @@
	# kokkos_phi = KOKKOS package with PHI support, MPI compiler, default MPI

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared
	-MIC = yes
	-OMP = yes
	+KOKKOS_DEVICES = OpenMP
	+KOKKOS_ARCH = KNC

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.mpich_g++ b/src/MAKE/OPTIONS/Makefile.mpich_g++
	index 4ffabf74e..0be96f925 100755
	--- a/src/MAKE/OPTIONS/Makefile.mpich_g++
	+++ b/src/MAKE/OPTIONS/Makefile.mpich_g++
	@@ -1,113 +1,115 @@
	# mpich_g++ = g++ compiler and MPICH via MPI wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx -cxx=g++
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx -cxx=g++
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.mpich_icc b/src/MAKE/OPTIONS/Makefile.mpich_icc
	index baf07d08c..a56f5b7a2 100755
	--- a/src/MAKE/OPTIONS/Makefile.mpich_icc
	+++ b/src/MAKE/OPTIONS/Makefile.mpich_icc
	@@ -1,113 +1,115 @@
	# mpich_icc = icc compiler and MPICH via MPI wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx -cxx=icc
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx -cxx=icc
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.mpich_native_g++ b/src/MAKE/OPTIONS/Makefile.mpich_native_g++
	index 03984459d..90e738fe6 100755
	--- a/src/MAKE/OPTIONS/Makefile.mpich_native_g++
	+++ b/src/MAKE/OPTIONS/Makefile.mpich_native_g++
	@@ -1,113 +1,115 @@
	# mpich_native_g++ = g++ compiler, native MPICH w/out wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = g++
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = g++
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1 -I/usr/local/include
	MPI_PATH = -L/usr/local/lib
	MPI_LIB = -lmpich -lmpl -lpthread

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.mpich_native_icc b/src/MAKE/OPTIONS/Makefile.mpich_native_icc
	index a7f855903..6f3f26646 100755
	--- a/src/MAKE/OPTIONS/Makefile.mpich_native_icc
	+++ b/src/MAKE/OPTIONS/Makefile.mpich_native_icc
	@@ -1,113 +1,115 @@
	# mpich_native_icc = icc compiler, native MPICH w/out wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = icc
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = icc
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1 -I/usr/local/include
	MPI_PATH = -L/usr/local/lib
	MPI_LIB = -lmpich -lmpl -lpthread

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.omp b/src/MAKE/OPTIONS/Makefile.omp
	index 674adc587..56310f284 100755
	--- a/src/MAKE/OPTIONS/Makefile.omp
	+++ b/src/MAKE/OPTIONS/Makefile.omp
	@@ -1,113 +1,115 @@
	# omp = USER-OMP package with default MPI compiler, default MPI

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3 -restrict -fopenmp
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O -fopenmp
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.ompi_g++ b/src/MAKE/OPTIONS/Makefile.ompi_g++
	index e29f6b014..1b447689f 100755
	--- a/src/MAKE/OPTIONS/Makefile.ompi_g++
	+++ b/src/MAKE/OPTIONS/Makefile.ompi_g++
	@@ -1,114 +1,116 @@
	# ompi_g++ = g++ compiler and OpenMPI via MPI wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	OMPI_CXX := g++
	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.ompi_icc b/src/MAKE/OPTIONS/Makefile.ompi_icc
	index 71aa95661..a5db57c95 100755
	--- a/src/MAKE/OPTIONS/Makefile.ompi_icc
	+++ b/src/MAKE/OPTIONS/Makefile.ompi_icc
	@@ -1,114 +1,116 @@
	# ompi_icc = icc compiler and OpenMPI via MPI wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	OMPI_CXX := icc
	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.ompi_native_g++ b/src/MAKE/OPTIONS/Makefile.ompi_native_g++
	index 426fda6c2..01c4c2fac 100755
	--- a/src/MAKE/OPTIONS/Makefile.ompi_native_g++
	+++ b/src/MAKE/OPTIONS/Makefile.ompi_native_g++
	@@ -1,113 +1,115 @@
	# ompi_native_g++ = g++ compiler, native OpenMPI w/out wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = g++
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = g++
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1 -I/usr/local/include
	MPI_PATH = -L/usr/local/lib
	MPI_LIB = -lmpi -lmpi_cxx

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.ompi_native_icc b/src/MAKE/OPTIONS/Makefile.ompi_native_icc
	index 3f8dd19f8..89e7df222 100755
	--- a/src/MAKE/OPTIONS/Makefile.ompi_native_icc
	+++ b/src/MAKE/OPTIONS/Makefile.ompi_native_icc
	@@ -1,113 +1,115 @@
	# ompi_native_icc = icc compiler, native OpenMPI w/out wrapper

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = icc
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = icc
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1 -I/usr/local/include
	MPI_PATH = -L/usr/local/lib
	MPI_LIB = -lmpi -lmpi_cxx

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.opt b/src/MAKE/OPTIONS/Makefile.opt
	index a8d6ecb9a..49e93fb7c 100755
	--- a/src/MAKE/OPTIONS/Makefile.opt
	+++ b/src/MAKE/OPTIONS/Makefile.opt
	@@ -1,113 +1,115 @@
	# opt = OPT package with default MPI compiler, default MPI

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3 -restrict
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.pgi b/src/MAKE/OPTIONS/Makefile.pgi
	index e8aaba378..89faf682a 100644
	--- a/src/MAKE/OPTIONS/Makefile.pgi
	+++ b/src/MAKE/OPTIONS/Makefile.pgi
	@@ -1,114 +1,116 @@
	# pgi = Portland Group compiler pgCC, MPICH, FFTW

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = pgCC
	CCFLAGS = -g -fast
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = pgCC
	LINKFLAGS = -g
	LIB =
	#LIB = -lstdc++
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I/usr/local/mpich-1.2.6/pg/include
	MPI_PATH = -L/usr/local/mpich-1.2.6/pg/lib
	MPI_LIB = -lmpich

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC = -DFFT_FFTW3
	FFT_PATH =
	FFT_LIB = -lfftw3

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.png b/src/MAKE/OPTIONS/Makefile.png
	index 30d07f316..26e25583d 100755
	--- a/src/MAKE/OPTIONS/Makefile.png
	+++ b/src/MAKE/OPTIONS/Makefile.png
	@@ -1,113 +1,115 @@
	# png = default MPI compiler, default MPI, PNG support

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = mpicxx
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = mpicxx
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP -DLAMMPS_PNG

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX=1
	MPI_PATH =
	MPI_LIB =

	# FFT library
	# see discussion in Section 2.2 (step 6) of manual
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC = -I/usr/include
	JPG_PATH = -L/usr/lib
	JPG_LIB = -lpng

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/MAKE/OPTIONS/Makefile.serial_icc b/src/MAKE/OPTIONS/Makefile.serial_icc
	index eaf04ce3a..ddc9a2c14 100755
	--- a/src/MAKE/OPTIONS/Makefile.serial_icc
	+++ b/src/MAKE/OPTIONS/Makefile.serial_icc
	@@ -1,113 +1,115 @@
	# serial_icc = icc compiler, no MPI

	SHELL = /bin/sh

	# ---------------------------------------------------------------------
	# compiler/linker settings
	# specify flags and libraries needed for your compiler

	CC = icc
	CCFLAGS = -g -O3
	SHFLAGS = -fPIC
	DEPFLAGS = -M

	LINK = icc
	LINKFLAGS = -g -O
	LIB =
	SIZE = size

	ARCHIVE = ar
	ARFLAGS = -rc
	SHLIBFLAGS = -shared

	# ---------------------------------------------------------------------
	# LAMMPS-specific settings, all OPTIONAL
	# specify settings for LAMMPS features you will use
	# if you change any -D setting, do full re-compile after "make clean"

	# LAMMPS ifdef settings
	# see possible settings in Section 2.2 (step 4) of manual

	LMP_INC = -DLAMMPS_GZIP

	# MPI library
	# see discussion in Section 2.2 (step 5) of manual
	# MPI wrapper compiler/linker can provide this info
	# can point to dummy MPI library in src/STUBS as in Makefile.serial
	# use -D MPICH and OMPI settings in INC to avoid C++ lib conflicts
	# INC = path for mpi.h, MPI compiler settings
	# PATH = path for MPI library
	# LIB = name of MPI library

	MPI_INC = -I../STUBS
	MPI_PATH = -L../STUBS
	MPI_LIB = -lmpi_stubs

	# FFT library
	# see discussion in Section 2.2 (step 6) of manaul
	# can be left blank to use provided KISS FFT library
	# INC = -DFFT setting, e.g. -DFFT_FFTW, FFT compiler settings
	# PATH = path for FFT library
	# LIB = name of FFT library

	FFT_INC =
	FFT_PATH =
	FFT_LIB =

	# JPEG and/or PNG library
	# see discussion in Section 2.2 (step 7) of manual
	# only needed if -DLAMMPS_JPEG or -DLAMMPS_PNG listed with LMP_INC
	# INC = path(s) for jpeglib.h and/or png.h
	# PATH = path(s) for JPEG library and/or PNG library
	# LIB = name(s) of JPEG library and/or PNG library

	JPG_INC =
	JPG_PATH =
	JPG_LIB =

	# ---------------------------------------------------------------------
	# build rules and dependencies
	# do not edit this section

	include Makefile.package.settings
	include Makefile.package

	EXTRA_INC = $(LMP_INC) $(PKG_INC) $(MPI_INC) $(FFT_INC) $(JPG_INC) $(PKG_SYSINC)
	EXTRA_PATH = $(PKG_PATH) $(MPI_PATH) $(FFT_PATH) $(JPG_PATH) $(PKG_SYSPATH)
	EXTRA_LIB = $(PKG_LIB) $(MPI_LIB) $(FFT_LIB) $(JPG_LIB) $(PKG_SYSLIB)
	+EXTRA_CPP_DEPENDS = $(PKG_CPP_DEPENDS)
	+EXTRA_LINK_DEPENDS = $(PKG_LINK_DEPENDS)

	# Path to src files

	vpath %.cpp ..
	vpath %.h ..

	# Link target

	-$(EXE): $(OBJ)
	+$(EXE): $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(LINK) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(EXTRA_LIB) $(LIB) -o $(EXE)
	$(SIZE) $(EXE)

	# Library targets

	-lib: $(OBJ)
	+lib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(ARCHIVE) $(ARFLAGS) $(EXE) $(OBJ)

	-shlib: $(OBJ)
	+shlib: $(OBJ) $(EXTRA_LINK_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(SHLIBFLAGS) $(EXTRA_PATH) -o $(EXE) \
	$(OBJ) $(EXTRA_LIB) $(LIB)

	# Compilation rules

	-%.o:%.cpp
	+%.o:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	-%.d:%.cpp
	+%.d:%.cpp $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(EXTRA_INC) $(DEPFLAGS) $< > $@

	-%.o:%.cu
	+%.o:%.cu $(EXTRA_CPP_DEPENDS)
	$(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c $<

	# Individual dependencies

	DEPENDS = $(OBJ:.o=.d)
	sinclude $(DEPENDS)
	diff --git a/src/Makefile.package.empty b/src/Makefile.package.empty
	index 0d8e6c175..d421877e2 100644
	--- a/src/Makefile.package.empty
	+++ b/src/Makefile.package.empty
	@@ -1,10 +1,12 @@
	# Settings for libraries used by specific LAMMPS packages
	# this file is auto-edited when those packages are included/excluded

	PKG_INC =
	PKG_PATH =
	PKG_LIB =
	+PKG_CPP_DEPENDS =
	+PKG_LINK_DEPENDS =

	PKG_SYSINC =
	PKG_SYSLIB =
	PKG_SYSPATH =
	diff --git a/src/RIGID/fix_rigid.cpp b/src/RIGID/fix_rigid.cpp
	index 05819550d..4c3c6b8f2 100644
	--- a/src/RIGID/fix_rigid.cpp
	+++ b/src/RIGID/fix_rigid.cpp
	@@ -1,2594 +1,2606 @@
	/* ----------------------------------------------------------------------
	LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	http://lammps.sandia.gov, Sandia National Laboratories
	Steve Plimpton, sjplimp@sandia.gov

	Copyright (2003) Sandia Corporation. Under the terms of Contract
	DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	certain rights in this software. This software is distributed under
	the GNU General Public License.

	See the README file in the top-level LAMMPS directory.
	------------------------------------------------------------------------- */

	#include "math.h"
	#include "stdio.h"
	#include "stdlib.h"
	#include "string.h"
	#include "fix_rigid.h"
	#include "math_extra.h"
	#include "atom.h"
	#include "atom_vec_ellipsoid.h"
	#include "atom_vec_line.h"
	#include "atom_vec_tri.h"
	#include "domain.h"
	#include "update.h"
	#include "respa.h"
	#include "modify.h"
	#include "group.h"
	#include "comm.h"
	#include "random_mars.h"
	#include "force.h"
	#include "output.h"
	#include "math_const.h"
	#include "memory.h"
	#include "error.h"

	using namespace LAMMPS_NS;
	using namespace FixConst;
	using namespace MathConst;

	enum{SINGLE,MOLECULE,GROUP};
	enum{NONE,XYZ,XY,YZ,XZ};
	enum{ISO,ANISO,TRICLINIC};

	#define MAXLINE 1024
	#define CHUNK 1024
	-#define ATTRIBUTE_PERBODY 17
	+#define ATTRIBUTE_PERBODY 20

	#define TOLERANCE 1.0e-6
	#define EPSILON 1.0e-7

	#define SINERTIA 0.4 // moment of inertia prefactor for sphere
	#define EINERTIA 0.4 // moment of inertia prefactor for ellipsoid
	#define LINERTIA (1.0/12.0) // moment of inertia prefactor for line segment

	/* ---------------------------------------------------------------------- */

	FixRigid::FixRigid(LAMMPS lmp, int narg, char *arg) :
	Fix(lmp, narg, arg)
	{
	int i,ibody;

	scalar_flag = 1;
	extscalar = 0;
	time_integrate = 1;
	rigid_flag = 1;
	virial_flag = 1;
	create_attribute = 1;
	dof_flag = 1;

	MPI_Comm_rank(world,&me);
	MPI_Comm_size(world,&nprocs);

	// perform initial allocation of atom-based arrays
	// register with Atom class

	extended = orientflag = dorientflag = 0;
	body = NULL;
	xcmimage = NULL;
	displace = NULL;
	eflags = NULL;
	orient = NULL;
	dorient = NULL;
	grow_arrays(atom->nmax);
	atom->add_callback(0);

	// parse args for rigid body specification
	// set nbody and body[i] for each atom

	if (narg < 4) error->all(FLERR,"Illegal fix rigid command");
	int iarg;

	mol2body = NULL;
	body2mol = NULL;

	// single rigid body
	// nbody = 1
	// all atoms in fix group are part of body

	if (strcmp(arg[3],"single") == 0) {
	rstyle = SINGLE;
	iarg = 4;
	nbody = 1;

	int *mask = atom->mask;
	int nlocal = atom->nlocal;

	for (i = 0; i < nlocal; i++) {
	body[i] = -1;
	if (mask[i] & groupbit) body[i] = 0;
	}

	// each molecule in fix group is a rigid body
	// maxmol = largest molecule ID
	// ncount = # of atoms in each molecule (have to sum across procs)
	// nbody = # of non-zero ncount values
	// use nall as incremented ptr to set body[] values for each atom

	} else if (strcmp(arg[3],"molecule") == 0) {
	rstyle = MOLECULE;
	iarg = 4;
	if (atom->molecule_flag == 0)
	error->all(FLERR,"Fix rigid molecule requires atom attribute molecule");

	int *mask = atom->mask;
	tagint *molecule = atom->molecule;
	int nlocal = atom->nlocal;

	tagint maxmol_tag = -1;
	for (i = 0; i < nlocal; i++)
	if (mask[i] & groupbit) maxmol_tag = MAX(maxmol_tag,molecule[i]);

	tagint itmp;
	MPI_Allreduce(&maxmol_tag,&itmp,1,MPI_LMP_TAGINT,MPI_MAX,world);
	if (itmp+1 > MAXSMALLINT)
	error->all(FLERR,"Too many molecules for fix rigid");
	maxmol = (int) itmp;

	int *ncount;
	memory->create(ncount,maxmol+1,"rigid:ncount");
	for (i = 0; i <= maxmol; i++) ncount[i] = 0;

	for (i = 0; i < nlocal; i++)
	if (mask[i] & groupbit) ncount[molecule[i]]++;

	memory->create(mol2body,maxmol+1,"rigid:mol2body");
	MPI_Allreduce(ncount,mol2body,maxmol+1,MPI_INT,MPI_SUM,world);

	nbody = 0;
	for (i = 0; i <= maxmol; i++)
	if (mol2body[i]) mol2body[i] = nbody++;
	else mol2body[i] = -1;

	memory->create(body2mol,nbody,"rigid:body2mol");

	nbody = 0;
	for (i = 0; i <= maxmol; i++)
	if (mol2body[i] >= 0) body2mol[nbody++] = i;

	for (i = 0; i < nlocal; i++) {
	body[i] = -1;
	if (mask[i] & groupbit) body[i] = mol2body[molecule[i]];
	}

	memory->destroy(ncount);

	// each listed group is a rigid body
	// check if all listed groups exist
	// an atom must belong to fix group and listed group to be in rigid body
	// error if atom belongs to more than 1 rigid body

	} else if (strcmp(arg[3],"group") == 0) {
	if (narg < 5) error->all(FLERR,"Illegal fix rigid command");
	rstyle = GROUP;
	nbody = force->inumeric(FLERR,arg[4]);
	if (nbody <= 0) error->all(FLERR,"Illegal fix rigid command");
	if (narg < 5+nbody) error->all(FLERR,"Illegal fix rigid command");
	iarg = 5+nbody;

	int *igroups = new int[nbody];
	for (ibody = 0; ibody < nbody; ibody++) {
	igroups[ibody] = group->find(arg[5+ibody]);
	if (igroups[ibody] == -1)
	error->all(FLERR,"Could not find fix rigid group ID");
	}

	int *mask = atom->mask;
	int nlocal = atom->nlocal;

	int flag = 0;
	for (i = 0; i < nlocal; i++) {
	body[i] = -1;
	if (mask[i] & groupbit)
	for (ibody = 0; ibody < nbody; ibody++)
	if (mask[i] & group->bitmask[igroups[ibody]]) {
	if (body[i] >= 0) flag = 1;
	body[i] = ibody;
	}
	}

	int flagall;
	MPI_Allreduce(&flag,&flagall,1,MPI_INT,MPI_SUM,world);
	if (flagall)
	error->all(FLERR,"One or more atoms belong to multiple rigid bodies");

	delete [] igroups;

	} else error->all(FLERR,"Illegal fix rigid command");

	// error check on nbody

	if (nbody == 0) error->all(FLERR,"No rigid bodies defined");

	// create all nbody-length arrays

	memory->create(nrigid,nbody,"rigid:nrigid");
	memory->create(masstotal,nbody,"rigid:masstotal");
	memory->create(xcm,nbody,3,"rigid:xcm");
	memory->create(vcm,nbody,3,"rigid:vcm");
	memory->create(fcm,nbody,3,"rigid:fcm");
	memory->create(inertia,nbody,3,"rigid:inertia");
	memory->create(ex_space,nbody,3,"rigid:ex_space");
	memory->create(ey_space,nbody,3,"rigid:ey_space");
	memory->create(ez_space,nbody,3,"rigid:ez_space");
	memory->create(angmom,nbody,3,"rigid:angmom");
	memory->create(omega,nbody,3,"rigid:omega");
	memory->create(torque,nbody,3,"rigid:torque");
	memory->create(quat,nbody,4,"rigid:quat");
	memory->create(imagebody,nbody,"rigid:imagebody");
	memory->create(fflag,nbody,3,"rigid:fflag");
	memory->create(tflag,nbody,3,"rigid:tflag");
	memory->create(langextra,nbody,6,"rigid:langextra");

	memory->create(sum,nbody,6,"rigid:sum");
	memory->create(all,nbody,6,"rigid:all");
	memory->create(remapflag,nbody,4,"rigid:remapflag");

	// initialize force/torque flags to default = 1.0
	// for 2d: fz, tx, ty = 0.0

	array_flag = 1;
	size_array_rows = nbody;
	size_array_cols = 15;
	global_freq = 1;
	extarray = 0;

	for (i = 0; i < nbody; i++) {
	fflag[i][0] = fflag[i][1] = fflag[i][2] = 1.0;
	tflag[i][0] = tflag[i][1] = tflag[i][2] = 1.0;
	if (domain->dimension == 2) fflag[i][2] = tflag[i][0] = tflag[i][1] = 0.0;
	}

	// number of linear rigid bodies is counted later
	nlinear = 0;

	// parse optional args

	int seed;
	langflag = 0;
	tstat_flag = 0;
	pstat_flag = 0;
	allremap = 1;
	id_dilate = NULL;
	t_chain = 10;
	t_iter = 1;
	t_order = 3;
	p_chain = 10;
	infile = NULL;

	pcouple = NONE;
	pstyle = ANISO;
	dimension = domain->dimension;

	for (int i = 0; i < 3; i++) {
	p_start[i] = p_stop[i] = p_period[i] = 0.0;
	p_flag[i] = 0;
	}

	while (iarg < narg) {
	if (strcmp(arg[iarg],"force") == 0) {
	if (iarg+5 > narg) error->all(FLERR,"Illegal fix rigid command");

	int mlo,mhi;
	force->bounds(arg[iarg+1],nbody,mlo,mhi);

	double xflag,yflag,zflag;
	if (strcmp(arg[iarg+2],"off") == 0) xflag = 0.0;
	else if (strcmp(arg[iarg+2],"on") == 0) xflag = 1.0;
	else error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(arg[iarg+3],"off") == 0) yflag = 0.0;
	else if (strcmp(arg[iarg+3],"on") == 0) yflag = 1.0;
	else error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(arg[iarg+4],"off") == 0) zflag = 0.0;
	else if (strcmp(arg[iarg+4],"on") == 0) zflag = 1.0;
	else error->all(FLERR,"Illegal fix rigid command");

	if (domain->dimension == 2 && zflag == 1.0)
	error->all(FLERR,"Fix rigid z force cannot be on for 2d simulation");

	int count = 0;
	for (int m = mlo; m <= mhi; m++) {
	fflag[m-1][0] = xflag;
	fflag[m-1][1] = yflag;
	fflag[m-1][2] = zflag;
	count++;
	}
	if (count == 0) error->all(FLERR,"Illegal fix rigid command");

	iarg += 5;

	} else if (strcmp(arg[iarg],"torque") == 0) {
	if (iarg+5 > narg) error->all(FLERR,"Illegal fix rigid command");

	int mlo,mhi;
	force->bounds(arg[iarg+1],nbody,mlo,mhi);

	double xflag,yflag,zflag;
	if (strcmp(arg[iarg+2],"off") == 0) xflag = 0.0;
	else if (strcmp(arg[iarg+2],"on") == 0) xflag = 1.0;
	else error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(arg[iarg+3],"off") == 0) yflag = 0.0;
	else if (strcmp(arg[iarg+3],"on") == 0) yflag = 1.0;
	else error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(arg[iarg+4],"off") == 0) zflag = 0.0;
	else if (strcmp(arg[iarg+4],"on") == 0) zflag = 1.0;
	else error->all(FLERR,"Illegal fix rigid command");

	if (domain->dimension == 2 && (xflag == 1.0 \|\| yflag == 1.0))
	error->all(FLERR,"Fix rigid xy torque cannot be on for 2d simulation");

	int count = 0;
	for (int m = mlo; m <= mhi; m++) {
	tflag[m-1][0] = xflag;
	tflag[m-1][1] = yflag;
	tflag[m-1][2] = zflag;
	count++;
	}
	if (count == 0) error->all(FLERR,"Illegal fix rigid command");

	iarg += 5;

	} else if (strcmp(arg[iarg],"langevin") == 0) {
	if (iarg+5 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid") != 0 && strcmp(style,"rigid/nve") != 0 &&
	strcmp(style,"rigid/omp") != 0 && strcmp(style,"rigid/nve/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	langflag = 1;
	t_start = force->numeric(FLERR,arg[iarg+1]);
	t_stop = force->numeric(FLERR,arg[iarg+2]);
	t_period = force->numeric(FLERR,arg[iarg+3]);
	seed = force->inumeric(FLERR,arg[iarg+4]);
	if (t_period <= 0.0)
	error->all(FLERR,"Fix rigid langevin period must be > 0.0");
	if (seed <= 0) error->all(FLERR,"Illegal fix rigid command");
	iarg += 5;

	} else if (strcmp(arg[iarg],"temp") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/nvt") != 0 && strcmp(style,"rigid/npt") != 0 &&
	strcmp(style,"rigid/nvt/omp") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	tstat_flag = 1;
	t_start = force->numeric(FLERR,arg[iarg+1]);
	t_stop = force->numeric(FLERR,arg[iarg+2]);
	t_period = force->numeric(FLERR,arg[iarg+3]);
	iarg += 4;

	} else if (strcmp(arg[iarg],"iso") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/npt") != 0 && strcmp(style,"rigid/nph") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0 &&
	strcmp(style,"rigid/nph/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	pcouple = XYZ;
	p_start[0] = p_start[1] = p_start[2] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[0] = p_stop[1] = p_stop[2] = force->numeric(FLERR,arg[iarg+2]);
	p_period[0] = p_period[1] = p_period[2] =
	force->numeric(FLERR,arg[iarg+3]);
	p_flag[0] = p_flag[1] = p_flag[2] = 1;
	if (dimension == 2) {
	p_start[2] = p_stop[2] = p_period[2] = 0.0;
	p_flag[2] = 0;
	}
	iarg += 4;

	} else if (strcmp(arg[iarg],"aniso") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/npt") != 0 && strcmp(style,"rigid/nph") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0 &&
	strcmp(style,"rigid/nph/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	p_start[0] = p_start[1] = p_start[2] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[0] = p_stop[1] = p_stop[2] = force->numeric(FLERR,arg[iarg+2]);
	p_period[0] = p_period[1] = p_period[2] =
	force->numeric(FLERR,arg[iarg+3]);
	p_flag[0] = p_flag[1] = p_flag[2] = 1;
	if (dimension == 2) {
	p_start[2] = p_stop[2] = p_period[2] = 0.0;
	p_flag[2] = 0;
	}
	iarg += 4;

	} else if (strcmp(arg[iarg],"x") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/npt") != 0 && strcmp(style,"rigid/nph") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0 &&
	strcmp(style,"rigid/nph/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	p_start[0] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[0] = force->numeric(FLERR,arg[iarg+2]);
	p_period[0] = force->numeric(FLERR,arg[iarg+3]);
	p_flag[0] = 1;
	iarg += 4;

	} else if (strcmp(arg[iarg],"y") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/npt") != 0 && strcmp(style,"rigid/nph") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0 &&
	strcmp(style,"rigid/nph/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	p_start[1] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[1] = force->numeric(FLERR,arg[iarg+2]);
	p_period[1] = force->numeric(FLERR,arg[iarg+3]);
	p_flag[1] = 1;
	iarg += 4;

	} else if (strcmp(arg[iarg],"z") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/npt") != 0 && strcmp(style,"rigid/nph") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0 &&
	strcmp(style,"rigid/nph/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	p_start[2] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[2] = force->numeric(FLERR,arg[iarg+2]);
	p_period[2] = force->numeric(FLERR,arg[iarg+3]);
	p_flag[2] = 1;
	iarg += 4;

	} else if (strcmp(arg[iarg],"couple") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(arg[iarg+1],"xyz") == 0) pcouple = XYZ;
	else if (strcmp(arg[iarg+1],"xy") == 0) pcouple = XY;
	else if (strcmp(arg[iarg+1],"yz") == 0) pcouple = YZ;
	else if (strcmp(arg[iarg+1],"xz") == 0) pcouple = XZ;
	else if (strcmp(arg[iarg+1],"none") == 0) pcouple = NONE;
	else error->all(FLERR,"Illegal fix rigid command");
	iarg += 2;

	} else if (strcmp(arg[iarg],"dilate") == 0) {
	if (iarg+2 > narg)
	error->all(FLERR,"Illegal fix rigid npt/nph command");
	if (strcmp(arg[iarg+1],"all") == 0) allremap = 1;
	else {
	allremap = 0;
	delete [] id_dilate;
	int n = strlen(arg[iarg+1]) + 1;
	id_dilate = new char[n];
	strcpy(id_dilate,arg[iarg+1]);
	int idilate = group->find(id_dilate);
	if (idilate == -1)
	error->all(FLERR,
	"Fix rigid npt/nph dilate group ID does not exist");
	}
	iarg += 2;

	} else if (strcmp(arg[iarg],"tparam") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/nvt") != 0 && strcmp(style,"rigid/npt") != 0 &&
	strcmp(style,"rigid/nvt/omp") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	t_chain = force->inumeric(FLERR,arg[iarg+1]);
	t_iter = force->inumeric(FLERR,arg[iarg+2]);
	t_order = force->inumeric(FLERR,arg[iarg+3]);
	iarg += 4;

	} else if (strcmp(arg[iarg],"pchain") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid command");
	if (strcmp(style,"rigid/npt") != 0 && strcmp(style,"rigid/nph") != 0 &&
	strcmp(style,"rigid/npt/omp") != 0 &&
	strcmp(style,"rigid/nph/omp") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	p_chain = force->inumeric(FLERR,arg[iarg+1]);
	iarg += 2;

	} else if (strcmp(arg[iarg],"infile") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid command");
	delete [] infile;
	int n = strlen(arg[iarg+1]) + 1;
	infile = new char[n];
	strcpy(infile,arg[iarg+1]);
	restart_file = 1;
	iarg += 2;

	} else error->all(FLERR,"Illegal fix rigid command");
	}

	// set pstat_flag

	pstat_flag = 0;
	for (int i = 0; i < 3; i++)
	if (p_flag[i]) pstat_flag = 1;

	if (pcouple == XYZ \|\| (dimension == 2 && pcouple == XY)) pstyle = ISO;
	else pstyle = ANISO;

	// initialize Marsaglia RNG with processor-unique seed

	if (langflag) random = new RanMars(lmp,seed + me);
	else random = NULL;

	// initialize vector output quantities in case accessed before run

	for (i = 0; i < nbody; i++) {
	xcm[i][0] = xcm[i][1] = xcm[i][2] = 0.0;
	vcm[i][0] = vcm[i][1] = vcm[i][2] = 0.0;
	fcm[i][0] = fcm[i][1] = fcm[i][2] = 0.0;
	torque[i][0] = torque[i][1] = torque[i][2] = 0.0;
	}

	// nrigid[n] = # of atoms in Nth rigid body
	// error if one or zero atoms

	int *ncount = new int[nbody];
	for (ibody = 0; ibody < nbody; ibody++) ncount[ibody] = 0;

	int nlocal = atom->nlocal;

	for (i = 0; i < nlocal; i++)
	if (body[i] >= 0) ncount[body[i]]++;

	MPI_Allreduce(ncount,nrigid,nbody,MPI_INT,MPI_SUM,world);
	delete [] ncount;

	for (ibody = 0; ibody < nbody; ibody++)
	if (nrigid[ibody] <= 1) error->all(FLERR,"One or zero atoms in rigid body");

	// bitmasks for properties of extended particles

	POINT = 1;
	SPHERE = 2;
	ELLIPSOID = 4;
	LINE = 8;
	TRIANGLE = 16;
	DIPOLE = 32;
	OMEGA = 64;
	ANGMOM = 128;
	TORQUE = 256;

	MINUSPI = -MY_PI;
	TWOPI = 2.0*MY_PI;

	// wait to setup bodies until first init() using current atom properties

	setupflag = 0;

	// print statistics

	int nsum = 0;
	for (ibody = 0; ibody < nbody; ibody++) nsum += nrigid[ibody];

	if (me == 0) {
	if (screen) fprintf(screen,"%d rigid bodies with %d atoms\n",nbody,nsum);
	if (logfile) fprintf(logfile,"%d rigid bodies with %d atoms\n",nbody,nsum);
	}
	}

	/* ---------------------------------------------------------------------- */

	FixRigid::~FixRigid()
	{
	// unregister callbacks to this fix from Atom class

	atom->delete_callback(id,0);

	delete random;
	delete [] infile;
	memory->destroy(mol2body);
	memory->destroy(body2mol);

	// delete locally stored per-atom arrays

	memory->destroy(body);
	memory->destroy(xcmimage);
	memory->destroy(displace);
	memory->destroy(eflags);
	memory->destroy(orient);
	memory->destroy(dorient);

	// delete nbody-length arrays

	memory->destroy(nrigid);
	memory->destroy(masstotal);
	memory->destroy(xcm);
	memory->destroy(vcm);
	memory->destroy(fcm);
	memory->destroy(inertia);
	memory->destroy(ex_space);
	memory->destroy(ey_space);
	memory->destroy(ez_space);
	memory->destroy(angmom);
	memory->destroy(omega);
	memory->destroy(torque);
	memory->destroy(quat);
	memory->destroy(imagebody);
	memory->destroy(fflag);
	memory->destroy(tflag);
	memory->destroy(langextra);

	memory->destroy(sum);
	memory->destroy(all);
	memory->destroy(remapflag);
	}

	/* ---------------------------------------------------------------------- */

	int FixRigid::setmask()
	{
	int mask = 0;
	mask \|= INITIAL_INTEGRATE;
	mask \|= FINAL_INTEGRATE;
	if (langflag) mask \|= POST_FORCE;
	mask \|= PRE_NEIGHBOR;
	mask \|= INITIAL_INTEGRATE_RESPA;
	mask \|= FINAL_INTEGRATE_RESPA;
	return mask;
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::init()
	{
	int i,ibody;

	triclinic = domain->triclinic;

	// atom style pointers to particles that store extra info

	avec_ellipsoid = (AtomVecEllipsoid *) atom->style_match("ellipsoid");
	avec_line = (AtomVecLine *) atom->style_match("line");
	avec_tri = (AtomVecTri *) atom->style_match("tri");

	// warn if more than one rigid fix

	int count = 0;
	for (i = 0; i < modify->nfix; i++)
	if (strcmp(modify->fix[i]->style,"rigid") == 0) count++;
	if (count > 1 && me == 0) error->warning(FLERR,"More than one fix rigid");

	// error if npt,nph fix comes before rigid fix

	for (i = 0; i < modify->nfix; i++) {
	if (strcmp(modify->fix[i]->style,"npt") == 0) break;
	if (strcmp(modify->fix[i]->style,"nph") == 0) break;
	}
	if (i < modify->nfix) {
	for (int j = i; j < modify->nfix; j++)
	if (strcmp(modify->fix[j]->style,"rigid") == 0)
	error->all(FLERR,"Rigid fix must come before NPT/NPH fix");
	}

	// timestep info

	dtv = update->dt;
	dtf = 0.5 * update->dt * force->ftm2v;
	dtq = 0.5 * update->dt;

	if (strstr(update->integrate_style,"respa"))
	step_respa = ((Respa *) update->integrate)->step;

	// setup rigid bodies, using current atom info
	// only do initialization once, b/c properties may not be re-computable
	// especially if overlapping particles
	// do not do dynamic init if read body properties from infile
	// this is b/c the infile defines the static and dynamic properties
	// and may not be computable if contain overlapping particles
	// setup_bodies_static() reads infile itself

	if (!setupflag) {
	setup_bodies_static();
	if (!infile) setup_bodies_dynamic();
	setupflag = 1;
	}

	// temperature scale factor

	double ndof = 0.0;
	for (ibody = 0; ibody < nbody; ibody++) {
	ndof += fflag[ibody][0] + fflag[ibody][1] + fflag[ibody][2];
	ndof += tflag[ibody][0] + tflag[ibody][1] + tflag[ibody][2];
	}
	ndof -= nlinear;
	if (ndof > 0.0) tfactor = force->mvv2e / (ndof * force->boltz);
	else tfactor = 0.0;
	}

	/* ----------------------------------------------------------------------
	invoke pre_neighbor() to insure body xcmimage flags are reset
	needed if Verlet::setup::pbc() has remapped/migrated atoms for 2nd run
	------------------------------------------------------------------------- */

	void FixRigid::setup_pre_neighbor()
	{
	pre_neighbor();
	}

	/* ----------------------------------------------------------------------
	compute initial fcm and torque on bodies, also initial virial
	reset all particle velocities to be consistent with vcm and omega
	------------------------------------------------------------------------- */

	void FixRigid::setup(int vflag)
	{
	int i,n,ibody;

	// fcm = force on center-of-mass of each rigid body

	double **f = atom->f;
	int nlocal = atom->nlocal;

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];
	sum[ibody][0] += f[i][0];
	sum[ibody][1] += f[i][1];
	sum[ibody][2] += f[i][2];
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	for (ibody = 0; ibody < nbody; ibody++) {
	fcm[ibody][0] = all[ibody][0];
	fcm[ibody][1] = all[ibody][1];
	fcm[ibody][2] = all[ibody][2];
	}

	// torque = torque on each rigid body

	double **x = atom->x;

	double dx,dy,dz;
	double unwrap[3];

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	domain->unmap(x[i],xcmimage[i],unwrap);
	dx = unwrap[0] - xcm[ibody][0];
	dy = unwrap[1] - xcm[ibody][1];
	dz = unwrap[2] - xcm[ibody][2];

	sum[ibody][0] += dy * f[i][2] - dz * f[i][1];
	sum[ibody][1] += dz * f[i][0] - dx * f[i][2];
	sum[ibody][2] += dx * f[i][1] - dy * f[i][0];
	}

	// extended particles add their torque to torque of body

	if (extended) {
	double **torque_one = atom->torque;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];
	if (eflags[i] & TORQUE) {
	sum[ibody][0] += torque_one[i][0];
	sum[ibody][1] += torque_one[i][1];
	sum[ibody][2] += torque_one[i][2];
	}
	}
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	for (ibody = 0; ibody < nbody; ibody++) {
	torque[ibody][0] = all[ibody][0];
	torque[ibody][1] = all[ibody][1];
	torque[ibody][2] = all[ibody][2];
	}

	// zero langextra in case Langevin thermostat not used
	// no point to calling post_force() here since langextra
	// is only added to fcm/torque in final_integrate()

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) langextra[ibody][i] = 0.0;

	// virial setup before call to set_v

	if (vflag) v_setup(vflag);
	else evflag = 0;

	// set velocities from angmom & omega

	for (ibody = 0; ibody < nbody; ibody++)
	MathExtra::angmom_to_omega(angmom[ibody],ex_space[ibody],ey_space[ibody],
	ez_space[ibody],inertia[ibody],omega[ibody]);

	set_v();

	// guesstimate virial as 2x the set_v contribution

	if (vflag_global)
	for (n = 0; n < 6; n++) virial[n] *= 2.0;
	if (vflag_atom) {
	for (i = 0; i < nlocal; i++)
	for (n = 0; n < 6; n++)
	vatom[i][n] *= 2.0;
	}
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::initial_integrate(int vflag)
	{
	double dtfm;

	for (int ibody = 0; ibody < nbody; ibody++) {

	// update vcm by 1/2 step

	dtfm = dtf / masstotal[ibody];
	vcm[ibody][0] += dtfm * fcm[ibody][0] * fflag[ibody][0];
	vcm[ibody][1] += dtfm * fcm[ibody][1] * fflag[ibody][1];
	vcm[ibody][2] += dtfm * fcm[ibody][2] * fflag[ibody][2];

	// update xcm by full step

	xcm[ibody][0] += dtv * vcm[ibody][0];
	xcm[ibody][1] += dtv * vcm[ibody][1];
	xcm[ibody][2] += dtv * vcm[ibody][2];

	// update angular momentum by 1/2 step

	angmom[ibody][0] += dtf * torque[ibody][0] * tflag[ibody][0];
	angmom[ibody][1] += dtf * torque[ibody][1] * tflag[ibody][1];
	angmom[ibody][2] += dtf * torque[ibody][2] * tflag[ibody][2];

	// compute omega at 1/2 step from angmom at 1/2 step and current q
	// update quaternion a full step via Richardson iteration
	// returns new normalized quaternion, also updated omega at 1/2 step
	// update ex,ey,ez to reflect new quaternion

	MathExtra::angmom_to_omega(angmom[ibody],ex_space[ibody],ey_space[ibody],
	ez_space[ibody],inertia[ibody],omega[ibody]);
	MathExtra::richardson(quat[ibody],angmom[ibody],omega[ibody],
	inertia[ibody],dtq);
	MathExtra::q_to_exyz(quat[ibody],
	ex_space[ibody],ey_space[ibody],ez_space[ibody]);
	}

	// virial setup before call to set_xv

	if (vflag) v_setup(vflag);
	else evflag = 0;

	// set coords/orient and velocity/rotation of atoms in rigid bodies
	// from quarternion and omega

	set_xv();
	}

	/* ----------------------------------------------------------------------
	apply Langevin thermostat to all 6 DOF of rigid bodies
	computed by proc 0, broadcast to other procs
	unlike fix langevin, this stores extra force in extra arrays,
	which are added in when final_integrate() calculates a new fcm/torque
	------------------------------------------------------------------------- */

	void FixRigid::post_force(int vflag)
	{
	if (me == 0) {
	double gamma1,gamma2;

	double delta = update->ntimestep - update->beginstep;
	if (delta != 0.0) delta /= update->endstep - update->beginstep;
	t_target = t_start + delta * (t_stop-t_start);
	double tsqrt = sqrt(t_target);

	double boltz = force->boltz;
	double dt = update->dt;
	double mvv2e = force->mvv2e;
	double ftm2v = force->ftm2v;

	for (int i = 0; i < nbody; i++) {
	gamma1 = -masstotal[i] / t_period / ftm2v;
	gamma2 = sqrt(masstotal[i]) * tsqrt *
	sqrt(24.0*boltz/t_period/dt/mvv2e) / ftm2v;
	langextra[i][0] = gamma1vcm[i][0] + gamma2(random->uniform()-0.5);
	langextra[i][1] = gamma1vcm[i][1] + gamma2(random->uniform()-0.5);
	langextra[i][2] = gamma1vcm[i][2] + gamma2(random->uniform()-0.5);

	gamma1 = -1.0 / t_period / ftm2v;
	gamma2 = tsqrt * sqrt(24.0*boltz/t_period/dt/mvv2e) / ftm2v;
	langextra[i][3] = inertia[i][0]gamma1omega[i][0] +
	sqrt(inertia[i][0])gamma2(random->uniform()-0.5);
	langextra[i][4] = inertia[i][1]gamma1omega[i][1] +
	sqrt(inertia[i][1])gamma2(random->uniform()-0.5);
	langextra[i][5] = inertia[i][2]gamma1omega[i][2] +
	sqrt(inertia[i][2])gamma2(random->uniform()-0.5);
	}
	}

	MPI_Bcast(&langextra[0][0],6*nbody,MPI_DOUBLE,0,world);
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::final_integrate()
	{
	int i,ibody;
	double dtfm;

	// sum over atoms to get force and torque on rigid body

	double **x = atom->x;
	double **f = atom->f;
	int nlocal = atom->nlocal;

	double dx,dy,dz;
	double unwrap[3];

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	sum[ibody][0] += f[i][0];
	sum[ibody][1] += f[i][1];
	sum[ibody][2] += f[i][2];

	domain->unmap(x[i],xcmimage[i],unwrap);
	dx = unwrap[0] - xcm[ibody][0];
	dy = unwrap[1] - xcm[ibody][1];
	dz = unwrap[2] - xcm[ibody][2];

	sum[ibody][3] += dyf[i][2] - dzf[i][1];
	sum[ibody][4] += dzf[i][0] - dxf[i][2];
	sum[ibody][5] += dxf[i][1] - dyf[i][0];
	}

	// extended particles add their torque to torque of body

	if (extended) {
	double **torque_one = atom->torque;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	if (eflags[i] & TORQUE) {
	sum[ibody][3] += torque_one[i][0];
	sum[ibody][4] += torque_one[i][1];
	sum[ibody][5] += torque_one[i][2];
	}
	}
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	// update vcm and angmom
	// include Langevin thermostat forces
	// fflag,tflag = 0 for some dimensions in 2d

	for (ibody = 0; ibody < nbody; ibody++) {
	fcm[ibody][0] = all[ibody][0] + langextra[ibody][0];
	fcm[ibody][1] = all[ibody][1] + langextra[ibody][1];
	fcm[ibody][2] = all[ibody][2] + langextra[ibody][2];
	torque[ibody][0] = all[ibody][3] + langextra[ibody][3];
	torque[ibody][1] = all[ibody][4] + langextra[ibody][4];
	torque[ibody][2] = all[ibody][5] + langextra[ibody][5];

	// update vcm by 1/2 step

	dtfm = dtf / masstotal[ibody];
	vcm[ibody][0] += dtfm * fcm[ibody][0] * fflag[ibody][0];
	vcm[ibody][1] += dtfm * fcm[ibody][1] * fflag[ibody][1];
	vcm[ibody][2] += dtfm * fcm[ibody][2] * fflag[ibody][2];

	// update angular momentum by 1/2 step

	angmom[ibody][0] += dtf * torque[ibody][0] * tflag[ibody][0];
	angmom[ibody][1] += dtf * torque[ibody][1] * tflag[ibody][1];
	angmom[ibody][2] += dtf * torque[ibody][2] * tflag[ibody][2];

	MathExtra::angmom_to_omega(angmom[ibody],ex_space[ibody],ey_space[ibody],
	ez_space[ibody],inertia[ibody],omega[ibody]);
	}

	// set velocity/rotation of atoms in rigid bodies
	// virial is already setup from initial_integrate

	set_v();
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::initial_integrate_respa(int vflag, int ilevel, int iloop)
	{
	dtv = step_respa[ilevel];
	dtf = 0.5 * step_respa[ilevel] * force->ftm2v;
	dtq = 0.5 * step_respa[ilevel];

	if (ilevel == 0) initial_integrate(vflag);
	else final_integrate();
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::final_integrate_respa(int ilevel, int iloop)
	{
	dtf = 0.5 * step_respa[ilevel] * force->ftm2v;
	final_integrate();
	}

	/* ----------------------------------------------------------------------
	remap xcm of each rigid body back into periodic simulation box
	done during pre_neighbor so will be after call to pbc()
	and after fix_deform::pre_exchange() may have flipped box
	use domain->remap() in case xcm is far away from box
	due to first-time definition of rigid body in setup_bodies_static()
	or due to box flip
	also adjust imagebody = rigid body image flags, due to xcm remap
	also reset body xcmimage flags of all atoms in bodies
	xcmimage flags are relative to xcm so that body can be unwrapped
	if don't do this, would need xcm to move with true image flags
	then a body could end up very far away from box
	set_xv() will then compute huge displacements every step to
	reset coords of all body atoms to be back inside the box,
	ditto for triclinic box flip, which causes numeric problems
	------------------------------------------------------------------------- */

	void FixRigid::pre_neighbor()
	{
	for (int ibody = 0; ibody < nbody; ibody++)
	domain->remap(xcm[ibody],imagebody[ibody]);
	image_shift();
	}

	/* ----------------------------------------------------------------------
	reset body xcmimage flags of atoms in bodies
	xcmimage flags are relative to xcm so that body can be unwrapped
	xcmimage = true image flag - imagebody flag
	------------------------------------------------------------------------- */

	void FixRigid::image_shift()
	{
	int ibody;
	imageint tdim,bdim,xdim[3];

	imageint *image = atom->image;
	int nlocal = atom->nlocal;

	for (int i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	tdim = image[i] & IMGMASK;
	bdim = imagebody[ibody] & IMGMASK;
	xdim[0] = IMGMAX + tdim - bdim;
	tdim = (image[i] >> IMGBITS) & IMGMASK;
	bdim = (imagebody[ibody] >> IMGBITS) & IMGMASK;
	xdim[1] = IMGMAX + tdim - bdim;
	tdim = image[i] >> IMG2BITS;
	bdim = imagebody[ibody] >> IMG2BITS;
	xdim[2] = IMGMAX + tdim - bdim;

	xcmimage[i] = (xdim[2] << IMG2BITS) \| (xdim[1] << IMGBITS) \| xdim[0];
	}
	}

	/* ----------------------------------------------------------------------
	count # of DOF removed by rigid bodies for atoms in igroup
	return total count of DOF
	------------------------------------------------------------------------- */

	int FixRigid::dof(int tgroup)
	{
	// cannot count DOF correctly unless setup_bodies_static() has been called

	if (!setupflag) {
	if (comm->me == 0)
	error->warning(FLERR,"Cannot count rigid body degrees-of-freedom "
	"before bodies are initialized");
	return 0;
	}

	int tgroupbit = group->bitmask[tgroup];

	// nall = # of point particles in each rigid body
	// mall = # of finite-size particles in each rigid body
	// particles must also be in temperature group

	int *mask = atom->mask;
	int nlocal = atom->nlocal;

	int *ncount = new int[nbody];
	int *mcount = new int[nbody];
	for (int ibody = 0; ibody < nbody; ibody++)
	ncount[ibody] = mcount[ibody] = 0;

	for (int i = 0; i < nlocal; i++)
	if (body[i] >= 0 && mask[i] & tgroupbit) {
	// do not count point particles or point dipoles as extended particles
	// a spheroid dipole will be counted as extended
	if (extended && (eflags[i] & ~(POINT \| DIPOLE))) mcount[body[i]]++;
	else ncount[body[i]]++;
	}

	int *nall = new int[nbody];
	int *mall = new int[nbody];
	MPI_Allreduce(ncount,nall,nbody,MPI_INT,MPI_SUM,world);
	MPI_Allreduce(mcount,mall,nbody,MPI_INT,MPI_SUM,world);

	// warn if nall+mall != nrigid for any body included in temperature group

	int flag = 0;
	for (int ibody = 0; ibody < nbody; ibody++) {
	if (nall[ibody]+mall[ibody] > 0 &&
	nall[ibody]+mall[ibody] != nrigid[ibody]) flag = 1;
	}
	if (flag && me == 0)
	error->warning(FLERR,"Computing temperature of portions of rigid bodies");

	// remove appropriate DOFs for each rigid body wholly in temperature group
	// N = # of point particles in body
	// M = # of finite-size particles in body
	// 3d body has 3N + 6M dof to start with
	// 2d body has 2N + 3M dof to start with
	// 3d point-particle body with all non-zero I should have 6 dof, remove 3N-6
	// 3d point-particle body (linear) with a 0 I should have 5 dof, remove 3N-5
	// 2d point-particle body should have 3 dof, remove 2N-3
	// 3d body with any finite-size M should have 6 dof, remove (3N+6M) - 6
	// 2d body with any finite-size M should have 3 dof, remove (2N+3M) - 3

	int n = 0;
	nlinear = 0;
	if (domain->dimension == 3) {
	for (int ibody = 0; ibody < nbody; ibody++)
	if (nall[ibody]+mall[ibody] == nrigid[ibody]) {
	n += 3nall[ibody] + 6mall[ibody] - 6;
	if (inertia[ibody][0] == 0.0 \|\| inertia[ibody][1] == 0.0 \|\|
	inertia[ibody][2] == 0.0) {
	n++;
	nlinear++;
	}
	}
	} else if (domain->dimension == 2) {
	for (int ibody = 0; ibody < nbody; ibody++)
	if (nall[ibody]+mall[ibody] == nrigid[ibody])
	n += 2nall[ibody] + 3mall[ibody] - 3;
	}

	delete [] ncount;
	delete [] mcount;
	delete [] nall;
	delete [] mall;

	return n;
	}

	/* ----------------------------------------------------------------------
	adjust xcm of each rigid body due to box deformation
	called by various fixes that change box size/shape
	flag = 0/1 means map from box to lamda coords or vice versa
	------------------------------------------------------------------------- */

	void FixRigid::deform(int flag)
	{
	if (flag == 0)
	for (int ibody = 0; ibody < nbody; ibody++)
	domain->x2lamda(xcm[ibody],xcm[ibody]);
	else
	for (int ibody = 0; ibody < nbody; ibody++)
	domain->lamda2x(xcm[ibody],xcm[ibody]);
	}

	/* ----------------------------------------------------------------------
	set space-frame coords and velocity of each atom in each rigid body
	set orientation and rotation of extended particles
	x = Q displace + Xcm, mapped back to periodic box
	v = Vcm + (W cross (x - Xcm))
	------------------------------------------------------------------------- */

	void FixRigid::set_xv()
	{
	int ibody;
	int xbox,ybox,zbox;
	double x0,x1,x2,v0,v1,v2,fc0,fc1,fc2,massone;
	double xy,xz,yz;
	double ione[3],exone[3],eyone[3],ezone[3],vr[6],p[3][3];

	double **x = atom->x;
	double **v = atom->v;
	double **f = atom->f;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	double xprd = domain->xprd;
	double yprd = domain->yprd;
	double zprd = domain->zprd;

	if (triclinic) {
	xy = domain->xy;
	xz = domain->xz;
	yz = domain->yz;
	}

	// set x and v of each atom

	for (int i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;

	// save old positions and velocities for virial

	if (evflag) {
	if (triclinic == 0) {
	x0 = x[i][0] + xbox*xprd;
	x1 = x[i][1] + ybox*yprd;
	x2 = x[i][2] + zbox*zprd;
	} else {
	x0 = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	x1 = x[i][1] + yboxyprd + zboxyz;
	x2 = x[i][2] + zbox*zprd;
	}
	v0 = v[i][0];
	v1 = v[i][1];
	v2 = v[i][2];
	}

	// x = displacement from center-of-mass, based on body orientation
	// v = vcm + omega around center-of-mass

	MathExtra::matvec(ex_space[ibody],ey_space[ibody],
	ez_space[ibody],displace[i],x[i]);

	v[i][0] = omega[ibody][1]x[i][2] - omega[ibody][2]x[i][1] +
	vcm[ibody][0];
	v[i][1] = omega[ibody][2]x[i][0] - omega[ibody][0]x[i][2] +
	vcm[ibody][1];
	v[i][2] = omega[ibody][0]x[i][1] - omega[ibody][1]x[i][0] +
	vcm[ibody][2];

	// add center of mass to displacement
	// map back into periodic box via xbox,ybox,zbox
	// for triclinic, add in box tilt factors as well

	if (triclinic == 0) {
	x[i][0] += xcm[ibody][0] - xbox*xprd;
	x[i][1] += xcm[ibody][1] - ybox*yprd;
	x[i][2] += xcm[ibody][2] - zbox*zprd;
	} else {
	x[i][0] += xcm[ibody][0] - xboxxprd - yboxxy - zbox*xz;
	x[i][1] += xcm[ibody][1] - yboxyprd - zboxyz;
	x[i][2] += xcm[ibody][2] - zbox*zprd;
	}

	// virial = unwrapped coords dotted into body constraint force
	// body constraint force = implied force due to v change minus f external
	// assume f does not include forces internal to body
	// 1/2 factor b/c final_integrate contributes other half
	// assume per-atom contribution is due to constraint force on that atom

	if (evflag) {
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];
	fc0 = massone*(v[i][0] - v0)/dtf - f[i][0];
	fc1 = massone*(v[i][1] - v1)/dtf - f[i][1];
	fc2 = massone*(v[i][2] - v2)/dtf - f[i][2];

	vr[0] = 0.5x0fc0;
	vr[1] = 0.5x1fc1;
	vr[2] = 0.5x2fc2;
	vr[3] = 0.5x0fc1;
	vr[4] = 0.5x0fc2;
	vr[5] = 0.5x1fc2;

	v_tally(1,&i,1.0,vr);
	}
	}

	// set orientation, omega, angmom of each extended particle

	if (extended) {
	double theta_body,theta;
	double shape,quatatom,*inertiaatom;

	AtomVecEllipsoid::Bonus *ebonus;
	if (avec_ellipsoid) ebonus = avec_ellipsoid->bonus;
	AtomVecLine::Bonus *lbonus;
	if (avec_line) lbonus = avec_line->bonus;
	AtomVecTri::Bonus *tbonus;
	if (avec_tri) tbonus = avec_tri->bonus;
	double **omega_one = atom->omega;
	double **angmom_one = atom->angmom;
	double **mu = atom->mu;
	int *ellipsoid = atom->ellipsoid;
	int *line = atom->line;
	int *tri = atom->tri;

	for (int i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	if (eflags[i] & SPHERE) {
	omega_one[i][0] = omega[ibody][0];
	omega_one[i][1] = omega[ibody][1];
	omega_one[i][2] = omega[ibody][2];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	quatatom = ebonus[ellipsoid[i]].quat;
	MathExtra::quatquat(quat[ibody],orient[i],quatatom);
	MathExtra::qnormalize(quatatom);
	ione[0] = EINERTIArmass[i] (shape[1]shape[1] + shape[2]shape[2]);
	ione[1] = EINERTIArmass[i] (shape[0]shape[0] + shape[2]shape[2]);
	ione[2] = EINERTIArmass[i] (shape[0]shape[0] + shape[1]shape[1]);
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(omega[ibody],exone,eyone,ezone,ione,
	angmom_one[i]);
	} else if (eflags[i] & LINE) {
	if (quat[ibody][3] >= 0.0) theta_body = 2.0*acos(quat[ibody][0]);
	else theta_body = -2.0*acos(quat[ibody][0]);
	theta = orient[i][0] + theta_body;
	while (theta <= MINUSPI) theta += TWOPI;
	while (theta > MY_PI) theta -= TWOPI;
	lbonus[line[i]].theta = theta;
	omega_one[i][0] = omega[ibody][0];
	omega_one[i][1] = omega[ibody][1];
	omega_one[i][2] = omega[ibody][2];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	quatatom = tbonus[tri[i]].quat;
	MathExtra::quatquat(quat[ibody],orient[i],quatatom);
	MathExtra::qnormalize(quatatom);
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(omega[ibody],exone,eyone,ezone,
	inertiaatom,angmom_one[i]);
	}
	if (eflags[i] & DIPOLE) {
	MathExtra::quat_to_mat(quat[ibody],p);
	MathExtra::matvec(p,dorient[i],mu[i]);
	MathExtra::snormalize3(mu[i][3],mu[i],mu[i]);
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	set space-frame velocity of each atom in a rigid body
	set omega and angmom of extended particles
	v = Vcm + (W cross (x - Xcm))
	------------------------------------------------------------------------- */

	void FixRigid::set_v()
	{
	int xbox,ybox,zbox;
	double x0,x1,x2,v0,v1,v2,fc0,fc1,fc2,massone;
	double xy,xz,yz;
	double ione[3],exone[3],eyone[3],ezone[3],delta[3],vr[6];

	double **x = atom->x;
	double **v = atom->v;
	double **f = atom->f;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	double xprd = domain->xprd;
	double yprd = domain->yprd;
	double zprd = domain->zprd;
	if (triclinic) {
	xy = domain->xy;
	xz = domain->xz;
	yz = domain->yz;
	}

	// set v of each atom

	for (int i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	const int ibody = body[i];

	MathExtra::matvec(ex_space[ibody],ey_space[ibody],
	ez_space[ibody],displace[i],delta);

	// save old velocities for virial

	if (evflag) {
	v0 = v[i][0];
	v1 = v[i][1];
	v2 = v[i][2];
	}

	v[i][0] = omega[ibody][1]delta[2] - omega[ibody][2]delta[1] +
	vcm[ibody][0];
	v[i][1] = omega[ibody][2]delta[0] - omega[ibody][0]delta[2] +
	vcm[ibody][1];
	v[i][2] = omega[ibody][0]delta[1] - omega[ibody][1]delta[0] +
	vcm[ibody][2];

	// virial = unwrapped coords dotted into body constraint force
	// body constraint force = implied force due to v change minus f external
	// assume f does not include forces internal to body
	// 1/2 factor b/c initial_integrate contributes other half
	// assume per-atom contribution is due to constraint force on that atom

	if (evflag) {
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];
	fc0 = massone*(v[i][0] - v0)/dtf - f[i][0];
	fc1 = massone*(v[i][1] - v1)/dtf - f[i][1];
	fc2 = massone*(v[i][2] - v2)/dtf - f[i][2];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;

	if (triclinic == 0) {
	x0 = x[i][0] + xbox*xprd;
	x1 = x[i][1] + ybox*yprd;
	x2 = x[i][2] + zbox*zprd;
	} else {
	x0 = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	x1 = x[i][1] + yboxyprd + zboxyz;
	x2 = x[i][2] + zbox*zprd;
	}

	vr[0] = 0.5x0fc0;
	vr[1] = 0.5x1fc1;
	vr[2] = 0.5x2fc2;
	vr[3] = 0.5x0fc1;
	vr[4] = 0.5x0fc2;
	vr[5] = 0.5x1fc2;

	v_tally(1,&i,1.0,vr);
	}
	}

	// set omega, angmom of each extended particle

	if (extended) {
	double shape,quatatom,*inertiaatom;

	AtomVecEllipsoid::Bonus *ebonus;
	if (avec_ellipsoid) ebonus = avec_ellipsoid->bonus;
	AtomVecTri::Bonus *tbonus;
	if (avec_tri) tbonus = avec_tri->bonus;
	double **omega_one = atom->omega;
	double **angmom_one = atom->angmom;
	int *ellipsoid = atom->ellipsoid;
	int *tri = atom->tri;

	for (int i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	const int ibody = body[i];

	if (eflags[i] & SPHERE) {
	omega_one[i][0] = omega[ibody][0];
	omega_one[i][1] = omega[ibody][1];
	omega_one[i][2] = omega[ibody][2];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	quatatom = ebonus[ellipsoid[i]].quat;
	ione[0] = EINERTIArmass[i] (shape[1]shape[1] + shape[2]shape[2]);
	ione[1] = EINERTIArmass[i] (shape[0]shape[0] + shape[2]shape[2]);
	ione[2] = EINERTIArmass[i] (shape[0]shape[0] + shape[1]shape[1]);
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(omega[ibody],exone,eyone,ezone,ione,
	angmom_one[i]);
	} else if (eflags[i] & LINE) {
	omega_one[i][0] = omega[ibody][0];
	omega_one[i][1] = omega[ibody][1];
	omega_one[i][2] = omega[ibody][2];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	quatatom = tbonus[tri[i]].quat;
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(omega[ibody],exone,eyone,ezone,
	inertiaatom,angmom_one[i]);
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	one-time initialization of static rigid body attributes
	sets extended flags, masstotal, center-of-mass
	sets Cartesian and diagonalized inertia tensor
	sets body image flags
	may read some properties from infile
	------------------------------------------------------------------------- */

	void FixRigid::setup_bodies_static()
	{
	int i,ibody;

	// extended = 1 if any particle in a rigid body is finite size
	// or has a dipole moment

	extended = orientflag = dorientflag = 0;

	AtomVecEllipsoid::Bonus *ebonus;
	if (avec_ellipsoid) ebonus = avec_ellipsoid->bonus;
	AtomVecLine::Bonus *lbonus;
	if (avec_line) lbonus = avec_line->bonus;
	AtomVecTri::Bonus *tbonus;
	if (avec_tri) tbonus = avec_tri->bonus;
	double **mu = atom->mu;
	double *radius = atom->radius;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *ellipsoid = atom->ellipsoid;
	int *line = atom->line;
	int *tri = atom->tri;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	if (atom->radius_flag \|\| atom->ellipsoid_flag \|\| atom->line_flag \|\|
	atom->tri_flag \|\| atom->mu_flag) {
	int flag = 0;
	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	if (radius && radius[i] > 0.0) flag = 1;
	if (ellipsoid && ellipsoid[i] >= 0) flag = 1;
	if (line && line[i] >= 0) flag = 1;
	if (tri && tri[i] >= 0) flag = 1;
	if (mu && mu[i][3] > 0.0) flag = 1;
	}

	MPI_Allreduce(&flag,&extended,1,MPI_INT,MPI_MAX,world);
	}

	// grow extended arrays and set extended flags for each particle
	// orientflag = 4 if any particle stores ellipsoid or tri orientation
	// orientflag = 1 if any particle stores line orientation
	// dorientflag = 1 if any particle stores dipole orientation

	if (extended) {
	if (atom->ellipsoid_flag) orientflag = 4;
	if (atom->line_flag) orientflag = 1;
	if (atom->tri_flag) orientflag = 4;
	if (atom->mu_flag) dorientflag = 1;
	grow_arrays(atom->nmax);

	for (i = 0; i < nlocal; i++) {
	eflags[i] = 0;
	if (body[i] < 0) continue;

	// set to POINT or SPHERE or ELLIPSOID or LINE

	if (radius && radius[i] > 0.0) {
	eflags[i] \|= SPHERE;
	eflags[i] \|= OMEGA;
	eflags[i] \|= TORQUE;
	} else if (ellipsoid && ellipsoid[i] >= 0) {
	eflags[i] \|= ELLIPSOID;
	eflags[i] \|= ANGMOM;
	eflags[i] \|= TORQUE;
	} else if (line && line[i] >= 0) {
	eflags[i] \|= LINE;
	eflags[i] \|= OMEGA;
	eflags[i] \|= TORQUE;
	} else if (tri && tri[i] >= 0) {
	eflags[i] \|= TRIANGLE;
	eflags[i] \|= ANGMOM;
	eflags[i] \|= TORQUE;
	} else eflags[i] \|= POINT;

	// set DIPOLE if atom->mu and mu[3] > 0.0

	if (atom->mu_flag && mu[i][3] > 0.0)
	eflags[i] \|= DIPOLE;
	}
	}

	// set body xcmimage flags = true image flags

	imageint *image = atom->image;
	for (i = 0; i < nlocal; i++)
	if (body[i] >= 0) xcmimage[i] = image[i];
	else xcmimage[i] = 0;

	// compute masstotal & center-of-mass of each rigid body
	// error if image flag is not 0 in a non-periodic dim

	double **x = atom->x;

	int *periodicity = domain->periodicity;
	double xprd = domain->xprd;
	double yprd = domain->yprd;
	double zprd = domain->zprd;
	double xy = domain->xy;
	double xz = domain->xz;
	double yz = domain->yz;

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;
	int xbox,ybox,zbox;
	double massone,xunwrap,yunwrap,zunwrap;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	if ((xbox && !periodicity[0]) \|\| (ybox && !periodicity[1]) \|\|
	(zbox && !periodicity[2]))
	error->one(FLERR,"Fix rigid atom has non-zero image flag "
	"in a non-periodic dimension");

	if (triclinic == 0) {
	xunwrap = x[i][0] + xbox*xprd;
	yunwrap = x[i][1] + ybox*yprd;
	zunwrap = x[i][2] + zbox*zprd;
	} else {
	xunwrap = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	yunwrap = x[i][1] + yboxyprd + zboxyz;
	zunwrap = x[i][2] + zbox*zprd;
	}

	sum[ibody][0] += xunwrap * massone;
	sum[ibody][1] += yunwrap * massone;
	sum[ibody][2] += zunwrap * massone;
	sum[ibody][3] += massone;
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	for (ibody = 0; ibody < nbody; ibody++) {
	masstotal[ibody] = all[ibody][3];
	xcm[ibody][0] = all[ibody][0]/masstotal[ibody];
	xcm[ibody][1] = all[ibody][1]/masstotal[ibody];
	xcm[ibody][2] = all[ibody][2]/masstotal[ibody];
	}

	// set vcm, angmom = 0.0 in case infile is used
	// and doesn't overwrite all body's values
	// since setup_bodies_dynamic() will not be called

	for (ibody = 0; ibody < nbody; ibody++) {
	vcm[ibody][0] = vcm[ibody][1] = vcm[ibody][2] = 0.0;
	angmom[ibody][0] = angmom[ibody][1] = angmom[ibody][2] = 0.0;
	}

	- // overwrite masstotal and center-of-mass with file values
	+ // set rigid body image flags to default values
	+
	+ for (ibody = 0; ibody < nbody; ibody++)
	+ imagebody[ibody] = ((imageint) IMGMAX << IMG2BITS) \|
	+ ((imageint) IMGMAX << IMGBITS) \| IMGMAX;
	+
	+ // overwrite masstotal, center-of-mass, image flags with file values
	// inbody[i] = 0/1 if Ith rigid body is initialized by file

	int *inbody;
	if (infile) {
	memory->create(inbody,nbody,"rigid:inbody");
	for (ibody = 0; ibody < nbody; ibody++) inbody[ibody] = 0;
	- readfile(0,masstotal,xcm,vcm,angmom,inbody);
	+ readfile(0,masstotal,xcm,vcm,angmom,imagebody,inbody);
	}

	- // set rigid body image flags to default values
	- // then remap the xcm of each body back into simulation box
	+ // remap the xcm of each body back into simulation box
	// and reset body and atom xcmimage flags via pre_neighbor()

	- for (ibody = 0; ibody < nbody; ibody++)
	- imagebody[ibody] = ((imageint) IMGMAX << IMG2BITS) \|
	- ((imageint) IMGMAX << IMGBITS) \| IMGMAX;
	-
	pre_neighbor();

	// compute 6 moments of inertia of each body in Cartesian reference frame
	// dx,dy,dz = coords relative to center-of-mass
	// symmetric 3x3 inertia tensor stored in Voigt notation as 6-vector

	double dx,dy,dz;

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;

	if (triclinic == 0) {
	xunwrap = x[i][0] + xbox*xprd;
	yunwrap = x[i][1] + ybox*yprd;
	zunwrap = x[i][2] + zbox*zprd;
	} else {
	xunwrap = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	yunwrap = x[i][1] + yboxyprd + zboxyz;
	zunwrap = x[i][2] + zbox*zprd;
	}

	dx = xunwrap - xcm[ibody][0];
	dy = yunwrap - xcm[ibody][1];
	dz = zunwrap - xcm[ibody][2];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	sum[ibody][0] += massone * (dydy + dzdz);
	sum[ibody][1] += massone * (dxdx + dzdz);
	sum[ibody][2] += massone * (dxdx + dydy);
	sum[ibody][3] -= massone * dy*dz;
	sum[ibody][4] -= massone * dx*dz;
	sum[ibody][5] -= massone * dx*dy;
	}

	// extended particles may contribute extra terms to moments of inertia

	if (extended) {
	double ivec[6];
	double shape,quatatom,*inertiaatom;
	double length,theta;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	if (eflags[i] & SPHERE) {
	sum[ibody][0] += SINERTIAmassone radius[i]*radius[i];
	sum[ibody][1] += SINERTIAmassone radius[i]*radius[i];
	sum[ibody][2] += SINERTIAmassone radius[i]*radius[i];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	quatatom = ebonus[ellipsoid[i]].quat;
	MathExtra::inertia_ellipsoid(shape,quatatom,massone,ivec);
	sum[ibody][0] += ivec[0];
	sum[ibody][1] += ivec[1];
	sum[ibody][2] += ivec[2];
	sum[ibody][3] += ivec[3];
	sum[ibody][4] += ivec[4];
	sum[ibody][5] += ivec[5];
	} else if (eflags[i] & LINE) {
	length = lbonus[line[i]].length;
	theta = lbonus[line[i]].theta;
	MathExtra::inertia_line(length,theta,massone,ivec);
	sum[ibody][0] += ivec[0];
	sum[ibody][1] += ivec[1];
	sum[ibody][2] += ivec[2];
	sum[ibody][3] += ivec[3];
	sum[ibody][4] += ivec[4];
	sum[ibody][5] += ivec[5];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	quatatom = tbonus[tri[i]].quat;
	MathExtra::inertia_triangle(inertiaatom,quatatom,massone,ivec);
	sum[ibody][0] += ivec[0];
	sum[ibody][1] += ivec[1];
	sum[ibody][2] += ivec[2];
	sum[ibody][3] += ivec[3];
	sum[ibody][4] += ivec[4];
	sum[ibody][5] += ivec[5];
	}
	}
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	// overwrite Cartesian inertia tensor with file values

	- if (infile) readfile(1,NULL,all,NULL,NULL,inbody);
	+ if (infile) readfile(1,NULL,all,NULL,NULL,NULL,inbody);

	// diagonalize inertia tensor for each body via Jacobi rotations
	// inertia = 3 eigenvalues = principal moments of inertia
	// evectors and exzy_space = 3 evectors = principal axes of rigid body

	int ierror;
	double cross[3];
	double tensor[3][3],evectors[3][3];

	for (ibody = 0; ibody < nbody; ibody++) {
	tensor[0][0] = all[ibody][0];
	tensor[1][1] = all[ibody][1];
	tensor[2][2] = all[ibody][2];
	tensor[1][2] = tensor[2][1] = all[ibody][3];
	tensor[0][2] = tensor[2][0] = all[ibody][4];
	tensor[0][1] = tensor[1][0] = all[ibody][5];

	ierror = MathExtra::jacobi(tensor,inertia[ibody],evectors);
	if (ierror) error->all(FLERR,
	"Insufficient Jacobi rotations for rigid body");

	ex_space[ibody][0] = evectors[0][0];
	ex_space[ibody][1] = evectors[1][0];
	ex_space[ibody][2] = evectors[2][0];
	ey_space[ibody][0] = evectors[0][1];
	ey_space[ibody][1] = evectors[1][1];
	ey_space[ibody][2] = evectors[2][1];
	ez_space[ibody][0] = evectors[0][2];
	ez_space[ibody][1] = evectors[1][2];
	ez_space[ibody][2] = evectors[2][2];

	// if any principal moment < scaled EPSILON, set to 0.0

	double max;
	max = MAX(inertia[ibody][0],inertia[ibody][1]);
	max = MAX(max,inertia[ibody][2]);

	if (inertia[ibody][0] < EPSILON*max) inertia[ibody][0] = 0.0;
	if (inertia[ibody][1] < EPSILON*max) inertia[ibody][1] = 0.0;
	if (inertia[ibody][2] < EPSILON*max) inertia[ibody][2] = 0.0;

	// enforce 3 evectors as a right-handed coordinate system
	// flip 3rd vector if needed

	MathExtra::cross3(ex_space[ibody],ey_space[ibody],cross);
	if (MathExtra::dot3(cross,ez_space[ibody]) < 0.0)
	MathExtra::negate3(ez_space[ibody]);

	// create initial quaternion

	MathExtra::exyz_to_q(ex_space[ibody],ey_space[ibody],ez_space[ibody],
	quat[ibody]);
	}

	// displace = initial atom coords in basis of principal axes
	// set displace = 0.0 for atoms not in any rigid body
	// for extended particles, set their orientation wrt to rigid body

	double qc[4],delta[3];
	double *quatatom;
	double theta_body;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) {
	displace[i][0] = displace[i][1] = displace[i][2] = 0.0;
	continue;
	}

	ibody = body[i];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;

	if (triclinic == 0) {
	xunwrap = x[i][0] + xbox*xprd;
	yunwrap = x[i][1] + ybox*yprd;
	zunwrap = x[i][2] + zbox*zprd;
	} else {
	xunwrap = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	yunwrap = x[i][1] + yboxyprd + zboxyz;
	zunwrap = x[i][2] + zbox*zprd;
	}

	delta[0] = xunwrap - xcm[ibody][0];
	delta[1] = yunwrap - xcm[ibody][1];
	delta[2] = zunwrap - xcm[ibody][2];
	MathExtra::transpose_matvec(ex_space[ibody],ey_space[ibody],
	ez_space[ibody],delta,displace[i]);

	if (extended) {
	if (eflags[i] & ELLIPSOID) {
	quatatom = ebonus[ellipsoid[i]].quat;
	MathExtra::qconjugate(quat[ibody],qc);
	MathExtra::quatquat(qc,quatatom,orient[i]);
	MathExtra::qnormalize(orient[i]);
	} else if (eflags[i] & LINE) {
	if (quat[ibody][3] >= 0.0) theta_body = 2.0*acos(quat[ibody][0]);
	else theta_body = -2.0*acos(quat[ibody][0]);
	orient[i][0] = lbonus[line[i]].theta - theta_body;
	while (orient[i][0] <= MINUSPI) orient[i][0] += TWOPI;
	while (orient[i][0] > MY_PI) orient[i][0] -= TWOPI;
	if (orientflag == 4) orient[i][1] = orient[i][2] = orient[i][3] = 0.0;
	} else if (eflags[i] & TRIANGLE) {
	quatatom = tbonus[tri[i]].quat;
	MathExtra::qconjugate(quat[ibody],qc);
	MathExtra::quatquat(qc,quatatom,orient[i]);
	MathExtra::qnormalize(orient[i]);
	} else if (orientflag == 4) {
	orient[i][0] = orient[i][1] = orient[i][2] = orient[i][3] = 0.0;
	} else if (orientflag == 1)
	orient[i][0] = 0.0;

	if (eflags[i] & DIPOLE) {
	MathExtra::transpose_matvec(ex_space[ibody],ey_space[ibody],
	ez_space[ibody],mu[i],dorient[i]);
	MathExtra::snormalize3(mu[i][3],dorient[i],dorient[i]);
	} else if (dorientflag)
	dorient[i][0] = dorient[i][1] = dorient[i][2] = 0.0;
	}
	}

	// test for valid principal moments & axes
	// recompute moments of inertia around new axes
	// 3 diagonal moments should equal principal moments
	// 3 off-diagonal moments should be 0.0
	// extended particles may contribute extra terms to moments of inertia

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	sum[ibody][0] += massone *
	(displace[i][1]displace[i][1] + displace[i][2]displace[i][2]);
	sum[ibody][1] += massone *
	(displace[i][0]displace[i][0] + displace[i][2]displace[i][2]);
	sum[ibody][2] += massone *
	(displace[i][0]displace[i][0] + displace[i][1]displace[i][1]);
	sum[ibody][3] -= massone * displace[i][1]*displace[i][2];
	sum[ibody][4] -= massone * displace[i][0]*displace[i][2];
	sum[ibody][5] -= massone * displace[i][0]*displace[i][1];
	}

	if (extended) {
	double ivec[6];
	double shape,inertiaatom;
	double length;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	if (eflags[i] & SPHERE) {
	sum[ibody][0] += SINERTIAmassone radius[i]*radius[i];
	sum[ibody][1] += SINERTIAmassone radius[i]*radius[i];
	sum[ibody][2] += SINERTIAmassone radius[i]*radius[i];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	MathExtra::inertia_ellipsoid(shape,orient[i],massone,ivec);
	sum[ibody][0] += ivec[0];
	sum[ibody][1] += ivec[1];
	sum[ibody][2] += ivec[2];
	sum[ibody][3] += ivec[3];
	sum[ibody][4] += ivec[4];
	sum[ibody][5] += ivec[5];
	} else if (eflags[i] & LINE) {
	length = lbonus[line[i]].length;
	MathExtra::inertia_line(length,orient[i][0],massone,ivec);
	sum[ibody][0] += ivec[0];
	sum[ibody][1] += ivec[1];
	sum[ibody][2] += ivec[2];
	sum[ibody][3] += ivec[3];
	sum[ibody][4] += ivec[4];
	sum[ibody][5] += ivec[5];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	MathExtra::inertia_triangle(inertiaatom,orient[i],massone,ivec);
	sum[ibody][0] += ivec[0];
	sum[ibody][1] += ivec[1];
	sum[ibody][2] += ivec[2];
	sum[ibody][3] += ivec[3];
	sum[ibody][4] += ivec[4];
	sum[ibody][5] += ivec[5];
	}
	}
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	// error check that re-computed momemts of inertia match diagonalized ones
	// do not do test for bodies with params read from infile

	double norm;
	for (ibody = 0; ibody < nbody; ibody++) {
	if (infile && inbody[ibody]) continue;
	if (inertia[ibody][0] == 0.0) {
	if (fabs(all[ibody][0]) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	} else {
	if (fabs((all[ibody][0]-inertia[ibody][0])/inertia[ibody][0]) >
	TOLERANCE) error->all(FLERR,"Fix rigid: Bad principal moments");
	}
	if (inertia[ibody][1] == 0.0) {
	if (fabs(all[ibody][1]) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	} else {
	if (fabs((all[ibody][1]-inertia[ibody][1])/inertia[ibody][1]) >
	TOLERANCE) error->all(FLERR,"Fix rigid: Bad principal moments");
	}
	if (inertia[ibody][2] == 0.0) {
	if (fabs(all[ibody][2]) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	} else {
	if (fabs((all[ibody][2]-inertia[ibody][2])/inertia[ibody][2]) >
	TOLERANCE) error->all(FLERR,"Fix rigid: Bad principal moments");
	}
	norm = (inertia[ibody][0] + inertia[ibody][1] + inertia[ibody][2]) / 3.0;
	if (fabs(all[ibody][3]/norm) > TOLERANCE \|\|
	fabs(all[ibody][4]/norm) > TOLERANCE \|\|
	fabs(all[ibody][5]/norm) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	}

	if (infile) memory->destroy(inbody);
	}

	/* ----------------------------------------------------------------------
	one-time initialization of dynamic rigid body attributes
	set vcm and angmom, computed explicitly from constituent particles
	not done if body properites read from file, e.g. for overlapping particles
	------------------------------------------------------------------------- */

	void FixRigid::setup_bodies_dynamic()
	{
	int i,ibody;
	double massone,radone;

	// vcm = velocity of center-of-mass of each rigid body
	// angmom = angular momentum of each rigid body

	double **x = atom->x;
	double **v = atom->v;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	double dx,dy,dz;
	double unwrap[3];

	for (ibody = 0; ibody < nbody; ibody++)
	for (i = 0; i < 6; i++) sum[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	sum[ibody][0] += v[i][0] * massone;
	sum[ibody][1] += v[i][1] * massone;
	sum[ibody][2] += v[i][2] * massone;

	domain->unmap(x[i],xcmimage[i],unwrap);
	dx = unwrap[0] - xcm[ibody][0];
	dy = unwrap[1] - xcm[ibody][1];
	dz = unwrap[2] - xcm[ibody][2];

	sum[ibody][3] += dy * massonev[i][2] - dz massone*v[i][1];
	sum[ibody][4] += dz * massonev[i][0] - dx massone*v[i][2];
	sum[ibody][5] += dx * massonev[i][1] - dy massone*v[i][0];
	}

	// extended particles add their rotation to angmom of body

	if (extended) {
	AtomVecLine::Bonus *lbonus;
	if (avec_line) lbonus = avec_line->bonus;
	double **omega_one = atom->omega;
	double **angmom_one = atom->angmom;
	double *radius = atom->radius;
	int *line = atom->line;

	for (i = 0; i < nlocal; i++) {
	if (body[i] < 0) continue;
	ibody = body[i];

	if (eflags[i] & OMEGA) {
	if (eflags[i] & SPHERE) {
	radone = radius[i];
	sum[ibody][3] += SINERTIArmass[i] radoneradone omega_one[i][0];
	sum[ibody][4] += SINERTIArmass[i] radoneradone omega_one[i][1];
	sum[ibody][5] += SINERTIArmass[i] radoneradone omega_one[i][2];
	} else if (eflags[i] & LINE) {
	radone = lbonus[line[i]].length;
	sum[ibody][5] += LINERTIArmass[i] radoneradone omega_one[i][2];
	}
	}
	if (eflags[i] & ANGMOM) {
	sum[ibody][3] += angmom_one[i][0];
	sum[ibody][4] += angmom_one[i][1];
	sum[ibody][5] += angmom_one[i][2];
	}
	}
	}

	MPI_Allreduce(sum[0],all[0],6*nbody,MPI_DOUBLE,MPI_SUM,world);

	// normalize velocity of COM

	for (ibody = 0; ibody < nbody; ibody++) {
	vcm[ibody][0] = all[ibody][0]/masstotal[ibody];
	vcm[ibody][1] = all[ibody][1]/masstotal[ibody];
	vcm[ibody][2] = all[ibody][2]/masstotal[ibody];
	angmom[ibody][0] = all[ibody][3];
	angmom[ibody][1] = all[ibody][4];
	angmom[ibody][2] = all[ibody][5];
	}
	}

	/* ----------------------------------------------------------------------
	read per rigid body info from user-provided file
	which = 0 to read everthing except 6 moments of inertia
	which = 1 to read 6 moments of inertia
	flag inbody = 0 for bodies whose info is read from file
	nlines = # of lines of rigid body info
	one line = rigid-ID mass xcm ycm zcm ixx iyy izz ixy ixz iyz
	- vxcm vycm vzcm lx ly lz
	+ vxcm vycm vzcm lx ly lz ix iy iz
	------------------------------------------------------------------------- */

	void FixRigid::readfile(int which, double *vec,
	double array1, double array2, double **array3,
	- int *inbody)
	+ imageint ivec, int inbody)
	{
	- int j,nchunk,id,eofflag;
	+ int j,nchunk,id,eofflag,xbox,ybox,zbox;
	int nlines;
	FILE *fp;
	char eof,start,next,buf;
	char line[MAXLINE];

	if (me == 0) {
	fp = fopen(infile,"r");
	if (fp == NULL) {
	char str[128];
	sprintf(str,"Cannot open fix rigid infile %s",infile);
	error->one(FLERR,str);
	}

	while (1) {
	eof = fgets(line,MAXLINE,fp);
	if (eof == NULL) error->one(FLERR,"Unexpected end of fix rigid file");
	start = &line[strspn(line," \t\n\v\f\r")];
	if (start != '\0' && start != '#') break;
	}

	sscanf(line,"%d",&nlines);
	}

	MPI_Bcast(&nlines,1,MPI_INT,0,world);
	if (nlines == 0) error->all(FLERR,"Fix rigid file has no lines");

	char buffer = new char[CHUNKMAXLINE];
	char *values = new char[ATTRIBUTE_PERBODY];

	int nread = 0;
	while (nread < nlines) {
	nchunk = MIN(nlines-nread,CHUNK);
	eofflag = comm->read_lines_from_file(fp,nchunk,MAXLINE,buffer);
	if (eofflag) error->all(FLERR,"Unexpected end of fix rigid file");

	buf = buffer;
	next = strchr(buf,'\n');
	*next = '\0';
	int nwords = atom->count_words(buf);
	*next = '\n';

	if (nwords != ATTRIBUTE_PERBODY)
	error->all(FLERR,"Incorrect rigid body format in fix rigid file");

	// loop over lines of rigid body attributes
	// tokenize the line into values
	// id = rigid body ID
	// use ID as-is for SINGLE, as mol-ID for MOLECULE, as-is for GROUP
	- // for which = 0, store all but inertia in vec and arrays
	+ // for which = 0, store all but inertia in vecs and arrays
	// for which = 1, store inertia tensor array, invert 3,4,5 values to Voigt

	for (int i = 0; i < nchunk; i++) {
	next = strchr(buf,'\n');

	values[0] = strtok(buf," \t\n\r\f");
	for (j = 1; j < nwords; j++)
	values[j] = strtok(NULL," \t\n\r\f");

	id = atoi(values[0]);
	if (rstyle == MOLECULE) {
	if (id <= 0 \|\| id > maxmol)
	error->all(FLERR,"Invalid rigid body ID in fix rigid file");
	id = mol2body[id];
	} else id--;

	if (id < 0 \|\| id >= nbody)
	error->all(FLERR,"Invalid rigid body ID in fix rigid file");
	inbody[id] = 1;

	if (which == 0) {
	vec[id] = atof(values[1]);
	array1[id][0] = atof(values[2]);
	array1[id][1] = atof(values[3]);
	array1[id][2] = atof(values[4]);
	array2[id][0] = atof(values[11]);
	array2[id][1] = atof(values[12]);
	array2[id][2] = atof(values[13]);
	array3[id][0] = atof(values[14]);
	array3[id][1] = atof(values[15]);
	array3[id][2] = atof(values[16]);
	+ xbox = atoi(values[17]);
	+ ybox = atoi(values[18]);
	+ zbox = atoi(values[19]);
	+ ivec[id] = ((imageint) (xbox + IMGMAX) & IMGMASK) \|
	+ (((imageint) (ybox + IMGMAX) & IMGMASK) << IMGBITS) \|
	+ (((imageint) (zbox + IMGMAX) & IMGMASK) << IMG2BITS);
	} else {
	array1[id][0] = atof(values[5]);
	array1[id][1] = atof(values[6]);
	array1[id][2] = atof(values[7]);
	array1[id][3] = atof(values[10]);
	array1[id][4] = atof(values[9]);
	array1[id][5] = atof(values[8]);
	}

	buf = next + 1;
	}

	nread += nchunk;
	}

	if (me == 0) fclose(fp);

	delete [] buffer;
	delete [] values;
	}

	/* ----------------------------------------------------------------------
	- write out restart info for mass, COM, inertia tensor to file
	+ write out restart info for mass, COM, inertia tensor, image flags to file
	identical format to infile option, so info can be read in when restarting
	only proc 0 writes list of global bodies to file
	------------------------------------------------------------------------- */

	void FixRigid::write_restart_file(char *file)
	{
	if (me) return;

	char outfile[128];
	sprintf(outfile,"%s.rigid",file);
	FILE *fp = fopen(outfile,"w");
	if (fp == NULL) {
	char str[128];
	sprintf(str,"Cannot open fix rigid restart file %s",outfile);
	error->one(FLERR,str);
	}

	fprintf(fp,"# fix rigid mass, COM, inertia tensor info for "
	"%d bodies on timestep " BIGINT_FORMAT "\n\n",
	nbody,update->ntimestep);
	fprintf(fp,"%d\n",nbody);

	// compute I tensor against xyz axes from diagonalized I and current quat
	// Ispace = P Idiag P_transpose
	// P is stored column-wise in exyz_space

	+ int xbox,ybox,zbox;
	double p[3][3],pdiag[3][3],ispace[3][3];

	int id;
	for (int i = 0; i < nbody; i++) {
	if (rstyle == SINGLE \|\| rstyle == GROUP) id = i;
	else id = body2mol[i];

	MathExtra::col2mat(ex_space[i],ey_space[i],ez_space[i],p);
	MathExtra::times3_diag(p,inertia[i],pdiag);
	MathExtra::times3_transpose(pdiag,p,ispace);

	+ xbox = (imagebody[i] & IMGMASK) - IMGMAX;
	+ ybox = (imagebody[i] >> IMGBITS & IMGMASK) - IMGMAX;
	+ zbox = (imagebody[i] >> IMG2BITS) - IMGMAX;
	+
	fprintf(fp,"%d %-1.16e %-1.16e %-1.16e %-1.16e "
	- "%-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %-1.16e\n",
	+ "%-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %d %d %d\n",
	id,masstotal[i],xcm[i][0],xcm[i][1],xcm[i][2],
	ispace[0][0],ispace[1][1],ispace[2][2],
	- ispace[0][1],ispace[0][2],ispace[1][2]);
	+ ispace[0][1],ispace[0][2],ispace[1][2],xbox,ybox,zbox);
	}

	fclose(fp);
	}

	/* ----------------------------------------------------------------------
	memory usage of local atom-based arrays
	------------------------------------------------------------------------- */

	double FixRigid::memory_usage()
	{
	int nmax = atom->nmax;
	double bytes = nmax * sizeof(int);
	bytes += nmax * sizeof(imageint);
	bytes += nmax3 sizeof(double);
	bytes += maxvatom6 sizeof(double); // vatom
	if (extended) {
	bytes += nmax * sizeof(int);
	if (orientflag) bytes = nmaxorientflag sizeof(double);
	if (dorientflag) bytes = nmax3 sizeof(double);
	}
	return bytes;
	}

	/* ----------------------------------------------------------------------
	allocate local atom-based arrays
	------------------------------------------------------------------------- */

	void FixRigid::grow_arrays(int nmax)
	{
	memory->grow(body,nmax,"rigid:body");
	memory->grow(xcmimage,nmax,"rigid:xcmimage");
	memory->grow(displace,nmax,3,"rigid:displace");
	if (extended) {
	memory->grow(eflags,nmax,"rigid:eflags");
	if (orientflag) memory->grow(orient,nmax,orientflag,"rigid:orient");
	if (dorientflag) memory->grow(dorient,nmax,3,"rigid:dorient");
	}

	// check for regrow of vatom
	// must be done whether per-atom virial is accumulated on this step or not
	// b/c this is only time grow_array() may be called
	// need to regrow b/c vatom is calculated before and after atom migration

	if (nmax > maxvatom) {
	maxvatom = atom->nmax;
	memory->grow(vatom,maxvatom,6,"fix:vatom");
	}
	}

	/* ----------------------------------------------------------------------
	copy values within local atom-based arrays
	------------------------------------------------------------------------- */

	void FixRigid::copy_arrays(int i, int j, int delflag)
	{
	body[j] = body[i];
	xcmimage[j] = xcmimage[i];
	displace[j][0] = displace[i][0];
	displace[j][1] = displace[i][1];
	displace[j][2] = displace[i][2];
	if (extended) {
	eflags[j] = eflags[i];
	for (int k = 0; k < orientflag; k++)
	orient[j][k] = orient[i][k];
	if (dorientflag) {
	dorient[j][0] = dorient[i][0];
	dorient[j][1] = dorient[i][1];
	dorient[j][2] = dorient[i][2];
	}
	}

	// must also copy vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	vatom[j][k] = vatom[i][k];
	}

	/* ----------------------------------------------------------------------
	initialize one atom's array values, called when atom is created
	------------------------------------------------------------------------- */

	void FixRigid::set_arrays(int i)
	{
	body[i] = -1;
	xcmimage[i] = 0;
	displace[i][0] = 0.0;
	displace[i][1] = 0.0;
	displace[i][2] = 0.0;

	// must also zero vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	vatom[i][k] = 0.0;
	}

	/* ----------------------------------------------------------------------
	pack values in local atom-based arrays for exchange with another proc
	------------------------------------------------------------------------- */

	int FixRigid::pack_exchange(int i, double *buf)
	{
	buf[0] = ubuf(body[i]).d;
	buf[1] = ubuf(xcmimage[i]).d;
	buf[2] = displace[i][0];
	buf[3] = displace[i][1];
	buf[4] = displace[i][2];
	if (!extended) return 5;

	int m = 5;
	buf[m++] = eflags[i];
	for (int j = 0; j < orientflag; j++)
	buf[m++] = orient[i][j];
	if (dorientflag) {
	buf[m++] = dorient[i][0];
	buf[m++] = dorient[i][1];
	buf[m++] = dorient[i][2];
	}

	// must also pack vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	buf[m++] = vatom[i][k];

	return m;
	}

	/* ----------------------------------------------------------------------
	unpack values in local atom-based arrays from exchange with another proc
	------------------------------------------------------------------------- */

	int FixRigid::unpack_exchange(int nlocal, double *buf)
	{
	body[nlocal] = (int) ubuf(buf[0]).i;
	xcmimage[nlocal] = (imageint) ubuf(buf[1]).i;
	displace[nlocal][0] = buf[2];
	displace[nlocal][1] = buf[3];
	displace[nlocal][2] = buf[4];
	if (!extended) return 5;

	int m = 5;
	eflags[nlocal] = static_cast<int> (buf[m++]);
	for (int j = 0; j < orientflag; j++)
	orient[nlocal][j] = buf[m++];
	if (dorientflag) {
	dorient[nlocal][0] = buf[m++];
	dorient[nlocal][1] = buf[m++];
	dorient[nlocal][2] = buf[m++];
	}

	// must also unpack vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	vatom[nlocal][k] = buf[m++];

	return m;
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::reset_dt()
	{
	dtv = update->dt;
	dtf = 0.5 * update->dt * force->ftm2v;
	dtq = 0.5 * update->dt;
	}

	/* ----------------------------------------------------------------------
	zero linear momentum of each rigid body
	set Vcm to 0.0, then reset velocities of particles via set_v()
	------------------------------------------------------------------------- */

	void FixRigid::zero_momentum()
	{
	for (int ibody = 0; ibody < nbody; ibody++)
	vcm[ibody][0] = vcm[ibody][1] = vcm[ibody][2] = 0.0;

	evflag = 0;
	set_v();
	}

	/* ----------------------------------------------------------------------
	zero angular momentum of each rigid body
	set angmom/omega to 0.0, then reset velocities of particles via set_v()
	------------------------------------------------------------------------- */

	void FixRigid::zero_rotation()
	{
	for (int ibody = 0; ibody < nbody; ibody++) {
	angmom[ibody][0] = angmom[ibody][1] = angmom[ibody][2] = 0.0;
	omega[ibody][0] = omega[ibody][1] = omega[ibody][2] = 0.0;
	}

	evflag = 0;
	set_v();
	}

	/* ----------------------------------------------------------------------
	return temperature of collection of rigid bodies
	non-active DOF are removed by fflag/tflag and in tfactor
	------------------------------------------------------------------------- */

	double FixRigid::compute_scalar()
	{
	double wbody[3],rot[3][3];

	double t = 0.0;
	for (int i = 0; i < nbody; i++) {
	t += masstotal[i] * (fflag[i][0]vcm[i][0]vcm[i][0] +
	fflag[i][1]vcm[i][1]vcm[i][1] +
	fflag[i][2]vcm[i][2]vcm[i][2]);

	// wbody = angular velocity in body frame

	MathExtra::quat_to_mat(quat[i],rot);
	MathExtra::transpose_matvec(rot,angmom[i],wbody);
	if (inertia[i][0] == 0.0) wbody[0] = 0.0;
	else wbody[0] /= inertia[i][0];
	if (inertia[i][1] == 0.0) wbody[1] = 0.0;
	else wbody[1] /= inertia[i][1];
	if (inertia[i][2] == 0.0) wbody[2] = 0.0;
	else wbody[2] /= inertia[i][2];

	t += tflag[i][0]inertia[i][0]wbody[0]*wbody[0] +
	tflag[i][1]inertia[i][1]wbody[1]*wbody[1] +
	tflag[i][2]inertia[i][2]wbody[2]*wbody[2];
	}

	t *= tfactor;
	return t;
	}

	/* ---------------------------------------------------------------------- */

	void FixRigid::extract(const char str, int &dim)
	{
	if (strcmp(str,"body") == 0) {
	dim = 1;
	return body;
	}
	if (strcmp(str,"masstotal") == 0) {
	dim = 1;
	return masstotal;
	}
	if (strcmp(str,"t_target") == 0) {
	dim = 0;
	return &t_target;
	}

	return NULL;
	}

	/* ----------------------------------------------------------------------
	return translational KE for all rigid bodies
	KE = 1/2 M Vcm^2
	------------------------------------------------------------------------- */

	double FixRigid::extract_ke()
	{
	double ke = 0.0;
	for (int i = 0; i < nbody; i++)
	ke += masstotal[i] *
	(vcm[i][0]vcm[i][0] + vcm[i][1]vcm[i][1] + vcm[i][2]*vcm[i][2]);

	return 0.5*ke;
	}

	/* ----------------------------------------------------------------------
	return rotational KE for all rigid bodies
	Erotational = 1/2 I wbody^2
	------------------------------------------------------------------------- */

	double FixRigid::extract_erotational()
	{
	double wbody[3],rot[3][3];

	double erotate = 0.0;
	for (int i = 0; i < nbody; i++) {

	// wbody = angular velocity in body frame

	MathExtra::quat_to_mat(quat[i],rot);
	MathExtra::transpose_matvec(rot,angmom[i],wbody);
	if (inertia[i][0] == 0.0) wbody[0] = 0.0;
	else wbody[0] /= inertia[i][0];
	if (inertia[i][1] == 0.0) wbody[1] = 0.0;
	else wbody[1] /= inertia[i][1];
	if (inertia[i][2] == 0.0) wbody[2] = 0.0;
	else wbody[2] /= inertia[i][2];

	erotate += inertia[i][0]wbody[0]wbody[0] +
	inertia[i][1]wbody[1]wbody[1] + inertia[i][2]wbody[2]wbody[2];
	}

	return 0.5*erotate;
	}

	/* ----------------------------------------------------------------------
	return attributes of a rigid body
	15 values per body
	xcm = 0,1,2; vcm = 3,4,5; fcm = 6,7,8; torque = 9,10,11; image = 12,13,14
	------------------------------------------------------------------------- */

	double FixRigid::compute_array(int i, int j)
	{
	if (j < 3) return xcm[i][j];
	if (j < 6) return vcm[i][j-3];
	if (j < 9) return fcm[i][j-6];
	if (j < 12) return torque[i][j-9];
	if (j == 12) return (imagebody[i] & IMGMASK) - IMGMAX;
	if (j == 13) return (imagebody[i] >> IMGBITS & IMGMASK) - IMGMAX;
	return (imagebody[i] >> IMG2BITS) - IMGMAX;
	}
	diff --git a/src/RIGID/fix_rigid.h b/src/RIGID/fix_rigid.h
	index aef310393..50314bd79 100644
	--- a/src/RIGID/fix_rigid.h
	+++ b/src/RIGID/fix_rigid.h
	@@ -1,265 +1,266 @@
	/* -- c++ -- ----------------------------------------------------------
	LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	http://lammps.sandia.gov, Sandia National Laboratories
	Steve Plimpton, sjplimp@sandia.gov

	Copyright (2003) Sandia Corporation. Under the terms of Contract
	DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	certain rights in this software. This software is distributed under
	the GNU General Public License.

	See the README file in the top-level LAMMPS directory.
	------------------------------------------------------------------------- */

	#ifdef FIX_CLASS

	FixStyle(rigid,FixRigid)

	#else

	#ifndef LMP_FIX_RIGID_H
	#define LMP_FIX_RIGID_H

	#include "fix.h"

	namespace LAMMPS_NS {

	class FixRigid : public Fix {
	public:
	FixRigid(class LAMMPS , int, char *);
	virtual ~FixRigid();
	virtual int setmask();
	virtual void init();
	virtual void setup(int);
	virtual void initial_integrate(int);
	void post_force(int);
	virtual void final_integrate();
	void initial_integrate_respa(int, int, int);
	void final_integrate_respa(int, int);
	void write_restart_file(char *);
	virtual double compute_scalar();
	virtual int modify_param(int, char **) {return 0;}

	double memory_usage();
	void grow_arrays(int);
	void copy_arrays(int, int, int);
	void set_arrays(int);
	int pack_exchange(int, double *);
	int unpack_exchange(int, double *);

	void setup_pre_neighbor();
	void pre_neighbor();
	int dof(int);
	void deform(int);
	void reset_dt();
	void zero_momentum();
	void zero_rotation();
	virtual void extract(const char, int &);
	double extract_ke();
	double extract_erotational();
	double compute_array(int, int);

	protected:
	int me,nprocs;
	double dtv,dtf,dtq;
	double *step_respa;
	int triclinic;
	double MINUSPI,TWOPI;

	char *infile; // file to read rigid body attributes from
	int rstyle; // SINGLE,MOLECULE,GROUP
	int setupflag; // 1 if body properties are setup, else 0

	int dimension; // # of dimensions
	int nbody; // # of rigid bodies
	int nlinear; // # of linear rigid bodies
	int *nrigid; // # of atoms in each rigid body
	int *mol2body; // convert mol-ID to rigid body index
	int *body2mol; // convert rigid body index to mol-ID
	int maxmol; // size of mol2body = max mol-ID

	int *body; // which body each atom is part of (-1 if none)
	double **displace; // displacement of each atom in body coords

	double *masstotal; // total mass of each rigid body
	double **xcm; // coords of center-of-mass of each rigid body
	double **vcm; // velocity of center-of-mass of each
	double **fcm; // force on center-of-mass of each
	double **inertia; // 3 principal components of inertia of each
	double ex_space,ey_space,**ez_space;
	// principal axes of each in space coords
	double **angmom; // angular momentum of each in space coords
	double **omega; // angular velocity of each in space coords
	double **torque; // torque on each rigid body in space coords
	double **quat; // quaternion of each rigid body
	imageint *imagebody; // image flags of xcm of each rigid body
	double **fflag; // flag for on/off of center-of-mass force
	double **tflag; // flag for on/off of center-of-mass torque
	double **langextra; // Langevin thermostat forces and torques

	double sum,all; // work vectors for each rigid body
	int **remapflag; // PBC remap flags for each rigid body

	int extended; // 1 if any particles have extended attributes
	int orientflag; // 1 if particles store spatial orientation
	int dorientflag; // 1 if particles store dipole orientation

	imageint *xcmimage; // internal image flags for atoms in rigid bodies
	// set relative to in-box xcm of each body
	int *eflags; // flags for extended particles
	double **orient; // orientation vector of particle wrt rigid body
	double **dorient; // orientation of dipole mu wrt rigid body

	double tfactor; // scale factor on temperature of rigid bodies
	int langflag; // 0/1 = no/yes Langevin thermostat

	int tstat_flag; // NVT settings
	double t_start,t_stop,t_target;
	double t_period,t_freq;
	int t_chain,t_iter,t_order;

	int pstat_flag; // NPT settings
	double p_start[3],p_stop[3];
	double p_period[3],p_freq[3];
	int p_flag[3];
	int pcouple,pstyle;
	int p_chain;

	int allremap; // remap all atoms
	int dilate_group_bit; // mask for dilation group
	char *id_dilate; // group name to dilate

	class RanMars *random;
	class AtomVecEllipsoid *avec_ellipsoid;
	class AtomVecLine *avec_line;
	class AtomVecTri *avec_tri;

	int POINT,SPHERE,ELLIPSOID,LINE,TRIANGLE,DIPOLE; // bitmasks for eflags
	int OMEGA,ANGMOM,TORQUE;

	void image_shift();
	void set_xv();
	void set_v();
	void setup_bodies_static();
	void setup_bodies_dynamic();
	- void readfile(int, double , double , double , double , int );
	+ void readfile(int, double , double , double , double *,
	+ imageint , int );
	};

	}

	#endif
	#endif

	/* ERROR/WARNING messages:

	E: Illegal ... command

	Self-explanatory. Check the input script syntax and compare to the
	documentation for the command. You can use -echo screen as a
	command-line option when running LAMMPS to see the offending line.

	E: Fix rigid molecule requires atom attribute molecule

	Self-explanatory.

	E: Too many molecules for fix rigid

	The limit is 2^31 = ~2 billion molecules.

	E: Could not find fix rigid group ID

	A group ID used in the fix rigid command does not exist.

	E: One or more atoms belong to multiple rigid bodies

	Two or more rigid bodies defined by the fix rigid command cannot
	contain the same atom.

	E: No rigid bodies defined

	The fix specification did not end up defining any rigid bodies.

	E: Fix rigid z force cannot be on for 2d simulation

	Self-explanatory.

	E: Fix rigid xy torque cannot be on for 2d simulation

	Self-explanatory.

	E: Fix rigid langevin period must be > 0.0

	Self-explanatory.

	E: Fix rigid npt/nph dilate group ID does not exist

	Self-explanatory.

	E: One or zero atoms in rigid body

	Any rigid body defined by the fix rigid command must contain 2 or more
	atoms.

	W: More than one fix rigid

	It is not efficient to use fix rigid more than once.

	E: Rigid fix must come before NPT/NPH fix

	NPT/NPH fix must be defined in input script after all rigid fixes,
	else the rigid fix contribution to the pressure virial is
	incorrect.

	W: Cannot count rigid body degrees-of-freedom before bodies are initialized

	This means the temperature associated with the rigid bodies may be
	incorrect on this timestep.

	W: Computing temperature of portions of rigid bodies

	The group defined by the temperature compute does not encompass all
	the atoms in one or more rigid bodies, so the change in
	degrees-of-freedom for the atoms in those partial rigid bodies will
	not be accounted for.

	E: Fix rigid atom has non-zero image flag in a non-periodic dimension

	Image flags for non-periodic dimensions should not be set.

	E: Insufficient Jacobi rotations for rigid body

	Eigensolve for rigid body was not sufficiently accurate.

	E: Fix rigid: Bad principal moments

	The principal moments of inertia computed for a rigid body
	are not within the required tolerances.

	E: Cannot open fix rigid infile %s

	The specified file cannot be opened. Check that the path and name are
	correct.

	E: Unexpected end of fix rigid file

	A read operation from the file failed.

	E: Fix rigid file has no lines

	Self-explanatory.

	E: Incorrect rigid body format in fix rigid file

	The number of fields per line is not what expected.

	E: Invalid rigid body ID in fix rigid file

	The ID does not match the number of an existing ID of rigid bodies
	that are defined by the fix rigid command.

	E: Cannot open fix rigid restart file %s

	The specified file cannot be opened. Check that the path and name are
	correct.

	*/
	diff --git a/src/RIGID/fix_rigid_small.cpp b/src/RIGID/fix_rigid_small.cpp
	index e6e537bc5..1d7569bcf 100644
	--- a/src/RIGID/fix_rigid_small.cpp
	+++ b/src/RIGID/fix_rigid_small.cpp
	@@ -1,3507 +1,3520 @@
	/* ----------------------------------------------------------------------
	LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
	http://lammps.sandia.gov, Sandia National Laboratories
	Steve Plimpton, sjplimp@sandia.gov

	Copyright (2003) Sandia Corporation. Under the terms of Contract
	DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains
	certain rights in this software. This software is distributed under
	the GNU General Public License.

	See the README file in the top-level LAMMPS directory.
	------------------------------------------------------------------------- */

	#include "math.h"
	#include "stdio.h"
	#include "stdlib.h"
	#include "string.h"
	#include "fix_rigid_small.h"
	#include "math_extra.h"
	#include "atom.h"
	#include "atom_vec_ellipsoid.h"
	#include "atom_vec_line.h"
	#include "atom_vec_tri.h"
	#include "molecule.h"
	#include "domain.h"
	#include "update.h"
	#include "respa.h"
	#include "modify.h"
	#include "group.h"
	#include "comm.h"
	#include "force.h"
	#include "output.h"
	#include "random_mars.h"
	#include "math_const.h"
	#include "memory.h"
	#include "error.h"

	#include <map>

	using namespace LAMMPS_NS;
	using namespace FixConst;
	using namespace MathConst;

	// allocate space for static class variable

	FixRigidSmall *FixRigidSmall::frsptr;

	#define MAXLINE 1024
	#define CHUNK 1024
	-#define ATTRIBUTE_PERBODY 17
	+#define ATTRIBUTE_PERBODY 20

	#define TOLERANCE 1.0e-6
	#define EPSILON 1.0e-7
	#define BIG 1.0e20

	#define SINERTIA 0.4 // moment of inertia prefactor for sphere
	#define EINERTIA 0.4 // moment of inertia prefactor for ellipsoid
	#define LINERTIA (1.0/12.0) // moment of inertia prefactor for line segment

	#define DELTA_BODY 10000

	enum{NONE,XYZ,XY,YZ,XZ}; // same as in FixRigid
	enum{ISO,ANISO,TRICLINIC}; // same as in FixRigid

	enum{FULL_BODY,INITIAL,FINAL,FORCE_TORQUE,VCM_ANGMOM,XCM_MASS,ITENSOR,DOF};

	/* ---------------------------------------------------------------------- */

	FixRigidSmall::FixRigidSmall(LAMMPS lmp, int narg, char *arg) :
	Fix(lmp, narg, arg)
	{
	int i;

	scalar_flag = 1;
	extscalar = 0;
	global_freq = 1;
	time_integrate = 1;
	rigid_flag = 1;
	virial_flag = 1;
	create_attribute = 1;
	dof_flag = 1;

	MPI_Comm_rank(world,&me);
	MPI_Comm_size(world,&nprocs);

	// perform initial allocation of atom-based arrays
	// register with Atom class

	extended = orientflag = dorientflag = 0;
	bodyown = NULL;
	bodytag = NULL;
	atom2body = NULL;
	xcmimage = NULL;
	displace = NULL;
	eflags = NULL;
	orient = NULL;
	dorient = NULL;
	grow_arrays(atom->nmax);
	atom->add_callback(0);

	// parse args for rigid body specification

	if (narg < 4) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(arg[3],"molecule") != 0)
	error->all(FLERR,"Illegal fix rigid/small command");

	if (atom->molecule_flag == 0)
	error->all(FLERR,"Fix rigid/small requires atom attribute molecule");
	if (atom->map_style == 0)
	error->all(FLERR,"Fix rigid/small requires an atom map, see atom_modify");

	// maxmol = largest molecule #

	int *mask = atom->mask;
	tagint *molecule = atom->molecule;
	int nlocal = atom->nlocal;

	maxmol = -1;
	for (i = 0; i < nlocal; i++)
	if (mask[i] & groupbit) maxmol = MAX(maxmol,molecule[i]);

	tagint itmp;
	MPI_Allreduce(&maxmol,&itmp,1,MPI_LMP_TAGINT,MPI_MAX,world);
	maxmol = itmp;

	// number of linear molecules is counted later
	nlinear = 0;

	// parse optional args

	int seed;
	langflag = 0;
	infile = NULL;
	onemols = NULL;

	tstat_flag = 0;
	pstat_flag = 0;
	allremap = 1;
	id_dilate = NULL;
	t_chain = 10;
	t_iter = 1;
	t_order = 3;
	p_chain = 10;

	pcouple = NONE;
	pstyle = ANISO;

	for (int i = 0; i < 3; i++) {
	p_start[i] = p_stop[i] = p_period[i] = 0.0;
	p_flag[i] = 0;
	}

	int iarg = 4;
	while (iarg < narg) {
	if (strcmp(arg[iarg],"langevin") == 0) {
	if (iarg+5 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(style,"rigid/small") != 0)
	error->all(FLERR,"Illegal fix rigid/small command");
	langflag = 1;
	t_start = force->numeric(FLERR,arg[iarg+1]);
	t_stop = force->numeric(FLERR,arg[iarg+2]);
	t_period = force->numeric(FLERR,arg[iarg+3]);
	seed = force->inumeric(FLERR,arg[iarg+4]);
	if (t_period <= 0.0)
	error->all(FLERR,"Fix rigid/small langevin period must be > 0.0");
	if (seed <= 0) error->all(FLERR,"Illegal fix rigid/small command");
	iarg += 5;
	} else if (strcmp(arg[iarg],"infile") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	delete [] infile;
	int n = strlen(arg[iarg+1]) + 1;
	infile = new char[n];
	strcpy(infile,arg[iarg+1]);
	restart_file = 1;
	iarg += 2;
	} else if (strcmp(arg[iarg],"mol") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	int imol = atom->find_molecule(arg[iarg+1]);
	if (imol == -1)
	error->all(FLERR,"Molecule template ID for "
	"fix rigid/small does not exist");
	onemols = &atom->molecules[imol];
	nmol = onemols[0]->nset;
	restart_file = 1;
	iarg += 2;

	} else if (strcmp(arg[iarg],"temp") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(style,"rigid/nvt/small") != 0 &&
	strcmp(style,"rigid/npt/small") != 0)
	error->all(FLERR,"Illegal fix rigid command");
	tstat_flag = 1;
	t_start = force->numeric(FLERR,arg[iarg+1]);
	t_stop = force->numeric(FLERR,arg[iarg+2]);
	t_period = force->numeric(FLERR,arg[iarg+3]);
	iarg += 4;

	} else if (strcmp(arg[iarg],"iso") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(style,"rigid/npt/small") != 0 &&
	strcmp(style,"rigid/nph/small") != 0)
	error->all(FLERR,"Illegal fix rigid/small command");
	pcouple = XYZ;
	p_start[0] = p_start[1] = p_start[2] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[0] = p_stop[1] = p_stop[2] = force->numeric(FLERR,arg[iarg+2]);
	p_period[0] = p_period[1] = p_period[2] =
	force->numeric(FLERR,arg[iarg+3]);
	p_flag[0] = p_flag[1] = p_flag[2] = 1;
	if (domain->dimension == 2) {
	p_start[2] = p_stop[2] = p_period[2] = 0.0;
	p_flag[2] = 0;
	}
	iarg += 4;

	} else if (strcmp(arg[iarg],"aniso") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(style,"rigid/npt/small") != 0 &&
	strcmp(style,"rigid/nph/small") != 0)
	error->all(FLERR,"Illegal fix rigid/small command");
	p_start[0] = p_start[1] = p_start[2] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[0] = p_stop[1] = p_stop[2] = force->numeric(FLERR,arg[iarg+2]);
	p_period[0] = p_period[1] = p_period[2] =
	force->numeric(FLERR,arg[iarg+3]);
	p_flag[0] = p_flag[1] = p_flag[2] = 1;
	if (domain->dimension == 2) {
	p_start[2] = p_stop[2] = p_period[2] = 0.0;
	p_flag[2] = 0;
	}
	iarg += 4;

	} else if (strcmp(arg[iarg],"x") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	p_start[0] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[0] = force->numeric(FLERR,arg[iarg+2]);
	p_period[0] = force->numeric(FLERR,arg[iarg+3]);
	p_flag[0] = 1;
	iarg += 4;

	} else if (strcmp(arg[iarg],"y") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	p_start[1] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[1] = force->numeric(FLERR,arg[iarg+2]);
	p_period[1] = force->numeric(FLERR,arg[iarg+3]);
	p_flag[1] = 1;
	iarg += 4;

	} else if (strcmp(arg[iarg],"z") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	p_start[2] = force->numeric(FLERR,arg[iarg+1]);
	p_stop[2] = force->numeric(FLERR,arg[iarg+2]);
	p_period[2] = force->numeric(FLERR,arg[iarg+3]);
	p_flag[2] = 1;
	iarg += 4;

	} else if (strcmp(arg[iarg],"couple") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(arg[iarg+1],"xyz") == 0) pcouple = XYZ;
	else if (strcmp(arg[iarg+1],"xy") == 0) pcouple = XY;
	else if (strcmp(arg[iarg+1],"yz") == 0) pcouple = YZ;
	else if (strcmp(arg[iarg+1],"xz") == 0) pcouple = XZ;
	else if (strcmp(arg[iarg+1],"none") == 0) pcouple = NONE;
	else error->all(FLERR,"Illegal fix rigid/small command");
	iarg += 2;

	} else if (strcmp(arg[iarg],"dilate") == 0) {
	if (iarg+2 > narg)
	error->all(FLERR,"Illegal fix rigid/small nvt/npt/nph command");
	if (strcmp(arg[iarg+1],"all") == 0) allremap = 1;
	else {
	allremap = 0;
	delete [] id_dilate;
	int n = strlen(arg[iarg+1]) + 1;
	id_dilate = new char[n];
	strcpy(id_dilate,arg[iarg+1]);
	int idilate = group->find(id_dilate);
	if (idilate == -1)
	error->all(FLERR,"Fix rigid/small nvt/npt/nph dilate group ID "
	"does not exist");
	}
	iarg += 2;

	} else if (strcmp(arg[iarg],"tparam") == 0) {
	if (iarg+4 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(style,"rigid/nvt/small") != 0 &&
	strcmp(style,"rigid/npt/small") != 0)
	error->all(FLERR,"Illegal fix rigid/small command");
	t_chain = force->numeric(FLERR,arg[iarg+1]);
	t_iter = force->numeric(FLERR,arg[iarg+2]);
	t_order = force->numeric(FLERR,arg[iarg+3]);
	iarg += 4;

	} else if (strcmp(arg[iarg],"pchain") == 0) {
	if (iarg+2 > narg) error->all(FLERR,"Illegal fix rigid/small command");
	if (strcmp(style,"rigid/npt/small") != 0 &&
	strcmp(style,"rigid/nph/small") != 0)
	error->all(FLERR,"Illegal fix rigid/small command");
	p_chain = force->numeric(FLERR,arg[iarg+1]);
	iarg += 2;


	} else error->all(FLERR,"Illegal fix rigid/small command");
	}

	// error check and further setup for Molecule template

	if (onemols) {
	for (int i = 0; i < nmol; i++) {
	if (onemols[i]->xflag == 0)
	error->all(FLERR,"Fix rigid/small molecule must have coordinates");
	if (onemols[i]->typeflag == 0)
	error->all(FLERR,"Fix rigid/small molecule must have atom types");

	// fix rigid/small uses center, masstotal, COM, inertia of molecule

	onemols[i]->compute_center();
	onemols[i]->compute_mass();
	onemols[i]->compute_com();
	onemols[i]->compute_inertia();
	}
	}

	// set pstat_flag

	pstat_flag = 0;
	for (int i = 0; i < 3; i++)
	if (p_flag[i]) pstat_flag = 1;

	if (pcouple == XYZ \|\| (domain->dimension == 2 && pcouple == XY)) pstyle = ISO;
	else pstyle = ANISO;

	// create rigid bodies based on molecule ID
	// sets bodytag for owned atoms
	// body attributes are computed later by setup_bodies()

	create_bodies();

	// set nlocal_body and allocate bodies I own

	tagint *tag = atom->tag;

	nlocal_body = nghost_body = 0;
	for (i = 0; i < nlocal; i++)
	if (bodytag[i] == tag[i]) nlocal_body++;

	nmax_body = 0;
	while (nmax_body < nlocal_body) nmax_body += DELTA_BODY;
	body = (Body ) memory->smalloc(nmax_bodysizeof(Body),
	"rigid/small:body");

	// set bodyown for owned atoms

	nlocal_body = 0;
	for (i = 0; i < nlocal; i++)
	if (bodytag[i] == tag[i]) {
	body[nlocal_body].ilocal = i;
	bodyown[i] = nlocal_body++;
	} else bodyown[i] = -1;


	// bodysize = sizeof(Body) in doubles

	bodysize = sizeof(Body)/sizeof(double);
	if (bodysize*sizeof(double) != sizeof(Body)) bodysize++;

	// set max comm sizes needed by this fix

	comm_forward = 1 + bodysize;
	comm_reverse = 6;

	// bitmasks for properties of extended particles

	POINT = 1;
	SPHERE = 2;
	ELLIPSOID = 4;
	LINE = 8;
	TRIANGLE = 16;
	DIPOLE = 32;
	OMEGA = 64;
	ANGMOM = 128;
	TORQUE = 256;

	MINUSPI = -MY_PI;
	TWOPI = 2.0*MY_PI;

	// atom style pointers to particles that store extra info

	avec_ellipsoid = (AtomVecEllipsoid *) atom->style_match("ellipsoid");
	avec_line = (AtomVecLine *) atom->style_match("line");
	avec_tri = (AtomVecTri *) atom->style_match("tri");

	// print statistics

	int one = 0;
	bigint atomone = 0;
	for (int i = 0; i < nlocal; i++) {
	if (bodyown[i] >= 0) one++;
	if (bodytag[i] > 0) atomone++;
	}
	MPI_Allreduce(&one,&nbody,1,MPI_INT,MPI_SUM,world);
	bigint atomall;
	MPI_Allreduce(&atomone,&atomall,1,MPI_LMP_BIGINT,MPI_SUM,world);

	if (me == 0) {
	if (screen) {
	fprintf(screen,"%d rigid bodies with " BIGINT_FORMAT " atoms\n",
	nbody,atomall);
	fprintf(screen," %g = max distance from body owner to body atom\n",
	maxextent);
	}
	if (logfile) {
	fprintf(logfile,"%d rigid bodies with " BIGINT_FORMAT " atoms\n",
	nbody,atomall);
	fprintf(logfile," %g = max distance from body owner to body atom\n",
	maxextent);
	}
	}

	// initialize Marsaglia RNG with processor-unique seed

	maxlang = 0;
	langextra = NULL;
	random = NULL;
	if (langflag) random = new RanMars(lmp,seed + comm->me);

	// mass vector for granular pair styles

	mass_body = NULL;
	nmax_mass = 0;

	// wait to setup bodies until comm stencils are defined

	setupflag = 0;
	}

	/* ---------------------------------------------------------------------- */

	FixRigidSmall::~FixRigidSmall()
	{
	// unregister callbacks to this fix from Atom class

	atom->delete_callback(id,0);

	// delete locally stored arrays

	memory->sfree(body);

	memory->destroy(bodyown);
	memory->destroy(bodytag);
	memory->destroy(atom2body);
	memory->destroy(xcmimage);
	memory->destroy(displace);
	memory->destroy(eflags);
	memory->destroy(orient);
	memory->destroy(dorient);

	delete random;
	delete [] infile;

	memory->destroy(langextra);
	memory->destroy(mass_body);
	}

	/* ---------------------------------------------------------------------- */

	int FixRigidSmall::setmask()
	{
	int mask = 0;
	mask \|= INITIAL_INTEGRATE;
	mask \|= FINAL_INTEGRATE;
	if (langflag) mask \|= POST_FORCE;
	mask \|= PRE_NEIGHBOR;
	mask \|= INITIAL_INTEGRATE_RESPA;
	mask \|= FINAL_INTEGRATE_RESPA;
	return mask;
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::init()
	{
	int i;

	triclinic = domain->triclinic;

	// warn if more than one rigid fix

	int count = 0;
	for (i = 0; i < modify->nfix; i++)
	if (strcmp(modify->fix[i]->style,"rigid") == 0) count++;
	if (count > 1 && me == 0) error->warning(FLERR,"More than one fix rigid");

	// error if npt,nph fix comes before rigid fix

	for (i = 0; i < modify->nfix; i++) {
	if (strcmp(modify->fix[i]->style,"npt") == 0) break;
	if (strcmp(modify->fix[i]->style,"nph") == 0) break;
	}
	if (i < modify->nfix) {
	for (int j = i; j < modify->nfix; j++)
	if (strcmp(modify->fix[j]->style,"rigid") == 0)
	error->all(FLERR,"Rigid fix must come before NPT/NPH fix");
	}

	// timestep info

	dtv = update->dt;
	dtf = 0.5 * update->dt * force->ftm2v;
	dtq = 0.5 * update->dt;

	if (strstr(update->integrate_style,"respa"))
	step_respa = ((Respa *) update->integrate)->step;
	}

	/* ----------------------------------------------------------------------
	setup static/dynamic properties of rigid bodies, using current atom info
	only do initialization once, b/c properties may not be re-computable
	especially if overlapping particles or bodies inserted from mol template
	do not do dynamic init if read body properties from infile
	this is b/c the infile defines the static and dynamic properties
	and may not be computable if contain overlapping particles
	setup_bodies_static() reads infile itself
	cannot do this until now, b/c requires comm->setup() to have setup stencil
	invoke pre_neighbor() to insure body xcmimage flags are reset
	needed if Verlet::setup::pbc() has remapped/migrated atoms for 2nd run
	setup_bodies_static() invokes pre_neighbor itself
	------------------------------------------------------------------------- */

	void FixRigidSmall::setup_pre_neighbor()
	{
	if (!setupflag) setup_bodies_static();
	else pre_neighbor();
	if (!setupflag && !infile) setup_bodies_dynamic();
	setupflag = 1;
	}

	/* ----------------------------------------------------------------------
	compute initial fcm and torque on bodies, also initial virial
	reset all particle velocities to be consistent with vcm and omega
	------------------------------------------------------------------------- */

	void FixRigidSmall::setup(int vflag)
	{
	int i,n,ibody;

	//check(1);

	// sum fcm, torque across all rigid bodies
	// fcm = force on COM
	// torque = torque around COM

	double **x = atom->x;
	double **f = atom->f;
	int nlocal = atom->nlocal;

	double xcm,fcm,*tcm;
	double dx,dy,dz;
	double unwrap[3];

	for (ibody = 0; ibody < nlocal_body+nghost_body; ibody++) {
	fcm = body[ibody].fcm;
	fcm[0] = fcm[1] = fcm[2] = 0.0;
	tcm = body[ibody].torque;
	tcm[0] = tcm[1] = tcm[2] = 0.0;
	}

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	fcm = b->fcm;
	fcm[0] += f[i][0];
	fcm[1] += f[i][1];
	fcm[2] += f[i][2];

	domain->unmap(x[i],xcmimage[i],unwrap);
	xcm = b->xcm;
	dx = unwrap[0] - xcm[0];
	dy = unwrap[1] - xcm[1];
	dz = unwrap[2] - xcm[2];

	tcm = b->torque;
	tcm[0] += dy * f[i][2] - dz * f[i][1];
	tcm[1] += dz * f[i][0] - dx * f[i][2];
	tcm[2] += dx * f[i][1] - dy * f[i][0];
	}

	// extended particles add their rotation/torque to angmom/torque of body

	if (extended) {
	double **torque = atom->torque;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];
	if (eflags[i] & TORQUE) {
	tcm = b->torque;
	tcm[0] += torque[i][0];
	tcm[1] += torque[i][1];
	tcm[2] += torque[i][2];
	}
	}
	}

	// reverse communicate fcm, torque of all bodies

	commflag = FORCE_TORQUE;
	comm->reverse_comm_fix(this,6);

	// virial setup before call to set_v

	if (vflag) v_setup(vflag);
	else evflag = 0;

	// compute and forward communicate vcm and omega of all bodies

	for (ibody = 0; ibody < nlocal_body; ibody++) {
	Body *b = &body[ibody];
	MathExtra::angmom_to_omega(b->angmom,b->ex_space,b->ey_space,
	b->ez_space,b->inertia,b->omega);
	}

	commflag = FINAL;
	comm->forward_comm_fix(this,10);

	// set velocity/rotation of atoms in rigid bodues

	set_v();

	// guesstimate virial as 2x the set_v contribution

	if (vflag_global)
	for (n = 0; n < 6; n++) virial[n] *= 2.0;
	if (vflag_atom) {
	for (i = 0; i < nlocal; i++)
	for (n = 0; n < 6; n++)
	vatom[i][n] *= 2.0;
	}
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::initial_integrate(int vflag)
	{
	double dtfm;

	//check(2);

	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	Body *b = &body[ibody];

	// update vcm by 1/2 step

	dtfm = dtf / b->mass;
	b->vcm[0] += dtfm * b->fcm[0];
	b->vcm[1] += dtfm * b->fcm[1];
	b->vcm[2] += dtfm * b->fcm[2];

	// update xcm by full step

	b->xcm[0] += dtv * b->vcm[0];
	b->xcm[1] += dtv * b->vcm[1];
	b->xcm[2] += dtv * b->vcm[2];

	// update angular momentum by 1/2 step

	b->angmom[0] += dtf * b->torque[0];
	b->angmom[1] += dtf * b->torque[1];
	b->angmom[2] += dtf * b->torque[2];

	// compute omega at 1/2 step from angmom at 1/2 step and current q
	// update quaternion a full step via Richardson iteration
	// returns new normalized quaternion, also updated omega at 1/2 step
	// update ex,ey,ez to reflect new quaternion

	MathExtra::angmom_to_omega(b->angmom,b->ex_space,b->ey_space,
	b->ez_space,b->inertia,b->omega);
	MathExtra::richardson(b->quat,b->angmom,b->omega,b->inertia,dtq);
	MathExtra::q_to_exyz(b->quat,b->ex_space,b->ey_space,b->ez_space);
	}

	// virial setup before call to set_xv

	if (vflag) v_setup(vflag);
	else evflag = 0;

	// forward communicate updated info of all bodies

	commflag = INITIAL;
	comm->forward_comm_fix(this,26);

	// set coords/orient and velocity/rotation of atoms in rigid bodies

	set_xv();
	}

	/* ----------------------------------------------------------------------
	apply Langevin thermostat to all 6 DOF of rigid bodies I own
	unlike fix langevin, this stores extra force in extra arrays,
	which are added in when final_integrate() calculates a new fcm/torque
	------------------------------------------------------------------------- */

	void FixRigidSmall::post_force(int vflag)
	{
	double gamma1,gamma2;

	// grow langextra if needed

	if (nlocal_body > maxlang) {
	memory->destroy(langextra);
	maxlang = nlocal_body + nghost_body;
	memory->create(langextra,maxlang,6,"rigid/small:langextra");
	}

	double delta = update->ntimestep - update->beginstep;
	delta /= update->endstep - update->beginstep;
	double t_target = t_start + delta * (t_stop-t_start);
	double tsqrt = sqrt(t_target);

	double boltz = force->boltz;
	double dt = update->dt;
	double mvv2e = force->mvv2e;
	double ftm2v = force->ftm2v;

	double vcm,omega,*inertia;

	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	vcm = body[ibody].vcm;
	omega = body[ibody].omega;
	inertia = body[ibody].inertia;

	gamma1 = -body[ibody].mass / t_period / ftm2v;
	gamma2 = sqrt(body[ibody].mass) * tsqrt *
	sqrt(24.0*boltz/t_period/dt/mvv2e) / ftm2v;
	langextra[ibody][0] = gamma1vcm[0] + gamma2(random->uniform()-0.5);
	langextra[ibody][1] = gamma1vcm[1] + gamma2(random->uniform()-0.5);
	langextra[ibody][2] = gamma1vcm[2] + gamma2(random->uniform()-0.5);

	gamma1 = -1.0 / t_period / ftm2v;
	gamma2 = tsqrt * sqrt(24.0*boltz/t_period/dt/mvv2e) / ftm2v;
	langextra[ibody][3] = inertia[0]gamma1omega[0] +
	sqrt(inertia[0])gamma2(random->uniform()-0.5);
	langextra[ibody][4] = inertia[1]gamma1omega[1] +
	sqrt(inertia[1])gamma2(random->uniform()-0.5);
	langextra[ibody][5] = inertia[2]gamma1omega[2] +
	sqrt(inertia[2])gamma2(random->uniform()-0.5);
	}
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::final_integrate()
	{
	int i,ibody;
	double dtfm;

	//check(3);

	// sum over atoms to get force and torque on rigid body

	double **x = atom->x;
	double **f = atom->f;
	int nlocal = atom->nlocal;

	double dx,dy,dz;
	double unwrap[3];
	double xcm,fcm,*tcm;

	for (ibody = 0; ibody < nlocal_body+nghost_body; ibody++) {
	fcm = body[ibody].fcm;
	fcm[0] = fcm[1] = fcm[2] = 0.0;
	tcm = body[ibody].torque;
	tcm[0] = tcm[1] = tcm[2] = 0.0;
	}

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	fcm = b->fcm;
	fcm[0] += f[i][0];
	fcm[1] += f[i][1];
	fcm[2] += f[i][2];

	domain->unmap(x[i],xcmimage[i],unwrap);
	xcm = b->xcm;
	dx = unwrap[0] - xcm[0];
	dy = unwrap[1] - xcm[1];
	dz = unwrap[2] - xcm[2];

	tcm = b->torque;
	tcm[0] += dyf[i][2] - dzf[i][1];
	tcm[1] += dzf[i][0] - dxf[i][2];
	tcm[2] += dxf[i][1] - dyf[i][0];
	}

	// extended particles add their torque to torque of body

	if (extended) {
	double **torque = atom->torque;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;

	if (eflags[i] & TORQUE) {
	tcm = body[atom2body[i]].torque;
	tcm[0] += torque[i][0];
	tcm[1] += torque[i][1];
	tcm[2] += torque[i][2];
	}
	}
	}

	// reverse communicate fcm, torque of all bodies

	commflag = FORCE_TORQUE;
	comm->reverse_comm_fix(this,6);

	// include Langevin thermostat forces and torques

	if (langflag) {
	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	fcm = body[ibody].fcm;
	fcm[0] += langextra[ibody][0];
	fcm[1] += langextra[ibody][1];
	fcm[2] += langextra[ibody][2];
	tcm = body[ibody].torque;
	tcm[0] += langextra[ibody][3];
	tcm[1] += langextra[ibody][4];
	tcm[2] += langextra[ibody][5];
	}
	}

	// update vcm and angmom, recompute omega

	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	Body *b = &body[ibody];

	// update vcm by 1/2 step

	dtfm = dtf / b->mass;
	b->vcm[0] += dtfm * b->fcm[0];
	b->vcm[1] += dtfm * b->fcm[1];
	b->vcm[2] += dtfm * b->fcm[2];

	// update angular momentum by 1/2 step

	b->angmom[0] += dtf * b->torque[0];
	b->angmom[1] += dtf * b->torque[1];
	b->angmom[2] += dtf * b->torque[2];

	MathExtra::angmom_to_omega(b->angmom,b->ex_space,b->ey_space,
	b->ez_space,b->inertia,b->omega);
	}

	// forward communicate updated info of all bodies

	commflag = FINAL;
	comm->forward_comm_fix(this,10);

	// set velocity/rotation of atoms in rigid bodies
	// virial is already setup from initial_integrate

	set_v();
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::initial_integrate_respa(int vflag, int ilevel, int iloop)
	{
	dtv = step_respa[ilevel];
	dtf = 0.5 * step_respa[ilevel] * force->ftm2v;
	dtq = 0.5 * step_respa[ilevel];

	if (ilevel == 0) initial_integrate(vflag);
	else final_integrate();
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::final_integrate_respa(int ilevel, int iloop)
	{
	dtf = 0.5 * step_respa[ilevel] * force->ftm2v;
	final_integrate();
	}

	/* ----------------------------------------------------------------------
	remap xcm of each rigid body back into periodic simulation box
	done during pre_neighbor so will be after call to pbc()
	and after fix_deform::pre_exchange() may have flipped box
	use domain->remap() in case xcm is far away from box
	due to first-time definition of rigid body in setup_bodies_static()
	or due to box flip
	also adjust imagebody = rigid body image flags, due to xcm remap
	then communicate bodies so other procs will know of changes to body xcm
	then adjust xcmimage flags of all atoms in bodies via image_shift()
	for two effects
	(1) change in true image flags due to pbc() call during exchange
	(2) change in imagebody due to xcm remap
	xcmimage flags are always -1,0,-1 so that body can be unwrapped
	around in-box xcm and stay close to simulation box
	if just inferred unwrapped from atom image flags,
	then a body could end up very far away
	when unwrapped by true image flags
	then set_xv() will compute huge displacements every step to reset coords of
	all the body atoms to be back inside the box, ditto for triclinic box flip
	note: so just want to avoid that numeric probem?
	------------------------------------------------------------------------- */

	void FixRigidSmall::pre_neighbor()
	{
	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	Body *b = &body[ibody];
	domain->remap(b->xcm,b->image);
	}

	nghost_body = 0;
	commflag = FULL_BODY;
	comm->forward_comm_fix(this);
	reset_atom2body();
	//check(4);

	image_shift();
	}

	/* ----------------------------------------------------------------------
	reset body xcmimage flags of atoms in bodies
	xcmimage flags are relative to xcm so that body can be unwrapped
	xcmimage = true image flag - imagebody flag
	------------------------------------------------------------------------- */

	void FixRigidSmall::image_shift()
	{
	imageint tdim,bdim,xdim[3];

	imageint *image = atom->image;
	int nlocal = atom->nlocal;

	for (int i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	tdim = image[i] & IMGMASK;
	bdim = b->image & IMGMASK;
	xdim[0] = IMGMAX + tdim - bdim;
	tdim = (image[i] >> IMGBITS) & IMGMASK;
	bdim = (b->image >> IMGBITS) & IMGMASK;
	xdim[1] = IMGMAX + tdim - bdim;
	tdim = image[i] >> IMG2BITS;
	bdim = b->image >> IMG2BITS;
	xdim[2] = IMGMAX + tdim - bdim;

	xcmimage[i] = (xdim[2] << IMG2BITS) \| (xdim[1] << IMGBITS) \| xdim[0];
	}
	}

	/* ----------------------------------------------------------------------
	count # of DOF removed by rigid bodies for atoms in igroup
	return total count of DOF
	------------------------------------------------------------------------- */

	int FixRigidSmall::dof(int tgroup)
	{
	int i,j;

	// cannot count DOF correctly unless setup_bodies_static() has been called

	if (!setupflag) {
	if (comm->me == 0)
	error->warning(FLERR,"Cannot count rigid body degrees-of-freedom "
	"before bodies are fully initialized");
	return 0;
	}

	int tgroupbit = group->bitmask[tgroup];

	// counts = 3 values per rigid body I own
	// 0 = # of point particles in rigid body and in temperature group
	// 1 = # of finite-size particles in rigid body and in temperature group
	// 2 = # of particles in rigid body, disregarding temperature group

	memory->create(counts,nlocal_body+nghost_body,3,"rigid/small:counts");
	for (int i = 0; i < nlocal_body+nghost_body; i++)
	counts[i][0] = counts[i][1] = counts[i][2] = 0;

	// tally counts from my owned atoms
	// 0 = # of point particles in rigid body and in temperature group
	// 1 = # of finite-size particles in rigid body and in temperature group
	// 2 = # of particles in rigid body, disregarding temperature group

	int *mask = atom->mask;
	int nlocal = atom->nlocal;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	j = atom2body[i];
	counts[j][2]++;
	if (mask[i] & tgroupbit) {
	if (extended && eflags[i]) counts[j][1]++;
	else counts[j][0]++;
	}
	}

	commflag = DOF;
	comm->reverse_comm_fix(this,3);

	// nall = count0 = # of point particles in each rigid body
	// mall = count1 = # of finite-size particles in each rigid body
	// warn if nall+mall != nrigid for any body included in temperature group

	int flag = 0;
	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	if (counts[ibody][0]+counts[ibody][1] > 0 &&
	counts[ibody][0]+counts[ibody][1] != counts[ibody][2]) flag = 1;
	}
	int flagall;
	MPI_Allreduce(&flag,&flagall,1,MPI_INT,MPI_MAX,world);
	if (flagall && me == 0)
	error->warning(FLERR,"Computing temperature of portions of rigid bodies");

	// remove appropriate DOFs for each rigid body wholly in temperature group
	// N = # of point particles in body
	// M = # of finite-size particles in body
	// 3d body has 3N + 6M dof to start with
	// 2d body has 2N + 3M dof to start with
	// 3d point-particle body with all non-zero I should have 6 dof, remove 3N-6
	// 3d point-particle body (linear) with a 0 I should have 5 dof, remove 3N-5
	// 2d point-particle body should have 3 dof, remove 2N-3
	// 3d body with any finite-size M should have 6 dof, remove (3N+6M) - 6
	// 2d body with any finite-size M should have 3 dof, remove (2N+3M) - 3

	double *inertia;

	int n = 0;
	nlinear = 0;
	if (domain->dimension == 3) {
	for (int ibody = 0; ibody < nlocal_body; ibody++) {
	if (counts[ibody][0]+counts[ibody][1] == counts[ibody][2]) {
	n += 3counts[ibody][0] + 6counts[ibody][1] - 6;
	inertia = body[ibody].inertia;
	if (inertia[0] == 0.0 \|\| inertia[1] == 0.0 \|\| inertia[2] == 0.0) {
	n++;
	nlinear++;
	}
	}
	}
	} else if (domain->dimension == 2) {
	for (int ibody = 0; ibody < nlocal_body; ibody++)
	if (counts[ibody][0]+counts[ibody][1] == counts[ibody][2])
	n += 2counts[ibody][0] + 3counts[ibody][1] - 3;
	}

	memory->destroy(counts);

	int nall;
	MPI_Allreduce(&n,&nall,1,MPI_INT,MPI_SUM,world);
	return nall;
	}

	/* ----------------------------------------------------------------------
	adjust xcm of each rigid body due to box deformation
	called by various fixes that change box size/shape
	flag = 0/1 means map from box to lamda coords or vice versa
	------------------------------------------------------------------------- */

	void FixRigidSmall::deform(int flag)
	{
	if (flag == 0)
	for (int ibody = 0; ibody < nlocal_body; ibody++)
	domain->x2lamda(body[ibody].xcm,body[ibody].xcm);
	else
	for (int ibody = 0; ibody < nlocal_body; ibody++)
	domain->lamda2x(body[ibody].xcm,body[ibody].xcm);
	}

	/* ----------------------------------------------------------------------
	set space-frame coords and velocity of each atom in each rigid body
	set orientation and rotation of extended particles
	x = Q displace + Xcm, mapped back to periodic box
	v = Vcm + (W cross (x - Xcm))
	------------------------------------------------------------------------- */

	void FixRigidSmall::set_xv()
	{
	int xbox,ybox,zbox;
	double x0,x1,x2,v0,v1,v2,fc0,fc1,fc2,massone;
	double ione[3],exone[3],eyone[3],ezone[3],vr[6],p[3][3];

	double xprd = domain->xprd;
	double yprd = domain->yprd;
	double zprd = domain->zprd;
	double xy = domain->xy;
	double xz = domain->xz;
	double yz = domain->yz;

	double **x = atom->x;
	double **v = atom->v;
	double **f = atom->f;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	// set x and v of each atom

	for (int i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;

	// save old positions and velocities for virial

	if (evflag) {
	if (triclinic == 0) {
	x0 = x[i][0] + xbox*xprd;
	x1 = x[i][1] + ybox*yprd;
	x2 = x[i][2] + zbox*zprd;
	} else {
	x0 = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	x1 = x[i][1] + yboxyprd + zboxyz;
	x2 = x[i][2] + zbox*zprd;
	}
	v0 = v[i][0];
	v1 = v[i][1];
	v2 = v[i][2];
	}

	// x = displacement from center-of-mass, based on body orientation
	// v = vcm + omega around center-of-mass

	MathExtra::matvec(b->ex_space,b->ey_space,b->ez_space,displace[i],x[i]);

	v[i][0] = b->omega[1]x[i][2] - b->omega[2]x[i][1] + b->vcm[0];
	v[i][1] = b->omega[2]x[i][0] - b->omega[0]x[i][2] + b->vcm[1];
	v[i][2] = b->omega[0]x[i][1] - b->omega[1]x[i][0] + b->vcm[2];

	// add center of mass to displacement
	// map back into periodic box via xbox,ybox,zbox
	// for triclinic, add in box tilt factors as well

	if (triclinic == 0) {
	x[i][0] += b->xcm[0] - xbox*xprd;
	x[i][1] += b->xcm[1] - ybox*yprd;
	x[i][2] += b->xcm[2] - zbox*zprd;
	} else {
	x[i][0] += b->xcm[0] - xboxxprd - yboxxy - zbox*xz;
	x[i][1] += b->xcm[1] - yboxyprd - zboxyz;
	x[i][2] += b->xcm[2] - zbox*zprd;
	}

	// virial = unwrapped coords dotted into body constraint force
	// body constraint force = implied force due to v change minus f external
	// assume f does not include forces internal to body
	// 1/2 factor b/c final_integrate contributes other half
	// assume per-atom contribution is due to constraint force on that atom

	if (evflag) {
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];
	fc0 = massone*(v[i][0] - v0)/dtf - f[i][0];
	fc1 = massone*(v[i][1] - v1)/dtf - f[i][1];
	fc2 = massone*(v[i][2] - v2)/dtf - f[i][2];

	vr[0] = 0.5x0fc0;
	vr[1] = 0.5x1fc1;
	vr[2] = 0.5x2fc2;
	vr[3] = 0.5x0fc1;
	vr[4] = 0.5x0fc2;
	vr[5] = 0.5x1fc2;

	v_tally(1,&i,1.0,vr);
	}
	}

	// set orientation, omega, angmom of each extended particle

	if (extended) {
	double theta_body,theta;
	double shape,quatatom,*inertiaatom;

	AtomVecEllipsoid::Bonus *ebonus;
	if (avec_ellipsoid) ebonus = avec_ellipsoid->bonus;
	AtomVecLine::Bonus *lbonus;
	if (avec_line) lbonus = avec_line->bonus;
	AtomVecTri::Bonus *tbonus;
	if (avec_tri) tbonus = avec_tri->bonus;
	double **omega = atom->omega;
	double **angmom = atom->angmom;
	double **mu = atom->mu;
	int *ellipsoid = atom->ellipsoid;
	int *line = atom->line;
	int *tri = atom->tri;

	for (int i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	if (eflags[i] & SPHERE) {
	omega[i][0] = b->omega[0];
	omega[i][1] = b->omega[1];
	omega[i][2] = b->omega[2];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	quatatom = ebonus[ellipsoid[i]].quat;
	MathExtra::quatquat(b->quat,orient[i],quatatom);
	MathExtra::qnormalize(quatatom);
	ione[0] = EINERTIArmass[i] (shape[1]shape[1] + shape[2]shape[2]);
	ione[1] = EINERTIArmass[i] (shape[0]shape[0] + shape[2]shape[2]);
	ione[2] = EINERTIArmass[i] (shape[0]shape[0] + shape[1]shape[1]);
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(b->omega,exone,eyone,ezone,ione,angmom[i]);
	} else if (eflags[i] & LINE) {
	if (b->quat[3] >= 0.0) theta_body = 2.0*acos(b->quat[0]);
	else theta_body = -2.0*acos(b->quat[0]);
	theta = orient[i][0] + theta_body;
	while (theta <= MINUSPI) theta += TWOPI;
	while (theta > MY_PI) theta -= TWOPI;
	lbonus[line[i]].theta = theta;
	omega[i][0] = b->omega[0];
	omega[i][1] = b->omega[1];
	omega[i][2] = b->omega[2];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	quatatom = tbonus[tri[i]].quat;
	MathExtra::quatquat(b->quat,orient[i],quatatom);
	MathExtra::qnormalize(quatatom);
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(b->omega,exone,eyone,ezone,
	inertiaatom,angmom[i]);
	}
	if (eflags[i] & DIPOLE) {
	MathExtra::quat_to_mat(b->quat,p);
	MathExtra::matvec(p,dorient[i],mu[i]);
	MathExtra::snormalize3(mu[i][3],mu[i],mu[i]);
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	set space-frame velocity of each atom in a rigid body
	set omega and angmom of extended particles
	v = Vcm + (W cross (x - Xcm))
	------------------------------------------------------------------------- */

	void FixRigidSmall::set_v()
	{
	int xbox,ybox,zbox;
	double x0,x1,x2,v0,v1,v2,fc0,fc1,fc2,massone;
	double ione[3],exone[3],eyone[3],ezone[3],delta[3],vr[6];

	double xprd = domain->xprd;
	double yprd = domain->yprd;
	double zprd = domain->zprd;
	double xy = domain->xy;
	double xz = domain->xz;
	double yz = domain->yz;

	double **x = atom->x;
	double **v = atom->v;
	double **f = atom->f;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	// set v of each atom

	for (int i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	MathExtra::matvec(b->ex_space,b->ey_space,b->ez_space,displace[i],delta);

	// save old velocities for virial

	if (evflag) {
	v0 = v[i][0];
	v1 = v[i][1];
	v2 = v[i][2];
	}

	v[i][0] = b->omega[1]delta[2] - b->omega[2]delta[1] + b->vcm[0];
	v[i][1] = b->omega[2]delta[0] - b->omega[0]delta[2] + b->vcm[1];
	v[i][2] = b->omega[0]delta[1] - b->omega[1]delta[0] + b->vcm[2];

	// virial = unwrapped coords dotted into body constraint force
	// body constraint force = implied force due to v change minus f external
	// assume f does not include forces internal to body
	// 1/2 factor b/c initial_integrate contributes other half
	// assume per-atom contribution is due to constraint force on that atom

	if (evflag) {
	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];
	fc0 = massone*(v[i][0] - v0)/dtf - f[i][0];
	fc1 = massone*(v[i][1] - v1)/dtf - f[i][1];
	fc2 = massone*(v[i][2] - v2)/dtf - f[i][2];

	xbox = (xcmimage[i] & IMGMASK) - IMGMAX;
	ybox = (xcmimage[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (xcmimage[i] >> IMG2BITS) - IMGMAX;

	if (triclinic == 0) {
	x0 = x[i][0] + xbox*xprd;
	x1 = x[i][1] + ybox*yprd;
	x2 = x[i][2] + zbox*zprd;
	} else {
	x0 = x[i][0] + xboxxprd + yboxxy + zbox*xz;
	x1 = x[i][1] + yboxyprd + zboxyz;
	x2 = x[i][2] + zbox*zprd;
	}

	vr[0] = 0.5x0fc0;
	vr[1] = 0.5x1fc1;
	vr[2] = 0.5x2fc2;
	vr[3] = 0.5x0fc1;
	vr[4] = 0.5x0fc2;
	vr[5] = 0.5x1fc2;

	v_tally(1,&i,1.0,vr);
	}
	}

	// set omega, angmom of each extended particle

	if (extended) {
	double shape,quatatom,*inertiaatom;

	AtomVecEllipsoid::Bonus *ebonus;
	if (avec_ellipsoid) ebonus = avec_ellipsoid->bonus;
	AtomVecTri::Bonus *tbonus;
	if (avec_tri) tbonus = avec_tri->bonus;
	double **omega = atom->omega;
	double **angmom = atom->angmom;
	int *ellipsoid = atom->ellipsoid;
	int *tri = atom->tri;

	for (int i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	if (eflags[i] & SPHERE) {
	omega[i][0] = b->omega[0];
	omega[i][1] = b->omega[1];
	omega[i][2] = b->omega[2];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	quatatom = ebonus[ellipsoid[i]].quat;
	ione[0] = EINERTIArmass[i] (shape[1]shape[1] + shape[2]shape[2]);
	ione[1] = EINERTIArmass[i] (shape[0]shape[0] + shape[2]shape[2]);
	ione[2] = EINERTIArmass[i] (shape[0]shape[0] + shape[1]shape[1]);
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(b->omega,exone,eyone,ezone,ione,
	angmom[i]);
	} else if (eflags[i] & LINE) {
	omega[i][0] = b->omega[0];
	omega[i][1] = b->omega[1];
	omega[i][2] = b->omega[2];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	quatatom = tbonus[tri[i]].quat;
	MathExtra::q_to_exyz(quatatom,exone,eyone,ezone);
	MathExtra::omega_to_angmom(b->omega,exone,eyone,ezone,
	inertiaatom,angmom[i]);
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	one-time identification of which atoms are in which rigid bodies
	set bodytag for all owned atoms
	------------------------------------------------------------------------- */

	void FixRigidSmall::create_bodies()
	{
	int i,m,n;
	double unwrap[3];

	// error check on image flags of atoms in rigid bodies

	imageint *image = atom->image;
	int *mask = atom->mask;
	int nlocal = atom->nlocal;

	int *periodicity = domain->periodicity;
	int xbox,ybox,zbox;

	int flag = 0;
	for (i = 0; i < nlocal; i++) {
	if (!(mask[i] & groupbit)) continue;
	xbox = (image[i] & IMGMASK) - IMGMAX;
	ybox = (image[i] >> IMGBITS & IMGMASK) - IMGMAX;
	zbox = (image[i] >> IMG2BITS) - IMGMAX;
	if ((xbox && !periodicity[0]) \|\| (ybox && !periodicity[1]) \|\|
	(zbox && !periodicity[2])) flag = 1;
	}

	int flagall;
	MPI_Allreduce(&flag,&flagall,1,MPI_INT,MPI_SUM,world);
	if (flagall) error->all(FLERR,"Fix rigid/small atom has non-zero image flag "
	"in a non-periodic dimension");

	// allocate buffer for passing messages around ring of procs
	// percount = max number of values to put in buffer for each of ncount

	int ncount = 0;
	for (i = 0; i < nlocal; i++)
	if (mask[i] & groupbit) ncount++;

	int percount = 5;
	double *buf;
	memory->create(buf,ncount*percount,"rigid/small:buf");

	// create map hash for storing unique molecule IDs of my atoms
	// key = molecule ID
	// value = index into per-body data structure
	// n = # of entries in hash

	hash = new std::map<tagint,int>();
	hash->clear();

	// setup hash
	// key = body ID
	// value = index into N-length data structure
	// n = count of unique bodies my atoms are part of

	tagint *molecule = atom->molecule;

	n = 0;
	for (i = 0; i < nlocal; i++) {
	if (!(mask[i] & groupbit)) continue;
	if (hash->find(molecule[i]) == hash->end()) (*hash)[molecule[i]] = n++;
	}

	// bbox = bounding box of each rigid body my atoms are part of

	memory->create(bbox,n,6,"rigid/small:bbox");

	for (i = 0; i < n; i++) {
	bbox[i][0] = bbox[i][2] = bbox[i][4] = BIG;
	bbox[i][1] = bbox[i][3] = bbox[i][5] = -BIG;
	}

	// pack my atoms into buffer as molecule ID, unwrapped coords

	double **x = atom->x;

	m = 0;
	for (i = 0; i < nlocal; i++) {
	if (!(mask[i] & groupbit)) continue;
	domain->unmap(x[i],image[i],unwrap);
	buf[m++] = molecule[i];
	buf[m++] = unwrap[0];
	buf[m++] = unwrap[1];
	buf[m++] = unwrap[2];
	}

	// pass buffer around ring of procs
	// func = update bbox with atom coords from every proc
	// when done, have full bbox for every rigid body my atoms are part of

	frsptr = this;
	comm->ring(m,sizeof(double),buf,1,ring_bbox,NULL);

	// check if any bbox is size 0.0, meaning rigid body is a single particle

	flag = 0;
	for (i = 0; i < n; i++)
	if (bbox[i][0] == bbox[i][1] && bbox[i][2] == bbox[i][3] &&
	bbox[i][4] == bbox[i][5]) flag = 1;
	MPI_Allreduce(&flag,&flagall,1,MPI_INT,MPI_SUM,world);
	if (flagall)
	error->all(FLERR,"One or more rigid bodies are a single particle");

	// ctr = center pt of each rigid body my atoms are part of

	memory->create(ctr,n,6,"rigid/small:bbox");

	for (i = 0; i < n; i++) {
	ctr[i][0] = 0.5 * (bbox[i][0] + bbox[i][1]);
	ctr[i][1] = 0.5 * (bbox[i][2] + bbox[i][3]);
	ctr[i][2] = 0.5 * (bbox[i][4] + bbox[i][5]);
	}

	// idclose = ID of atom in body closest to center pt (smaller ID if tied)
	// rsqclose = distance squared from idclose to center pt

	memory->create(idclose,n,"rigid/small:idclose");
	memory->create(rsqclose,n,"rigid/small:rsqclose");

	for (i = 0; i < n; i++) rsqclose[i] = BIG;

	// pack my atoms into buffer as molecule ID, atom ID, unwrapped coords

	tagint *tag = atom->tag;

	m = 0;
	for (i = 0; i < nlocal; i++) {
	if (!(mask[i] & groupbit)) continue;
	domain->unmap(x[i],image[i],unwrap);
	buf[m++] = molecule[i];
	buf[m++] = ubuf(tag[i]).d;
	buf[m++] = unwrap[0];
	buf[m++] = unwrap[1];
	buf[m++] = unwrap[2];
	}

	// pass buffer around ring of procs
	// func = update idclose,rsqclose with atom IDs from every proc
	// when done, have idclose for every rigid body my atoms are part of

	frsptr = this;
	comm->ring(m,sizeof(double),buf,2,ring_nearest,NULL);

	// set bodytag of all owned atoms, based on idclose
	// find max value of rsqclose across all procs

	double rsqmax = 0.0;
	for (i = 0; i < nlocal; i++) {
	bodytag[i] = 0;
	if (!(mask[i] & groupbit)) continue;
	m = hash->find(molecule[i])->second;
	bodytag[i] = idclose[m];
	rsqmax = MAX(rsqmax,rsqclose[m]);
	}

	// pack my atoms into buffer as bodytag of owning atom, unwrapped coords

	m = 0;
	for (i = 0; i < nlocal; i++) {
	if (!(mask[i] & groupbit)) continue;
	domain->unmap(x[i],image[i],unwrap);
	buf[m++] = ubuf(bodytag[i]).d;
	buf[m++] = unwrap[0];
	buf[m++] = unwrap[1];
	buf[m++] = unwrap[2];
	}

	// pass buffer around ring of procs
	// func = update rsqfar for atoms belonging to bodies I own
	// when done, have rsqfar for all atoms in bodies I own

	rsqfar = 0.0;
	frsptr = this;
	comm->ring(m,sizeof(double),buf,3,ring_farthest,NULL);

	// find maxextent of rsqfar across all procs
	// if defined, include molecule->maxextent

	MPI_Allreduce(&rsqfar,&maxextent,1,MPI_DOUBLE,MPI_MAX,world);
	maxextent = sqrt(maxextent);
	if (onemols) {
	for (int i = 0; i < nmol; i++)
	maxextent = MAX(maxextent,onemols[i]->maxextent);
	}

	// clean up

	delete hash;
	memory->destroy(buf);
	memory->destroy(bbox);
	memory->destroy(ctr);
	memory->destroy(idclose);
	memory->destroy(rsqclose);
	}

	/* ----------------------------------------------------------------------
	process rigid body atoms from another proc
	update bounding box for rigid bodies my atoms are part of
	------------------------------------------------------------------------- */

	void FixRigidSmall::ring_bbox(int n, char *cbuf)
	{
	std::map<tagint,int> *hash = frsptr->hash;
	double **bbox = frsptr->bbox;

	double buf = (double ) cbuf;
	int ndatums = n/4;

	int j,imol;
	double *x;

	int m = 0;
	for (int i = 0; i < ndatums; i++, m += 4) {
	imol = static_cast<int> (buf[m]);
	if (hash->find(imol) != hash->end()) {
	j = hash->find(imol)->second;
	x = &buf[m+1];
	bbox[j][0] = MIN(bbox[j][0],x[0]);
	bbox[j][1] = MAX(bbox[j][1],x[0]);
	bbox[j][2] = MIN(bbox[j][2],x[1]);
	bbox[j][3] = MAX(bbox[j][3],x[1]);
	bbox[j][4] = MIN(bbox[j][4],x[2]);
	bbox[j][5] = MAX(bbox[j][5],x[2]);
	}
	}
	}

	/* ----------------------------------------------------------------------
	process rigid body atoms from another proc
	update nearest atom to body center for rigid bodies my atoms are part of
	------------------------------------------------------------------------- */

	void FixRigidSmall::ring_nearest(int n, char *cbuf)
	{
	std::map<tagint,int> *hash = frsptr->hash;
	double **ctr = frsptr->ctr;
	tagint *idclose = frsptr->idclose;
	double *rsqclose = frsptr->rsqclose;

	double buf = (double ) cbuf;
	int ndatums = n/5;

	int j,imol;
	tagint tag;
	double delx,dely,delz,rsq;
	double *x;

	int m = 0;
	for (int i = 0; i < ndatums; i++, m += 5) {
	imol = static_cast<int> (buf[m]);
	if (hash->find(imol) != hash->end()) {
	j = hash->find(imol)->second;
	tag = (tagint) ubuf(buf[m+1]).i;
	x = &buf[m+2];
	delx = x[0] - ctr[j][0];
	dely = x[1] - ctr[j][1];
	delz = x[2] - ctr[j][2];
	rsq = delxdelx + delydely + delz*delz;
	if (rsq <= rsqclose[j]) {
	if (rsq == rsqclose[j] && tag > idclose[j]) continue;
	idclose[j] = tag;
	rsqclose[j] = rsq;
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	process rigid body atoms from another proc
	update rsqfar = distance from owning atom to other atom
	------------------------------------------------------------------------- */

	void FixRigidSmall::ring_farthest(int n, char *cbuf)
	{
	double **x = frsptr->atom->x;
	imageint *image = frsptr->atom->image;
	int nlocal = frsptr->atom->nlocal;

	double buf = (double ) cbuf;
	int ndatums = n/4;

	int iowner;
	tagint tag;
	double delx,dely,delz,rsq;
	double *xx;
	double unwrap[3];

	int m = 0;
	for (int i = 0; i < ndatums; i++, m += 4) {
	tag = (tagint) ubuf(buf[m]).i;
	iowner = frsptr->atom->map(tag);
	if (iowner < 0 \|\| iowner >= nlocal) continue;
	frsptr->domain->unmap(x[iowner],image[iowner],unwrap);
	xx = &buf[m+1];
	delx = xx[0] - unwrap[0];
	dely = xx[1] - unwrap[1];
	delz = xx[2] - unwrap[2];
	rsq = delxdelx + delydely + delz*delz;
	frsptr->rsqfar = MAX(frsptr->rsqfar,rsq);
	}
	}

	/* ----------------------------------------------------------------------
	one-time initialization of rigid body attributes
	sets extended flags, masstotal, center-of-mass
	sets Cartesian and diagonalized inertia tensor
	sets body image flags
	may read some properties from infile
	------------------------------------------------------------------------- */

	void FixRigidSmall::setup_bodies_static()
	{
	int i,ibody;

	// extended = 1 if any particle in a rigid body is finite size
	// or has a dipole moment

	extended = orientflag = dorientflag = 0;

	AtomVecEllipsoid::Bonus *ebonus;
	if (avec_ellipsoid) ebonus = avec_ellipsoid->bonus;
	AtomVecLine::Bonus *lbonus;
	if (avec_line) lbonus = avec_line->bonus;
	AtomVecTri::Bonus *tbonus;
	if (avec_tri) tbonus = avec_tri->bonus;
	double **mu = atom->mu;
	double *radius = atom->radius;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *ellipsoid = atom->ellipsoid;
	int *line = atom->line;
	int *tri = atom->tri;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	if (atom->radius_flag \|\| atom->ellipsoid_flag \|\| atom->line_flag \|\|
	atom->tri_flag \|\| atom->mu_flag) {
	int flag = 0;
	for (i = 0; i < nlocal; i++) {
	if (bodytag[i] == 0) continue;
	if (radius && radius[i] > 0.0) flag = 1;
	if (ellipsoid && ellipsoid[i] >= 0) flag = 1;
	if (line && line[i] >= 0) flag = 1;
	if (tri && tri[i] >= 0) flag = 1;
	if (mu && mu[i][3] > 0.0) flag = 1;
	}

	MPI_Allreduce(&flag,&extended,1,MPI_INT,MPI_MAX,world);
	}

	// extended = 1 if using molecule template with finite-size particles
	// require all molecules in template to have consistent radiusflag

	if (onemols) {
	int radiusflag = onemols[0]->radiusflag;
	for (i = 1; i < nmol; i++) {
	if (onemols[i]->radiusflag != radiusflag)
	error->all(FLERR,"Inconsistent use of finite-size particles "
	"by molecule template molecules");
	}
	if (radiusflag) extended = 1;
	}

	// grow extended arrays and set extended flags for each particle
	// orientflag = 4 if any particle stores ellipsoid or tri orientation
	// orientflag = 1 if any particle stores line orientation
	// dorientflag = 1 if any particle stores dipole orientation

	if (extended) {
	if (atom->ellipsoid_flag) orientflag = 4;
	if (atom->line_flag) orientflag = 1;
	if (atom->tri_flag) orientflag = 4;
	if (atom->mu_flag) dorientflag = 1;
	grow_arrays(atom->nmax);

	for (i = 0; i < nlocal; i++) {
	eflags[i] = 0;
	if (bodytag[i] == 0) continue;

	// set to POINT or SPHERE or ELLIPSOID or LINE

	if (radius && radius[i] > 0.0) {
	eflags[i] \|= SPHERE;
	eflags[i] \|= OMEGA;
	eflags[i] \|= TORQUE;
	} else if (ellipsoid && ellipsoid[i] >= 0) {
	eflags[i] \|= ELLIPSOID;
	eflags[i] \|= ANGMOM;
	eflags[i] \|= TORQUE;
	} else if (line && line[i] >= 0) {
	eflags[i] \|= LINE;
	eflags[i] \|= OMEGA;
	eflags[i] \|= TORQUE;
	} else if (tri && tri[i] >= 0) {
	eflags[i] \|= TRIANGLE;
	eflags[i] \|= ANGMOM;
	eflags[i] \|= TORQUE;
	} else eflags[i] \|= POINT;

	// set DIPOLE if atom->mu and mu[3] > 0.0

	if (atom->mu_flag && mu[i][3] > 0.0)
	eflags[i] \|= DIPOLE;
	}
	}

	// set body xcmimage flags = true image flags

	imageint *image = atom->image;
	for (i = 0; i < nlocal; i++)
	if (bodytag[i] >= 0) xcmimage[i] = image[i];
	else xcmimage[i] = 0;

	// acquire ghost bodies via forward comm
	// set atom2body for ghost atoms via forward comm
	// set atom2body for other owned atoms via reset_atom2body()

	nghost_body = 0;
	commflag = FULL_BODY;
	comm->forward_comm_fix(this);
	reset_atom2body();

	// compute mass & center-of-mass of each rigid body

	double **x = atom->x;

	double *xcm;

	for (ibody = 0; ibody < nlocal_body+nghost_body; ibody++) {
	xcm = body[ibody].xcm;
	xcm[0] = xcm[1] = xcm[2] = 0.0;
	body[ibody].mass = 0.0;
	}

	double unwrap[3];
	double massone;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	domain->unmap(x[i],xcmimage[i],unwrap);
	xcm = b->xcm;
	xcm[0] += unwrap[0] * massone;
	xcm[1] += unwrap[1] * massone;
	xcm[2] += unwrap[2] * massone;
	b->mass += massone;
	}

	// reverse communicate xcm, mass of all bodies

	commflag = XCM_MASS;
	comm->reverse_comm_fix(this,4);

	for (ibody = 0; ibody < nlocal_body; ibody++) {
	xcm = body[ibody].xcm;
	xcm[0] /= body[ibody].mass;
	xcm[1] /= body[ibody].mass;
	xcm[2] /= body[ibody].mass;
	}

	// set vcm, angmom = 0.0 in case infile is used
	// and doesn't overwrite all body's values
	// since setup_bodies_dynamic() will not be called

	double vcm,angmom;

	for (ibody = 0; ibody < nlocal_body; ibody++) {
	vcm = body[ibody].vcm;
	vcm[0] = vcm[1] = vcm[2] = 0.0;
	angmom = body[ibody].angmom;
	angmom[0] = angmom[1] = angmom[2] = 0.0;
	}

	- // overwrite masstotal and center-of-mass with file values
	+ // set rigid body image flags to default values
	+
	+ for (ibody = 0; ibody < nlocal_body; ibody++)
	+ body[ibody].image = ((imageint) IMGMAX << IMG2BITS) \|
	+ ((imageint) IMGMAX << IMGBITS) \| IMGMAX;
	+
	+ // overwrite masstotal, center-of-mass, image flags with file values
	// inbody[i] = 0/1 if Ith rigid body is initialized by file

	int *inbody;
	if (infile) {
	memory->create(inbody,nlocal_body,"rigid/small:inbody");
	for (ibody = 0; ibody < nlocal_body; ibody++) inbody[ibody] = 0;
	readfile(0,NULL,inbody);
	}

	- // set rigid body image flags to default values
	- // then remap the xcm of each body back into simulation box
	+ // remap the xcm of each body back into simulation box
	// and reset body and atom xcmimage flags via pre_neighbor()

	- for (ibody = 0; ibody < nlocal_body; ibody++)
	- body[ibody].image = ((imageint) IMGMAX << IMG2BITS) \|
	- ((imageint) IMGMAX << IMGBITS) \| IMGMAX;
	-
	pre_neighbor();

	// compute 6 moments of inertia of each body in Cartesian reference frame
	// dx,dy,dz = coords relative to center-of-mass
	// symmetric 3x3 inertia tensor stored in Voigt notation as 6-vector

	memory->create(itensor,nlocal_body+nghost_body,6,"rigid/small:itensor");
	for (ibody = 0; ibody < nlocal_body+nghost_body; ibody++)
	for (i = 0; i < 6; i++) itensor[ibody][i] = 0.0;

	double dx,dy,dz;
	double *inertia;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	domain->unmap(x[i],xcmimage[i],unwrap);
	xcm = b->xcm;
	dx = unwrap[0] - xcm[0];
	dy = unwrap[1] - xcm[1];
	dz = unwrap[2] - xcm[2];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	inertia = itensor[atom2body[i]];
	inertia[0] += massone * (dydy + dzdz);
	inertia[1] += massone * (dxdx + dzdz);
	inertia[2] += massone * (dxdx + dydy);
	inertia[3] -= massone * dy*dz;
	inertia[4] -= massone * dx*dz;
	inertia[5] -= massone * dx*dy;
	}

	// extended particles may contribute extra terms to moments of inertia

	if (extended) {
	double ivec[6];
	double shape,quatatom,*inertiaatom;
	double length,theta;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	inertia = itensor[atom2body[i]];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	if (eflags[i] & SPHERE) {
	inertia[0] += SINERTIAmassone radius[i]*radius[i];
	inertia[1] += SINERTIAmassone radius[i]*radius[i];
	inertia[2] += SINERTIAmassone radius[i]*radius[i];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	quatatom = ebonus[ellipsoid[i]].quat;
	MathExtra::inertia_ellipsoid(shape,quatatom,massone,ivec);
	inertia[0] += ivec[0];
	inertia[1] += ivec[1];
	inertia[2] += ivec[2];
	inertia[3] += ivec[3];
	inertia[4] += ivec[4];
	inertia[5] += ivec[5];
	} else if (eflags[i] & LINE) {
	length = lbonus[line[i]].length;
	theta = lbonus[line[i]].theta;
	MathExtra::inertia_line(length,theta,massone,ivec);
	inertia[0] += ivec[0];
	inertia[1] += ivec[1];
	inertia[2] += ivec[2];
	inertia[3] += ivec[3];
	inertia[4] += ivec[4];
	inertia[5] += ivec[5];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	quatatom = tbonus[tri[i]].quat;
	MathExtra::inertia_triangle(inertiaatom,quatatom,massone,ivec);
	inertia[0] += ivec[0];
	inertia[1] += ivec[1];
	inertia[2] += ivec[2];
	inertia[3] += ivec[3];
	inertia[4] += ivec[4];
	inertia[5] += ivec[5];
	}
	}
	}

	// reverse communicate inertia tensor of all bodies

	commflag = ITENSOR;
	comm->reverse_comm_fix(this,6);

	// overwrite Cartesian inertia tensor with file values

	if (infile) readfile(1,itensor,inbody);

	// diagonalize inertia tensor for each body via Jacobi rotations
	// inertia = 3 eigenvalues = principal moments of inertia
	// evectors and exzy_space = 3 evectors = principal axes of rigid body

	int ierror;
	double cross[3];
	double tensor[3][3],evectors[3][3];
	double ex,ey,*ez;

	for (ibody = 0; ibody < nlocal_body; ibody++) {
	tensor[0][0] = itensor[ibody][0];
	tensor[1][1] = itensor[ibody][1];
	tensor[2][2] = itensor[ibody][2];
	tensor[1][2] = tensor[2][1] = itensor[ibody][3];
	tensor[0][2] = tensor[2][0] = itensor[ibody][4];
	tensor[0][1] = tensor[1][0] = itensor[ibody][5];

	inertia = body[ibody].inertia;
	ierror = MathExtra::jacobi(tensor,inertia,evectors);
	if (ierror) error->all(FLERR,
	"Insufficient Jacobi rotations for rigid body");

	ex = body[ibody].ex_space;
	ex[0] = evectors[0][0];
	ex[1] = evectors[1][0];
	ex[2] = evectors[2][0];
	ey = body[ibody].ey_space;
	ey[0] = evectors[0][1];
	ey[1] = evectors[1][1];
	ey[2] = evectors[2][1];
	ez = body[ibody].ez_space;
	ez[0] = evectors[0][2];
	ez[1] = evectors[1][2];
	ez[2] = evectors[2][2];

	// if any principal moment < scaled EPSILON, set to 0.0

	double max;
	max = MAX(inertia[0],inertia[1]);
	max = MAX(max,inertia[2]);

	if (inertia[0] < EPSILON*max) inertia[0] = 0.0;
	if (inertia[1] < EPSILON*max) inertia[1] = 0.0;
	if (inertia[2] < EPSILON*max) inertia[2] = 0.0;

	// enforce 3 evectors as a right-handed coordinate system
	// flip 3rd vector if needed

	MathExtra::cross3(ex,ey,cross);
	if (MathExtra::dot3(cross,ez) < 0.0) MathExtra::negate3(ez);

	// create initial quaternion

	MathExtra::exyz_to_q(ex,ey,ez,body[ibody].quat);
	}

	// forward communicate updated info of all bodies

	commflag = INITIAL;
	comm->forward_comm_fix(this,26);

	// displace = initial atom coords in basis of principal axes
	// set displace = 0.0 for atoms not in any rigid body
	// for extended particles, set their orientation wrt to rigid body

	double qc[4],delta[3];
	double *quatatom;
	double theta_body;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) {
	displace[i][0] = displace[i][1] = displace[i][2] = 0.0;
	continue;
	}

	Body *b = &body[atom2body[i]];

	domain->unmap(x[i],xcmimage[i],unwrap);
	xcm = b->xcm;
	delta[0] = unwrap[0] - xcm[0];
	delta[1] = unwrap[1] - xcm[1];
	delta[2] = unwrap[2] - xcm[2];
	MathExtra::transpose_matvec(b->ex_space,b->ey_space,b->ez_space,
	delta,displace[i]);

	if (extended) {
	if (eflags[i] & ELLIPSOID) {
	quatatom = ebonus[ellipsoid[i]].quat;
	MathExtra::qconjugate(b->quat,qc);
	MathExtra::quatquat(qc,quatatom,orient[i]);
	MathExtra::qnormalize(orient[i]);
	} else if (eflags[i] & LINE) {
	if (b->quat[3] >= 0.0) theta_body = 2.0*acos(b->quat[0]);
	else theta_body = -2.0*acos(b->quat[0]);
	orient[i][0] = lbonus[line[i]].theta - theta_body;
	while (orient[i][0] <= MINUSPI) orient[i][0] += TWOPI;
	while (orient[i][0] > MY_PI) orient[i][0] -= TWOPI;
	if (orientflag == 4) orient[i][1] = orient[i][2] = orient[i][3] = 0.0;
	} else if (eflags[i] & TRIANGLE) {
	quatatom = tbonus[tri[i]].quat;
	MathExtra::qconjugate(b->quat,qc);
	MathExtra::quatquat(qc,quatatom,orient[i]);
	MathExtra::qnormalize(orient[i]);
	} else if (orientflag == 4) {
	orient[i][0] = orient[i][1] = orient[i][2] = orient[i][3] = 0.0;
	} else if (orientflag == 1)
	orient[i][0] = 0.0;

	if (eflags[i] & DIPOLE) {
	MathExtra::transpose_matvec(b->ex_space,b->ey_space,b->ez_space,
	mu[i],dorient[i]);
	MathExtra::snormalize3(mu[i][3],dorient[i],dorient[i]);
	} else if (dorientflag)
	dorient[i][0] = dorient[i][1] = dorient[i][2] = 0.0;
	}
	}

	// test for valid principal moments & axes
	// recompute moments of inertia around new axes
	// 3 diagonal moments should equal principal moments
	// 3 off-diagonal moments should be 0.0
	// extended particles may contribute extra terms to moments of inertia

	for (ibody = 0; ibody < nlocal_body+nghost_body; ibody++)
	for (i = 0; i < 6; i++) itensor[ibody][i] = 0.0;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	inertia = itensor[atom2body[i]];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	inertia[0] += massone *
	(displace[i][1]displace[i][1] + displace[i][2]displace[i][2]);
	inertia[1] += massone *
	(displace[i][0]displace[i][0] + displace[i][2]displace[i][2]);
	inertia[2] += massone *
	(displace[i][0]displace[i][0] + displace[i][1]displace[i][1]);
	inertia[3] -= massone * displace[i][1]*displace[i][2];
	inertia[4] -= massone * displace[i][0]*displace[i][2];
	inertia[5] -= massone * displace[i][0]*displace[i][1];
	}

	if (extended) {
	double ivec[6];
	double shape,inertiaatom;
	double length;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	inertia = itensor[atom2body[i]];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	if (eflags[i] & SPHERE) {
	inertia[0] += SINERTIAmassone radius[i]*radius[i];
	inertia[1] += SINERTIAmassone radius[i]*radius[i];
	inertia[2] += SINERTIAmassone radius[i]*radius[i];
	} else if (eflags[i] & ELLIPSOID) {
	shape = ebonus[ellipsoid[i]].shape;
	MathExtra::inertia_ellipsoid(shape,orient[i],massone,ivec);
	inertia[0] += ivec[0];
	inertia[1] += ivec[1];
	inertia[2] += ivec[2];
	inertia[3] += ivec[3];
	inertia[4] += ivec[4];
	inertia[5] += ivec[5];
	} else if (eflags[i] & LINE) {
	length = lbonus[line[i]].length;
	MathExtra::inertia_line(length,orient[i][0],massone,ivec);
	inertia[0] += ivec[0];
	inertia[1] += ivec[1];
	inertia[2] += ivec[2];
	inertia[3] += ivec[3];
	inertia[4] += ivec[4];
	inertia[5] += ivec[5];
	} else if (eflags[i] & TRIANGLE) {
	inertiaatom = tbonus[tri[i]].inertia;
	MathExtra::inertia_triangle(inertiaatom,orient[i],massone,ivec);
	inertia[0] += ivec[0];
	inertia[1] += ivec[1];
	inertia[2] += ivec[2];
	inertia[3] += ivec[3];
	inertia[4] += ivec[4];
	inertia[5] += ivec[5];
	}
	}
	}

	// reverse communicate inertia tensor of all bodies

	commflag = ITENSOR;
	comm->reverse_comm_fix(this,6);

	// error check that re-computed momemts of inertia match diagonalized ones
	// do not do test for bodies with params read from infile

	double norm;
	for (ibody = 0; ibody < nlocal_body; ibody++) {
	if (infile && inbody[ibody]) continue;
	inertia = body[ibody].inertia;

	if (inertia[0] == 0.0) {
	if (fabs(itensor[ibody][0]) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	} else {
	if (fabs((itensor[ibody][0]-inertia[0])/inertia[0]) >
	TOLERANCE) error->all(FLERR,"Fix rigid: Bad principal moments");
	}
	if (inertia[1] == 0.0) {
	if (fabs(itensor[ibody][1]) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	} else {
	if (fabs((itensor[ibody][1]-inertia[1])/inertia[1]) >
	TOLERANCE) error->all(FLERR,"Fix rigid: Bad principal moments");
	}
	if (inertia[2] == 0.0) {
	if (fabs(itensor[ibody][2]) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	} else {
	if (fabs((itensor[ibody][2]-inertia[2])/inertia[2]) >
	TOLERANCE) error->all(FLERR,"Fix rigid: Bad principal moments");
	}
	norm = (inertia[0] + inertia[1] + inertia[2]) / 3.0;
	if (fabs(itensor[ibody][3]/norm) > TOLERANCE \|\|
	fabs(itensor[ibody][4]/norm) > TOLERANCE \|\|
	fabs(itensor[ibody][5]/norm) > TOLERANCE)
	error->all(FLERR,"Fix rigid: Bad principal moments");
	}

	// clean up

	memory->destroy(itensor);
	if (infile) memory->destroy(inbody);
	}

	/* ----------------------------------------------------------------------
	one-time initialization of dynamic rigid body attributes
	vcm and angmom, computed explicitly from constituent particles
	not done if body properites read from file, e.g. for overlapping particles
	------------------------------------------------------------------------- */

	void FixRigidSmall::setup_bodies_dynamic()
	{
	int i,ibody;
	double massone,radone;

	// sum vcm, angmom across all rigid bodies
	// vcm = velocity of COM
	// angmom = angular momentum around COM

	double **x = atom->x;
	double **v = atom->v;
	double *rmass = atom->rmass;
	double *mass = atom->mass;
	int *type = atom->type;
	int nlocal = atom->nlocal;

	double xcm,vcm,*acm;
	double dx,dy,dz;
	double unwrap[3];

	for (ibody = 0; ibody < nlocal_body+nghost_body; ibody++) {
	vcm = body[ibody].vcm;
	vcm[0] = vcm[1] = vcm[2] = 0.0;
	acm = body[ibody].angmom;
	acm[0] = acm[1] = acm[2] = 0.0;
	}

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	if (rmass) massone = rmass[i];
	else massone = mass[type[i]];

	vcm = b->vcm;
	vcm[0] += v[i][0] * massone;
	vcm[1] += v[i][1] * massone;
	vcm[2] += v[i][2] * massone;

	domain->unmap(x[i],xcmimage[i],unwrap);
	xcm = b->xcm;
	dx = unwrap[0] - xcm[0];
	dy = unwrap[1] - xcm[1];
	dz = unwrap[2] - xcm[2];

	acm = b->angmom;
	acm[0] += dy * massonev[i][2] - dz massone*v[i][1];
	acm[1] += dz * massonev[i][0] - dx massone*v[i][2];
	acm[2] += dx * massonev[i][1] - dy massone*v[i][0];
	}

	// extended particles add their rotation to angmom of body

	if (extended) {
	AtomVecLine::Bonus *lbonus;
	if (avec_line) lbonus = avec_line->bonus;
	double **omega = atom->omega;
	double **angmom = atom->angmom;
	double *radius = atom->radius;
	int *line = atom->line;

	for (i = 0; i < nlocal; i++) {
	if (atom2body[i] < 0) continue;
	Body *b = &body[atom2body[i]];

	if (eflags[i] & OMEGA) {
	if (eflags[i] & SPHERE) {
	radone = radius[i];
	acm = b->angmom;
	acm[0] += SINERTIArmass[i] radoneradone omega[i][0];
	acm[1] += SINERTIArmass[i] radoneradone omega[i][1];
	acm[2] += SINERTIArmass[i] radoneradone omega[i][2];
	} else if (eflags[i] & LINE) {
	radone = lbonus[line[i]].length;
	b->angmom[2] += LINERTIArmass[i] radoneradone omega[i][2];
	}
	}
	if (eflags[i] & ANGMOM) {
	acm = b->angmom;
	acm[0] += angmom[i][0];
	acm[1] += angmom[i][1];
	acm[2] += angmom[i][2];
	}
	}
	}

	// reverse communicate vcm, angmom of all bodies

	commflag = VCM_ANGMOM;
	comm->reverse_comm_fix(this,6);

	// normalize velocity of COM

	for (ibody = 0; ibody < nlocal_body; ibody++) {
	vcm = body[ibody].vcm;
	vcm[0] /= body[ibody].mass;
	vcm[1] /= body[ibody].mass;
	vcm[2] /= body[ibody].mass;
	}
	}

	/* ----------------------------------------------------------------------
	read per rigid body info from user-provided file
	which = 0 to read everthing except 6 moments of inertia
	which = 1 to read just 6 moments of inertia
	flag inbody = 0 for local bodies this proc initializes from file
	nlines = # of lines of rigid body info, 0 is OK
	one line = rigid-ID mass xcm ycm zcm ixx iyy izz ixy ixz iyz
	vxcm vycm vzcm lx ly lz
	where rigid-ID = mol-ID for fix rigid/small
	------------------------------------------------------------------------- */

	void FixRigidSmall::readfile(int which, double *array, int inbody)
	{
	- int i,j,m,nchunk,eofflag,nlines;
	+ int i,j,m,nchunk,eofflag,nlines,xbox,ybox,zbox;
	tagint id;
	FILE *fp;
	char eof,start,next,buf;
	char line[MAXLINE];

	// create local hash with key/value pairs
	// key = mol ID of bodies my atoms own
	// value = index into local body array

	int nlocal = atom->nlocal;

	hash = new std::map<tagint,int>();
	for (i = 0; i < nlocal; i++)
	if (bodyown[i] >= 0) (*hash)[atom->molecule[i]] = bodyown[i];

	// open file and read header

	if (me == 0) {
	fp = fopen(infile,"r");
	if (fp == NULL) {
	char str[128];
	sprintf(str,"Cannot open fix rigid/small infile %s",infile);
	error->one(FLERR,str);
	}

	while (1) {
	eof = fgets(line,MAXLINE,fp);
	if (eof == NULL)
	error->one(FLERR,"Unexpected end of fix rigid/small file");
	start = &line[strspn(line," \t\n\v\f\r")];
	if (start != '\0' && start != '#') break;
	}

	sscanf(line,"%d",&nlines);
	}

	MPI_Bcast(&nlines,1,MPI_INT,0,world);

	char buffer = new char[CHUNKMAXLINE];
	char *values = new char[ATTRIBUTE_PERBODY];

	int nread = 0;
	while (nread < nlines) {
	nchunk = MIN(nlines-nread,CHUNK);
	eofflag = comm->read_lines_from_file(fp,nchunk,MAXLINE,buffer);
	if (eofflag) error->all(FLERR,"Unexpected end of fix rigid/small file");

	buf = buffer;
	next = strchr(buf,'\n');
	*next = '\0';
	int nwords = atom->count_words(buf);
	*next = '\n';

	if (nwords != ATTRIBUTE_PERBODY)
	error->all(FLERR,"Incorrect rigid body format in fix rigid/small file");

	// loop over lines of rigid body attributes
	// tokenize the line into values
	// id = rigid body ID = mol-ID
	// for which = 0, store all but inertia directly in body struct
	// for which = 1, store inertia tensor array, invert 3,4,5 values to Voigt

	for (int i = 0; i < nchunk; i++) {
	next = strchr(buf,'\n');

	values[0] = strtok(buf," \t\n\r\f");
	for (j = 1; j < nwords; j++)
	values[j] = strtok(NULL," \t\n\r\f");

	id = ATOTAGINT(values[0]);
	if (id <= 0 \|\| id > maxmol)
	error->all(FLERR,"Invalid rigid body ID in fix rigid/small file");
	if (hash->find(id) == hash->end()) {
	buf = next + 1;
	continue;
	}
	m = (*hash)[id];
	inbody[m] = 1;

	if (which == 0) {
	body[m].mass = atof(values[1]);
	body[m].xcm[0] = atof(values[2]);
	body[m].xcm[1] = atof(values[3]);
	body[m].xcm[2] = atof(values[4]);
	body[m].vcm[0] = atof(values[11]);
	body[m].vcm[1] = atof(values[12]);
	body[m].vcm[2] = atof(values[13]);
	body[m].angmom[0] = atof(values[14]);
	body[m].angmom[1] = atof(values[15]);
	body[m].angmom[2] = atof(values[16]);
	+ xbox = atoi(values[17]);
	+ ybox = atoi(values[18]);
	+ zbox = atoi(values[19]);
	+ body[m].image = ((imageint) (xbox + IMGMAX) & IMGMASK) \|
	+ (((imageint) (ybox + IMGMAX) & IMGMASK) << IMGBITS) \|
	+ (((imageint) (zbox + IMGMAX) & IMGMASK) << IMG2BITS);
	} else {
	array[m][0] = atof(values[5]);
	array[m][1] = atof(values[6]);
	array[m][2] = atof(values[7]);
	array[m][3] = atof(values[10]);
	array[m][4] = atof(values[9]);
	array[m][5] = atof(values[8]);
	}

	buf = next + 1;
	}

	nread += nchunk;
	}

	if (me == 0) fclose(fp);

	delete [] buffer;
	delete [] values;
	delete hash;
	}

	/* ----------------------------------------------------------------------
	write out restart info for mass, COM, inertia tensor to file
	identical format to infile option, so info can be read in when restarting
	each proc contributes info for rigid bodies it owns
	------------------------------------------------------------------------- */

	void FixRigidSmall::write_restart_file(char *file)
	{
	FILE *fp;

	// do not write file if bodies have not yet been intialized

	if (!setupflag) return;

	// proc 0 opens file and writes header

	if (me == 0) {
	char outfile[128];
	sprintf(outfile,"%s.rigid",file);
	fp = fopen(outfile,"w");
	if (fp == NULL) {
	char str[128];
	sprintf(str,"Cannot open fix rigid restart file %s",outfile);
	error->one(FLERR,str);
	}

	fprintf(fp,"# fix rigid mass, COM, inertia tensor info for "
	"%d bodies on timestep " BIGINT_FORMAT "\n\n",
	nbody,update->ntimestep);
	fprintf(fp,"%d\n",nbody);
	}

	// communication buffer for all my rigid body info
	// max_size = largest buffer needed by any proc
	// ncol = # of values per line in output file

	int ncol = ATTRIBUTE_PERBODY;
	int sendrow = nlocal_body;
	int maxrow;
	MPI_Allreduce(&sendrow,&maxrow,1,MPI_INT,MPI_MAX,world);

	double **buf;
	if (me == 0) memory->create(buf,MAX(1,maxrow),ncol,"rigid/small:buf");
	else memory->create(buf,MAX(1,sendrow),ncol,"rigid/small:buf");

	// pack my rigid body info into buf
	// compute I tensor against xyz axes from diagonalized I and current quat
	// Ispace = P Idiag P_transpose
	// P is stored column-wise in exyz_space

	double p[3][3],pdiag[3][3],ispace[3][3];

	for (int i = 0; i < nlocal_body; i++) {
	MathExtra::col2mat(body[i].ex_space,body[i].ey_space,body[i].ez_space,p);
	MathExtra::times3_diag(p,body[i].inertia,pdiag);
	MathExtra::times3_transpose(pdiag,p,ispace);

	buf[i][0] = atom->molecule[body[i].ilocal];
	buf[i][1] = body[i].mass;
	buf[i][2] = body[i].xcm[0];
	buf[i][3] = body[i].xcm[1];
	buf[i][4] = body[i].xcm[2];
	buf[i][5] = ispace[0][0];
	buf[i][6] = ispace[1][1];
	buf[i][7] = ispace[2][2];
	buf[i][8] = ispace[0][1];
	buf[i][9] = ispace[0][2];
	buf[i][10] = ispace[1][2];
	buf[i][10] = ispace[1][2];
	buf[i][11] = body[i].vcm[0];
	buf[i][12] = body[i].vcm[1];
	buf[i][13] = body[i].vcm[2];
	buf[i][14] = body[i].angmom[0];
	buf[i][15] = body[i].angmom[1];
	buf[i][16] = body[i].angmom[2];
	+ buf[i][17] = (body[i].image & IMGMASK) - IMGMAX;
	+ buf[i][18] = (body[i].image >> IMGBITS & IMGMASK) - IMGMAX;
	+ buf[i][19] = (body[i].image >> IMG2BITS) - IMGMAX;
	}

	// write one chunk of rigid body info per proc to file
	// proc 0 pings each proc, receives its chunk, writes to file
	// all other procs wait for ping, send their chunk to proc 0

	int tmp,recvrow;

	if (me == 0) {
	MPI_Status status;
	MPI_Request request;
	for (int iproc = 0; iproc < nprocs; iproc++) {
	if (iproc) {
	MPI_Irecv(&buf[0][0],maxrow*ncol,MPI_DOUBLE,iproc,0,world,&request);
	MPI_Send(&tmp,0,MPI_INT,iproc,0,world);
	MPI_Wait(&request,&status);
	MPI_Get_count(&status,MPI_DOUBLE,&recvrow);
	recvrow /= ncol;
	} else recvrow = sendrow;

	for (int i = 0; i < recvrow; i++)
	fprintf(fp,"%d %-1.16e %-1.16e %-1.16e %-1.16e "
	"%-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %-1.16e "
	- "%-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %-1.16e\n",
	+ "%-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %-1.16e %d %d %d\n",
	static_cast<int> (buf[i][0]),buf[i][1],
	buf[i][2],buf[i][3],buf[i][4],
	buf[i][5],buf[i][6],buf[i][7],buf[i][8],buf[i][9],buf[i][10],
	buf[i][11],buf[i][12],buf[i][13],
	- buf[i][14],buf[i][15],buf[i][16]);
	+ buf[i][14],buf[i][15],buf[i][16],
	+ static_cast<int> (buf[i][17]),
	+ static_cast<int> (buf[i][18]),
	+ static_cast<int> (buf[i][19]));
	}

	} else {
	MPI_Recv(&tmp,0,MPI_INT,0,0,world,MPI_STATUS_IGNORE);
	MPI_Rsend(&buf[0][0],sendrow*ncol,MPI_DOUBLE,0,0,world);
	}

	// clean up and close file

	memory->destroy(buf);
	if (me == 0) fclose(fp);
	}

	/* ----------------------------------------------------------------------
	allocate local atom-based arrays
	------------------------------------------------------------------------- */

	void FixRigidSmall::grow_arrays(int nmax)
	{
	memory->grow(bodyown,nmax,"rigid/small:bodyown");
	memory->grow(bodytag,nmax,"rigid/small:bodytag");
	memory->grow(atom2body,nmax,"rigid/small:atom2body");
	memory->grow(xcmimage,nmax,"rigid/small:xcmimage");
	memory->grow(displace,nmax,3,"rigid/small:displace");
	if (extended) {
	memory->grow(eflags,nmax,"rigid/small:eflags");
	if (orientflag) memory->grow(orient,nmax,orientflag,"rigid/small:orient");
	if (dorientflag) memory->grow(dorient,nmax,3,"rigid/small:dorient");
	}

	// check for regrow of vatom
	// must be done whether per-atom virial is accumulated on this step or not
	// b/c this is only time grow_array() may be called
	// need to regrow b/c vatom is calculated before and after atom migration

	if (nmax > maxvatom) {
	maxvatom = atom->nmax;
	memory->grow(vatom,maxvatom,6,"fix:vatom");
	}
	}

	/* ----------------------------------------------------------------------
	copy values within local atom-based arrays
	------------------------------------------------------------------------- */

	void FixRigidSmall::copy_arrays(int i, int j, int delflag)
	{
	bodytag[j] = bodytag[i];
	xcmimage[j] = xcmimage[i];
	displace[j][0] = displace[i][0];
	displace[j][1] = displace[i][1];
	displace[j][2] = displace[i][2];

	if (extended) {
	eflags[j] = eflags[i];
	for (int k = 0; k < orientflag; k++)
	orient[j][k] = orient[i][k];
	if (dorientflag) {
	dorient[j][0] = dorient[i][0];
	dorient[j][1] = dorient[i][1];
	dorient[j][2] = dorient[i][2];
	}
	}

	// must also copy vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	vatom[j][k] = vatom[i][k];

	// if deleting atom J via delflag and J owns a body, then delete it

	if (delflag && bodyown[j] >= 0) {
	bodyown[body[nlocal_body-1].ilocal] = bodyown[j];
	memcpy(&body[bodyown[j]],&body[nlocal_body-1],sizeof(Body));
	nlocal_body--;
	}

	// if atom I owns a body, reset I's body.ilocal to loc J
	// do NOT do this if self-copy (I=J) since I's body is already deleted

	if (bodyown[i] >= 0 && i != j) body[bodyown[i]].ilocal = j;
	bodyown[j] = bodyown[i];
	}

	/* ----------------------------------------------------------------------
	initialize one atom's array values, called when atom is created
	------------------------------------------------------------------------- */

	void FixRigidSmall::set_arrays(int i)
	{
	bodyown[i] = -1;
	bodytag[i] = 0;
	atom2body[i] = -1;
	xcmimage[i] = 0;
	displace[i][0] = 0.0;
	displace[i][1] = 0.0;
	displace[i][2] = 0.0;

	// must also zero vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	vatom[i][k] = 0.0;
	}

	/* ----------------------------------------------------------------------
	initialize a molecule inserted by another fix, e.g. deposit or pour
	called when molecule is created
	nlocalprev = # of atoms on this proc before molecule inserted
	tagprev = atom ID previous to new atoms in the molecule
	xgeom = geometric center of new molecule
	vcm = COM velocity of new molecule
	quat = rotation of new molecule (around geometric center)
	relative to template in Molecule class
	------------------------------------------------------------------------- */

	void FixRigidSmall::set_molecule(int nlocalprev, tagint tagprev, int imol,
	double xgeom, double vcm, double *quat)
	{
	int m;
	double ctr2com[3],ctr2com_rotate[3];
	double rotmat[3][3];

	// increment total # of rigid bodies

	nbody++;

	// loop over atoms I added for the new body

	int nlocal = atom->nlocal;
	if (nlocalprev == nlocal) return;

	tagint *tag = atom->tag;

	for (int i = nlocalprev; i < nlocal; i++) {
	bodytag[i] = tagprev + onemols[imol]->comatom;
	if (tag[i]-tagprev == onemols[imol]->comatom) bodyown[i] = nlocal_body;

	m = tag[i] - tagprev-1;
	displace[i][0] = onemols[imol]->dxbody[m][0];
	displace[i][1] = onemols[imol]->dxbody[m][1];
	displace[i][2] = onemols[imol]->dxbody[m][2];

	if (extended) {
	eflags[i] = 0;
	if (onemols[imol]->radiusflag) {
	eflags[i] \|= SPHERE;
	eflags[i] \|= OMEGA;
	eflags[i] \|= TORQUE;
	}
	}

	if (bodyown[i] >= 0) {
	if (nlocal_body == nmax_body) grow_body();
	Body *b = &body[nlocal_body];
	b->mass = onemols[imol]->masstotal;

	// new COM = Q (onemols[imol]->xcm - onemols[imol]->center) + xgeom
	// Q = rotation matrix associated with quat

	MathExtra::quat_to_mat(quat,rotmat);
	MathExtra::sub3(onemols[imol]->com,onemols[imol]->center,ctr2com);
	MathExtra::matvec(rotmat,ctr2com,ctr2com_rotate);
	MathExtra::add3(ctr2com_rotate,xgeom,b->xcm);

	b->vcm[0] = vcm[0];
	b->vcm[1] = vcm[1];
	b->vcm[2] = vcm[2];
	b->inertia[0] = onemols[imol]->inertia[0];
	b->inertia[1] = onemols[imol]->inertia[1];
	b->inertia[2] = onemols[imol]->inertia[2];

	// final quat is product of insertion quat and original quat
	// true even if insertion rotation was not around COM

	MathExtra::quatquat(quat,onemols[imol]->quat,b->quat);
	MathExtra::q_to_exyz(b->quat,b->ex_space,b->ey_space,b->ez_space);

	b->angmom[0] = b->angmom[1] = b->angmom[2] = 0.0;
	b->omega[0] = b->omega[1] = b->omega[2] = 0.0;
	b->image = ((imageint) IMGMAX << IMG2BITS) \|
	((imageint) IMGMAX << IMGBITS) \| IMGMAX;
	b->ilocal = i;
	nlocal_body++;
	}
	}
	}

	/* ----------------------------------------------------------------------
	pack values in local atom-based arrays for exchange with another proc
	------------------------------------------------------------------------- */

	int FixRigidSmall::pack_exchange(int i, double *buf)
	{
	buf[0] = ubuf(bodytag[i]).d;
	buf[1] = ubuf(xcmimage[i]).d;
	buf[2] = displace[i][0];
	buf[3] = displace[i][1];
	buf[4] = displace[i][2];

	// extended attribute info

	int m = 5;
	if (extended) {
	buf[m++] = eflags[i];
	for (int j = 0; j < orientflag; j++)
	buf[m++] = orient[i][j];
	if (dorientflag) {
	buf[m++] = dorient[i][0];
	buf[m++] = dorient[i][1];
	buf[m++] = dorient[i][2];
	}
	}

	// atom not in a rigid body

	if (!bodytag[i]) return m;

	// must also pack vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	buf[m++] = vatom[i][k];

	// atom does not own its rigid body

	if (bodyown[i] < 0) {
	buf[m++] = 0;
	return m;
	}

	// body info for atom that owns a rigid body

	buf[m++] = 1;
	memcpy(&buf[m],&body[bodyown[i]],sizeof(Body));
	m += bodysize;
	return m;
	}

	/* ----------------------------------------------------------------------
	unpack values in local atom-based arrays from exchange with another proc
	------------------------------------------------------------------------- */

	int FixRigidSmall::unpack_exchange(int nlocal, double *buf)
	{
	bodytag[nlocal] = (tagint) ubuf(buf[0]).i;
	xcmimage[nlocal] = (imageint) ubuf(buf[1]).i;
	displace[nlocal][0] = buf[2];
	displace[nlocal][1] = buf[3];
	displace[nlocal][2] = buf[4];

	// extended attribute info

	int m = 5;
	if (extended) {
	eflags[nlocal] = static_cast<int> (buf[m++]);
	for (int j = 0; j < orientflag; j++)
	orient[nlocal][j] = buf[m++];
	if (dorientflag) {
	dorient[nlocal][0] = buf[m++];
	dorient[nlocal][1] = buf[m++];
	dorient[nlocal][2] = buf[m++];
	}
	}

	// atom not in a rigid body

	if (!bodytag[nlocal]) {
	bodyown[nlocal] = -1;
	return m;
	}

	// must also unpack vatom if per-atom virial calculated on this timestep
	// since vatom is calculated before and after atom migration

	if (vflag_atom)
	for (int k = 0; k < 6; k++)
	vatom[nlocal][k] = buf[m++];

	// atom does not own its rigid body

	bodyown[nlocal] = static_cast<int> (buf[m++]);
	if (bodyown[nlocal] == 0) {
	bodyown[nlocal] = -1;
	return m;
	}

	// body info for atom that owns a rigid body

	if (nlocal_body == nmax_body) grow_body();
	memcpy(&body[nlocal_body],&buf[m],sizeof(Body));
	m += bodysize;
	body[nlocal_body].ilocal = nlocal;
	bodyown[nlocal] = nlocal_body++;

	return m;
	}

	/* ----------------------------------------------------------------------
	only pack body info if own or ghost atom owns the body
	for FULL_BODY, send 0/1 flag with every atom
	------------------------------------------------------------------------- */

	int FixRigidSmall::pack_forward_comm(int n, int list, double buf,
	int pbc_flag, int *pbc)
	{
	int i,j;
	double xcm,vcm,quat,omega,ex_space,ey_space,ez_space,conjqm;

	int m = 0;

	if (commflag == INITIAL) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	xcm = body[bodyown[j]].xcm;
	buf[m++] = xcm[0];
	buf[m++] = xcm[1];
	buf[m++] = xcm[2];
	vcm = body[bodyown[j]].vcm;
	buf[m++] = vcm[0];
	buf[m++] = vcm[1];
	buf[m++] = vcm[2];
	quat = body[bodyown[j]].quat;
	buf[m++] = quat[0];
	buf[m++] = quat[1];
	buf[m++] = quat[2];
	buf[m++] = quat[3];
	omega = body[bodyown[j]].omega;
	buf[m++] = omega[0];
	buf[m++] = omega[1];
	buf[m++] = omega[2];
	ex_space = body[bodyown[j]].ex_space;
	buf[m++] = ex_space[0];
	buf[m++] = ex_space[1];
	buf[m++] = ex_space[2];
	ey_space = body[bodyown[j]].ey_space;
	buf[m++] = ey_space[0];
	buf[m++] = ey_space[1];
	buf[m++] = ey_space[2];
	ez_space = body[bodyown[j]].ez_space;
	buf[m++] = ez_space[0];
	buf[m++] = ez_space[1];
	buf[m++] = ez_space[2];
	conjqm = body[bodyown[j]].conjqm;
	buf[m++] = conjqm[0];
	buf[m++] = conjqm[1];
	buf[m++] = conjqm[2];
	buf[m++] = conjqm[3];
	}

	} else if (commflag == FINAL) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	vcm = body[bodyown[j]].vcm;
	buf[m++] = vcm[0];
	buf[m++] = vcm[1];
	buf[m++] = vcm[2];
	omega = body[bodyown[j]].omega;
	buf[m++] = omega[0];
	buf[m++] = omega[1];
	buf[m++] = omega[2];
	conjqm = body[bodyown[j]].conjqm;
	buf[m++] = conjqm[0];
	buf[m++] = conjqm[1];
	buf[m++] = conjqm[2];
	buf[m++] = conjqm[3];
	}

	} else if (commflag == FULL_BODY) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) buf[m++] = 0;
	else {
	buf[m++] = 1;
	memcpy(&buf[m],&body[bodyown[j]],sizeof(Body));
	m += bodysize;
	}
	}
	}

	return m;
	}

	/* ----------------------------------------------------------------------
	only ghost atoms are looped over
	for FULL_BODY, store a new ghost body if this atom owns it
	for other commflag values, only unpack body info if atom owns it
	------------------------------------------------------------------------- */

	void FixRigidSmall::unpack_forward_comm(int n, int first, double *buf)
	{
	int i,j,last;
	double xcm,vcm,quat,omega,ex_space,ey_space,ez_space,conjqm;

	int m = 0;
	last = first + n;

	if (commflag == INITIAL) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	xcm = body[bodyown[i]].xcm;
	xcm[0] = buf[m++];
	xcm[1] = buf[m++];
	xcm[2] = buf[m++];
	vcm = body[bodyown[i]].vcm;
	vcm[0] = buf[m++];
	vcm[1] = buf[m++];
	vcm[2] = buf[m++];
	quat = body[bodyown[i]].quat;
	quat[0] = buf[m++];
	quat[1] = buf[m++];
	quat[2] = buf[m++];
	quat[3] = buf[m++];
	omega = body[bodyown[i]].omega;
	omega[0] = buf[m++];
	omega[1] = buf[m++];
	omega[2] = buf[m++];
	ex_space = body[bodyown[i]].ex_space;
	ex_space[0] = buf[m++];
	ex_space[1] = buf[m++];
	ex_space[2] = buf[m++];
	ey_space = body[bodyown[i]].ey_space;
	ey_space[0] = buf[m++];
	ey_space[1] = buf[m++];
	ey_space[2] = buf[m++];
	ez_space = body[bodyown[i]].ez_space;
	ez_space[0] = buf[m++];
	ez_space[1] = buf[m++];
	ez_space[2] = buf[m++];
	conjqm = body[bodyown[i]].conjqm;
	conjqm[0] = buf[m++];
	conjqm[1] = buf[m++];
	conjqm[2] = buf[m++];
	conjqm[3] = buf[m++];
	}

	} else if (commflag == FINAL) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	vcm = body[bodyown[i]].vcm;
	vcm[0] = buf[m++];
	vcm[1] = buf[m++];
	vcm[2] = buf[m++];
	omega = body[bodyown[i]].omega;
	omega[0] = buf[m++];
	omega[1] = buf[m++];
	omega[2] = buf[m++];
	conjqm = body[bodyown[i]].conjqm;
	conjqm[0] = buf[m++];
	conjqm[1] = buf[m++];
	conjqm[2] = buf[m++];
	conjqm[3] = buf[m++];
	}

	} else if (commflag == FULL_BODY) {
	for (i = first; i < last; i++) {
	bodyown[i] = static_cast<int> (buf[m++]);
	if (bodyown[i] == 0) bodyown[i] = -1;
	else {
	j = nlocal_body + nghost_body;
	if (j == nmax_body) grow_body();
	memcpy(&body[j],&buf[m],sizeof(Body));
	m += bodysize;
	body[j].ilocal = i;
	bodyown[i] = j;
	nghost_body++;
	}
	}
	}
	}

	/* ----------------------------------------------------------------------
	only ghost atoms are looped over
	only pack body info if atom owns it
	------------------------------------------------------------------------- */

	int FixRigidSmall::pack_reverse_comm(int n, int first, double *buf)
	{
	int i,j,m,last;
	double fcm,torque,vcm,angmom,*xcm;

	m = 0;
	last = first + n;

	if (commflag == FORCE_TORQUE) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	fcm = body[bodyown[i]].fcm;
	buf[m++] = fcm[0];
	buf[m++] = fcm[1];
	buf[m++] = fcm[2];
	torque = body[bodyown[i]].torque;
	buf[m++] = torque[0];
	buf[m++] = torque[1];
	buf[m++] = torque[2];
	}

	} else if (commflag == VCM_ANGMOM) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	vcm = body[bodyown[i]].vcm;
	buf[m++] = vcm[0];
	buf[m++] = vcm[1];
	buf[m++] = vcm[2];
	angmom = body[bodyown[i]].angmom;
	buf[m++] = angmom[0];
	buf[m++] = angmom[1];
	buf[m++] = angmom[2];
	}

	} else if (commflag == XCM_MASS) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	xcm = body[bodyown[i]].xcm;
	buf[m++] = xcm[0];
	buf[m++] = xcm[1];
	buf[m++] = xcm[2];
	buf[m++] = body[bodyown[i]].mass;
	}

	} else if (commflag == ITENSOR) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	j = bodyown[i];
	buf[m++] = itensor[j][0];
	buf[m++] = itensor[j][1];
	buf[m++] = itensor[j][2];
	buf[m++] = itensor[j][3];
	buf[m++] = itensor[j][4];
	buf[m++] = itensor[j][5];
	}

	} else if (commflag == DOF) {
	for (i = first; i < last; i++) {
	if (bodyown[i] < 0) continue;
	j = bodyown[i];
	buf[m++] = counts[j][0];
	buf[m++] = counts[j][1];
	buf[m++] = counts[j][2];
	}
	}

	return m;
	}

	/* ----------------------------------------------------------------------
	only unpack body info if own or ghost atom owns the body
	------------------------------------------------------------------------- */

	void FixRigidSmall::unpack_reverse_comm(int n, int list, double buf)
	{
	int i,j,k;
	double fcm,torque,vcm,angmom,*xcm;

	int m = 0;

	if (commflag == FORCE_TORQUE) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	fcm = body[bodyown[j]].fcm;
	fcm[0] += buf[m++];
	fcm[1] += buf[m++];
	fcm[2] += buf[m++];
	torque = body[bodyown[j]].torque;
	torque[0] += buf[m++];
	torque[1] += buf[m++];
	torque[2] += buf[m++];
	}

	} else if (commflag == VCM_ANGMOM) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	vcm = body[bodyown[j]].vcm;
	vcm[0] += buf[m++];
	vcm[1] += buf[m++];
	vcm[2] += buf[m++];
	angmom = body[bodyown[j]].angmom;
	angmom[0] += buf[m++];
	angmom[1] += buf[m++];
	angmom[2] += buf[m++];
	}

	} else if (commflag == XCM_MASS) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	xcm = body[bodyown[j]].xcm;
	xcm[0] += buf[m++];
	xcm[1] += buf[m++];
	xcm[2] += buf[m++];
	body[bodyown[j]].mass += buf[m++];
	}

	} else if (commflag == ITENSOR) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	k = bodyown[j];
	itensor[k][0] += buf[m++];
	itensor[k][1] += buf[m++];
	itensor[k][2] += buf[m++];
	itensor[k][3] += buf[m++];
	itensor[k][4] += buf[m++];
	itensor[k][5] += buf[m++];
	}

	} else if (commflag == DOF) {
	for (i = 0; i < n; i++) {
	j = list[i];
	if (bodyown[j] < 0) continue;
	k = bodyown[j];
	counts[k][0] += static_cast<int> (buf[m++]);
	counts[k][1] += static_cast<int> (buf[m++]);
	counts[k][2] += static_cast<int> (buf[m++]);
	}
	}
	}

	/* ----------------------------------------------------------------------
	grow body data structure
	------------------------------------------------------------------------- */

	void FixRigidSmall::grow_body()
	{
	nmax_body += DELTA_BODY;
	body = (Body ) memory->srealloc(body,nmax_bodysizeof(Body),
	"rigid/small:body");
	}

	/* ----------------------------------------------------------------------
	reset atom2body for all owned atoms
	do this via bodyown of atom that owns the body the owned atom is in
	atom2body values can point to original body or any image of the body
	------------------------------------------------------------------------- */

	void FixRigidSmall::reset_atom2body()
	{
	int iowner;

	// iowner = index of atom that owns the body that atom I is in

	int nlocal = atom->nlocal;

	for (int i = 0; i < nlocal; i++) {
	atom2body[i] = -1;
	if (bodytag[i]) {
	iowner = atom->map(bodytag[i]);
	if (iowner == -1) {
	char str[128];
	sprintf(str,
	"Rigid body atoms " TAGINT_FORMAT " " TAGINT_FORMAT
	" missing on proc %d at step " BIGINT_FORMAT,
	atom->tag[i],bodytag[i],comm->me,update->ntimestep);
	error->one(FLERR,str);

	}
	atom2body[i] = bodyown[iowner];
	}
	}
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::reset_dt()
	{
	dtv = update->dt;
	dtf = 0.5 * update->dt * force->ftm2v;
	dtq = 0.5 * update->dt;
	}

	/* ----------------------------------------------------------------------
	zero linear momentum of each rigid body
	set Vcm to 0.0, then reset velocities of particles via set_v()
	------------------------------------------------------------------------- */

	void FixRigidSmall::zero_momentum()
	{
	double *vcm;
	for (int ibody = 0; ibody < nlocal_body+nghost_body; ibody++) {
	vcm = body[ibody].vcm;
	vcm[0] = vcm[1] = vcm[2] = 0.0;
	}

	// forward communicate of vcm to all ghost copies

	commflag = FINAL;
	comm->forward_comm_fix(this,10);

	// set velocity of atoms in rigid bodues

	evflag = 0;
	set_v();
	}

	/* ----------------------------------------------------------------------
	zero angular momentum of each rigid body
	set angmom/omega to 0.0, then reset velocities of particles via set_v()
	------------------------------------------------------------------------- */

	void FixRigidSmall::zero_rotation()
	{
	double angmom,omega;
	for (int ibody = 0; ibody < nlocal_body+nghost_body; ibody++) {
	angmom = body[ibody].angmom;
	angmom[0] = angmom[1] = angmom[2] = 0.0;
	omega = body[ibody].omega;
	omega[0] = omega[1] = omega[2] = 0.0;
	}

	// forward communicate of omega to all ghost copies

	commflag = FINAL;
	comm->forward_comm_fix(this,10);

	// set velocity of atoms in rigid bodues

	evflag = 0;
	set_v();
	}

	/* ---------------------------------------------------------------------- */

	void FixRigidSmall::extract(const char str, int &dim)
	{
	if (strcmp(str,"body") == 0) {
	dim = 1;
	return atom2body;
	}

	if (strcmp(str,"onemol") == 0) {
	dim = 0;
	return onemols;
	}

	// return vector of rigid body masses, for owned+ghost bodies
	// used by granular pair styles, indexed by atom2body

	if (strcmp(str,"masstotal") == 0) {
	dim = 1;

	if (nmax_mass < nmax_body) {
	memory->destroy(mass_body);
	nmax_mass = nmax_body;
	memory->create(mass_body,nmax_mass,"rigid:mass_body");
	}

	int n = nlocal_body + nghost_body;
	for (int i = 0; i < n; i++)
	mass_body[i] = body[i].mass;

	return mass_body;
	}

	return NULL;
	}

	/* ----------------------------------------------------------------------
	return translational KE for all rigid bodies
	KE = 1/2 M Vcm^2
	sum local body results across procs
	------------------------------------------------------------------------- */

	double FixRigidSmall::extract_ke()
	{
	double *vcm;

	double ke = 0.0;
	for (int i = 0; i < nlocal_body; i++) {
	vcm = body[i].vcm;
	ke += body[i].mass * (vcm[0]vcm[0] + vcm[1]vcm[1] + vcm[2]*vcm[2]);
	}

	double keall;
	MPI_Allreduce(&ke,&keall,1,MPI_DOUBLE,MPI_SUM,world);

	return 0.5*keall;
	}

	/* ----------------------------------------------------------------------
	return rotational KE for all rigid bodies
	Erotational = 1/2 I wbody^2
	------------------------------------------------------------------------- */

	double FixRigidSmall::extract_erotational()
	{
	double wbody[3],rot[3][3];
	double *inertia;

	double erotate = 0.0;
	for (int i = 0; i < nlocal_body; i++) {

	// for Iw^2 rotational term, need wbody = angular velocity in body frame
	// not omega = angular velocity in space frame

	inertia = body[i].inertia;
	MathExtra::quat_to_mat(body[i].quat,rot);
	MathExtra::transpose_matvec(rot,body[i].angmom,wbody);
	if (inertia[0] == 0.0) wbody[0] = 0.0;
	else wbody[0] /= inertia[0];
	if (inertia[1] == 0.0) wbody[1] = 0.0;
	else wbody[1] /= inertia[1];
	if (inertia[2] == 0.0) wbody[2] = 0.0;
	else wbody[2] /= inertia[2];

	erotate += inertia[0]wbody[0]wbody[0] + inertia[1]wbody[1]wbody[1] +
	inertia[2]wbody[2]wbody[2];
	}

	double erotateall;
	MPI_Allreduce(&erotate,&erotateall,1,MPI_DOUBLE,MPI_SUM,world);

	return 0.5*erotateall;
	}

	/* ----------------------------------------------------------------------
	return temperature of collection of rigid bodies
	non-active DOF are removed by fflag/tflag and in tfactor
	------------------------------------------------------------------------- */

	double FixRigidSmall::compute_scalar()
	{
	double wbody[3],rot[3][3];

	double vcm,inertia;

	double t = 0.0;

	for (int i = 0; i < nlocal_body; i++) {
	vcm = body[i].vcm;
	t += body[i].mass * (vcm[0]vcm[0] + vcm[1]vcm[1] + vcm[2]*vcm[2]);

	// for Iw^2 rotational term, need wbody = angular velocity in body frame
	// not omega = angular velocity in space frame

	inertia = body[i].inertia;
	MathExtra::quat_to_mat(body[i].quat,rot);
	MathExtra::transpose_matvec(rot,body[i].angmom,wbody);
	if (inertia[0] == 0.0) wbody[0] = 0.0;
	else wbody[0] /= inertia[0];
	if (inertia[1] == 0.0) wbody[1] = 0.0;
	else wbody[1] /= inertia[1];
	if (inertia[2] == 0.0) wbody[2] = 0.0;
	else wbody[2] /= inertia[2];

	t += inertia[0]wbody[0]wbody[0] + inertia[1]wbody[1]wbody[1] +
	inertia[2]wbody[2]wbody[2];
	}

	double tall;
	MPI_Allreduce(&t,&tall,1,MPI_DOUBLE,MPI_SUM,world);

	double tfactor = force->mvv2e / ((6.0nbody - nlinear) force->boltz);
	tall *= tfactor;
	return tall;
	}

	/* ----------------------------------------------------------------------
	memory usage of local atom-based arrays
	------------------------------------------------------------------------- */

	double FixRigidSmall::memory_usage()
	{
	int nmax = atom->nmax;
	double bytes = nmax2 sizeof(int);
	bytes += nmax * sizeof(imageint);
	bytes += nmax3 sizeof(double);
	bytes += maxvatom6 sizeof(double); // vatom
	if (extended) {
	bytes += nmax * sizeof(int);
	if (orientflag) bytes = nmaxorientflag sizeof(double);
	if (dorientflag) bytes = nmax3 sizeof(double);
	}
	bytes += nmax_body * sizeof(Body);
	return bytes;
	}

	/* ----------------------------------------------------------------------
	debug method for sanity checking of atom/body data pointers
	------------------------------------------------------------------------- */

	/*
	void FixRigidSmall::check(int flag)
	{
	for (int i = 0; i < atom->nlocal; i++) {
	if (bodyown[i] >= 0) {
	if (bodytag[i] != atom->tag[i]) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD AAA");
	}
	if (bodyown[i] < 0 \|\| bodyown[i] >= nlocal_body) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD BBB");
	}
	if (atom2body[i] != bodyown[i]) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD CCC");
	}
	if (body[bodyown[i]].ilocal != i) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD DDD");
	}
	}
	}

	for (int i = 0; i < atom->nlocal; i++) {
	if (bodyown[i] < 0 && bodytag[i] > 0) {
	if (atom2body[i] < 0 \|\| atom2body[i] >= nlocal_body+nghost_body) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD EEE");
	}
	if (bodytag[i] != atom->tag[body[atom2body[i]].ilocal]) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD FFF");
	}
	}
	}

	for (int i = atom->nlocal; i < atom->nlocal + atom->nghost; i++) {
	if (bodyown[i] >= 0) {
	if (bodyown[i] < nlocal_body \|\|
	bodyown[i] >= nlocal_body+nghost_body) {
	printf("Values %d %d: %d %d %d\n",
	i,atom->tag[i],bodyown[i],nlocal_body,nghost_body);
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD GGG");
	}
	if (body[bodyown[i]].ilocal != i) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD HHH");
	}
	}
	}

	for (int i = 0; i < nlocal_body; i++) {
	if (body[i].ilocal < 0 \|\| body[i].ilocal >= atom->nlocal) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD III");
	}
	if (bodytag[body[i].ilocal] != atom->tag[body[i].ilocal] \|\|
	bodyown[body[i].ilocal] != i) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD JJJ");
	}
	}

	for (int i = nlocal_body; i < nlocal_body + nghost_body; i++) {
	if (body[i].ilocal < atom->nlocal \|\|
	body[i].ilocal >= atom->nlocal + atom->nghost) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD KKK");
	}
	if (bodyown[body[i].ilocal] != i) {
	printf("Proc %d, step %ld, flag %d\n",comm->me,update->ntimestep,flag);
	errorx->one(FLERR,"BAD LLL");
	}
	}
	}
	*/
	diff --git a/src/version.h b/src/version.h
	index ecff1ced9..604404a0a 100644
	--- a/src/version.h
	+++ b/src/version.h
	@@ -1 +1 @@
	-#define LAMMPS_VERSION "8 Jul 2015"
	+#define LAMMPS_VERSION "15 Jul 2015"

No OneTemporaryActions

File Metadata

View Options

Event Timeline

No OneTemporary
Actions