fix race condition on rho
the main bug here is the use of a local
rho_i accumulator which later gets assigned
back to rho[i].
in parallel, atomic additions can happen to
rho[i] while the local accumulator is held;
those atomic additions are lost when
the accumulator is atomically assigned.
we instead initialize the accumulator to zero
and atomically add it back to rho[i].