include explicit 8x loop unrolling for faster thread data reduction
code contributed by paul coffman(IBM)