add small optimization for lj/charmm/coul/long/omp
sum data for f[i][0/1/2] first into fx/y/ztmp and only update those at the end.