Page MenuHomec4science

floatingpoint.tex
No OneTemporary

File Metadata

Created
Mon, Feb 24, 05:19

floatingpoint.tex

\section{Floating-Point Implementation} \label{sec:fp}
% Explain code logic, setup
In this \namecref{sec:fp}, the implementation of \cref{eq:T4lorenzo,eq:t5jorge} in Fortran is discussed. Fortran (FORmula TRANslator) is a compiled programming language designed for scientific computing. Separately, Wolfram's Mathematica software is used to verify accurate output of the code. It should however be noted that even if this software provides an a priori exact result for the target computations, its performance prevents computing a large number of elements in a reasonable time.
As one also expects the computation of these coefficients to be lengthy, the program should write the computed coefficients to a table, saved on the host machine, rather than acting as an interface, providing subroutines returning coefficients that are computed ``on the fly''.
This table can then be loaded into memory (totally or partially) when required by other parts of the program.
We expect $p,j \leq 60$ in \cref{eq:T4lorenzo} to be reasonable bounds to reproduce usual plasma distribution functions $f_a$ accurately, and $p,j\leq25$ in \cref{eq:t5jorge}.
\paragraph{Validation with Mathematica}
Both formulas \labelcref{eq:T4lorenzo,eq:t5jorge} are first implemented within Mathematica. In \cref{fig:mathematica_t45_checks}, it is verified that these expressions perform the basis transformations \labelcref{eq:LGL2HL,eq:GLGL2HL} accurately: the left- and right-hand sides of these equations are computed independently, and the absolute error between the two is computed.
%
\begin{figure}[hb]
\begin{subfigure}{.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/bigbluesquare.png}
\subcaption{Basis transformation \labelcref{eq:LGL2HL}, considering $l=6$ and $k=6$.}
\end{subfigure}
\hfill
\begin{subfigure}{.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/bigbluesquare.png}
\subcaption{Basis transformation \labelcref{eq:GLGL2HL}, considering $l,m=2$, and $k=1$.}
\end{subfigure}
\caption{On these plots, the absolute error between left- and right-hand sides of basis transformations \labelcref{eq:LGL2HL,eq:GLGL2HL}, as computed by Mathematica, is reported. The error is negligible to machine precision. Similar plots can be obtained for other indices.}
\label{fig:mathematica_t45_checks}
\end{figure}
%
It can be seen that this error is negligible with respect to machine precision of the host computer ($\epsilon=\num{1.16e-16}$). Thus, the results obtained through Mathematica are deemed accurate so that they can be used to validate the output of the Fortran code.
\paragraph{Direct implementation}
% Give results about 1st numerical implementation
Coefficients \labelcref{eq:T4lorenzo,eq:t5jorge} are then implemented in double-precision (\num{64}-bit, DP) floating-point (FP) arithmetic. The first implementation ($T_{lk}^{pj}$ only) features use of the intrinsic \lstinline{gamma} function to compute factorial of integers and half-integers, as well as binomials. It's structure is straightforward, and can be found in \cref{lst:t4_v1}.
\begin{lstlisting}[float,%
caption={Procedure for computing $T_{lk}^{pj}$},%
label={lst:t4_v1}]
do i = 0, l/2
do m = 0, k
do r = 0, m+i
num(1) = Binomial(2*(l-i), l)
! Compute numerators...
num(8) = Factorial(m+i-r)
den(1) = 2**(2*(l-i+r))
! Compute denominators...
den(6) = Factorial(p)
return_value = return_value + product(num)/product(den)
end do
end do
end do
\end{lstlisting}
Although a speedup of $s_4\sim\num{10}$ is already obtained compared to the Mathematica implementation\footnote{The `4' subscript refers to the number of indices in the coefficient $T_{lk}^{pj}$. When referring to $T_{lkm}^{pj}$, `5' will be used.}\footnote{Performance benchmark of the code can be found in \cref{app:benchmarks}. Here, we refer to benchmarks ``Mathematica'' and ``73f3d2e''.}, two issues are immediately noticed:
\begin{enumerate*}[label={(\arabic*)}]
\item when $l,k,p,j$ indices are sufficiently high, the \lstinline|return_value| variable may overflow and
\item most of the execution time is spent within calls to \lstinline|gamma|.
\end{enumerate*}
The second issue is solved by implementing a custom factorial function, valid for integer and half-integer input, which disposes of a cache array (with size defined at compile time). This provides an additional speedup of about \num{14}, such that $s_4\sim\num{e3}$ and $s_5\sim\num{e4}$ (\cref{app:benchmarks}, ``1e91750'').
Two solutions can be imagined to the first issue. Either the precision of the considered variables needs to be increased, that is, use quadruple-precision (\num{128}-bit, QP) FP arithmetic instead of DP, or rewrite the internal summation operation such that the risk of encountering large terms is reduced. For example, one could rewrite \lstinline{return_value = return_value + product(num/den)}.
We note that using QP arithmetic has a strong impact on performance, with computational times about \num{10} times longer, due to the fact that most modern processors' floating point units (FPUs) are designed to handle DP variables (\cref{app:benchmarks}, ``875839a'').
Thus, using \num{128}-bit variables will require splitting the variable into different pipelines, and lead to dependent computations, which is the main reason for the slow down.
One can also argue that neither of the two options above is reasonable, as it does not prevent overflows from occurring for sufficiently high indices in $T_{lk}^{pj}$ and $T_{lkm}^{pj}$.
However, comparing the outputs of the code with expected values gotten from Mathematica, large errors quickly appear before overflows are noticed, see \cref{fig:roundoffs}. We shall see that these arise from the FP arithmetic. Since these errors become significant already when computing the $T_{lk}^{pj}$ coefficients, the computation of $T_{lkm}^{pj}$, dependent on $T_{lk}^{pj}$ is left aside in the next paragraph.
\begin{figure}
\centering
\begin{subfigure}{0.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/t4_default.png}
\subcaption{DP error, $\delta T_{30,30}^{30,30}\approx\num{-3.42e-9}$.}
\label{subfig:roundoffs_dbl}
\end{subfigure}
\hfill
\begin{subfigure}{0.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/t4_quadprc.png}
\subcaption{QP error. $\delta T_{30,30}^{30,30}$ is negligible to machine precision.}
\label{subfig:roundoffs_quad}
\end{subfigure}
\caption{$T_{30,30}^{p,j}$ table as output by the code in DP (\labelcref{subfig:roundoffs_dbl}) and QP (\labelcref{subfig:roundoffs_quad}). All coefficients are supposed to be zero, expected $T_{30,30}^{30,30}\approx\num{-696.46}$.}
\label{fig:roundoffs}
\end{figure}
\paragraph{Floating-point issues}
The errors that were presented in the previous paragraph can be attributed to round-off errors from FP arithmetic. Consider, for example, the computation of $T_{30,30}^{0,0}$. This coefficient is supposedly zero, as computed by Mathematica. However, the code yields \num{6.04296e27}, displaying a significant error.
By plotting the intermediate terms that are successively summed in \cref{lst:t4_v1}, \cref{fig:roundoffs_intrm}, we see that terms of various magnitudes appear.
\begin{figure}
\centering
\includegraphics[scale=0.6]{figs/t4_default_intrm.png}
\caption{Plot of the magnitude ($\log_{10}$) of intermediate terms intervening in the sum computing $T_{30,30}^{0,0}=0$. Positive terms are show on the left, negative ones on the right.}
\label{fig:roundoffs_intrm}
\end{figure}
In general, a FP number is represented as
\begin{equation}
\pm d_0.d_1d_2\ldots d_{p-1} \times \beta^e = (d_0 + d_1\beta^{-1} + \ldots + d_{p-1}\beta^{-(p-1)}) \beta^e,
\label{eq:fpformat}
\end{equation}
with $0\leq d_i<\beta$, $\beta$ the base, $\{d_i\}_i$ the fraction, $\pm$ the sign and $e$ the exponent. According to the IEEE-754 standard, $\beta=2$, the exponent and fraction are represented by $11$ and $52$ bits respectively, which, adding the sign bit composes the $64$-bit DP format. Thus, the best relative precision one gets from the FP arithmetic amounts to $2^{-53}\sim\num{1.110223e-16}$.
At this point, two conclusions can be drawn, \begin{enumerate*}[label=(\arabic*)]
\item the intermediate terms themselves suffer from a rounding error, as they're likely not an exact solution of \cref{eq:fpformat},
\item digits will be rounded by summation when adding terms with large magnitude difference.
\end{enumerate*}
Thus, considering our algorithm once again, it becomes evident that many intermediate terms in \cref{fig:roundoffs_intrm} will be lost from round-offs errors.
Remark that both \cref{fig:roundoffs,fig:roundoffs_intrm} are consistent since the magnitude of maximal intermediate terms and errors in \cref{subfig:roundoffs_dbl} are separated by about \num{16} orders of magnitude (roughly machine precision).
Using QP improve these estimates by another \num{16} orders of magnitude, \cref{subfig:roundoffs_quad}, as would be expected.
In the attempt to improve over the rounding errors, the so-called \emph{Kahan summation} technique (also compensated summation) is investigated \cite{kahan1965}. Aimed at reducing summation errors in FP arithmetic, a compensation term is introduced in the sum and updated in parallel with the sum variable. Each time an intermediate result is added to the sum variable, an eventual round-off error is estimated and kept in the compensation term. We specifically consider the variant by \textcite{neumaier1974} where the compensation term is added only after original summation. It also accounts for cases where the partial sum is relatively small compared to a newly added term, see \cref{alg:kahan_improved} for a pseudocode.
\begin{algorithm}
\begin{algorithmic}
\Function{NeumaierSum}{input}
\State $s,c\gets\num{0.0}$
\For{$v \in \text{input}$}
\State $t \gets s + v$
\If{$\abs{s} >= \abs{v}$}
\State $c \gets c + (s-t) + v$ \Comment{Low-order digits of $v$ are lost.}
\Else
\State $c \gets c + (v-t) + s$ \Comment{Low-order digits of $s$ are lost.}
\EndIf
\State $s \gets t$
\EndFor
\State \Return $s+c$ \Comment{Add compensation when returning}
\EndFunction
\end{algorithmic}
\caption{Improved Kahan summation from Neumaier.}
\label{alg:kahan_improved}
\end{algorithm}
On \cref{fig:roundoffs_kahan}, we compare the numerical errors produced by the original code and the version using the improved Kahan summation.
\begin{figure}
\centering
\begin{subfigure}{.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/t4_default.png}
\subcaption{$T_{30,30}^{pj}$ coefficients, without compensation}
\label{subfig:kahan_matrix_orig}
\end{subfigure}
\hfill
\begin{subfigure}{.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/t4_neumaier.png}
\subcaption{$T_{30,30}^{pj}$ coefficients, Neumaier summation}
\label{subfig:kahan_matrix_comp}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/t4_default_err.png}
\subcaption{Basis reconstruction ($l,k=6$), no compensation}
\label{subfig:kahan_basis_orig}
\end{subfigure}
\hfill
\begin{subfigure}{.45\linewidth}
\centering
\includegraphics[width=\linewidth]{figs/t4_neumaier_err.png}
\subcaption{Basis reconstruction ($l,k=6$), Neumaier summation}
\label{subfig:kahan_basis_comp}
\end{subfigure}
\caption{Numerical errors obtained when running the original code, compared to an implementation using the Neumaier compensated summation. Errors on $T_{lk}^{pj}$ coefficients are considered in \cref{subfig:kahan_matrix_orig,subfig:kahan_matrix_comp}, as well as the error after basis reconstruction, \cref{subfig:kahan_basis_orig,subfig:kahan_basis_comp}. Marginal improvements are made, but are clearly outperformed by the QP implementation, \cref{subfig:roundoffs_quad}.}
\label{fig:roundoffs_kahan}
\end{figure}
One may observe the error is improved in most regions by an order of magnitude. However, a large amount of coefficients still display a significant absolute error, such that reconstruction of high order elements in the basis will surely be lead to wrong estimations.
Such considerations suggest the FP arithmetic is not adapted to the application at hand. All the solutions considered until now (increasing the variable size to QP, rearranging computation of inner sum terms and alternative summation algorithms) have proven either detrimental in terms of performance, or only able to partially solve the initially encountered issues (round-off errors, overflows and computation speed). In the next \namecref{sec:fp}, we present an alternative that removes the two first problems to the cost of additional computational power.

Event Timeline