The computation of coefficients \labelcref{eq:T4lorenzo,eq:t5jorge} proved to be numerically unstable using FP arithmetic, specifically due to the summation of intermediate terms much larger than the expected final result.
In such situations, it was seen that increasing the number of significant digits used in the computations, from \num{16} in DP to \num{32} in QP, could help overcome the said instability.
As such problems are common, a variety of libraries extending popular programming languages to \emph{multiple-precision} (MP) arithmetic, and this is no exception in Fortran \cite{bailey_fortran_1995, brent_fortran_1978, smith_algorithm_1991}.
We choose to rewrite our implementation using the \emph{Floating-point Multiple-precision} package (abbreviated FMLIB) of \textcite{smith_algorithm_1991} to achieve better accuracy, which implements stable algorithms for precisions of no more than a few thousand digits \cite{smith_efficient_1989}.
Custom routines computing factorials and binomials of positive integers and half-integers are also rewritten for performance concerns.
\paragraph{Comparison with FP implementation}
The accuracy of the new code is first validated against Mathematica, as performed in \cref{sec:fp}, to ensure correct implementation, see \cref{fig:mathematica_t45_checks_fm}. Unless otherwise mentioned, a precision of \num{50} significant digits is considered. It can be seen that the errors affecting the $T_{lk}^{pj}$ coefficients have negligible impact on the reconstruction.
\caption{Error of the basis reconstruction, using \num{50} significant digits precision.}
\label{fig:mathematica_t45_checks_fm}
\end{figure}
While these results are satisfying, an important drop in performance of the code is noticed (\cref{app:benchmarks}, ``36b3b7a''). Improvements have been yielded by avoiding duplicating computation of intermediate terms in the sum\footnote{Consider \cref{lst:t4_v1}, this would account to moving individual elements of \lstinline{num} and \lstinline{den} to the most outer loops possible.} as well as restricting the range of the loops based on mathematical constraints (diverging terms in the denominator, modulo condition, \dots).
Based on observations of the output, the following postulate is suggested:
\begin{equation}
T_{lk}^{pj}
=
\begin{cases}
0 &\text{if } l+2k \neq p+2j, \\
\text{\cref{eq:T4lorenzo}}&\text{otherwise.}
\end{cases}
\label{eq:t4postulate}
\end{equation}
It is likely that this result can be derived from symmetry considerations in \cref{eq:t4integral}, but delay this justification to future works, due to time constraints.
Considering the aforementioned modifications, the performance for computing $T_{lk}^{pj}$ in MP arithmetic is greatly improved (\cref{app:benchmarks}, ``74c1d49''), and removes the possibility of numerical errors on most coefficients altogether.
Such modifications are however difficult to replicate when considering the $T_{lkm}^{pj}$ coefficients, as defined in \cref{eq:t5jorge}. Although avoiding work duplication when computing intermediate terms of the sum is a straightforward task, other improvements such as limiting the range of the loops become much more obscure. This observation is due to the number of loops (accounting for the ones hidden behind $T_{l'k'}^{pj}$) required to compute these coefficients.
As it seems impractical to improve the performance from a programming point of view, an attempt to find a simpler expression, independent from the $T_{lk}^{pj}$, is presented in the next paragraph.
\paragraph{Alternative derivation of \texorpdfstring{$T_{lkm}^{pj}$}{Tpjlkm}}
The former expression of the $T_{lkm}^{pj}$ coefficients, \cref{eq:t5jorge}, is obtained by substituting a series of Legendre polynomial in place of the associated Legendre polynomial, $P_l^m(\xi)$, in \cref{eq:t5integral}. Then, the problem can be mapped to make use of the $T_{lk}^{pj}$ coefficients, as developed in \cref{eq:T4lorenzo}. However, as we have seen it, the result is computationally expensive, and we seek an easier form for the MP implementation.
Consider starting from \cref{eq:t5integral}. We rewrite it in terms of integration variables $x=v_\parallel$ and $y=v_\perp^2$,
The closed-form of the generalized Laguerre and associated Legendre polynomials, see \cref{eq:laguerre_gen_poly,eq:legendre_gen_poly}, can then be substituted,
Shifting the $i$-sum such that it starts at zero, one obtains
\begin{multline}
T_{lkm}^{pj}
=
\frac{(-1)^m 2^l}{\sqrt{\pi} 2^p p!}
\int_{\mathbb{R}}\dd{x}\int_{\mathbb{R^+}}\dd{y}
\sum_{i=0}^{l-m}\sum_{r=0}^{k}
\binom{l}{i+m}\binom{(l+i+m-1)/2}{l}
\times\\
\binom{k+l+1/2}{k-r}
\frac{(-1)^r (i+m)!}{i! r!}
x^i (x^2+y)^{r+(l-m-i)/2}
H_p(x) L_j(y) e^{-x^2-y}.
\end{multline}
One can see that contrarily to \cref{eq:t4integralsplitting}, a coupling term $(x^2+y)^{r+(l-m-i)/2}$ remains, such that the integration cannot clearly be separated using the binomial identity, \cref{eq:newtonbinom}.
In half of the cases, when $l-m-i$ is even, it is possible to use the binomial theorem, and in the other, it is necessary to find a workaround.
In the following, the two following cases will be distinguished: when $l-m=2\Delta$ and when $l-m=2\Delta+1$, where $\Delta\in\mathbb{N}$. Only the 1\textsuperscript{st} situation is studied here, since solving for the second one is of very similar nature. As suggested above, the $i$-sum is split in even and odd terms\footnote{A factor two is introduced: $\sum_{i}\mapsto\sum_{2i}+\sum_{2i+1}$.},
\begin{multline}
T_{lkm}^{pj}
=
\frac{(-1)^m 2^l}{\sqrt{\pi} 2^p p!}
\int_{\mathbb{R}}\dd{x}\int_{\mathbb{R^+}}\dd{y}
\sum_{r=0}^{k}
\biggl\lbrace
\sum_{i=0}^{\Delta}
\binom{l}{2i+m}\binom{i+(l+m-1)/2}{l}
\times\\
\binom{k+l+1/2}{k-r}
\frac{(-1)^r (2i+m)!}{(2i)! r!}
x^{2i} (x^2+y)^{r+\Delta-i}
+
\sum_{i=0}^{\Delta-1}
\binom{l}{2i+m+1}
\times\\
\binom{i+(l+m)/2}{l}\binom{k+l+1/2}{k-r}
\frac{(-1)^r (2i+m+1)!}{(2i+1)! r!}
x^{2i+1} (x^2+y)^{r+\Delta-i-1/2}
\biggr\rbrace
\times\\
H_p(x) L_j(y) e^{-x^2-y}.
\label{eq:t5integralChapter01}
\end{multline}
The first term can readily be integrated, as performed in \cite{perrone_4-dimensional_2018}, by using the binomial theorem \labelcref{eq:newtonbinom}.
Thus, splitting \cref{eq:t5integralChapter01} in two terms (from the inner sum), $T_{lkm}^{pk}=\tau_1+\tau_2$,
where relations \labelcref{eq:laguerre_x_integral,eq:hermite_x_integral} were use to perform the integrals. Similarly to the $T_{lk}^{pj}$ coefficients \eqref{eq:T4lorenzo}, we retrieve a modulo factor.
To compute the second term, $\tau_2$, consider rewriting the coupled term as
where the first term may be expanded using the binomial theorem
%NOTE The exponent for which the binomial theorem is applied is problematic. This term could potentially be negative, but using $\Delta=l+m$ doesn't avoid these issues. If problems arise, one should notice them when $m>l$, ie. when $\Delta<0$.
and the following Taylor expansion is considered for the square root term,
If this expansion is inserted in the integral, it can be noticed that the infinite sum will be truncated as powers of $x$ are lowered (with increasing $\eta$), when the Hermite integration is performed, see \cref{eq:hermite_x_integral}.
where one Kronecker delta (coming from the Hermite integral) was used to bound the $\eta$-sum.
The integration of the case where $l-m=2\Delta+1$ is performed very similarly, with the only notable difference being that both $i$-sums range up to $\Delta$. Finally, one is left with
where we have $\theta_1=1-\theta_2=\bmod(l-m,2)$, used to remove the modulo terms that we got previously from even and odd $l-m$ computations, and $\eta_L^{isplm}=i+s+(1+\theta_2-p)/2$.
This equation is clearly cheaper to compute than \cref{eq:t5jorge}, since we need to compute at most four sums and because several conditions for cancellations (modulo and Kronecker terms) were introduced by expansions \labelcref{eq:laguerre_x_integral,eq:hermite_x_integral}.
\paragraph{Validation}
The implementation of \cref{eq:t5stenger} using the FMLIB package is conducted. Similarly to the implementation of the $T_{lk}^{pj}$ coefficients, when possible, terms from the numerator and denominator belonging to intermediate terms are computed in the most-outer loop possible, and the range of the loops is limited accordingly with the $\delta$ terms.
Validation of this new formulation is performed in \cref{fig:mathematica_t45_checks_fm2}, where the coefficients are computed accurately to the 50\textsuperscript{th} significant digit.
It can be seen that the reconstruction is correct, displaying negligible error.
\caption{Error of the basis reconstruction, using \num{50} significant digits precision and $l,m=2$, $k=1$.}
\label{fig:mathematica_t45_checks_fm2}
\end{figure}
Performance-wise, an improvement is already observed, but remains insufficient for computing large arrays of such coefficients. Based on the output data, the following postulate is considered:
Similarly to postulate \labelcref{eq:t4postulate}, it is suggested that symmetry considerations in \cref{eq:t5integral} lead to this result, but leave this explicit derivation to future work due to time constraints.
It can be noted that taking $m=0$ (for which $T_{lk0}^{pj}=T_{lk}^{pj}$) retrieves a structure close to postulate \labelcref{eq:t4postulate}. The modulo term is included in \cref{eq:T4lorenzo}. The fact that an inequality, $l\geq p+2j-2k$ remains in place of a previously obtained equality suggests that an additional condition may exist.
It can be verified that the error presented in \cref{fig:mathematica_t45_checks_fm2} remains invariant.
With this new condition, the computation time for large coefficient matrices sizes is drastically reduced (\cref{app:benchmarks}, ``74c1d49'').