This project proposes a way to increase the performance of a sequential code using classical parallelism technic. It is based on a Matlab code of \textit{Nicolas Richart} that simulates the evolution of a tsunami. The application is coded in C++ and will use the industrial standards MPI and OpenMP. The purpose of this report is to describe the state of the current project and to describe how the sequential code will parallelized.
}
\section{Scientific Background}
A parallel iterative finite difference method for solving the 2D elliptic PDE Poisson's equation on a distributed system using Message Passing Interface (MPI) and OpenMP is presented. This method is based on a domain decomposition where the global 2D domain is divided into multiple sub-domains using horizontal axis. The number of subdomains is specified by the number of processes. The Shallow Water Wave Equation is solved by explicit iterative schemes. The global error is shared by all processes.
\\
\subsection{Introducing the Model}
The shallow water wave equation is a non-linear hyperbolic system of coupled partial differential equations often used to model various wave phenomenons. Simulation of ocean waves, river flows, hydraulic engineering and atmospheric modeling are among the many areas of application. The researcher has used the two dimensional version of the equations along with a right hand side source term as his model
In the above, $h := h(x,y,t)$ denotes water height, $u := u(x,y,t)$ and $v := v(x,y,t)$ water velocity in $x$ and $y$ direction respectively, $z := z (x, y)$ topography and g$ = 9.82m/s^2$ the gravitational constant.
\subsection{Introducing the Numerical Scheme}
Finite volume schemes are a popular approach for computing an approximate solution to hyperbolic equations as the underlying physics is represented in a natural way. Let $ I_{i,j} = \big[ x_{i-1} , x_{i+1} \big] \times \big[ y_{j-1} , y_{j+ 1} \big] $
define a structured rectangular uniform mesh. In a finite-volume scheme, one seeks to find the cell average $\bar{q}_{i,j} (t_n)$ that approximates $q(x_i,y_j,t_n)$ at every cell $I_{i,j}$ for a given time-step $t_n$ in the sense
\begin{equation}
\bar{q}_{i,j} (t_n) = \frac{1}{\Delta x \Delta y} \int_{I_{i,j}} q(x_i,y_j,t_n)dydx
\end{equation}
The researcher has used the Lax-Friedrichs method to discretize the problem. The method is explicit meaning that the solution $\bar{q}_{i,j}^{n+1}$ for all $i, j$ at timestep $t_{n+1}$ may be computed directly from the solution at the previous timestep $\bar{q}_{i,j}^{n}$ without solving any system of equations using the following three equations
$\bar{hv}^{n+1}_{i,j}$ is calculated in a similar manner.
After each simulation time-step, we need to compute a new time-step length. This is needed to make sure that the CFL condition is satisfied. The time-step $\Delta t^n$
\caption{\label{Tsunamifigure} Simulation of a Tsunami with $N=4001$, $T_{end} = 0.2$}
\end{figure}
\section{Implementations}
The application is implemented in C++. We will investigate two implementations of the same solver :
\begin{itemize}
\item A shared memory version using the OpenMP paradigm~\cite{openmp}
\item A distributed memory version using the MPI library~\cite{mpi}
\end{itemize}
The code will be fully debugged using the \texttt{gdb} debugger. The \texttt{Valgrind} tool will be used to remove all the memory leaks.
\subsection{Description of the Code}
The code is arranged in the following way. The algorithm will compute the evolution of the height of the wave $H(x,y)$, based on the knowledge of the topography and the evolution of the speed of the wave in the direction $x$ and $y$. In order to start the computation, the algorithm first read the initial conditions for the different values ($H$,$HU$,$HV$, $Zdx$, $Zdy$) in binary files. The duration of the evolution $T_{end}$ is specified by the user. While this time is not reached, the algorithm will compute the next variable time step and the evolution of the height and the speed : $H$,$HU$,$HV$. The update of the variables only needs information of the direct neighbor of a point (x,y). When the simulation is over the height of the wave at the final time step is stored in a file.
\subsection{Parallelization}
The sequential code will be parallelized using \texttt{MPI} and \texttt{OpenMP}. The reader, writer and time step computation as well as the update of the height and the speed $H,HU,HV$ will be implemented.
The parallelization of the reader and writer will make use of the Parallel I/O of MPI. Computing the next time step implies that the max over a grid should be found. This can be done in a parallel fashion using the following design : as \texttt{C++} is a row major language, the grid will be split between nodes in a block of $\frac{N}{P}$ complete row. Each processor will then return the max over his subgrid and the final max will be computed at the end, this final computation will be done on a single node. The parallelization of the update of $H$, $HU$, $HV$ will be approached in a similar manner. The grid will be split as before. Before and after each subgrid (except the first one and the last one) a row of ghost cells will be added, in order to allow the computation. Most of the code will therefore be parallelized.
\section{Prior results}
Translating the code from \texttt{Matlab} to \texttt{C++} allows us to compute the solution approximately 5 times faster. This performance will be improved drastically using parallelism technic.
\begin{table}[h]
\begin{center}
\begin{tabular}{l*{3}{r}r}
& Matlab [s] & C++ [s]& \\
\hline
2001x2001 & 1233.068 & 226.741 &\\
4001x4001 & 10625.835 & 2003.880 & \\
\end{tabular}
\caption{Total time in seconds for a run on a 3.1 GHz Intel i5}
\label{Table1}
\end{center}
\end{table}
\subsection{Strong scaling}
The speed up of a precise parallelism problem depends mostly on the percentage of sequential code that can be parallelized. In the case of this problem, most of the code can be parallelized. A first estimation can be deduced from the profiling table for the grid $2001 \times 2001$ [TAB. \ref{Table2}]. Summing up the three main functions that can be parallelized (\texttt{compute\_step, compute\_mu\_and\_set\_dt} and \texttt{imposeTolerances}) gives a value of 98.06\% of code parallelizable. The speed up is given by Amdahl's Law :
\begin{equation}
S_p = \frac{1}{\alpha + \frac{1-\alpha}{p}}
\end{equation}
Where $\alpha$ is the fraction of non-parallelizable code, in this case 1.94\%, and $p$ is the number of cores that is used to compute the simulation.
Computing the speed up using strong scaling for this problem gives the following curve [FIG. \ref{SpeedUp}]. This theoretical result will be compared with the real one when the problem is parallelized.
\subsection{Weak scaling}
The weak scaling is another way to measure the speed up. The main difference with the strong scaling is that in this case, the size of the problem can change. It is given by Gustaon's law :
\begin{equation}
S_p = p-\beta(n)(p-1)
\end{equation}
Where $\beta(n)$ is the fraction of non-parallelizable code, and $p$ is the number of cores that is used to compute the simulation. We have the information on three different setup in the projet. Corresponding to the grid $2001 \times 2001$, $4001 \times 4001$, $8001 \times 8001$. As our approach will split the grid only by row, we can expect that the workload for each processor to increase as $n$ for the computation on a grid. Furthermore, adding more precision will reduce the $\Delta x$ leading to smaller $\Delta t$. We can expect this to increase the workload by computer in an order of $n$ as well. Theses hypotheses and the exact speed up will be computed in the final report.
\section{Resources budget}
\textbf{\texttt{ This part has not been modified but will be completed for the final report }} \\ \\
In order to fulfill the requirements of our project, we present hereafter the resource budget.
\subsection{Computing Power}
\textbf{\texttt{!! FIXME !!}}
\subsection{Raw storage}
\textbf{\texttt{!! FIXME !!}}
\subsection{Grand Total}
\begin{center}
\begin{tabular}{| l | l |}
\hline
Total number of requested cores & \textbf{\texttt{!! FIXME !!}}\\
\hline
Minimum total memory & \textbf{\texttt{!! FIXME !!}} \\
\hline
Maximum total memory & \textbf{\texttt{!! FIXME !!}} \\
\hline
Temporary disk space for a single run & \textbf{\texttt{!! FIXME !!}} \\
\hline
Permanent disk space for the entire project & \textbf{\texttt{!! FIXME !!}} \\
\hline
Communications & \textbf{\texttt{!! FIXME !!}} \\
\hline
License & own code (BSD) \\
\hline
Code publicly available ? & \textbf{\texttt{!! FIXME !!}} \\
Architectures where code ran & \textbf{\texttt{!! FIXME !!}} \\
\hline
\end{tabular}
\end{center}
\section{Scientific outcome}
This work will lead to a fast implementation for simulation of Tsunamis. It will allow to do impressive simulations without waiting too long. It will also learn to the principal investigator how to parallelize code in a good and efficient way.
\begin{thebibliography}{1}
\bibitem{openmp} Dagum L. and Menon R.,{\em OpenMP: An Industry-Standard API for Shared-Memory Programming}, IEEE Computational Science \& Engineering, Volume 5 Issue 1, pp 46-55, January 1998
\bibitem{mpi} The MPI Forum, {\em MPI: A Message-Passing Interface Standard}, Technical Report, 1994
\bibitem{scitas} Scientific IT and Application Support, \url{http://scitas.epfl.ch}, 2015
\end{thebibliography}
\newpage
\appendix
\section{Profiling}
\begin{center}
\begin{table}[h]
\begin{tabular}{l*{7}{r}r}
\% & cumulative & self & & self & total & \\
time & seconds & seconds & calls & ms/call & ms/call & name \\