\frametitle{Before you start your parallel implementation}
\begin{itemize}
\item{\bf You have no serial code : } design your application in a parallel way from scratch
\item{\bf You have a serial code :} follow a Debugging-Profiling-Optimization cycle before any parallelization
\end{itemize}
\end{frame}
\subsubsection{Debugging}
\begin{frame}
\frametitle{Debugging ?}
\begin{itemize}
\item Find and correct bugs within an application
\item Bugs can be of various nature : division by zero, buffer overflow, null pointer, infinite loops, etc..
\item The compiler is (very) rarely able to recognize a bug at compilation time and the error is (very) rarely explicit regarding the bug ("syntax error")
\item Use standard tools like {\tt gdb}
\item A multi-threaded code can be tricky to debug (race conditions, deadlocks, etc..)
\item (Complex) tools exist for parallel debug : {\tt Totalview}, {\tt Alinea DDT} or recently {\tt Eclipse PTP}
\end{itemize}
\end{frame}
\subsubsection{Profiling}
\begin{frame}
\frametitle{Profiling ?}
Where do I spend most of the time ?
\begin{itemize}
\item (good) using tools like {\tt gprof} or {\tt Intel Amplifier}
\item (bad) ``by hand'' using timings and {\tt printf}'s
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Profiling ?}
What should be profiled ?
\begin{itemize}
\item TTS (Time To Solution)
\item best usage of resources (storage, memory, etc..)
\item behavior of the application to scale
\item ...
\end{itemize}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{Profiling : an example with gprof}
\begin{itemize}
\item{{\bf MiniFE as test application}
\begin{itemize}
\item 3D implicit finite-elements on an unstructured mesh
\item mini-application written in C++
\item\url{http://www.mantevo.org}
\end{itemize}
}
\item compile with {\tt -pg -g -O3 -ftree-vectorize}
\item run it. It should produce a {\tt gmon.out} file
Empirical ''80-20 law'' or ''The Pareto law'' : roughly 80 \% of the time is spent in 20 \% of the code
\begin{itemize}
\item concentrate on these 20 \% of the code
\end{itemize}
Example with MiniFE : 80 \% of the time is spent in the solver (63), the boundary conditions (8) and reordering (5).
\end{frame}
\subsubsection{Parallelization}
\begin{frame}
\frametitle{Parallelization ?}
Only when your sequential code has {\bf no bug} and is {\bf optimized}.
\begin{enumerate}
\item Is it worth to parallelize my code ? Does my algorithm scale ?
\item Performance prediction ?
\item Timing diagram ?
\item Bottlenecks ?
\item Which parallel paradigm should I chose ? What is the target architecture (SMP, cluster, GPU, hybrid, etc..) ?
\end{enumerate}
\end{frame}
\subsection{A few words on high performance}
\begin{frame}
\frametitle{Parallelization of an unoptimized code}
% \framesubtitle{(... or parallelization of a non-appropriate algorithm)}
\begin{columns}
% \begin{column}[l]{7cm}
\begin{column}{7cm}
Back in 1991, David H. Bailey from Lawrence Berkeley National Laboratory released a famous paper in the Supercomputing Review: \textit{``Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers''}.
\begin{block}{}
Number 6 was: \textit{Compare your results against scalar, unoptimized code on Crays.}
\subsubsection{The choice of the (right) compiler}
\begin{frame}[containsverbatim]
\frametitle{The compiler issue}
\label{compilerissue}
\begin{block}{}
The choice of the compiler is \textbf{very} important. Different version can lead to different performance on the same code.
\end{block}
\begin{lstlisting}[language=C,frame=lines]
for (i=0;i<N;i++){
for (j=0;j<N;j++){
for (k=0;k<N;k++){
C[i][j]=C[i][j] + A[i][k]*B[k][j];
}
}
}
\end{lstlisting}
\begin{block}{}
We change the loop order and the compiler version: \texttt{i-j-k}, \texttt{i-k-j}, \texttt{j-i-k}, \texttt{j-k-i}, \texttt{k-i-j}, \texttt{k-j-i} and the two compilers: \texttt{gcc-4.8.2}, \texttt{icc 15}.