\item $\qty{21.33}{\gibi\byte\per\second}$ for 1 channel
\item Maximum of $\qty{128}{\gibi\byte\per\second}$
\end{itemize}
\pause
\item Or use a software that estimates it
\end{itemize}
\begin{itemize}
\item A corollary from ``theoretical'' is that it is not achievable in practice!
\end{itemize}
\end{frame}
\begin{frame}[t,fragile]
\frametitle{Roofline model}
\framesubtitle{How to find arithmetic intensity}
\begin{itemize}
\item For very simple algorithms, you can compute the AI
\item Let's take back the DAXPY example
\begin{cxxcode}{}
N = 1e8;
for (int i = 0; i < N; ++i) {
c[i] = a[i] + alpha * b[i];
}
\end{cxxcode}
\item There are 2 operations (1 add and 1 mul)
\item Three 8-byte memory operations (2 loads and 1 store)
\item The AI is then $2/24 = 1/12$
\pause
\item For more complex algorithms, use a tool, e.g. Intel Advisor
\end{itemize}
\end{frame}
\subsection{Profiling}
\label{sec:profiling}
\begin{frame}
\frametitle{Profiling}
\framesubtitle{A precious ally for optimization}
\begin{itemize}
\item Where is my application spending most of its time?
\begin{itemize}
\item (bad) measure time ``by hand'' using timings and prints
\item (good) use a tool made for this, e.g. Intel Amplifier, Scorep,
gprof
\end{itemize}
\end{itemize}
\vfill
\begin{itemize}
\item In addition to timings, profilers give you a lot more information on
\begin{itemize}
\item Memory usage
\item Hardware counters
\item CPU activity
\item MPI communications
\item etc.
\end{itemize}
\end{itemize}
\end{frame}
\begin{frame}[fragile]
\frametitle{Profiling}
\framesubtitle{Interactive demonstration}
\begin{itemize}
\item For the purpose of this exercise, we will use MiniFE
\begin{itemize}
\item 3D implicit finite-elements on an unstructured mesh
\item C++ mini application
\item \url{https://github.com/Mantevo/miniFE}
\end{itemize}
\item We will use Intel VTune, part of the \href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html\#base-kit}{OneAPI Base toolkit (free)}