\item\textbf{\texttt{private(\textit{list})}}, \textbf{\texttt{firstprivate(\textit{list})}}, \textbf{\texttt{shared(\textit{list})}} or \textbf{\texttt{copyin(\textit{list})}} : data scoping
\frametitle{Example : the \texttt{if} conditional clause }
\begin{block}{The \texttt{if} clause}
A \texttt{if} test specifies if a parallel region must be executed in parallel or not:
\begin{verbatim}
if (n<2) then
execute test(n) serial
else
execute test(n) in parallel
endif
\end{verbatim}
\end{block}
\begin{lstlisting}[language=C,frame=lines]
#pragma omp parallel if (n>2)
test(n);
\end{lstlisting}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{The \texttt{if} clause [output]}
\begin{verbatim}
vkeller@mathicsepc13:~/OpenMP/exercises/C$ ./ex3
var = 1 : Code is executed by only one thread
Parallelized with 2 threads : 2
Parallelized with 3 threads : 3
Parallelized with 4 threads : 4
\end{verbatim}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{Data scoping}
What is data scoping ?
\begin{itemize}
\item{most common source of errors}
\item{determine which variables are {\bf private} to a thread, which are {\bf shared} among all the threads}
\item{In case of a private variable, what is its value when entering the
parallel region {\bf firstprivate}, what is its value when leaving the
parallel region {\bf lastprivate}}
\item The default scope (if none are specified) is \textbf{shared}
\item{most difficult part of OpenMP}
\end{itemize}
\end{frame}
\begin{frame}[fragile]
\frametitle{The data sharing-attributes \texttt{shared} and \texttt{private}}
\begin{exampleblock}{Syntax}
These attributes determines the scope (visibility) of a single or list of variables
\begin{lstlisting}[language=C,frame=lines]
shared(list1) private(list2)
\end{lstlisting}
\begin{itemize}
\item{The \verb+private+ attribute : the data is private to each thread and non-initiatilized. Each thread has its own copy. Example : \verb+#pragma omp parallel private(i)+}
\item{The \verb+shared+ attribute : the data is shared among all the threads. It is accessible (and non-protected) by all the threads simultaneously. Example : \verb+#pragma omp parallel shared(array)+}
\end{itemize}
\end{exampleblock}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{The data sharing-attributes \texttt{firstprivate} and \texttt{lastprivate}}
\begin{exampleblock}{Syntax}
These clauses determines the attributes of the variables within a parallel region:
\begin{lstlisting}[language=C,frame=lines]
firstprivate(list1) lastprivate(list2)
\end{lstlisting}
\begin{itemize}
\item{The \texttt{firstprivate} like {\tt private} but initialized to the value before the parallel region}
\item{The \texttt{lastprivate} like {\tt private} but the value is updated after the parallel region}
Only one thread (usualy the first entering thread) executes the \textbf{\texttt{single}} region. The others wait for completion, except if the \textbf{\texttt{nowait}} clause has been activated
\end{block}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{The \texttt{master} construct}
\begin{itemize}
\item{Only the master thread execute the section. It can be used in any OpenMP construct}
\begin{block}{A solution with the \texttt{reduction(...)} clause}
\begin{verbatim}
vec = (int*) malloc (size_vec*sizeof(int));
global_sum = 0;
#pragma omp parallel for reduction(+:global_sum)
for (i=0;i<size_vec;i++){
global_sum += vec[i];
}
\end{verbatim}
But other solutions exist !
\end{block}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{The \texttt{schedule} clause}
\begin{block}{}
Load-balancing
\end{block}
\begin{center}
\begin{tabular}{|l|l|}
\hline
\textbf{clause}&\textbf{behavior}\\
\hline
\hline
\textit{schedule(static [, chunk\_size])}&
iterations divided in chunks sized \\
&\textit{chunk\_size} assigned to threads in \\
& a round-robin fashion. \\
& If \textit{chunk\_size} not specified \\
& system decides. \\
\hline
\textit{schedule(dynamic [, chunk\_size])}&
iterations divided in chunks sized \\
&\textit{chunk\_size} assigned to threads \\
& when they request them until no \\
& chunk remains to be distributed. \\
& If \textit{chunk\_size} not specified \\
& default is 1. \\
\hline
\end{tabular}
\end{center}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{The \texttt{schedule} clause}
\begin{center}
\begin{tabular}{|l|l|}
\hline
\textbf{clause}&\textbf{behavior}\\
\hline
\hline
\textit{schedule(guided [, chunk\_size])}&
iterations divided in chunks sized \\
&\textit{chunk\_size} assigned to threads \\
& when they request them. Size of \\
& chunks is proportional to the \\
& remaining unassigned chunks. \\
%& If \textit{chunk\_size} not specified \\
%& default is 1. \\
& By default the chunk size is approx \\
& loop$\_$count/number$\_$of$\_$threads. \\
%By default the chunk size is approximately
\hline
\textit{schedule(auto)}&
The decisions is delegated to the \\
& compiler and/or the runtime system \\
\hline
\textit{schedule(runtime)}&
The decisions is delegated to the \\
& runtime system \\
\hline
\end{tabular}
\end{center}
\end{frame}
\begin{frame}
\frametitle{The \texttt{schedule} clause}
\begin{center}
{\input{day1/images/schedule-decision.tex}}
\end{center}
\end{frame}
\begin{frame}[containsverbatim]
\frametitle{A parallel \texttt{for} example}
\begin{block}{How to...}
... parallelize the dense matrix multiplication $C = A B$ (triple for loop $C_{ij} = C_{ij} + A_{ik} B_{kj}$). What happens using different \texttt{schedule} clauses ?)
Use the collapse clause to increase the total number of iterations that will be partitioned across the available number of OMP threads by reducing the granularity of work to be done by each thread.
You can improve performance by avoiding use of the collapsed-loop indices (if possible) inside the collapse loop-nest (since the compiler has to recreate them from the collapsed loop-indices using divide/mod operations AND the uses are complicated enough that they don't get dead-code-eliminated as part of compiler optimizations)
It is mandatory that the \textit{\texttt{n-}}collapsed loops are perfectly nested and with a rectangular shape (nothing like \texttt{do i=1,N ... do j=1,f(i)}) and that their upper limits are ``small''.
%\item{The directive must appear after the declaration of listed variables/common blocks}
%\item{The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active \texttt{parallel} regions if:
% \begin{itemize}
% \item{No nested \texttt{parallel} regions}
% \item{Number of threads for both \texttt{parallel} regions is the same}
% \item{\texttt{dyn-var} ICV is false for both \texttt{parallel} regions}
% \end{itemize}
%}
%\item{A \texttt{threadprivate} variable is affected by a \texttt{copyin} clause if it appears in the list}
%\item{A \texttt{threadprivate} variable is \textbf{NOT} affected by a \texttt{copyin} clause if it as the \texttt{allocatable} (not initially allocated) or the \texttt{pointer} (no initial association) attributes}
%\end{itemize}
%\end{exampleblock}
%
%\end{frame}
%\begin{frame}[containsverbatim]
%\frametitle{A \texttt{copyin} clause}
%
%\begin{exampleblock}{Properties}
%\begin{itemize}
%\item{The \texttt{copyin} clause provides a mechanism to copy the value of the master thread's \texttt{threadprivate} variable to the \texttt{threadprivate} variable of each other member of the team executing the \texttt{parallel}region. }
%\item{If the original list item has the \texttt{POINTER} attribute, each copy receives the same association status of the master thread's copy as if by pointer assignment. }
%\item{If the original list item does not have the \texttt{POINTER} attribute, each copy becomes defined with the value of the master thread's copy as if by intrinsic assignment, unless it has the allocation status of not currently allocated, in which case each copy will have the same status. }
%\end{itemize}
%\end{exampleblock}
%\end{frame}
%\begin{frame}[containsverbatim]
%\frametitle{A \texttt{copyprivate} clause}
%
%\begin{exampleblock}{Properties}
%\begin{itemize}
%\item{The \texttt{copyprivate} clause provides a mechanism to use a private variable to broadcast a value from the data environment of one implicit task to the data environments of the other implicit tasks belonging to the \texttt{parallel} region.}
%\item{To avoid race conditions, concurrent reads or updates of the list item must be synchronized with the update of the list item that occurs as a result of the \texttt{copyprivate} clause.}
%\end{itemize}
%\end{exampleblock}
%\end{frame}
\subsubsection{Nesting}
\begin{frame}
\frametitle{Nesting regions}
\begin{exampleblock}{Nesting}
It is possible to include parallel regions in a parallel region (i.e. nesting) under restrictions (cf. sec. 2.10, p.111, \textit{OpenMP: Specifications ver. 3.1})
\end{exampleblock}
\end{frame}
\subsection{Runtime Library routines}
\begin{frame}
\frametitle{Runtime Library routines}
\begin{exampleblock}{Usage}
\begin{itemize}
\item{The functions/subroutines are defined in the lib \texttt{libomp.so / libgomp.so}. Don't
forget to include \texttt{\#include <omp.h>}}
\item{These functions can be called anywhere in your programs}
\item{ Support for new devices (\verb+Intel Phi+, \verb+GPU+,...) with \verb+omp target+. Offloading on those devices. }
\item{ Hardware agnostic}
\item{ League of threads with \verb+omp teams+ and distribute a loop over the team with \verb+omp distribute+ }
\item{ SIMD support for vectorization \verb+omp simd+ }
\item{ Task management enhancements (cancelation of a task, groups of tasks, task-to-task synchro)}
\item{ Set thread affinity with a more standard way than \verb+KMP_AFFINITY+ with the concepts of \verb+places+(a thread, a core, a socket), \verb+policies+(spread, close, master) and \verb+control settings+ the new clause \verb+proc_bind+}