Thread divergence comes out from the conditional branching of the execution between the
different threads belonging to the same warp. In order to avoid possible data race, the instruction flow
should admit only one path at a time, meaning that all threads are executing that path.
In case one thread branches the instruction flow, all other thread should stop and wait
for the current path to join the main one.
Such a thread is called a diverging thread.
\end{homeworkSection}
\begin{homeworkSection}{(3)}
There are different type of memory on a GPU.
The main, or rather the one in common to all processing units and so it allows blocks
to communicate between. A part of this hardware is also used for local memory, or rather
a dedicated buffer for warps. As it's the highest level memory, its latency is bigger compared to the next.
A single block is said to own a shared memory: all threads running on this block have access to it and its latency
is much smaller (around $5$ ns).
Each streaming multiprocessor owns an L1 cache and there also an L2 shared to all of them and they generally cache local or global memory.
It remains registers, constants and textures.
The first one is the faster over all the GPU and serves to stack the variables declared in the GPU kernel.
Constans and textures are supplementary dedicated spaces and they are separated global memories with their own cache.
They were introduced in order to reduce traffic on global memory.
\\
The first thing a programmer should consider is to use the shared memory as it's possible,
unless a bank conflict is present, which would cause more latency than global memory access.
So it's important to pay attention to the threads accessing banks and check to always have a one-to-one access.
\\
In principle, further optimizations are reached if GPU ALUs are always/equivalently busy, because the more
they are, the least will finish and that will play also a role on the final thread synchronization.
So the GPU must be fully occupied all time for the maximum performance.
In order to not overflow the faster memories traffic, it's a good pratice to minimize the resource access and use kernel defined variables when possible.