\paragraph{SWMR}: Single Writer, Multiple Reader. For each memory block,
and for a fixed time, many units can read the stocked value but only one is allowed
to write it. Principle: the last writer and any other reader has a valid copy of the stocked value.
The first difference is that cores don't have the authority to write onto it's own cache in the default case \texttt{S},
implying more security in data coherence (can write in $1/3$ of the cases instead of $1/2$).
A \texttt{PWr} operation doesn't forcely imply a \texttt{DataWB}, precisely a new written value
is broadcasted only if it's needed again.
Performance improvement: if a value is managed by only one cache, then
\texttt{BusInv} (invalidate all other caches), which is an expensive operation, can be omitted.
The primary purpose of a directory is to restrict an action which is sent from a core to the smallest group of caches which
are involved with that action. In this sense, a directory-based system allows to manage coherence by many
small regroupements of caches instead of the bus, which acts on the global scope.
Sparse directories buffers in principle have the same size of non-sparse directories, but
it organizes the cache way blocks together as function of the last bits in the \texttt{tag} field.
When no tags bits are used, the associativity is maximal.
Supposing each cache has a $A_{cache} = 4$-way associativity level, then for a fixed cache index
a directory managing $4$ cores (then $4$ caches) has to emplace in total $16$-way blocks.
Imagine now that the directory organizes those blocks in a $A_{dir} = 8$-way associativity level using
the last three bits of the \texttt{tag} field as identifier ($3 = \log_2(8)$), then it's enough
that $3$ blocks share the same value of these three bits to have a conflict. Such a conflict is raised
because the directory supports at maximum $2 = \frac{16}{8}$ values to be equal in those last three bits.
More generally:
\#bit_{max} = \frac{\#cores \; \cdot \; A_{cache}}{A_{dir}}
False sharing is a coherence cache miss and occures when two cores write different values on the same cache block.
It's bad because in this case the MSI protocol would generate an excess of invalidations and
memory updates, in sense that in order to maintain coherence, addictional \texttt{BusInv} and \texttt{DataBW} are necessary operation.
\paragraph{Loop fusion}: when more independent loops have the same iteration pattern on the header \texttt{for(int i=0; i < N; ++i)},
then loops overhead and consequent cache misses can be reduced putting all them together in a single loop block.
/* Before optimization */
float a = 3.0;
float b = 2.0;
for (int i = 0; i < N; ++i) {
a += 0.5 * i;
for (int i = 0; i < N; ++i) {
b += 0.2 * i;
/* After optimization */
float a = 3.0;
float b = 2.0;
for (int i = 0; i < N; ++i) {
a += 0.5 * i;
b += 0.2 * i;
True sharing can be reduced by minimizing the data dependence between threads.
Parallelizing only independent tasks totally avoids true sharing.

