The \textit{locking} process is performed using a boolean shared flag $X$ ($X =1\implies$ locked)
which is supposed to be coherently accessible from all cores.
\texttt{TS} operation involves into the bus a \texttt{BusRdX} operation which reads
the value of $X$ with the intent to modify it, then invalidating all other cores' cache.
\\
Suppose that $P_0, P_1, ... , P_n$ aptempt to obtain the lock in order to execute a critical section and $P_0$ anticipate all others.
While $P_0$ is executing it's own critical section, $P_1, ..., P_n$ are continously aptempting \texttt{TS} in order to check $X$ state and see whether the lock is present or not.
This way to check generates high coherence traffic, because any time that a processor $P_i$ (for $1\le i \le n$) calls \texttt{TS}, all other caches get invalidated causing a miss on $X$ for the next processor aptempting to read it.
By estimation, with $n$ waiting processors, there is a $\mathcal{O}(n)$ number of invalidations for each check of $X$, resulting in a coherence traffic in the order of $\mathcal{O}(n^2)$.
Taking the previous example, the main purpose of introducing \texttt{TTS} is to make $P_1, P_2, ... , P_n$ perform \texttt{BusRd} on $X$
instead of \texttt{BusRdX} while $P_0$ hold the lock.
This small difference allows $P_i$ to locally check $X$ without invalidating all other caches,
because the read value doesn't need to be overwritten unless the lock is released.
\\
A local cache read would work because once $P_0$ stores $1$ on $X$, the next store operation
should be aptempted only after the execution of the critical section. In the meantime,
$P_1, ... , P_n$ aptempt to read $X$, which would miss the first time and hit all others.
When $P_0$ releases the lock, $X$ is set to $0$ and once this value is read by another processor $P_i$,
then $P_i$ would perform \texttt{TS} and eventually get lock ownership.
\\
This way, the coherence traffic is reduced to $\mathcal{O}(n)$ operations.
\end{homeworkSection}
\begin{homeworkSection}{(3)}
Exponential back-off consists on the same model as \texttt{TS} but with the addiction of an exponentially increasing waiting time after each check.
Apparently, it could be a good practice because invalidation are not all performed at the same time, allowing to reduce traffic.
\\
However, the method leads to a severe unfairness problem and consequent latency under contention: newer checking processors would obtain lock earlier than older ones.
\\
Supposing the same situation of the previous points and suppose a processor $P_i$ performs the first check after $\Delta t$ and fails,
the next check is scheduled after $2\cdot\Delta t$. Supposing in the meantime that $P_0$ releases the lock and another
processor $P_j$ performs a check, then $P_j$ will gain advance on $P_i$ and will take the lock ownership before it.
If many other $m$ processors act like $P_j$, then the $m^{th}$ scheduled time $P_i$ must wait before the next check is $2^m$,
which can be long enough to make all other processors take precedence on $P_i$ while it was the first one requesting the lock.
In the worst case, under contention, this process enforces $P_i$ to wait until all other cores have finished their critical section, causing unneeded synchronization, then huge latency.
\end{homeworkSection}
\begin{homeworkSection}{(4)}
Supposing $X$ is the lock flag register and corresponds to a cache miss for both processors, then in order:
\begin{enumerate}
\item$P_0$ broadcasts \texttt{BusRdX} reading $0$, stores $1$ in its cache and then passes into critical section.
\item$P_1$ broadcasts \texttt{BusRdX} but $X$ must be taken from $P_0$'s cache.
\item$P_0$ broadcasts \texttt{DataWB} as requested by $P_1$ and reads $1$.
\item$P_1$ writes $1$ broadcasting \texttt{BusInv}, then invalidating $P_0$'s cache. \label{repeat}
\item$P_1$ checks $X$ broadcasting \texttt{BusRdX} and repeat from (\ref{repeat}) while $X =1$. \label{check}
\item$P_0$ writes $0$ on $X$, then $P_1$ aptempts \texttt{BusRdX} finding $X =0$ and exits loop.
\item$P_1$ writes $1$ on $X$, $P_0$'s cache gets invalidated.
\item$P_1$ executes the critical section.
\end{enumerate}
Steps \ref{repeat} and \ref{check} correspond to the waiting loop of $P_1$.
Its aptempt to pass in critical section fails because $X \neq0$ during the waiting loop and
each performed store-conditional operation would always write $1$ on $X$.
When $P_0$ rewrites $0$ on $X$, the value on $P_1$'s cache gets invalidated, meaning that the next test-conditional would
miss $X$ and consequently $P_1$ would read $0$ with the intent to write $1$. At that moment the operation is successful