\paragraph{Vertical:} it occurs when there's no instruction assigned to an entire cycle, or rather the cycle is empty.
For example, a thread waiting for another in mutual exclusion in the context of exponential fallback will
cause multiple cycles to be empty because of the waiting process. More generally, when long latency events occur,
all the pipeline is blocked before an instruction can be actually retired and begin to fill the next cycle again.
\paragraph{Horizonthal:} it occurs when a cycle of the pipeline is not completely filled, so part of the unit is waiting
for a single instruction to advance. This issue is particularly related to superscalar because different buffers and queues let an instruction be released if the coherence and consistency are garanteed, causing some units to wait while others are not.
This waste can be generated by an bad optimized software, which doesn't expose enough instruction level parallelism to the hardware.
\paragraph{Coarse grain multi-threading:} in occurrence of vertical wastes, a new thread is initialized in order to quickly resolve the waiting regions. The critical decision of this method is about when to fork, or rather anticipate a long latency event in the pipeline. In order to be more efficient, it requires that the thread switch time and the pipe fill time is much smaller than the event latency. On the other hand, it doesn't require great changes to the hardware but the single-thread performance is drammatically decreased. It specifically addresses vertical wastes without caring about horizontal ones.
\paragraph{Fine grain multi-threading:} each cycle is performed by a different thread in a periodic way, or rather a fixed number of cycles equivalent to the number of running threads are resolved in parallel. In this case, once the threads are initialized, they're maintained until the end of the program, removing this way the problem of the initialization cost as it's done only at the beginning. Also the critical decision is not present itself, as parallelized cycles are equally distributed such that each threads executes its own pipeline vertical section. Despite it could manage vertical waster, this is not sufficient to face horizontal waste efficiently as the method organizes the threads execution in vertical sections. This is because a thread would be anyway blocked from an horizontal waste and concerning vertical ones, there must be enough threads to eliminate them effectively.
Compared to coarse grain multi-threading, the fine grain single-thread performance is reasonable but it is particular effective for short latencies while the coarse grain would better manage long ones. This method requires also a more complex hardware in order to track dependencies, which could be subject to further latencies effectively compensing the theoretical gain of the fine grain multi-threading.
\paragraph{Simultaneus multi-threading:} allow multiple threads to parallelize the instructions over the same cycle.
The method is specifically designed to address horizontal waste, because if well implemented, it could reduce all waiting time in a specific depth level of the pipeline. A good implementation expects that for each thread a map table is associated, because once the addresses are physically mapped, the program won't suffer from consistency problems anymore. As the pipeline structures are shared, a policy is need in order to direct thread instruction ownership, this point is faced differently depending on the cpu model. The last consists in the critical decision: which fetch-interleaving policy should be used. The main requirement in resource of this method is similar to fine grain: have enough running threads in order to properly manage the wastes.
The simulaneus multi-threading manages both vertical and horizontal wastes at the same time potentially in an efficient way, but in it's not trivial to be implemented.
\end{homeworkSection}
\begin{homeworkSection}{(3)}
OS concearns software, coarse grain is hardware, then thread switch happens in different ways. OS is able to schedule a thread switch in case of blocking process and consequently change thread priority, while coarse grain multi-threading acts on the pipeline and quickly resolves vertical wastes.
First, modern hardware doesn't forcely support cache miss detection, which implies that
a coarse grain multi-threading implementation is not always possible as this requirement is missing and
then only OS could manage such a case.
Second, all OS runtime preemptions operations could first take much more latency than a waste latency (around micro seconds), which would annihilate their utility.
%- Modern hardware is not supporting cache miss detection.
%- The latency is too small with respect to a OS thread scheduling.