Page MenuHomec4science

formats.tex
No OneTemporary

File Metadata

Created
Thu, Mar 13, 13:30

formats.tex

\chapter{Audio file formats}
When naively digitalizing an analog audio signal, one converts a continuous wave into a list of numbers. The standard way of doing so is PCM, as described in the chapter about D/A conversion. It saves the value of $f_s$ samples per second, each over $b$ bits of quantization. The file size of a digital audio file of length $t$ seconds sampled with PCM is then obtain by simply multiplying all terms
\begin{align*}
\text{file size} = t \cdot f_s \cdot b \text{ [bits]}.
\end{align*}
Such a digital audio file is said to be \textbf{uncompressed}, as it simply stores all the measurements in a row on the computer without any treatment. PCM creates very large files which can be inconvenient for uses such as streaming, or storing on small devices. \textbf{Audio compression }is the domain of sound processing that hence consists in reducing the size of the data needed to represent a track of music. Many methods have been proposed to reduce the file size of an audio track while keeping a maximal quality of sound.\\
It is important to distinguish between two concepts on audio encoding and compression: format and codec. An \textbf{(audio file) format} is a standard for storing audio into a file on a computer. It has an associated file extension, and a precise syntax about how the data is organized in the file. On the other hand, a \textbf{codec (or compression algorithm)} is a mathematical method for representing the audio information in a wanted way. Compression codecs can be sorted in two categories: lossy, and lossless compression codecs.
\begin{itemize}
\item \textbf{Lossy compression} codecs loose information about the sound during the compression process. The lost information allows to spare data to be saved, and must be chosen wisely as to be as unnoticeable as possible. The compressed audio hence cannot be restored to its original, uncompressed version.
\item On the other hand, \textbf{lossless compression} codecs do not loose any information during the compression process, and hence allow a total recovery of the original audio while playing the file. Unsurprisingly, they achieve lower compression rates than lossy codecs, resulting in larger files.
\end{itemize}
While some audio formats make an exclusive use of one unique associated codec, some formats can contain different codecs, in which case the choice of the codec must be encoded in the metadata of the file format. In this chapter, we focus on 3 popular audio formats: the WAV, FLAC, and MP3 formats.
\section{The WAV format}
The Waveform Audio File Format (WAV) is an audio format, developed by IBM and Microsoft in 1991. WAV is almost always used for storing uncompressed audio, although it is compatible with some lossy audio codecs. It is hence generally considered as a lossless audio format.
\subsection{The RIFF container}
In 1991, Microsoft and IBM developed a generic file container for media, called the Resource Interchange File Format (RIFF). WAV was proposed the same year as an instance of RIFF for support on Windows 3.1. A RIFF file is exclusively composed of nested \textbf{chunks} that all have the following format:
\begin{itemize}
\item an ASCII id (4 bytes)
\item the length of the chuck's data (4 bytes)
\item the chunk's data (variable length)
\item (an optional padding byte, if the total length is not even).
\end{itemize}
Chunks can contain other chunks, so that any RIFF file is organized in a tree fashion, with one big chunk containing all the data in smaller chunks. In the WAV instance of RIFF, the main chunk has id \texttt{RIFF} and directly contains a chunk with id \texttt{WAVE}. It is followed by a series of format chunks containing the general information about the file. Other chunk types include \texttt{INFO}, containing metadata about the artist, track name, etc. However, RIFF and WAV do not specify a precise syntax for the chunks containig the data itself.
\subsection{Associated codecs}
The main codec generally used within the WAV format is the linear-PCM (LPCM). LPCM is a PCM codec where the amplitude measured for each sample is quantized linearly. The amplitude range is divided into $2^b$ slots of equal size, where $b$ is the number of quantization bits. The amplitude measured for each sample is then simply encoded using the number of the chunk it falls in. Lossy algorithms handled by WAV notably include $\mu$-law and A-law.
\subsection{WAV is not the format for CDs}
It is sometimes believed that WAV is the format used in CDs, since both mainly use the LPCM codec. While being indeed similar, CDs actually use the Compact Disc Digital Audio (CDDA) format, also known as \textbf{Red Book}. The CDAA format specifies a sampling frequency of 44.1kHz encoded over 2 channels with unsigned bits per sample, that produces 1411 kbit/s. The LPCM data is first encoded in 192 bits long frames. It makes use of an error correction code called \textbf{CIRC}, allowing the CD to be playable even after being scratched. This adds 64 bits of redundancy to each frame.
\section{The FLAC format}
The Free Lossless Audio Codec (FLAC) is a lossless audio format and compression algorithm, introduced in 2001 by the Xiph.org Foundation. It can achieve a reduction of up to 50\% of the size of a WAV file, without loosing any information. It is distributed though a GNU GPL licence, meaning that all the specifications and the description of the algorithm are published online for free on the official website\footnote{\url{https://xiph.org/flac/}}.\\
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.2\linewidth]{figures/flac_logo.jpg}
\end{minipage}
\end{figure}
While WAV is generally preferred for working on audio production locally, FLAC can achieve relatively large compression rates for a lossless format, making it a popular choice for storing and sharing high-quality music online. Being a quality competitor to other lossless formats such as WAV (Microsoft) or ALAC (Apple), FLAC was for a long time kept out of the support in Windows and Mac, and instead gained popularity online for sharing music (often illegally). With most audio players gradually implementing support for FLAC, Microsoft finally added support in Windows 10, and Apple in High Sierra and iOS 11\footnote{\url{See https://www.cnet.com/news/what-is-flac-the-high-def-mp3-explained/}.}.
\subsection{Compression algorithm}
FLAC is built following a scheme used by most lossless compression algorithms. The data is first split into chunks called \textbf{blocks} of fixed length. If the signal is stereo, it is \textbf{decorrelated}, allowing to reduce the numbers of bits needed for encoding. Then, each block's samples are fit to a model that creates non-exact estimates of the samples, and the difference with the actual samples, called the residual error, is also computed. This is called the \textbf{prediction} stage. Finally, the model's parameters and the residual errors are encoded using \textbf{Rice codes}. A summary of all stages is shown in figure \ref{flac_pipelinea} and each stage is described below.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[h]
\center
\begin{dspBlocks}{.7}{.6}
% 1 2 3 4 5
PCM~~ & \BDfilter{Blocking} & \BDfilterMulti{Inter-channel\\decorrelation} & \BDfilter{Prediction} & \BDfilterMulti{Rice\\coding} & ~~bit stream
\ncline{->}{1,1}{1,2}\ncline{->}{1,2}{1,3}
\ncline{->}{1,3}{1,4}\ncline{->}{1,4}{1,5}\ncline{->}{1,5}{1,6}
\end{dspBlocks}
\caption{Pipeline of the stages of a FLAC encoder.}
\label{flac_pipelinea}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Blocking}
Blocking is the action of separating the input data stream into blocks. In the FLAC standard, each block of a file must have the same size. This size however can vary from a file to another, as it is optimized based on the sample rate. The maximum size of a block is 65535 samples.
\subsubsection{Decorrelation}
Inter-channel decorrelation takes advantage of the redundancy between the left and right signals: it encodes the average of the left and right signals on one channel (center), and their difference between on another channel (side). The difference between typically small, it requires less bits to be encoded and allows to gain space. Conversion to mono is also simple as one just use the center signal. Note that no information is lost during this process. This is shown in figure \ref{decorrelation}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[h]
\center
\begin{dspBlocks}{1.2}{.6}
% 1 2 3 4 5
Left~~ & \BDsplit & \BDadd & ~~Center (mono)\\
Right~~ & \BDsplit & \BDadd & ~~Side
\ncline{1,1}{1,2}\ncline{->}{1,2}{1,3}
\ncline{->}{1,3}{1,4}^{$0.5$~~~~~~~~~~~~}
\ncline{2,1}{2,2}\ncline{->}{2,2}{2,3}_{$-$}
\ncline{->}{2,3}{2,4}
\ncline{->}{2,2}{1,3}
\ncline{->}{1,2}{2,3}
\end{dspBlocks}
\caption{Inter-channel decorrelation.}
\label{decorrelation}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Prediction}
FLAC can make use of one out of 4 different models $\hat{x}$ for fitting the data $x$, and can use a different one for each channel of each block. Then the residual error is computed to check the difference between the actual data and the estimator, as
\begin{align*}
e[i] = x[i] - \hat{x}[i]
\end{align*}
for each sample $i$. The easiest model, called \textbf{Verbatim}, simply estimates all the sample to be zero,
\begin{align*}
\hat{x}[i] = 0, \forall i .
\end{align*}
The residual error is then equal to the signal itself. It is used when the audio data is too random to be fit by anything simpler that this. The second model is used for silence parts in the audio. It simply encodes one \textbf{constant} value for each channel,
\begin{align*}
\hat{x}[i] = c, \forall i.
\end{align*}
The two last predictors are called the \textbf{Fixed Linear Predictor}, resp. the \textbf{FIR linear predictor}, and are described in Shorten\footnote{\url{ftp://svr-ftp.eng.cam.ac.uk/pub/reports/auto-pdf/robinson_tr156.pdf}}. They use Linear Predictive Coding (LPC): for an input decorrelated block of samples $x$, LPC approximates the $i$th sample $x[i]$ by using a weighed sum of the past $N$ samples in the block as
\begin{align*}
\hat{x}[i] = \sum_{k=1}^N a_k x(i-k) .
\end{align*}
The number of coefficients $N$ is optimally computed for each block and can be at maximum 32. The residual error is subsequently computed. The fixed linear predictor uses a predefined fixed set of weights $a$ for estimating the samples, hence only the residual error and needs to be coded, while the FIR model allows using any $a$ and must encode $a$ as well.
\subsubsection{Rice codes}
The last step is encode $a$ and $e$. This is done using Rice codes, which are popular for their efficiency for encoding small numbers (hence the choice of computing a residual error in the first place). To encode a number $n$, Rice codes use one global tunable non-negative integer $k$ and proceed as follows
\begin{enumerate}
\item Compute $q = \left \lfloor{x / 2^k}\right \rfloor$
\item Concatenate $q$ binary ones
\item Concatenate a binary zero
\item Concatenate the last $k$ bits of $q$, written as an unsigned binary number.
\end{enumerate}
Below are listed the Rice codes for diverse values of $k$ and $n$. FLAC chooses the optimal $k$ for each block to minimize the length of the total generated code.
\begin{figure}[h]
\begin{center}
\begin{tabular}{ |c |c c c| }
\hline
$x$ & $k=0$ & $k=2$ & $k=4$ \\
\hline
\hline
0 & 0 & 000 & 00000\\
1 & 10 &001 &00001\\
2 &110 &010 &00010\\
3 &1110 &011 &00011\\
4 &11110 &1000 &00100\\
5 &111110 &1001 &00101\\
6 &1111110 &1010 &00110\\
7 &11111110 &1011 &00111\\
8 &111111110 &11000 &01000\\
\hline
\end{tabular}
\end{center}
\end{figure}
\subsection{File structure}
A FLAC file is composed of
\begin{itemize}
\item The marker string "\texttt{fLaC}"
\item The metadata block STREAMINFO
\item (Optional additional metadata blocks)
\item The audio frames.
\end{itemize}
The STREAMINFO blocks contains global necessary information about the file, such as its sample rate, number of channels, etc. The optional metadata may notably contain the album artwork, artist name, album name, cue sheet (used to encode the start time of each track when a whole CD is stored in one file). The rest of the file is audio frames, that contain the Rice codes and the predictor's parameters, if any. Note that frames do not necessarily match with the blocks. The format is entirely described on the official webpage: \url{https://xiph.org/flac/format.html#prediction}.
\section{The MP3 format}
MPEG-1 Audio Layer III (MP3) is an umbrella term containing both an audio format and a lossy compression codec that were proposed in 1993 by the Faunhofer society. MP3 takes advantage of several independent properties of sound and the human perception of it, that are each implemented in a different stage in the pipeline of the encoding of an MP3 file. The pipeline is shown in figure \ref{mp3_pipelinea}.\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[h]
\center
\begin{dspBlocks}{.7}{.6}
% 1 2 3 4 5
PCM~~ & \BDsplit & \BDfilterMulti{Hybrid\\filter bank} & \BDfilterMulti{Quantization\\and coding} & \BDfilterMulti{Bit stream\\encoder} & ~~bit stream\\
&&\BDfilterMulti{Perceptual coding} & &
\ncline{1,1}{1,2}\ncline{->}{1,2}{1,3}
\ncline{->}{1,3}{1,4}\ncline{->}{1,4}{1,5}\ncline{->}{1,5}{1,6}
\ncline{1,2}{2,2}\ncline{->}{2,2}{2,3}\ncline{2,3}{2,4}\ncline{->}{2,4}{1,4} \ncline{->}{2,3}{1,3}
\end{dspBlocks}
\caption{Pipeline of the stages of a MP3 encoder.}
\label{mp3_pipelinea}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this section, we first present a short history of the MP3 standard, and list what parameters must be set before encoding audio into MP3. Then, we review each of the 4 stages of the encoder pipeline sequentially. A precise implementation and mathematical derivations of each encoding stage can be found in \cite{Jacaba2001AUDIOCU}.
\subsection{Brief history}
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.2\linewidth]{figures/MP3_logo.png}
\end{minipage}
\end{figure}
In 1988, the Moving Picture Expert Group (MPEG) called for a lossy audio encoding standard. 14 algorithms were submitted by companies such as Phillips, Sony, or AT\&T. One proposed implementation, called MUSICAM, was chosen for its error robustness and its efficiency. Ideas from other submitted algorithms were also included and in 1991, the standard was approved. The standard, named MPEG-1, was published in 1993 and featured a standard for video, audio, and communication. The audio part featured 3 levels of compression, named layer I, II, and II, respectively, and the third level (layer III) achieving the highest compression rate of 1:12 gained mainstream success. The long \textit{Moving Picture Expert Group 1 Audio Layer III} was then shortened to simply \textbf{MP3}. \\
The standard proposes a precise syntax for encoding a file, as well as a general structure for compression the data. However, details in the implementation of the encoder were voluntarily not included, motivating distributors of MP3 encoders to develop concurrent algorithms. In 1994, the MPEG group release MPEG-2 Audio Layer III, also shortened as MP3, featuring improvements of the encoding algorithm. A third implementation, MPEG-2.5 was later proposed in 2000. In this section however, we focus on the MPEG-1 implementation.
\subsection{Parameters of the encoder}
The first step to encoding PCM data into an MP3 file is to set several parameters for the encoder.
\subsubsection{Channel mode}
The \textbf{channel mode} defines how the different audio channels are encoded. MP3 supports a maximum of two audio channels and features 4 different type of channel encoding: mono, dual, stereo, and joint stereo. Mono simply encodes one channel. Dual encodes two independent channels (for instance two languages of a same movie) that are not meant to be played together. Stereo also encodes two channels, that are meant to be played together (left and right). Joint stereo works the same way as the inter-channel decorrelation used in FLAC.
\subsubsection{Sampling frequency}
The \textbf{sampling frequency} of the signal can also be set among the values of 32kHz, 44.1kHz, and 48kHz, with default being 44.1kHz.
\subsubsection{Bit rate}
The \textbf{bit rate} of the compressed output audio can be chosen in a set between 8kbps and 320kbps, with 128kbps being the default value. MP3 specifies two types of bit rates: Constant Bit Rate (CBR), and Variable Bit Rate (VBR). MP3 divides the audio signal into temporal frames during encoding. CBR will encore each frame with a fixed number of bits, while VBR will allow more bits for frames that contain more complexity. While VBR is preferred for light size storage as it achieves higher compression rates than CBR, CBR is preferred for particular applications such as streaming, where a variable bit rate can cause timing difficulties during the decoding.
\subsection{The encoder: perceptual coding}
The human ear and brain do not hear all audio frequencies at the same level. Perceptual coding regroups the strategies that take advantage of the limitations of the human hearing to get rid of imperceptible audio data and reduce the file size. MP3 makes use of 3 such characteristics: the absolute threshold of hearing, the simultaneous masking, and the temporal masking.
\subsubsection{Absolute threshold of hearing}
The Absolute Threshold of Hearing (ATH) is a function of the frequencies returning what is the minimum Sound Pressure Level (SPL) required to hear a sound at this frequency. Frequencies below 20Hz and above 20kHz are inaudible at all age, while frequencies within this range become harder to hear with age. Frequencies around 1kHz are perceived with the most sensitivity. The average ATH is shown on figure \ref{ATH}. \\
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.7\linewidth]{figures/ATH.png}
\caption[Caption for LOF]
{Average human ATH\protect\footnotemark.}
\label{ATH}
\end{minipage}
\end{figure}
\footnotetext{\cite{ATH}}
MP3 takes advantage of this property by using more data to encode frequencies that are heard with the most sensitivity, while using the larger grained coding for frequencies that are heard less. MP3 also uses a bandpass filter to get rid of all frequencies below 20Hz and above 20kHz.
\subsubsection{Simultaneous masking}
Simultaneous masking is the effect that occurs when two sounds with close frequencies are playing synchronously. The louder sound will have a masking effect shaped as a triangle centered on its frequency. Everything inside of this triangle will not be heard, as shown in figure \ref{sim_masking}.\\
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.6\linewidth]{figures/masking_sim.png}
\caption[Caption for LOF]
{Simultaneous masking example\protect\footnotemark.}
\label{sim_masking}
\end{minipage}
\end{figure}
\footnotetext{\cite{masking}}
The physical reason of this phenomenon is that the human hear has approximately 24 frequency bands, from which about one frequency can be heard at the time. When 2 frequencies belonging to the same band are played together, the louder will start masking the quieter. MP3 hence detects the loudest frequencies and does not need to encode neighboring frequencies.
\subsubsection{Temporal masking}
Similarly to simultaneous masking, temporal masking occurs when two sounds are played close to each other over time. Temporal masking can be divided in pre-masking and post-masking. \textbf{Post-masking} occurs after a sound is heard, in which case the human hear needs between 50 and 300 ms to hear with the same sensitivity again. During this interval, the hearing is less sensitive. Pre-masking is the opposite effect, just before a loud sound is heard, all quieter sound would be blended in with the later, louder sound. On can observe the effect on figure \ref{temp_masking}. Again, MP3 takes advantage of this by not encoding quiet sound occurring shortly before of after louder sounds.
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.6\linewidth]{figures/masking_temp.png}
\caption[Caption for LOF]
{Temporal masking example\protect\footnotemark.}
\label{temp_masking}
\end{minipage}
\end{figure}
\footnotetext{\cite{masking}}
\subsubsection{1024 points FFT}
In practice, the perceptual coder computes a 1024 points FFT using the input PCM signal directly. For each band, it then computes the masking threshold using the 3 phenomenon described above. The exact computation is not defined in the MP3 standard and left to be implemented by one distribution of the codec. The information containing the masking threshold for each of the 1024 bands is then sent to the quantization blocks.
\subsection{The encoder: hybrid filter bank}
In parallel to the perceptual coder, the input PCM data is first sent through a hybrid filter bank that separates the data into frequency bands. The filter bank is called hybrid since the filtering is done in two steps, as described below.
\subsubsection{Step 1: Polyphase filter bank}
The PCM first enters a Polyphase quadrature Filter Bank (PFB)\cite{polyphase} that divides the signal into 32 frequency bands. A PFB is a filter bank that tries to avoid spectral leakage\footnote{\url{https://en.wikipedia.org/wiki/Spectral_leakage}}, the leakage of a frequency's power over the neighboring bands. One can observe the effect of spectral leakage on figure \ref{spectral_leakage}. The graph is shaped like a circus tent, instead of featuring one precise, sharp delta function. Since MP3 relies on the relationship between adjacent bands as we will see later, this effect must be avoided. PDB allows to almost fully reconstruct the signal with flat frequency response, by adding back each band together.\\
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.6\linewidth]{figures/sine_fft.pdf}
\caption[Caption for LOF]
{Spectral leakage occurring when computing the FFT of a sine wave with a single frequency.}
\label{spectral_leakage}
\end{minipage}
\end{figure}
The PFB reads the signal 1152 samples at a time, and separates it in 32 bands of equal frequency width, distributed between $0$ and $fs/2$. Each band (containing 1152 samples) is then downsampled by a factor 32, resulting in 36 samples per each of the 32 band. No information is lost during the downsampling phase since each band's highest frequency is at maximum 1/32th of the original Nyquist frequency.
\subsubsection{Step 2: Modified Discrete Cosine Transform}
Each of the 32 bands is then passed through a window and then in a Modified Discrete Cosine Transform (MDCT). The window is meant to limit the reduce the temporal aliases between two subsequent segments of samples. The choice of the window is computed by the perceptual coding block, according to the difference between the previous audio segment and the current one. Small differences will lead to choosing a large window with more overlapping, while a lot of differences will motivate choosing a smaller window with almost no overlapping. A total of 4 different windows can be chosen from, in order to minimize time aliasing.\\
Then, for each band, the windowed signal of 36 sample is sent to the MDCT. The MDCT transforms a vector $x$ of size $2N$ into a vector $X$ of size $N$ through the formula
\begin{align*}
X_k = \sum_{n=0}^{2N-1} x_n \cos \left[\frac{\pi}{N} \left(n+\frac{1}{2}+\frac{N}{2}\right) \left(k+\frac{1}{2}\right) \right] .
\end{align*}
The output of the MDCT is hence only of 18 samples, which sums up to a total of 576 samples for all bands together. These 576 samples are then sent to the \textbf{quantization and coding block}. The diagram of the hybrid filter bank block is shown in figure \ref{filter_bank}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{center}
\begin{figure}[h]
\center
\begin{dspBlocks}{2.5}{.4}
% 1 2 3 4 5
PCM~~ & \BDsplit & \BDfilter{BP} & \BDdwsmp{32} & \BDfilter{MDCT} & \\
& & \BDfilter{BP} & \BDdwsmp{32} & \BDfilter{MDCT} & \\
& & & & & \\
& & \BDfilter{BP} & \BDdwsmp{32} & \BDfilter{MDCT} &
\ncline{1,1}{1,2}^{$1152$ samples} \ncline{->}{1,2}{1,3} \ncline{->}{1,3}{1,4}^{$1152$ samples} \ncline{->}{1,4}{1,5}^{$36$ samples~~~~} \ncline[linewidth=4pt]{1,5}{1,6}^{~~~~~~~~~~$18$ coefficients}
\ncline{->}{2,2}{2,3} \ncline{->}{2,3}{2,4}^{$1152$ samples} \ncline{->}{2,4}{2,5}^{$36$ samples~~~~} \ncline[linewidth=4pt]{2,5}{2,6}^{~~~~~~~~~~$18$ coefficients}
\ncline{->}{4,2}{4,3} \ncline{->}{4,3}{4,4}^{$1152$ samples} \ncline{->}{4,4}{4,5}^{$36$ samples~~~~} \ncline[linewidth=4pt]{4,5}{4,6}^{~~~~~~~~~~$18$ coefficients}
\ncline{1,2}{2,2}
\ncline[linestyle=dotted]{2,2}{3,2} \ncline{3,2}{4,2}
\end{dspBlocks}
\caption{Diagram of the hybrid filter bank. There are 32 parallel processing lines.}
\label{filter_bank}
\end{figure}
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The encoder: quantization and coding}
The third step is to quantize the output of the hybrid filter bank and to encode it into bits.
\subsubsection{Scalefactor bands}
First, the 576 values ouptuted by the hybrid filter bank are grouped in new bands, called \textbf{Scalefactor bands}. The scalefactor bands roughly follow a Bark scale based on the human perceptual bands. Depending on the window type set in the last block, a total of either 21 or 12 scalefactor bands are used. The bandwidth and center frequency of each band are stored in lookup tables that are defined in the standard. The motivation behind scalefactor bands is to be able to rescale the content of each band based on the perceptual coding model.
\subsubsection{Double loop structure}
The computation of the quantization and coding of the data within one band is not done in one pass, but instead two loops are used to optimize the choice of the quantization's precision, and the code words for each sample. First, a default \textbf{scalefactor} of value 1 is used for the band.\\
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.7\linewidth]{figures/double_loop.png}
\caption[Caption for LOF]
{Diagram of the double loop\protect\footnotemark.}
\label{double_loop}
\end{minipage}
\end{figure}
\footnotetext{Diagram by Mark Handley, \url{http://www0.cs.ucl.ac.uk/teaching/GZ05/05-music-coding.pdf}.}
The scaled data of all bands together then enters the \textbf{inner iteration loop}, where it is non-uniformly quantized: it is first passed to the power 0.75 so that the small values are given more precision than the large values. Then, the data is encoded using a \textbf{Huffman code}. The choice of the Huffman code is explained in the dedicated subsection below. If the code produced by Huffman is longer than a threshold computed according to the desired bitrate set by the user, it is rejected. The algorithm then increases the step size of the quantization, hence reducing its precision, and producing Huffman code will be shorter, until it goes below the maximum bitrate threshold.\\
When the bitrate threshold is met, the algorithm exits the inner loop and proceeds to compute the quantization noise for each scalefactor band, and compares this value to the masking threshold of the band. This is done using the information provided by the perceptual coding block, using the absolute threshold and the masking effects: given the center frequency of the band, and the amplitude of the loudest signal in this band in the present, past, and future set of samples, the perceptual coder can find the minimum amplitude that will be heard in the scalefactor band (i.e. the masking threshold). If the quantization noise is louder than the masking threshold, the scalefactor of the band is increased and the algorithm goes back to re-computing the inner loop again. This is the \textbf{outer iteration loop} of the algorithm.\\
On figure \ref{scale_factors}, one can observe the outcome of the double loop. The scalefactor for each band has been increase until it reached the masking threshold, so that the quantization noise would not be heard anymore. A more detailed explanation of the relationship between the signal level, masking threshold and quantization noise was provided by \cite{herre1999temporal}.
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=.4\linewidth]{figures/scalefactors.png}
\caption[Caption for LOF]
{Example diagram of outcome of the double loop. The x-axis represents the 576 coefficients from the MDCT, that are grouped into scalefactor bands, and the y-axis represents the amplitude of each coefficient. The blue curve is the computed making threshold provided by the perceptual coder, and the green staircase is the different scalefactors computed for each band. In red is the coefficient levels.\protect\footnotemark}
\label{scale_factors}
\end{minipage}
\end{figure}
\footnotetext{Diagram by Prof. Karlheinz Brandenburg \url{https://www.tu-ilmenau.de/fileadmin/media/mt/lehre/ma_mt/audiocoding/Lecture_WS2012_13/05_bdg_Quantization_WS1213.pdf}.}
\subsubsection{Huffman coding}
In the inner loop, the data is quantized using the optimal step size after being scaled by the scalefactor and non-linearized. The data is then cut in 5 frequency bands, making the assumptions that higher frequency bands will have only smaller, or even zero values. The 3 lowest frequency bands are called the \textbf{big values} and are each encoded using the best Huffman code out of a list of 10 predefined Huffman codes stored in lookup tables\footnote{See the list of the codes here: \url{https://www.tu-ilmenau.de/fileadmin/media/mt/lehre/ma_mt/audiocoding/Lecture_WS2012_13/05_bdg_Quantization_WS1213.pdf}}. These 10 Huffman codes have different size to match the quantization step size at the best. The fourth band is called \textbf{count 1} and contains only quantized values in the set $\{-1, 0, 1\}$. Hence only 2 bits are needed per value and a special Huffman code is used, out of a set of two special Huffman codes. The last band is called the Rzero band and only contains zero. The limit between the bands is determined after inspection of the quantized values and set before computing the Huffman codes.
\begin{figure}[h]
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
Big values 1 & Big values 2& Big values 3& Count 1 & Rzero \\
\hline
\end{tabular}
\end{center}
\caption{Division of the 576 coefficients into 5 bands for Huffman coding.}
\label{huffman_group}
\end{figure}
\subsection{The encoder: bit stream encoder}
The last step is to store the code outputed by Huffman into an actual MP3 file that can be read by the decoder. A MP3 file is divided into time fragments called \textbf{frames}, each coding 1152 samples, like before. The size in bits of a frame hence depends on the selected bit rate. It is composed of 5 distinct blocks, as shown in figure \ref{frame_layout}. Let us review the 3 main blocks of a frame.\footnote{Read \url{http://www.mp3-tech.org/programmer/docs/mp3_theory.pdf} for detailed information about all the data stored during encoding.}
\begin{figure}[h]
\begin{center}
\begin{tabular}{ |c|c|c|c|c| }
\hline
Header & (CRC) & Side information & Main data & (Ancillary data) \\
\hline
\end{tabular}
\end{center}
\caption{Composition of a frame. The blocks in parentheses are optional and will not be reviewed.}
\label{frame_layout}
\end{figure}
\subsubsection{Header}
The header is 32 bits long and contains information about the frame. This information is primarily composed of synchronization information. This information allows to navigate through a MP3 file and to broadcast it without shuffling the order of the frames. Additional information notably features the bit rate choice (4 bits), the sampling frequency (2 bits), or the channel mode (2 bits).
\subsubsection{Main data}
As it names suggests, the main data contains the actual encoded music. It is divided in two parts, called \textbf{granules}, of 576 samples each, as shown in figure \ref{main_data}. Each granule is itself decomposed in either one or two channels, depending on the channel mode. Each channel is itself decomposed in two blocks: the scalefactor block, and the Huffman code block.\\
\begin{figure}[h]
\begin{center}
\begin{tabular}{ |c|c|c|c|c|c|c|c| }
\hline
\multicolumn{8}{|c|}{\textbf{Main data}} \\
\hline
\multicolumn{4}{|c|}{Granule 0} & \multicolumn{4}{|c|}{Granule 1} \\
\hline
\multicolumn{2}{|c|}{Left channel} & \multicolumn{2}{|c|}{Right channel} & \multicolumn{2}{|c|}{Left channel} & \multicolumn{2}{|c|}{Right channel} \\
\hline
Scale F & Huffman & Scale F & Huffman & Scale F & Huffman & Scale F & Huffman\\
\hline
\end{tabular}
\end{center}
\caption{Organization of a main data block, with Stereo channel mode.}
\label{main_data}
\end{figure}
For each channel of each granule, the scalefactor block contains a list of the (either 21 or 10) values of the scalefactors for each scalefactor band that were obtained in the double loop. The Huffman block contains the output of the Huffman trees for the frame. All the information about the selection of the scalefactor bands or the Huffman trees are contained in the Side information.
\subsubsection{Side information}
The side information block describes how to decode the main data block. Most of the information is given for each granule, notably
\begin{itemize}
\item The size of the Big values (1,2,3), Count 1, and Rzero bands.
\item The global gain of the granule.
\item The number of bits used to encode the value of each scalefactor (can be different for each).
\item The type of window used (out of 4 types), implicitely indicating the number of scalefactor bands.
\item The choice of the Huffman code to use for decoding the big values, and count 1.
\end{itemize}
Finally, a diagram summarizing the implementation of each block can be seen in figure \ref{mp3_precise}.
\begin{figure}[h]
\begin{minipage}{\textwidth}
\centering
\includegraphics[width=\linewidth]{figures/mp3_precise.png}
\caption[Caption for LOF]
{Complete implementation of an MP3 encoder\protect\footnotemark.}
\label{mp3_precise}
\end{minipage}
\end{figure}
\footnotetext{Diagram from \cite{Brandenburg1999MP3AA}.}
\section{Exercises}
\begin{enumerate}
\item Compute the FIR linear predictor coefficients $a$ and residual error $e$ for the signal
\begin{align*}
x = [11, 3, 5, 8, 13, 8, 5, 9].
\end{align*}
We assume that $N$ = 2, 4 unsigned bits are used to encode each sample $x$ and residual error $e$, and 2 signed (2-complement) bits are used to encode each LPC coefficient $a$. Zero-padding is used on the left.
\item Compute what is the value of $k$ that minimizes the length of the Rice code produced after encoding the signal
$$ x = [1,0,7,5,2,2,2,0] .$$
\item Compute one optimal Huffman code for the signal of the previous exercise.
\end{enumerate}

Event Timeline