projectMachineLearning.tex
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Mon, Feb 24, 03:15

projectMachineLearning.tex
View Options

	\documentclass[a4paper,oneside]{article}

	%\usepackage[francais]{babel}
	%\usepackage[utf8]{inputenc}
	%\usepackage[T1]{fontenc}
	\usepackage{subfig}
	\usepackage{graphicx}
	\graphicspath{{../fig/}}
	\usepackage{amsmath}
	\usepackage{verbatim}
	\usepackage{epstopdf}
	\usepackage[colorlinks,bookmarks=false,citecolor=black,linkcolor=blue,urlcolor=blue]{hyperref}
	\hypersetup{%
	colorlinks = true,
	linkcolor = black
	}
	\usepackage{wrapfig}
	\usepackage{enumitem}
	%\usepackage{siunitx}
	\usepackage{nameref}
	\usepackage{amssymb}

	\paperheight=297mm
	\paperwidth=210mm

	\setlength{\textheight}{235mm}
	\setlength{\topmargin}{-1.2cm} % pour centrer la page verticalement
	%\setlength{\footskip}{5mm}
	\setlength{\textwidth}{15cm}

	\setlength{\oddsidemargin}{0.56cm}
	\setlength{\evensidemargin}{0.56cm}

	\pagestyle{plain}



	% quelques abreviations utiles
	\def \be {\begin{equation}}
	\def \ee {\end{equation}}
	\def \dd {{\rm d}}

	\newcommand{\mail}[1]{{\href{mailto:#1}{#1}}}
	\newcommand{\ftplink}[1]{{\href{ftp://#1}{#1}}}
	\newcommand{\fracOne}[1]{\frac{1}{#1}}
	\newcommand{\diff}[1]{\mathrm{d}#1}
	\newcommand{\grad}{\vec{\nabla}}
	\newcommand{\abs}[1]{\left\| #1 \right\|}
	\newcommand{\abssq}[1]{\abs{#1}^2}
	\newcommand{\matr}[1]{{#1}}

	\begin{document}
	% Le titre, l'auteur et la date
	%%\title{\Huge Rapport d'échange Erasmus\\
	%% \huge Technische Universität Wien}
	%%\date{\today}
	%%\author{Nicolas Rime\\
	%%{\small \mail{n.rime@hotmail.com}}}
	%%\maketitle
	%%\tableofcontents % Table des matieres

	% Quelques options pour les espacements entre lignes, l'identation
	% des nouveaux paragraphes, et l'espacement entre paragraphes
	\baselineskip=16pt
	\parindent=15pt
	\parskip=5pt

	\newpage
	\section{Introduction}
	Since the very beginning of science, physicists tried to model the evolution of systems in order to give a reasonable explanation of the dynamics of objects. Newton is the first one to give a compact law of motion under the form of an ODE. %which describes the variation of momentum of an particle as a consequence of the forces applied.
	%Since then new concepts were elaborated to improve the laws of physics. Special relativity for example was introduced to leave the speed of light invariant under a change of frame. It was a major breakthrough as before 1905, time was considered like a perfect independent variable on which the spacial evolution of a particle was depending.
	Since then, many other models are based on ODE or PDE. For reasonably easy systems, we can experimentally prove that the models are conclusive and theories were made to explain the nature underlying these models as it was rapidly impossible to see the structure of matter such as molecules or what is inside an atom. However for more complex systems we generally fail to derive a general law. Data are generated experimentally or using simulations but as mechanical analysis has showed, it is sometimes impossible to predict accurately the evolution of a system because of a lack of conservative values. The idea now would be to find a different method as analysis in order to give a prediction of the time evolution. Machine Learning (ML) is an emerging alternative approach. Through optimisation of parameters it is possible to find a convenient answer to difficult problems like time series prediction. Nonetheless the dynamics of the algorithm is not well understood. It is a rising challenge for physicists to find a theory explaining how exactly ML is working and use it to predict complex systems.

	\section{Basic ML algorithm}
	One of the most basic ML algorithm is the linear regression. Let's consider a discrete serie with value $x_n$ at step $n = 1,...,N $. Giving $x_n$, we want to estimate the next time step $x_{n+1}$ using two parameters that we will now call weight $w_0, w_1$ such as $\hat{x}_n = w_0 + w_1\cdot x_n$ where $\hat{x}_n$ is the estimator of $x_{n+1}$. The weight $w_0$ is called the bias. Now we split the data into a training set containing $m+1$ points and a validation set containing $m-1$ points. For example, let's take $ N = 100$, $m = 60$ and we arbitrarily choose to put the first 61 values of the serie in the training set and the last 39 in the validation set. Now we need to train our algorithm using only the 61 values of the training set. For clarity a vectorial notation is used:
	\be
	X =
	\begin{pmatrix}
	1 & x_1 \\
	\vdots & \vdots\\
	1 & x_{60} \\
	\end{pmatrix}
	,
	\quad
	\vec w =
	\begin{pmatrix}
	w_0\\
	w_1 \\
	\end{pmatrix}
	,
	\quad
	\vec y =
	\begin{pmatrix}
	x_2 \\
	\vdots\\
	x_{61} \\
	\end{pmatrix}
	\quad
	\mathrm{and}
	\quad
	\hat{x} = X\vec w
	\ee
	The goal is to find the values of the weights that minimize the distance between $\hat{x}_n$ and $\vec y$. Thus we define the error function or loss function:
	\be
	L(\vec w) = \frac{1}{2m}\lVert\hat{x}(\vec w) - \vec y\rVert ^2
	\ee
	An algorithm called gradient descent is used to proceed to the minimization of the loss function. The gradient of L is given by:
	\be
	\vec \nabla _w L = \frac{1}{m}X^\top (\hat{x}(\vec w) - \vec y)
	\ee
	and gradient descent is the process of readjusting the weights iteratively until the gradient of L is close to zero i.e. the loss function is at a local minimum. Therefore the program computes
	\be
	\vec w = \vec w - \alpha \vec \nabla L
	\ee
	until $\nabla L$ is sufficiently low. $\alpha$ is called the learning rate, typically of order $10^{-2}$, it is necessary to avoid too high variations of the weights. The algorithm is now trained and we can use it to predict the values of the validation set recursively:
	\be
	\begin{aligned}
	\hat{x}_{61} &= w_0 + w_1\cdot x_{61}\\
	\hat{x}_{n+1} &= w_0 + w_1\cdot \hat{x}_{n}, \quad n = 61,\dots,N-1
	\end{aligned}
	\ee
	Another error that we want to compute to compare this algorithm to other algorithms is the validation error:
	\be
	L_{val} = \frac{1}{2(N - m)}\lVert\hat{x}_{val} - \vec x_{val}\rVert ^2
	\ee
	where $val$ stands for all points of the validation set (in our case for $n = 61,\dots,N$). \\
	Note that instead of using gradient descent we could have used the normal equation giving the optimum weights:
	\be
	\vec w = (X^\top X)^{-1} (X^\top \vec y)
	\ee
	but the concept of gradient descent is useful for the next sections. Concerning the choice of $m$ it is typically taken between $0.6N$ and $0.7N$. We could have taken more than one previous point, here is the general notation for $p$ previous points:
	\be
	X =
	\begin{pmatrix}
	1 & x_1 & \dots & x_p\\
	\vdots & \vdots & & \vdots\\
	1 & x_{m} & \dots & x_{m+p}\\
	\end{pmatrix}
	,
	\quad
	\vec w =
	\begin{pmatrix}
	w_0\\
	\vdots\\
	w_p \\
	\end{pmatrix}
	,
	\quad
	\vec y =
	\begin{pmatrix}
	x_{1+p} \\
	\vdots\\
	x_{m+p} \\
	\end{pmatrix}
	\quad
	\mathrm{and}
	\quad
	\hat{x} = X\vec w
	\ee
	This algorithm gives accurate predictions for linear or exponential functions as the error tends to zero. In fact, one can show that if $w_0$ and $w_1$ are independent of $n$ then the error tends to zero. However for completely arbitrary series other more sophisticated approaches exist. A new architecture called Long Short Term Memory (LSTM) showed that it can give very accurate results on prediction of non regular series. But before introducing LSTM, I need to introduce a subset of Machine Learning called Deep Learning (DL) and more precisely Artificial Neural Network (ANN) or shortly Neural Network (NN).

	\section{Welcome into your brain's machinery: Neural Network (NN)}
	These last years, the efficiency of NN is constantly improving. Even if the architecture was invented a few decades ago, modern computers made it possible to execute it in practice. NN are the best for pattern recognition like recognize a hand written digit or voice recognition for example. Despite their incredible accuracy in pattern recognition, they are not well conceived for series prediction but other architectures better in this field are derived from NN. To begin with, let us consider a problem with a simple NN.\\
	Imagine you are a new robot with a specific task: you need to throw a ball in a basket. As a new robot, you have no idea how to proceed to accomplish the task. Luckily you can learn it by trying. To make the problem easier, suppose you only need to learn how strong you need to throw the ball in a defined direction. It means that you have to convert electrical energy into mechanical energy. Indeed, you can choose the electrical tension you give to your arm to throw the ball. It is exactly how a human brain following your will orders to move your arm. If the tension is too low or too high, the ball misses the basket. To avoid that you need to actually do the throws, another robot exactly like you, a prototype that we will call Robot0, had already made them for you. You will only receive the results of Robot0 and from these you can optimise your brain. Obviously, we could try to determine the values analytically using the laws of motion but it will takes more time and it will not be the best values as a robot is not perfect like the theoretical model.\\
	Suppose you receive a set of input of $N$ throws with $m$ features. To simplify, $N = 50$, $m = 1$ then the only feature is the tension allocated to the arm. You also receive a set of binary results of the $N$ throws: 0 is a miss and 1 is a success. A vectorial notation is again used:
	\be
	X =
	\begin{pmatrix}
	1 & x_1 \\
	\vdots & \vdots\\
	1 & x_{50} \\
	\end{pmatrix}
	,
	\quad
	\vec y =
	\begin{pmatrix}
	y_1 \\
	\vdots\\
	y_{50} \\
	\end{pmatrix}
	\ee
	where a bias was added again to the input. The diagram of the simple NN is presented in fig. %\ref{NN}.
	!!Figure of the scheme\\
	As you can see on the diagram, the NN is composed of three layers. The first layer is called the input layer. It possesses 2 neurons: the first neuron is the bias (always 1) and the second is the input $x_i$ of the $i$-th throw. The second layer is called the hidden layer. A NN can have many hidden layer but for simplicity this example has only one. The first neuron is again a bias and the two following neurons are hidden neurons. The third layer is called the output layer. It has no bias and only one neuron in our case, the output.\\
	Each neurons are connected through synapses (the lines) with a different weight. Moreover each neurons (except the ones in the input layer) possesses an activation function. This is the main difference of this algorithm compared to a linear one as the activation function is not linear. Two activation functions can be used. The first one is the sigmoid function $\sigma (z) = \frac{1}{1 + e^{-z}}$ and the second one is the $\tanh (z)$. As you can on fig. (graph of both functions) $\sigma (z)$ goes from 0 to 1 and is exactly 0.5 when $z = 0$. $\tanh (z)$ has the same shape but goes from -1 to 1 and is 0 when $z = 0$. We will only use the sigmoid function $\sigma$ from now. These two functions are an improvement to a step function (or heavyside function) which is 0 when the input is negative and 1 when it is 0 or positive, as they are continuous and differentiable.\\
	\subsection{Compute the output: the feed-forward method}
	To keep it simple, let us consider the $i$-th throw and $x = x_i$ as I will write the different variables more explicitly. Let us define:
	\be
	a^{1} = \begin{pmatrix}
	1 \\
	x \\
	\end{pmatrix}
	\quad
	z^{2} = W^1 a^1 =
	\begin{pmatrix}
	w_{11}^1 & w_{12}^1 \\
	w_{21}^1 & w_{22}^1 \\
	\end{pmatrix}
	\begin{pmatrix}
	1 \\
	x \\
	\end{pmatrix}
	=
	\begin{pmatrix}
	w_{11}^1 + w_{12}^1x \\
	w_{21}^1 + w_{22}^1x \\
	\end{pmatrix}
	\ee
	The vector $z$ is the values before the activation function (before going through the neurons) and the vector $a$ is the values after the activation function (after going through the neurons). The upper index indicates the layer, the lower the component of the vector. Then:
	\be
	a^2 =
	\begin{pmatrix}
	1 \\
	\sigma (z^2) \\
	\end{pmatrix}
	=
	\begin{pmatrix}
	1 \\
	(1 + e^{-z_1^2})^{-1} \\
	(1 + e^{-z_2^2})^{-1} \\
	\end{pmatrix}
	\quad
	z^3 = W^2a^2 =
	\begin{pmatrix}
	w_1^2 & w_2^2 & w_3^2 \\
	\end{pmatrix}
	\begin{pmatrix}
	1 \\
	(1 + e^{-z_1^2})^{-1} \\
	(1 + e^{-z_2^2})^{-1} \\
	\end{pmatrix}
	\ee
	and finally the output:
	\be
	a^3 = \sigma (z^3)
	\ee
	%general
	Generally, for $N$ throws, $m$ features and $l$ layers:
	\be
	a^1 = X^\top = \begin{pmatrix}
	1 & \dots & 1 \\
	x_{1;1} & \dots & x_{1;N}\\
	\vdots & & \vdots \\
	x_{m;1} & \dots & x_{m;N}\\
	\end{pmatrix}
	\ee
	\be
	z^{j+1} = W^j a^j;
	\quad
	a^{j+1} = \begin{pmatrix}
	1 \\
	\sigma (z^{j+1}) \\
	\end{pmatrix}
	\quad
	j=1,\dots ,l-2;
	\quad
	a^{l} = \begin{pmatrix}
	\sigma (z^{l-1}) \\
	\end{pmatrix}
	\ee
	where the number of rows of $W^j$ is equal to the number of features plus the bias for $j = 1$ and $\forall j$ it is equal to the number of neurons of the $j$-th layer. $\forall j$ because the number of neurons of the first layer is determined by the number of features plus the bias. The number of columns of $W^j$ is equal to the number of neurons of the $j+1$-th layer \underline{without} the bias's neuron. The number of neurons of the output layer is equal to the desired number $K$ of output. Concerning the vector $\vec y$ it contains the output data. It is necessary to transform it as a binary vector of 0 and 1 in testing it on each of the $k$ type of data. For example, for digit recognition, $k=1,\dots,10$, \\
	Now that we know how to compute the output, we need the error. The cost function is different as the one of the linear regression.
	\be
	L(W) = \frac{1}{N}\sum_{i=1}^{N} \sum_{k = 1}^{K} [-y_{i,k}\ln a^l_{i,k} - (1-y_{i,k})\ln(1-a^l_{i,k})]
	\ee
	as in our example $K = 1$, then:
	\be
	L(W) = \frac{1}{N}\sum_{i=1}^{N} [-y_i\ln a^l - (1-y_i)\ln(1-a^l)] \quad \mathrm{(for~our~example)}
	\ee
	The idea now is to minimize this cost function. Using the gradient descent like in linear regression is not applicable here. We are going to use an algorithm called back-propagation.
	\subsection{Optimizing the weights: the back-propagation algorithm}
	Back-propagation is applied only for the $i$-th example. Then it is necessary to do a for-loop for all examples.
	\begin{enumerate}
	\item Apply the feed-forward method in order to keep all $a^j$ and $z^j, j=1,\dots ,l$
	\item for each $k$ output neurons in the output layer computes:
	\be
	\delta_k^l = a_k^l - y_k
	\ee
	\end{enumerate}


	%Bibliography====================================================================================================
	%\begin{thebibliography}{99}
	%\bibitem{ex}

	%\end{thebibliography}
	%\vfill
	%\newpage
	%\section{Annex}
	%partieA==================================================================================

	\end{document}

projectMachineLearning.texNo OneTemporaryActions

File Metadata

projectMachineLearning.texView Options

Event Timeline

projectMachineLearning.tex
No OneTemporary
Actions

projectMachineLearning.tex
View Options