\begin{frame}{Training the Neural Network: Adjoint Method}
We aim at minimizing $J:R^p\mapsto R,$$$J(\yy,t_f,\te)=J\left(\yy(t_0)+\int_{t_0}^{t_f}f(\yy,t,\te)\diff t \right)=J(\text{\texttt{\alert{ODE\_Solver}}}(f,\yy(t_0),\te,t_0=0,t_f=1)).$$
\item There is no notion of layers, since we are on a continuous limit.
\end{enumerate}
How does $\te$ depend on $\yy(t)$ at each instant $t$?
Don't use back-prop, but rather the \tinto{adjoint-state method} (Pontryagin et al. 1962.).
\end{frame}
\begin{frame}{The adjoint method}
\end{frame}
%%
%%
%%
\begin{frame}{Training the Neural Network: Adjoint Method}
Define first $$G(\yy,t_f,\te):=\int_{t_0}^{t_f} J(\yy,t,\te)\diff t, \quad\frac{\diff}{\diff t_f}G(\yy,t_f,\te)=J(\yy,t,\te)$$ and the Lagrangian $$L=G(\yy,t_f,\te)+\int_{t_0}^{t_f}\aat(t)\left(\dot{\yy}(t,\te)-f(\yy,t,\te)\right)\diff\te$$
Then,
\begin{align}
\frac{\partial L}{\partial\te}=\int_{t_0}^{t_f}\left(\frac{\partial J}{\partial\yy}\frac{\partial\yy}{\partial\te} +\frac{\partial J}{\partial\te}\right)\diff t +\int_{t_0}^{t_f}\aat(t)\left( \blue{\frac{\partial\dot{\yy}}{\partial\te}}-\frac{\partial f}{\partial\yy}\frac{\partial\yy}{\partial\te}- \frac{\partial f}{\partial\te}\right)\diff t
\end{align}
IBP:
\begin{align}
\int_{t_0}^{t_f}\aat(t)\blue{\frac{\partial\dot{\yy}}{\partial\te}}\diff t=\aat(t)\frac{\partial{\yy}}{\partial\te}\rvert_{t_0}^{t_f}-\int_{t_0}^{t_f}\dat(t)\blue{\frac{\partial{\yy}}{\partial\te}}\diff t
&=\int_{t_0}^{t_f}\left(\frac{\partial\yy}{\partial\te}\right)\alert{\left(\frac{\partial\yy}{\partial\te} -\aat\pd{f}{\yy}-\dat\right)}\diff t+\int_{t_0}^{t_f}-\aat\pd{f}{\te}+\pink{\pd{J}{\te}}\diff t +\left(\aat\pd{\yy}{\te}\right)_{t_0}^{t_f}\\
\end{align}
Setting $\alert{\left(\frac{\partial J}{\partial\yy} -\aat\pd{f}{\yy}-\dat\right)}=0$, $\aat(t_f)=0$, one gets
\begin{align}
\frac{\partial L}{\partial\te}&=\int_{t_0}^{t_f}-\aat\pd{f}{\te}+\pink{\pd{J}{\te}}\diff t +\left(\aat\pd{\yy}{\te}\right)_{\blue{t_0}}^{t_f}\\&=\int_{t_0}^{t_f}-\aat\pd{f}{\te}+\pink{\pd{J}{\te}}\diff t +\left(\aat(t_0) \pd{\yy}{\te}(t_0)\right)
\end{align}
\end{frame}
%%
%%
%%
\begin{frame}{Adjoint method (cont'd)}
From $J(\yy,\te)=\frac{\diff}{\diff t_f} G(\yy,t_f,\te)$ then,
\begin{align}
\pd{J}{\te}&=\frac{\partial}{\partial t_f}\left(\int_{t_0}^{t_f}-\aat\pd{f}{\te}+\pink{\pd{J}{\te}}\diff t +\left(\aat(t_0) \pd{\yy}{\te}(t_0)\right)\right),\\
\uncover<3->{ Thus, if one knows \tinto{$f_\te$} and can compute \blue{$\det$}, one can evaluate the transformed density $p_1$. }
\uncover<3->{
This has applications in Bayesian inference, image generation, etc.
}
\uncover<3->{\textbf{Issues }
\begin{enumerate}
\item Needs invertible \tinto{$f_\te$}.
\item$\blue{\det}$ can be, at worst $\mathcal{O}(n^3)$, $Y\in\R^n$.
\end{enumerate}
}
\uncover<4->{ One solution is to take \tinto{$f_\te$} triangular, but this reduces expressability of the transformation}
\uncover<5->{\pink{Continuous normalizing as an alternative}}
\end{frame}
%%
%%
%%
\begin{frame}{Change of variable formula via continuous transformation}
\textbf{Idea:} Don't consider a "one shot" transformation, but a continuous one.
\textbf{Theorem:}
Consider a \alert{continuous-in-time} transformation of $\yy(t,\te)=\yy_\te(t)$ given by $$\frac{\diff\yy_\te}{\diff t}(t,\te)=f\left(t, \yy_\te(t,\te),\te\right)=\tinto{f_\te}\left(\yy_\te(t),t\right)$$
Then, under the assumption that $f_\te$ is uniformly Lipschitz continuous in $t$, it follows that the change in log-probability is given by: $$\frac{\partial\log(p(\yy_\te(t)))}{\partial t}=-\text{Tr}\left(\frac{\partial\tinto{f_\te}}{\partial\yy_\te }(Y_\te(t),t)\right).$$
\uncover<2->{
Notice that:
\begin{enumerate}
\item It involves a \pink{trace} instead of a \blue{determinant}$\implies$ cheaper.
\item$f_\te$ need not be bijective; if solution is unique, then, whole transf. is bijective.
Given a \tinto{target}$p$ we construct a \alert{flow}$q$, minimizing $J=\text{KL}(q\lVert p):=\int\log\left(\frac{q(\te)}{p(\te)}\right)q(\te)\diff\te$ (assuming we can evaluate both $p$ and $q$.)
This paper can be seen more towards from a computational perspective than the previous one. Aim is to consider the time-continuous limit of the DNN and its interpretation as an ODE. Using this, one can use \alert{black-box} ODE solving routines.
\begin{enumerate}
\item There is no notion of layers. Use number of function evaluations as a measure of depth.
\item Can speed up in terms of accuracy/cost.
\item No control during training phase (due to black-box nature). More expensive than equivalent ResNet
\item Constant memory cost
\item Nice applications for density transport and continuous time models.