Page MenuHomec4science

guarantees.tex
No OneTemporary

File Metadata

Created
Sat, Jul 6, 18:54

guarantees.tex

\section{Convergence Rate \label{sec:cvg rate}}
In this section, we investigate the convergence rate of Algorithm~\ref{Algo:2} for finding first-order and second-order stationary points, along with the iteration complexity results.
All the proofs are deferred to Appendix~\ref{sec:theory} for the clarity of exposition.
Theorem~\ref{thm:main} below characterizes the convergence rate of Algorithm~\ref{Algo:2} for finding first-order stationary points in terms of the number of outer iterations.\\
{\color{red}{Ahmet: I think that it is possible to remove sufficiently large k0 assumption here. }} \textbf{AE: You're likely right but it might not be worth your time right now.}
\begin{theorem}\textbf{\emph{(convergence rate)}}
\label{thm:main}
Suppose that $f$ and $A$ are smooth in the sense defined in (\ref{eq:smoothness basic}). For $\rho'>0$, let
\begin{align}
\lambda'_f = \max_{\|x\|\le \rho'} \|\nabla f(x)\|,
~~
\lambda'_A = \max_{\|x\| \le \rho'} \|DA(x)\|,
\label{eq:defn_restricted_lipsichtz}
\end{align}
be the (restricted) Lipschitz constants of $f$ and $A$, respectively.
For sufficiently large integers $k_0\le k_1$, consider the interval $K=[k_0:k_1]$. Let $\{x_k\}_{k\in K}$ be the output sequence of Algorithm~\ref{Algo:2} on the interval $K$ and, for $\nu>0$, assume that
\begin{align}
\nu \|A(x_k)\|
& \le \dist\left( -DA(x_k)^\top A(x_k) , \frac{\partial g(x_k)}{ \b_{k-1}} \right),
\label{eq:regularity}
\end{align}
for every $k\in K$. We have two cases:
\begin{itemize}[leftmargin=*]
\item If a first-order solver is used in Step~2, then $x_k$ is an $(\epsilon_{k,f},\b_k)$ first-order stationary point of Program~(\ref{prob:01}) with
%\begin{align}
%& \dist^2( -\nabla_x \L_{\b_{k-1}}(x_k,y_k),\partial g(x_k)) + \| A(x_k)\|^2 \nonumber\\
%& = \frac{1}{\b_{k-1}^2}\left( \frac{8 \lambda'_A\s_1^2 + 4}{ \nu^2} ( \lambda'_f+\lambda'_A y_{\max})^2 + 2\right) \nonumber \\
%&=: \frac{Q(f,g,A)}{\beta_{k-1}^2},
%\end{align}
\begin{align}
\epsilon_{k,f} & = \frac{1}{\beta_{k-1}^2} \left( \frac{8 \lambda'_A\s_1^2 + 4}{ \nu^2} ( \lambda'_f+\lambda'_A y_{\max})^2 + 2 \right) \nonumber \\
&=: \frac{Q}{\beta_{k-1}^2},
\label{eq:stat_prec_first}
\end{align}
for every $k\in K$, where the expression for $y_{\max}=y_{\max}(x_1,y_0,\s_1)$ is given in (\ref{eq:dual growth}), and the dependence of $Q$ on $f,g,A,\s_1$ is suppressed for brevity.
\item If a second-order solver is used in Step~2, then $x_k$ is an $(\epsilon_{k,f}, \epsilon_{k,s},\b_k)$ second-order stationary point of Program~(\ref{prob:01}) with $\epsilon_{k,s}$ specified above and with
\begin{align}
\epsilon_{k,s} & = \frac{C}{\beta_{k-1}}+ \epsilon_{k-1}.
\end{align}
\textbf{We should try to get the constant C since we get the constants elsewhere for consistency.}
\end{itemize}
\end{theorem}
%A few remarks are in order about Theorem~\ref{thm:main}.
%\textbf{Tradeoff.}
%Roughly speaking, Theorem~\ref{thm:main} states that the iterates of Algorithm~\ref{Algo:2} approach first-order stationarity for~\eqref{prob:01} at the rate of $1/{\b_k}$, where $\b_k$ is the penalty weight in AL in iteration $k$ as in~\eqref{eq:Lagrangian}. Note the inherent tradeoff here: For a faster sequence $\{\b_k\}_k$, the iterates approach stationarity faster but the computational cost of the inexact primal subroutine (Step 2) increases since a higher accuracy is required and Lipschitz constant of the unconstrained problem is also affected by $\beta_k$. In words, there is a tradeoff between the number of outer and inner loop iterations of Algorithm~\ref{Algo:2}. We highlight this tradeoff for a number of subroutines below.
Loosely speaking, Theorem~\ref{thm:main} states that Algorithm~\ref{Algo:2} converges to a (first- or second-) order stationary point of Program~\eqref{prob:01} at the rate of $1/\b_k^2$. A few remarks about Theorem~\ref{thm:main} are in order.
\textbf{Regularity.} The key condition in Theorem~\ref{thm:main} is \eqref{eq:regularity} which, broadly speaking, controls the geometry of the problem.
As the penalty weight $\b_k$ grows, the primal solver (Step~2) places an increasing emphasis on reducing the feasibility gap and \eqref{eq:regularity} formalizes this intuition.
In contrast to most of the conditions in the nonconvex optimization literature~\cite{flores2012complete} \textbf{AE: the way the previous sentence is phrased suggests more than one reference. can we add more?}, condition~\eqref{eq:regularity} appears to be easier to check. Indeed, we verify this condition rigorously in several applications in Section~\ref{sec:experiments}.
Let us provide an example to argue that such a condition is necessary for controlling the feasibility gap of Program~\eqref{prob:01} and ensuring the success of Algorithm~\ref{Algo:2}. Consider the case where $f=0$, $g=\delta_\mathcal{X}$ where $\mathcal{X}$ is a convex set, and $A$ is a linear operator. In this case, solving Program~\eqref{prob:01} finds a point in $\mathcal{X}\cap \text{null}(A)$, where the subspace $\text{null}(A)=\{ x\in\mathbb{R}^d: A(x) = 0 \} \subset \RR^d$ is the null space of $A$.
Here, the Slater's condition requires that
\begin{equation}
\text{relint} (\mathcal{X}) \cap \text{null}(A)\neq \emptyset.
\end{equation}
In general, the Slater's condition plays a key role in convex optimization as a sufficient condition for strong duality and, as a result, guarantees the success of a variety of primal-dual algorithms for linearly-constrained convex programs. Intuitively, the Slater's condition here removes any pathological cases by ensuring that the subspace $\text{null}(A)$ is not tangent to $\mathcal{X}$, see Figure~\ref{fig:convex_slater}. In such pathological cases, solving Program~\eqref{prob:01}, namely, finding a point in $\mathcal{X}\cap \text{null}(A)$, can be particularly difficult. For instance, the alternating projection algorithm (which iteratively projects onto $\mathcal{X}$ and $\text{null}(A)$) has arbitrarily slow convergence, see Figure~\ref{fig:convex_slater}.
\begin{figure}
\begin{center}
\includegraphics[scale=.5]{Slater.pdf}
\end{center}
\caption{Some regularity is required for Algorithm~\ref{Algo:2} to succeed, as detailed after Theorem~\ref{thm:main}. \label{fig:convex_slater}}
\end{figure}
\textbf{Computational complexity.} Theorem~\ref{thm:main} allows us to specify the number of iterations of Algorithm~\ref{Algo:2} required to reach a near-stationary point of Program~\eqref{prob:01} with a prescribed precision and, in particular, can specify the number of calls made to the solver in Step~2. In this sense, Theorem~\ref{thm:main} does not fully capture the computational complexity of Algorithm~\ref{Algo:2}, as it does not take into account the computational cost of the solver in Step~2.
To better understand Algorithm~\ref{Algo:2}, we consider two scenarios in what follows. In the first scenario, we take the solver in Step~2 to be the Accelerated Proximal Gradient Method (APGM), a well-known first-order algorithm~\cite{ghadimi2016accelerated}. In the second scenario, we will use the second-order trust region method developed in~\cite{cartis2012complexity}.
\subsection{First-Order Solver}
\textbf{AE: we go back and forth between "subroutine" and "solver". for consistency, I'm just using "solver" everywhere.}
Let us first consider the case where the solver in Step~2 is the well-known Accelerated Proximal Gradient Method (APGM)~\cite{ghadimi2016accelerated}. \textbf{AE: Ahmet to give a brief description of APGM here. }
%First, we characterize the iteration complexity of Algorithm~\ref{Algo:2} for finding a first-order stationary point of~\eqref{prob:01}.
%We propose to use the standard accelerated proximal method (APGM), guarantees of which are analyzed in~\cite{ghadimi2016accelerated} for nonconvex problems of the form~\eqref{e:exac}.
Suppose that $g=1_\mathcal{X}$ is the indicator function on a bounded convex set $\mathcal{X}\subset \RR^d$ and let
\begin{align}
x_{\max}= \max_{x\in \mathcal{X}}\|x\|
\end{align}
be the radius of a ball centered at the origin that includes $\mathcal{X}$.
Then, adapting the results in~\cite{ghadimi2016accelerated} to our setup, APGM reaches $x_{k}$ in Step 2 of Algorithm~\ref{Algo:2} after
\begin{equation}
\mathcal{O}\left ( \frac{\lambda_{\beta_{k-1}}^2 x_{\max}^2 }{\epsilon_k} \right)
\label{eq:iter_1storder}
\end{equation}
(inner) iterations, where $\lambda_{\beta_{k-1}}$ denotes the Lipschitz constant of $\nabla_x{\mathcal{L}_{\beta_{k-1}}(x, y)}$, bounded in~\eqref{eq:smoothness of Lagrangian}.
\textbf{AE: I made the bounds worse a bit but simplified them.}
Using \eqref{eq:iter_1storder}, we derive the following corollary, describing the total computational complexity of Algorithm~\ref{Algo:2} in terms of the calls to the first-order oracle in APGM. \textbf{AE: we haven't talked about oracle before.}
\begin{corollary}\label{cor:first}
For $b>1$, let $\beta_k =b^k $ for every $k$. If we use APGM from~\cite{ghadimi2016accelerated} for Step~2 of Algorithm~\ref{Algo:2}, the algorithm finds an $(\epsilon_f,\b_k)$ first-order stationary point, see (\ref{eq:inclu3}),
in $T$ number of calls to the first-order oracle, where
%
\begin{equation}
T = \mathcal{O}\left( \frac{Q^{\frac{3}{2}+\frac{1}{2b}} x_{\max}^2}{\epsilon^{\frac{3}{2}+\frac{1}{2b}}} \right) = \tilde{\mathcal{O}}\left( \frac{Q^{\frac{3}{2}} x_{\max}^2}{\epsilon^{\frac{3}{2}}} \right).
\end{equation}
\end{corollary}
For Algorithm~\ref{Algo:2}, to reach a near-stationary point with an accuracy of $\epsilon_f$ in the sense of \eqref{eq:inclu3} and with the lowest computational cost, one therefore needs to perform one iteration of Algorithm~\ref{Algo:2}, with $\b_1=b$ specified as a function of $\epsilon_f$ by \eqref{eq:stat_prec_first} in Theorem~\ref{thm:main}. However, the constants in the bound are unknown and this approach is consequently intractable in general. Instead, the homotopy approach taken by Algorithm~\ref{Algo:2} ensures achieving the desired accuracy by gradually increasing the penalty weight. This homotopy approach increases the computational cost of Algorithm~\ref{Algo:2} only by factor logarithmic in the $\epsilon_f$, as detailed in the proof of Corollary~\ref{cor:first}.
\textbf{AE: if we find some time, i'll add a couple of lines for how 1st order opt implies 2nd order opt for smooth constraints.}
\subsection{Second-Order Solver}
{\color{red}{Ahmet (note to myself): not sure of the constants of trust-region, check again}} \\
In this section, we use a second-order solver in Step~2 of Algorithm~\ref{Algo:2}, which in turn allows us to find approximate second-order stationary points of Program~\eqref{prob:01}.
That is, every iteration of Algorithm~\ref{Algo:2} uses a second-order solver to find approximate second-order stationary points of the Program~\eqref{e:exac}, in the sense specified in~\eqref{eq:sec_opt}.
As shown in~\cite{nouiehed2018convergence}, finding approximate second-order stationary points of convex-constrained problems is in general NP-hard. For this reason, we focus in this section on the special case of Program~\eqref{prob:01} with $g=0$.
%We first give a theorem to show the convergence rate of the algorithm for second order stationarity: \\
%{\color{red}{Ahmet: I think that it is possible to remove sufficiently large k0 assumption here. }} \textbf{AE: not worth it really}
%\begin{corollary}
%\label{thm:main_second}
%Under the assumptions of Theorem~\ref{thm:main}, the output of Algorithm~\ref{Algo:2} satisfies
%\begin{align}
%%& \dist^2( -\nabla_x \L_{\b_{k-1}}(x_k,y_k),\partial g(x_k)) + \| A(x_k)\|^2 \nonumber\\
%%& = \frac{1}{\b_{k-1}^2}\left( \frac{8 \lambda'_A\s_k^2 + 4}{ \nu^2} ( \lambda'_f+\lambda'_A y_{\max})^2 + 2\right) \nonumber \\
%%&=: \frac{Q}{\beta_{k-1}^2}.
%\lambda_{\text{min}}(\nabla _{xx}\mathcal{L}_{\beta_{k-1}}(x_k, y_k)) \geq - \frac{C}{\beta_{k-1}} - \epsilon_{k-1}.
%\end{align}
%\end{corollary}
%
We consider the trust region method for Step~2 of Algorithm~\ref{Algo:2}, developed in~\cite{cartis2012complexity}. \textbf{AE: Ahmet to add a brief description of the algorithm.}
Let us compute the total computational complexity of Algorithm~\ref{Algo:2} with the trust region method in Step~2, in terms of the number of calls made to the second-order oracle. By adapting the result in~\cite{cartis2012complexity} to our setup, we find that the number of (inner) iterations required in Step~2 of Algorithm~\ref{Algo:2} to produce $x_k$ is
%
\begin{equation}
\mathcal{O}\left ( \frac{\lambda_{\beta_{k-1}, H}^2 (\mathcal{L}_{\beta_{k-1}}(x_1, y) - \min_{x}\mathcal{L}_{\beta}(x, y))}{\epsilon_k^3} \right),
\label{eq:sec_inn_comp}
\end{equation}
%
where $\lambda_{\beta, H}$ is the Lipschitz constant of the Hessian of the augmented Lagrangian, which is of the order of $\beta$. \textbf{AE: This is somewhat speculative. Make precise?}
In~\cite{cartis2012complexity}, the term $\mathcal{L}_{\beta}(x_1, y) - \min_{x}\mathcal{L}_{\beta}(x, y)$ is bounded by a constant independent of $\epsilon$. \textbf{AE: If so, there is perhaps no need to mention it in the above bound. It'd simplify things.}
We have a similar assumption for this quantity $\forall \beta_k$. \textbf{AE: A bit vague. I couldn't follow...} Using \eqref{eq:sec_inn_comp} and Theorem~\ref{thm:main}, we arrive at the following result.
%
\begin{corollary}\label{cor:second} \textbf{AE: once you converge on the statement here, please rewrite to match the language of the previous corollary as much as possible for consistency.}
Let us use the following form for $\beta$
\begin{equation*}
\beta_k = \alpha^k, ~~ \alpha > 1.
\end{equation*}
%
We use the trust region method from~\cite{cartis2012complexity} for step 2 in Algorithm~\ref{Algo:2} and assume that
\begin{equation}
\mathcal{L}_{\beta}(x_1, y) - \min_{x}\mathcal{L}_{\beta}(x, y)
\end{equation}
is upper bounded by a constant independent of $\epsilon$, for all $\beta_k$.
Algorithm~\ref{Algo:2} then finds an $\epsilon$-second-order stationary point of~\eqref{prob:01} in $T_K$ total number of calls to the second order oracle where
%
\begin{equation}
T_K \leq \tilde{\mathcal{O}}\left( \frac{Q^{\frac{5}{2}}}{\epsilon^{5}} \right).
\end{equation}
\end{corollary}
%
{\color{red} Ahmet: I'll check again the constants of the corollaries, eps dependence should be correct.}\\
%

Event Timeline