diff --git a/ICML19/Arxiv version/iALM_main.pdf b/ICML19/Arxiv version/iALM_main.pdf index f89bd31..9ca3356 100644 Binary files a/ICML19/Arxiv version/iALM_main.pdf and b/ICML19/Arxiv version/iALM_main.pdf differ diff --git a/ICML19/Arxiv version/iALM_main.tex b/ICML19/Arxiv version/iALM_main.tex index faa0f93..7f5d131 100644 --- a/ICML19/Arxiv version/iALM_main.tex +++ b/ICML19/Arxiv version/iALM_main.tex @@ -1,80 +1,85 @@ \documentclass[a4paper]{article} %% Language and font encodings \usepackage[english]{babel} \usepackage[utf8x]{inputenc} \usepackage[T1]{fontenc} \usepackage{amsfonts} %% Sets page size and margins \usepackage[a4paper,top=3cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry} %%% Useful packages %\usepackage{amsmath} %\usepackage{graphicx} %\usepackage[colorinlistoftodos]{todonotes} %\usepackage[colorlinks=true, allcolors=blue]{hyperref} %\usepackage{algorithm, algorithmic} \usepackage{amsthm} \usepackage{amsmath} \usepackage{amssymb} \usepackage{graphicx} \usepackage{algorithm} \usepackage{algorithmic} \usepackage{color} \usepackage{authblk} %\usepackage{cite} + \usepackage{enumitem} + +\newcommand{\anote}[1]{\textbf{{\color{red} AE: #1}}} + + \input{preamble} \title{An Inexact Augmented Lagrangian Framework \\ for Non-Convex Optimization with Nonlinear Constraints} -\author{Mehmet Fatih Sahin\thanks{mehmet.sahin@epfl.ch} +\author{Mehmet Fatih Sahin $\qquad$Armin Eftekhari $\qquad$Ahmet Alacaoglu\\ Fabian Latorre -$\qquad$Volkan Cevher\\ +$\qquad$Volkan Cevher\thanks{mehmet.sahin@epfl.ch \anote{Fatih please add all emails!}}\\ ~\\ LIONS, Ecole Polytechnique F\'ed\'erale de Lausanne, Switzerland } \date{} \begin{document} \maketitle \input{sections/abstract.tex} \input{sections/introduction.tex} \input{sections/preliminaries.tex} \input{sections/AL.tex} \input{sections/guarantees.tex} %\input{sections/guarantees2.tex} \input{sections/related_works.tex} \input{sections/experiments.tex} \section*{Acknowledgements} This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n$^o$ 725594 - time-data), and was supported by the Swiss National Science Foundation (SNSF) under grant number 200021\_178865 / 1. %\section{Conclusion and Future Work} %We studied a non-convex inexact augmented Lagrangian framework for solving nonlinearly constrained problems and showed that when coupled with a first order method iALM finds a first-order stationary point with $\tilde{\mathcal{O}}(1/\epsilon^3)$ calls to the first-order oracle. %Similarly, when a second-order solver is used for the inner iterates, we prove that iALM finds a second-order stationary point with $\tilde{\mathcal{O}}(1/\epsilon^5)$ calls to the second-order oracle. %%Numerical experiments for the basis pursuit problem may lead to an interesting direction. %Stochastic version of Algorithm $(1)$ is a natural direction. %\bibliographystyle{alpha} \bibliographystyle{plain} \bibliography{bibliography.bib} \newpage \appendix \input{sections/AL_theory.tex} \input{sections/appendix.tex} \end{document} \ No newline at end of file diff --git a/ICML19/Arxiv version/sections/AL.tex b/ICML19/Arxiv version/sections/AL.tex index 423ca8f..d6c94a1 100644 --- a/ICML19/Arxiv version/sections/AL.tex +++ b/ICML19/Arxiv version/sections/AL.tex @@ -1,56 +1,56 @@ \section{Our optimization framework \label{sec:AL algorithm}} -To solve the formulation presented in \eqref{eq:minmax}, we propose the inexact ALM (iALM), detailed in Algorithm~\ref{Algo:2}. +To solve the equivalent formulation of \eqref{prob:01}, introduced in \eqref{eq:minmax}, we propose the inexact Augmented Lagragnian Method (iALM), detailed in Algorithm~\ref{Algo:2}. -At the iteration $k$, Algorithm~\ref{Algo:2} calls in Step 2 a solver that finds an approximate stationary point of the augmented Lagrangian $\L_{\b_k}(\cdot,y_k)$ with the accuracy of $\epsilon_{k+1}$, and this accuracy gradually increases in a controlled fashion. +At the $k$th iteration, Step 2 of Algorithm~\ref{Algo:2} calls a solver that finds an approximate stationary point of the augmented Lagrangian $\L_{\b_k}(\cdot,y_k)$ with the accuracy of $\epsilon_{k+1}$; this accuracy gradually increases throughout the iterations in a controlled fashion. -The increasing sequence of penalty weights $\{\b_k\}_k$ and the dual update (Steps 4 and 5) are responsible for continuously enforcing the constraints in~\eqref{prob:01}. As we will see in the convergence analysis, the particular choice of the dual step size $\s_k$ in Algorithm~\ref{Algo:2} ensures that the dual variable $y_k$ remains bounded; see~\cite{bertsekas1976penalty} for a precedent in the ALM literature where a similar choice for $\sigma_k$ is considered. +The increasing sequence of penalty weights $\{\b_k\}_k$ and the dual update (Steps 4 and 5 of Algorithm~\ref{Algo:2}) are responsible for continuously enforcing the constraints in~\eqref{prob:01}. As we will see in the convergence analysis, the particular choice of the dual step size $\s_k$ in Algorithm~\ref{Algo:2} ensures that the dual variable $y_k$ remains bounded; see~\cite{bertsekas1976penalty} for a precedent in the ALM literature where a similar choice for $\sigma_k$ is considered. Step 3 of Algorithm~\ref{Algo:2} removes pathological cases with divergent iterates. As an example, suppose that $g=\delta_\mathcal{X}$ in \eqref{prob:01} is the indicator function for a bounded convex set $\mathcal{X}\subset \RR^d$ and take $\rho' > \max_{x\in \mathcal{X}} \|x\|$. Then, for sufficiently large $k$, it is not difficult to verify that all the iterates of Algorithm~\ref{Algo:2} automatically satisfy $\|x_k\|\le \rho'$ without the need to execute Step 3. \begin{algorithm}[h!] \begin{algorithmic} -\STATE \textbf{Input:} $\rho,\rho',\rho''>0$. A non-decreasing, positive, unbounded sequence $\{\b_k\}_{k\ge 1}$, stopping thresholds $\tau_f$ and $\tau_s$. \vspace{2pt} +\STATE \textbf{Input:} $\rho,\rho',\rho''>0$, a non-decreasing, positive, unbounded sequence $\{\b_k\}_{k\ge 1}$, stopping thresholds $\tau_f$ and $\tau_s$. \vspace{2pt} \STATE \textbf{Initialization:} $x_{1}\in \RR^d$ such that $\|A(x_1)\|\le \rho$ and $\|x_1\|\le \rho'$, $y_0\in \RR^m$, $\s_1$. %For $k=1,2,\ldots$, execute\\ \vspace{2pt} \FOR{$k=1,2,\dots$} \STATE \begin{enumerate}[leftmargin=*] \item \textbf{(Update tolerance)} $\epsilon_{k+1} = 1/\b_k$. %\begin{align} %\beta_k = \frac{\beta_{1}}{\log 2}\sqrt{k}\log^2(k+1) . %\end{align} \item \textbf{(Inexact primal solution)} Obtain $x_{k+1}\in \RR^d$ such that \begin{equation*} \dist(-\nabla_x \L_{\beta_k} (x_{k+1},y_k), \partial g(x_{k+1}) ) \le \epsilon_{k+1} \end{equation*} for first-order stationarity and, in addition, \begin{equation*} \lambda_{\text{min}}(\nabla _{xx}\mathcal{L}_{\beta_k}(x_{k+1}, y_k)) \ge -\epsilon_{k+1} \end{equation*} for second-order-stationarity. \item \textbf{(Control)} If necessary, project $x_{k+1}$ to ensure that $\|x_{k+1}\|\le \rho'$.\\ \item \textbf{(Update dual step size)} \begin{align*} \s_{k+1} & = \s_{1} \min\Big( \frac{\|A(x_1)\| \log^2 2 }{\|A(x_{k+1})\| (k+1)\log^2(k+2)} ,1 \Big). \end{align*} \item \textbf{(Dual ascent)} $y_{k+1} = y_{k} + \sigma_{k+1}A(x_{k+1})$. \item \textbf{(Stopping criterion)} If \begin{align*} -& \dist(-\nabla_x \L_{\b_k}(x_{k+1}),\partial g(x_{k+1})) \nonumber\\ -& \qquad + \s_{k+1} \|A(x_{k+1})\| \le \tau_f, + \dist(-\nabla_x \L_{\b_k}(x_{k+1}),\partial g(x_{k+1})) ++ \s_{k+1} \|A(x_{k+1})\| \le \tau_f, \end{align*} for first-order stationarity and if also $\lambda_{\text{min}}(\nabla _{xx}\mathcal{L}_{\beta_{k}}(x_{k+1}, y_k)) \geq -\tau_s$ for second-order stationarity, then quit and return $x_{k+1}$ as an (approximate) stationary point of~\eqref{prob:01}. \end{enumerate} \ENDFOR \end{algorithmic} \caption{Inexact AL for solving~\eqref{prob:01}} \label{Algo:2} \end{algorithm} diff --git a/ICML19/Arxiv version/sections/guarantees.tex b/ICML19/Arxiv version/sections/guarantees.tex index d2dd7c1..c3ed030 100644 --- a/ICML19/Arxiv version/sections/guarantees.tex +++ b/ICML19/Arxiv version/sections/guarantees.tex @@ -1,192 +1,187 @@ %!TEX root = ../iALM_main.tex \section{Convergence Rate \label{sec:cvg rate}} -In this section, we detail the convergence rate of Algorithm~\ref{Algo:2} for finding first-order and second-order stationary points, along with the iteration complexity results. +In this section, we derive the convergence rate of Algorithm~\ref{Algo:2} for finding first-order and second-order stationary points of \eqref{prob:01}, and present the corresponding total iteration complexity results. All the proofs are deferred to Appendix~\ref{sec:theory} for the clarity. -Theorem~\ref{thm:main} below characterizes the convergence rate of Algorithm~\ref{Algo:2} for finding stationary points in terms of the number of outer iterations.\\ +Let us now turn to the details. Theorem~\ref{thm:main} below, proved in Appendix~\ref{sec:theory}, characterizes the convergence rate of Algorithm~\ref{Algo:2} for finding stationary points in terms of the number of outer iterations.\\ %<<<<<<< HEAD %{\color{red}{Ahmet: Maybe instead of sufficiently large k0 we can say k0 such that beta is bigger than 1, it makes the proof go thru, it would be a bit more explanatory}} %======= %>>>>>>> b39eea61625954bc9d3858590b7b37a182a6af4f \begin{theorem}\textbf{\emph{(convergence rate)}} \label{thm:main} -Suppose that $f$ and $A$ are smooth in the sense specified in (\ref{eq:smoothness basic}). For $\rho'>0$, let +Suppose that $f$ and $A$ are smooth in the sense specified in~(\ref{eq:smoothness basic}). For $\rho'>0$, let \begin{align} \lambda'_f = \max_{\|x\|\le \rho'} \|\nabla f(x)\|, ~~ \lambda'_A = \max_{\|x\| \le \rho'} \|DA(x)\|, \label{eq:defn_restricted_lipsichtz} \end{align} be the (restricted) Lipschitz constants of $f$ and $A$, respectively. For sufficiently large integers $k_0\le k_1$, consider the interval $K=[k_0:k_1]$, and let $\{x_k\}_{k\in K}$ be the output sequence of Algorithm~\ref{Algo:2} on the interval $K$. For $\nu>0$, assume that \begin{align} \nu \|A(x_k)\| & \le \dist\left( -DA(x_k)^\top A(x_k) , \frac{\partial g(x_k)}{ \b_{k-1}} \right), \label{eq:regularity} \end{align} for every $k\in K$. We consider two cases: %{\color{red} Ahmet: I removed the squares and showed the rate on dist + feasibility, since it is the measure for the stationarity, using squares confuses complexity analysis...} \begin{itemize}[leftmargin=*] \item If a first-order solver is used in Step~2, then $x_k$ is an $(\epsilon_{k,f},\b_k)$ first-order stationary point of (\ref{prob:01}) with %\begin{align} %& \dist^2( -\nabla_x \L_{\b_{k-1}}(x_k,y_k),\partial g(x_k)) + \| A(x_k)\|^2 \nonumber\\ %& = \frac{1}{\b_{k-1}^2}\left( \frac{8 \lambda'_A\s_1^2 + 4}{ \nu^2} ( \lambda'_f+\lambda'_A y_{\max})^2 + 2\right) \nonumber \\ %&=: \frac{Q(f,g,A)}{\beta_{k-1}^2}, %\end{align} \begin{align} %\epsilon_{k,f} & = \frac{1}{\beta_{k-1}} \sqrt{ \frac{8 \lambda'_A\s_1^2 + 4}{ \nu^2} ( \lambda'_f+\lambda'_A y_{\max})^2 + 2 } \nonumber \\ %&=: \frac{Q(f,g,A,\s_1)}{\beta_{k-1}}, \epsilon_{k,f} & = \frac{1}{\beta_{k-1}} \left(\frac{2(\lambda'_f+\lambda'_A y_{\max}) (1+\lambda_A' \sigma_k)}{\nu}+1\right) \nonumber \\ &=: \frac{Q(f,g,A,\s_1)}{\beta_{k-1}}, \label{eq:stat_prec_first} \end{align} for every $k\in K$, where the expression for $y_{\max}=y_{\max}(x_1,y_0,\s_1)$ is given in (\ref{eq:dual growth}). \item If a second-order solver is used in Step~2, then $x_k$ is an $(\epsilon_{k,f}, \epsilon_{k,s},\b_k)$ second-order stationary point of~(\ref{prob:01}) with $\epsilon_{k,s}$ specified above and with \begin{align} \epsilon_{k,s} &= \epsilon_{k-1} + \sigma_k \sqrt{m} \lambda_A \frac{ 2\lambda'_f +2 \lambda'_A y_{\max} }{\nu \b_{k-1}} \nonumber\\ &= \frac{\nu + \sigma_k \sqrt{m} \lambda_A 2\lambda'_f +2 \lambda'_A y_{\max} }{\nu \b_{k-1}} =: \frac{Q'}{\beta_{k-1}}. \end{align} \end{itemize} \end{theorem} %A few remarks are in order about Theorem~\ref{thm:main}. %\textbf{Tradeoff.} %Roughly speaking, Theorem~\ref{thm:main} states that the iterates of Algorithm~\ref{Algo:2} approach first-order stationarity for~\eqref{prob:01} at the rate of $1/{\b_k}$, where $\b_k$ is the penalty weight in AL in iteration $k$ as in~\eqref{eq:Lagrangian}. Note the inherent tradeoff here: For a faster sequence $\{\b_k\}_k$, the iterates approach stationarity faster but the computational cost of the inexact primal subroutine (Step 2) increases since a higher accuracy is required and Lipschitz constant of the unconstrained problem is also affected by $\beta_k$. In words, there is a tradeoff between the number of outer and inner loop iterations of Algorithm~\ref{Algo:2}. We highlight this tradeoff for a number of subroutines below. -Loosely speaking, Theorem~\ref{thm:main} states that Algorithm~\ref{Algo:2} converges to a (first- or second-) order stationary point of \eqref{prob:01} at the rate of $1/\b_k$. +\noindent Loosely speaking, Theorem~\ref{thm:main} states that Algorithm~\ref{Algo:2} converges to a (first- or second-) order stationary point of \eqref{prob:01} at the rate of $1/\b_k$. A few remarks are in order. -A few remarks are in order. +\paragraph{Regularity.} The key condition in Theorem~\ref{thm:main} is \eqref{eq:regularity} which, broadly speaking, controls the geometry of the problemuujq. -\textbf{Regularity.} The key condition in Theorem~\ref{thm:main} is \eqref{eq:regularity} which, broadly speaking, controls the problem geometry. +As the penalty weight $\b_k$ grows, the primal solver in Step~2 places an increasing emphasis on reducing the feasibility gap and \eqref{eq:regularity} formhddddhdalizes this intuition. +In contrast to most conditions in the nonconvex optimization literature, such as~\cite{flores2012complete} \anote{Fatih, please cite here all papers with nonconvex salter conditions}, our condition in~\eqref{eq:regularity} appears to be easier to verify, as we see in Section~\ref{sec:experiments}. -As the penalty weight $\b_k$ grows, the primal solver in Step~2 places an increasing emphasis on reducing the feasibility gap and \eqref{eq:regularity} formalizes this intuition. -In contrast to most conditions in the nonconvex optimization literature, such as~\cite{flores2012complete}, our condition in~\eqref{eq:regularity} appears to be easier to verify, as we see in Section~\ref{sec:experiments}. - -We now argue that such a condition is necessary for controlling the feasibility gap of~\eqref{prob:01} and ensuring the success of Algorithm~\ref{Algo:2}. Consider the case where $f=0$ and $g=\delta_\mathcal{X}$, where $\mathcal{X}$ is a convex set, $A$ is a linear operator. In this case, solving \eqref{prob:01} finds a point in $\mathcal{X}\cap \text{null}(A)$, where the subspace $\text{null}(A)=\{ x\in\mathbb{R}^d: A(x) = 0 \} \subset \RR^d$ is the null space of $A$. -Here, the Slater's condition requires that +We now argue that such a condition is necessary for controlling the feasibility gap of~\eqref{prob:01} and ensuring the success of Algorithm~\ref{Algo:2}. For instance, consider the case where $f=0$ and $g=\delta_\mathcal{X}$, where $\mathcal{X}$ is a convex set, $A$ is a linear operator. In this case, solving \eqref{prob:01} finds a point in $\mathcal{X}\cap \text{null}(A)$, where the subspace $\text{null}(A)=\{ x\in\mathbb{R}^d: A(x) = 0 \} \subset \RR^d$ is the null space of $A$. +Here, \eqref{prob:01} is convex and the Slater's condition requires that \begin{equation} \text{relint} (\mathcal{X}) \cap \text{null}(A)\neq \emptyset. \end{equation} %<<<<<<< HEAD -In general, the Slater's condition plays a key role in convex optimization as a sufficient condition for strong duality and, as a result, guarantees the success of a variety of primal-dual algorithms for linearly-constrained convex programs~\cite{bauschke2011convex}. - -Intuitively, the Slater's condition here removes any pathological cases by ensuring that the subspace $\text{null}(A)$ is not tangent to $\mathcal{X}$, see Figure~\ref{fig:convex_slater}. In such pathological cases, solving~\eqref{prob:01}, namely, finding a point in $\mathcal{X}\cap \text{null}(A)$, can be particularly difficult. For instance, the alternating projection algorithm (which iteratively projects onto $\mathcal{X}$ and $\text{null}(A)$) has arbitrarily slow convergence, see Figure~\ref{fig:convex_slater}. +In general, the Slater's condition plays a key role in convex optimization as a sufficient condition for strong duality and, as a result, guarantees the success of a variety of primal-dual algorithms for linearly-constrained convex programs~\cite{bauschke2011convex}. Intuitively, the Slater's condition here removes any pathological cases by ensuring that the subspace $\text{null}(A)$ is not tangent to $\mathcal{X}$, see Figure~\ref{fig:convex_slater}. In such pathological cases, solving~\eqref{prob:01}, namely, finding a point in $\mathcal{X}\cap \text{null}(A)$, can be particularly difficult. For instance, the alternating projection algorithm (which iteratively projects onto $\mathcal{X}$ and $\text{null}(A)$) has arbitrarily slow convergence, see Figure~\ref{fig:convex_slater}. %{\color{red} Ahmet: a reference would be cool here} %======= %In general, the Slater's condition plays a key role in convex optimization as a sufficient condition for strong duality and, as a result, guarantees the success of a variety of primal-dual algorithms for linearly-constrained convex programs~\cite{bauschke2011convex}. Intuitively, the Slater's condition here removes any pathological cases by ensuring that the subspace $\text{null}(A)$ is not tangent to $\mathcal{X}$, see Figure~\ref{fig:convex_slater}. In such pathological cases, solving~\eqref{prob:01}, namely, finding a point in $\mathcal{X}\cap \text{null}(A)$, can be particularly difficult. For instance, the alternating projection algorithm (which iteratively projects onto $\mathcal{X}$ and $\text{null}(A)$) has arbitrarily slow convergence, see Figure~\ref{fig:convex_slater}. %>>>>>>> 795311da274f2d8ab56d215c2fd481073e616732 \begin{figure} \begin{center} \includegraphics[scale=.5]{Slater.pdf} \end{center} -\caption{Solving \eqref{prob:01} can be particularly difficult, even when \eqref{prob:01} is a convex program. We present a pathological geometry where the Slater's condition does not apply. See the first remark after Theorem~\ref{thm:main} for more details. \label{fig:convex_slater}} +\caption{Solving \eqref{prob:01} can be particularly difficult, even when it is a convex program. We present a pathological geometry where the Slater's condition does not apply. See the first remark after Theorem~\ref{thm:main} for more details. \label{fig:convex_slater}} \end{figure} -\textbf{Computational complexity.} Theorem~\ref{thm:main} allows us to specify the number of iterations that Algorithm~\ref{Algo:2} requires to reach a near-stationary point of Program~\eqref{prob:01} with a prescribed precision and, in particular, specifies the number of calls made to the solver in Step~2. In this sense, Theorem~\ref{thm:main} does not fully capture the computational complexity of Algorithm~\ref{Algo:2}, as it does not take into account the computational cost of the solver in Step~2. +\paragraph{Computational complexity.} Theorem~\ref{thm:main} allows us to specify the number of iterations that Algorithm~\ref{Algo:2} requires to reach a near-stationary point of~\eqref{prob:01} with a prescribed precision and, in particular, specifies the number of calls made to the solver in its Step~2. In this sense, Theorem~\ref{thm:main} does not fully capture the computational complexity of Algorithm~\ref{Algo:2}, as it does not take into account the computational cost of the solver in its Step~2. -To better understand the total complexity of Algorithm~\ref{Algo:2}, we consider two scenarios in the following. In the first scenario, we take the solver in Step~2 to be the Accelerated Proximal Gradient Method (APGM), a well-known first-order algorithm~\cite{ghadimi2016accelerated}. In the second scenario, we will use the second-order trust region method developed in~\cite{cartis2012complexity}. +To understand the total complexity of Algorithm~\ref{Algo:2}, we consider two scenarios in the following. In the first scenario, we take the solver in Step~2 to be the Accelerated Proximal Gradient Method (APGM), a well-known first-order algorithm~\cite{ghadimi2016accelerated}. In the second scenario, we will use the second-order trust region method developed in~\cite{cartis2012complexity}. \subsection{First-Order Optimality} %\textbf{AE: we go back and forth between "subroutine" and "solver". for consistency, I'm just using "solver" everywhere.} Let us first consider the first-order optimality case where the solver in Step~2 is APGM~\cite{ghadimi2016accelerated}. -APGM makes use of $\nabla_x \mathcal{L}_{\beta}(x,y)$, $\text{prox}_g$ and classical Nesterov acceleration step for the iterates~\cite{nesterov1983method} to obtain first order stationarity guarantees for solving~\eqref{e:exac}. +APGM makes use of $\nabla_x \mathcal{L}_{\beta}(x,y)$, the proximal operator $\text{prox}_g$, and the classical Nesterov's acceleration~\cite{nesterov1983method} to obtain first-order stationarity guarantees for solving the subproblem~\eqref{e:exac}. % \textbf{AE: Ahmet to give a brief description of APGM here. } %First, we characterize the iteration complexity of Algorithm~\ref{Algo:2} for finding a first-order stationary point of~\eqref{prob:01}. %We propose to use the standard accelerated proximal method (APGM), guarantees of which are analyzed in~\cite{ghadimi2016accelerated} for nonconvex problems of the form~\eqref{e:exac}. -Suppose that $g=\delta_\mathcal{X}$ is the indicator function on a bounded convex set $\mathcal{X}\subset \RR^d$ and let +Suppose that $g=\delta_\mathcal{X}$ is the indicator function of a bounded convex set $\mathcal{X}\subset \RR^d$, and let \begin{align} x_{\max}= \max_{x\in \mathcal{X}}\|x\|, \end{align} be the radius of a ball centered at the origin that includes $\mathcal{X}$. Then, adapting the results in~\cite{ghadimi2016accelerated} to our setup, APGM reaches $x_{k}$ in Step 2 of Algorithm~\ref{Algo:2} after \begin{equation} \mathcal{O}\left ( \frac{\lambda_{\beta_{k}}^2 x_{\max}^2 }{\epsilon_{k+1}} \right) \label{eq:iter_1storder} \end{equation} -(inner) iterations, where $\lambda_{\beta_{k}}$ denotes the Lipschitz constant of $\nabla_x{\mathcal{L}_{\beta_{k}}(x, y)}$, bounded in~\eqref{eq:smoothness of Lagrangian}. We note that for simplicity, we use a looser bound in \eqref{eq:iter_1storder} than~\cite{ghadimi2016accelerated}. +(inner) iterations, where $\lambda_{\beta_{k}}$ denotes the Lipschitz constant of $\nabla_x{\mathcal{L}_{\beta_{k}}(x, y)}$, bounded in~\eqref{eq:smoothness of Lagrangian}. We note that, for simplicity, we use a looser bound in \eqref{eq:iter_1storder} than the original in~\cite{ghadimi2016accelerated}. Using \eqref{eq:iter_1storder}, we derive the following corollary, describing the total computational complexity of Algorithm~\ref{Algo:2} in terms of the calls to the first-order oracle in APGM. %\textbf{AE: we haven't talked about oracle before.} \begin{corollary}\label{cor:first} -For $b>1$, let $\beta_k =b^k $ for every $k$. If we use APGM from~\cite{ghadimi2016accelerated} for Step~2 of Algorithm~\ref{Algo:2}, the algorithm finds an $(\epsilon_f,\b_k)$ first-order stationary point, see (\ref{eq:inclu3}), - after $T$ calls to the first-order oracle, where +For $b>1$, let $\beta_k =b^k $ for every $k$. If we use APGM from~\cite{ghadimi2016accelerated} for Step~2 of Algorithm~\ref{Algo:2}, the algorithm finds an $(\epsilon_f,\b_k)$ first-order stationary point, see (\ref{eq:inclu3}), after $T$ calls to the first-order oracle, where % \begin{equation} T = \mathcal{O}\left( \frac{Q^3 x_{\max}^2}{\epsilon^{3}}\log_b{\left( \frac{Q}{\epsilon} \right)} \right) = \tilde{\mathcal{O}}\left( \frac{Q^{3} x_{\max}^2}{\epsilon^{3}} \right). \end{equation} \end{corollary} -For Algorithm~\ref{Algo:2} to reach a near-stationary point with an accuracy of $\epsilon_f$ in the sense of \eqref{eq:inclu3} and with the lowest computational cost, we therefore need to perform only one iteration of Algorithm~\ref{Algo:2}, with $\b_1$ specified as a function of $\epsilon_f$ by \eqref{eq:stat_prec_first} in Theorem~\ref{thm:main}. In general, however, the constants in \eqref{eq:stat_prec_first} are unknown and this approach is intractable. Instead, the homotopy approach taken by Algorithm~\ref{Algo:2} ensures achieving the desired accuracy by gradually increasing the penalty weight. This homotopy approach increases the computational cost of Algorithm~\ref{Algo:2} only by a factor logarithmic in the $\epsilon_f$, as detailed in the proof of Corollary~\ref{cor:first}. +\noindent For Algorithm~\ref{Algo:2} to reach a near-stationary point with an accuracy of $\epsilon_f$ in the sense of \eqref{eq:inclu3} and with the lowest computational cost, we therefore need to perform only one iteration of Algorithm~\ref{Algo:2}, with $\b_1$ specified as a function of $\epsilon_f$ by \eqref{eq:stat_prec_first} in Theorem~\ref{thm:main}. In general, however, the constants in \eqref{eq:stat_prec_first} are unknown and this approach is intractable. Instead, the homotopy approach taken by Algorithm~\ref{Algo:2} ensures achieving the desired accuracy by gradually increasing the penalty weight. This homotopy approach increases the computational cost of Algorithm~\ref{Algo:2} only by a factor logarithmic in the $\epsilon_f$, as detailed in the proof of Corollary~\ref{cor:first}. %\textbf{AE: if we find some time, i'll add a couple of lines for how 1st order opt implies 2nd order opt for smooth constraints.} \subsection{Second-Order Optimality} %{\color{red}{Ahmet (note to myself): not sure of the constants of trust-region, check again}} \\ Let us now consider the second-order optimality case where the solver in Step~2 is the the trust region method developed in~\cite{cartis2012complexity}. -Trust region method minimizes quadratic approximation of the function within a dynamically updated trust-region radius. -Second-order trust region method that we consider in this section makes use of Hessian (or an approximation of Hessian) of the augmented Lagrangian in addition to first order oracles. +The trust region method minimizes quadratic approximation of the function within a dynamically updated trust-region radius. +The particular second-order trust region method that we consider in this section makes use of Hessian (or an approximation of Hessian) of the augmented Lagrangian in addition to first order oracles. As shown in~\cite{nouiehed2018convergence}, finding approximate second-order stationary points of convex-constrained problems is in general NP-hard. For this reason, we focus in this section on the special case of~\eqref{prob:01} with $g=0$. %We first give a theorem to show the convergence rate of the algorithm for second order stationarity: \\ %{\color{red}{Ahmet: I think that it is possible to remove sufficiently large k0 assumption here. }} \textbf{AE: not worth it really} %\begin{corollary} %\label{thm:main_second} %Under the assumptions of Theorem~\ref{thm:main}, the output of Algorithm~\ref{Algo:2} satisfies %\begin{align} %%& \dist^2( -\nabla_x \L_{\b_{k-1}}(x_k,y_k),\partial g(x_k)) + \| A(x_k)\|^2 \nonumber\\ %%& = \frac{1}{\b_{k-1}^2}\left( \frac{8 \lambda'_A\s_k^2 + 4}{ \nu^2} ( \lambda'_f+\lambda'_A y_{\max})^2 + 2\right) \nonumber \\ %%&=: \frac{Q}{\beta_{k-1}^2}. %\lambda_{\text{min}}(\nabla _{xx}\mathcal{L}_{\beta_{k-1}}(x_k, y_k)) \geq - \frac{C}{\beta_{k-1}} - \epsilon_{k-1}. %\end{align} %\end{corollary} % %We consider \textbf{AE: Ahmet to add a brief description of the algorithm.} Let us compute the total computational complexity of Algorithm~\ref{Algo:2} with the trust region method in Step~2, in terms of the number of calls made to the second-order oracle. By adapting the result in~\cite{cartis2012complexity} to our setup, we find that the number of (inner) iterations required in Step~2 of Algorithm~\ref{Algo:2} to produce $x_{k+1}$ is % \begin{equation} \mathcal{O}\left ( \frac{\lambda_{\beta_{k}, H}^2 (\mathcal{L}_{\beta_{k}}(x_1, y) - \min_{x}\mathcal{L}_{\beta_k}(x, y))}{\epsilon_k^3} \right), \label{eq:sec_inn_comp} \end{equation} % %<<<<<<< HEAD where $\lambda_{\beta, H}$ is the Lipschitz constant of the Hessian of the augmented Lagrangian, which is of the order of $\beta$, as can be proven similar to Lemma~\ref{lem:smoothness} and $x_1$ is the initial iterate of the given outer loop. In~\cite{cartis2012complexity}, the term $\mathcal{L}_{\beta}(x_1, y) - \min_{x}\mathcal{L}_{\beta}(x, y)$ is bounded by a constant independent of $\epsilon$. -We assume a uniform bound for this quantity $\forall \beta_k$, instead of for one value of $\beta_k$ as in~\cite{cartis2012complexity}. Using \eqref{eq:sec_inn_comp} and Theorem~\ref{thm:main}, we arrive at the following: +We assume a uniform bound for this quantity for all $\beta_k$, instead of for one value of $\beta_k$ as in~\cite{cartis2012complexity}. Using \eqref{eq:sec_inn_comp} and Theorem~\ref{thm:main}, we arrive at the following: %======= %where $\lambda_{\beta_k, H}$ is the Lipschitz constant of the Hessian of the augmented Lagrangian, which is of the order of $\beta_k$, as can be proven similar to Lemma~\ref{lem:smoothness} and $x_1$ is the initial iterate of the given outer loop. %In~\cite{cartis2012complexity}, the term $\mathcal{L}_{\beta_k}(x_1, y) - \min_{x}\mathcal{L}_{\beta_k}(x, y)$ is bounded by a constant independent of $\epsilon_k$. %We assume a uniform bound for this quantity for all $\b$, instead of for one value of $\beta_k$ as in~\cite{cartis2012complexity}. Using \eqref{eq:sec_inn_comp} and Theorem~\ref{thm:main}, we arrive at the following result. The proof is very similar to that of Corollary~\ref{cor:first} and hence omitted here. %>>>>>>> 795311da274f2d8ab56d215c2fd481073e616732 % \begin{corollary}\label{cor:second} For $b>1$, let $\beta_k =b^k $ for every $k$. We assume that \begin{equation} -\mathcal{L}_{\beta}(x_1, y) - \min_{x}\mathcal{L}_{\beta}(x, y) \leq L_{u},~~ \forall \beta. +\mathcal{L}_{\beta}(x_1, y) - \min_{x}\mathcal{L}_{\beta}(x, y) \leq L_{u},~~ \forall \beta . \end{equation} If we use the trust region method from~\cite{cartis2012complexity} for Step~2 of Algorithm~\ref{Algo:2}, the algorithm finds an $\epsilon$-second-order stationary point of~(\ref{prob:01}) in $T$ calls to the second-order oracle where % \begin{equation} T \leq \mathcal{O}\left( \frac{L_u Q'^{5}}{\epsilon^{5}} \log_b{\left( \frac{Q'}{\epsilon} \right)} \right) = \widetilde{\mathcal{O}}\left( \frac{L_u Q'^{5}}{\epsilon^{5}} \right). \end{equation} \end{corollary} % -Before closing this section, we note that the remark after Corollary~\ref{cor:first} applies here as well. +\noindent Before closing this section, we note that the remark after Corollary~\ref{cor:first} applies here as well. % diff --git a/ICML19/Arxiv version/sections/introduction.tex b/ICML19/Arxiv version/sections/introduction.tex index 0dd8385..f178bec 100644 --- a/ICML19/Arxiv version/sections/introduction.tex +++ b/ICML19/Arxiv version/sections/introduction.tex @@ -1,91 +1,90 @@ %!TEX root = ../main.tex \vspace{-5mm} \section{Introduction} \label{intro} -We study the following nonconvex optimization problem +We study the nonconvex optimization problem % \begin{equation} \label{prob:01} \begin{cases} \underset{x\in \mathbb{R}^d}{\min}\,\, f(x)+g(x)\\ A(x) = b, \end{cases} \end{equation} % -where $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is possibly non-convex and $A:\mathbb{R}^d\rightarrow\mathbb{R}^m$ is a nonlinear operator and $b\in\mathbb{R}^m$. +where $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is possibly nonconvex and $A:\mathbb{R}^d\rightarrow\mathbb{R}^m$ is a nonlinear operator and $b\in\mathbb{R}^m$. For clarity of notation, we take $b=0$ in the sequel, the extension to any $b$ is trivial. -We assume that $g:\mathbb{R}^d\rightarrow\mathbb{R}$ is proximal-friendly (possibly nonsmooth) convex function. % (not necessarily smooth). -%For instance, for a convex set $\mathcal{X}\subset\RR^d$, letting $g=\delta_\mathcal{X}$ to be the indicator function on $\mathcal{X}$, ~\eqref{prob:01} is a nonconvex optimization problem with the convex constraint $x\in \mathcal{X}$. +We assume that $g:\mathbb{R}^d\rightarrow\mathbb{R}$ is a proximal-friendly (but possibly nonsmooth) convex function. % (not necessarily smooth). +%For instance, for a convex set $\mathcal{X}\subset\RR^d$, letting $g=\delta_\mathcal{X}$ to be the indicator function on $\mathcal{X}$,~\eqref{prob:01} is a nonconvex optimization problem with the convex constraint $x\in \mathcal{X}$. %Our other requirement for $g$ is that computing its proximal operator is inexpensive, see Section~\ref{sec:preliminaries}. -A host of problems in computer science \cite{khot2011grothendieck, lovasz2003semidefinite}, machine learning \cite{mossel2015consistency, song2007dependence}, and signal processing \cite{singer2011angular, singer2011three} naturally fall under the template of ~\eqref{prob:01}, including max-cut, clustering, generalized eigenvalue, as well as community detection. % \textbf{AE: We need to add some more specifics to this paragraph to convince people that this is an important template. What are the names of the problems you're citing? Max-cut, sparse regression, statistical learning with deep networks as prior \cite{Bora2017}, what else we can think of?} +A host of problems in computer science \cite{khot2011grothendieck, lovasz2003semidefinite}, machine learning \cite{mossel2015consistency, song2007dependence}, and signal processing~\cite{singer2011angular, singer2011three} naturally fall under the template of~\eqref{prob:01}, including max-cut, clustering, generalized eigenvalue, and community detection. % \textbf{AE: We need to add some more specifics to this paragraph to convince people that this is an important template. What are the names of the problems you're citing? Max-cut, sparse regression, statistical learning with deep networks as prior \cite{Bora2017}, what else we can think of?} -To address these applications, this paper builds up on the vast literature on the classical inexact augmented Lagrangian framework and proposes a simple, intuitive as well as easy-to-implement algorithm with total complexity results for ~\eqref{prob:01} under an interpretable geometric condition. Before we elaborate on the results, let us first motivate ~\eqref{prob:01} with an important application to semidefinite programming (SDP): +To address these applications, this paper builds up on the vast literature on the classical inexact augmented Lagrangian framework and proposes a simple, intuitive and easy-to-implement algorithm which enjoys total complexity results for~\eqref{prob:01}, under an interpretable geometric condition detailed below. Before we elaborate on the results, let us first motivate~\eqref{prob:01} with an important application to Semidefinite Programming (SDP): \vspace{-3mm} \paragraph{Vignette: Burer-Monteiro splitting.} -A powerful convex relaxation for max-cut, clustering, and several other problems above is provided by the SDP +A powerful convex relaxation for max-cut, clustering, and several other applications we listed above is provided by the SDP \begin{equation} \label{eq:sdp} \begin{cases} \underset{X\in\mathbb{S}^{d \times d}}{\min} \langle C, X \rangle \\ B(X) = b, \,\, X \succeq 0, \end{cases} \end{equation} % where $C\in \RR^{d\times d}$ and $X$ is a positive semidefinite and symmetric $d\times d$ matrix, %$\mathbb{S}^{d \times d}$ denotes the set of real symmetric matrices, and ${B}: \mathbb{S}^{d\times d} \to \mathbb{R}^m$ is a linear operator. If the unique-games conjecture is true, SDPs achieve the best approximation for the underlying discrete problem~\cite{raghavendra2008optimal}. -Since $d$ is often large, many first- and second-order methods for solving such SDP's are immediately ruled out, not only due to their high computational complexity, but also due to their storage requirements, which are $\mathcal{O}(d^2)$. +Since $d$ is often large, many first- and second-order methods for solving such SDPs are immediately ruled out, not only due to their high computational complexity, but also due to their storage requirements, which are $\mathcal{O}(d^2)$. -A contemporary challenge in optimization therefore is to solve SDP's in small space and in a scalable fashion. A recent algorithm, i.e., homotopy conditional gradient method (HCGM) based on Linear Minimization Oracles (LMO), can address this template in small space via sketching \cite{yurtsever2018conditional}; however, such LMO-based methods are extremely slow in obtaining accurate solutions. +A contemporary challenge in optimization is therefore to solve SDPs in small space and in a scalable fashion. A recent algorithm, i.e., Homotopy Conditional Gradient Method (HCGM), based on Linear Minimization Oracles (LMO), can address the template \eqref{eq:sdp} in a small space via sketching~\cite{yurtsever2018conditional}; however, such LMO-based methods are extremely slow in obtaining accurate solutions. -%In practice, $d$ is often very large which makes interior point methods, with their poor scaling in $d$, an unattractive option for solving ~\eqref{eq:sdp}. Attempts to resolve this issue prompted extensive research in computationally- and memory-efficient SDP solvers. The first such solvers relied on the so-called Linear Minimization Oracle (LMO), reviewed in Section~\ref{sec:related work}, alongside other scalabe SDP solvers. - -A key approach for solving \eqref{prob:01}, dating back to~\cite{burer2003nonlinear, burer2005local}, is the so-called Burer-Monteiro (BR) splitting $X=UU^\top$, where $U\in\mathbb{R}^{d\times r}$ and $r$ is selected according to the guidelines in~\cite{pataki1998rank, barvinok1995problems}. -It has been shown that these bounds on the rank, which are shown to be optimal~\cite{waldspurger2018rank}, under some assumptions removing the spurious local minima of the nonconvex factorized problem~\cite{boumal2016non}. +%In practice, $d$ is often very large which makes interior point methods, with their poor scaling in $d$, an unattractive option for solving~\eqref{eq:sdp}. Attempts to resolve this issue prompted extensive research in computationally- and memory-efficient SDP solvers. The first such solvers relied on the so-called Linear Minimization Oracle (LMO), reviewed in Section~\ref{sec:related work}, alongside other scalabe SDP solvers. +Alternatively, a powerful approach for solving \eqref{prob:01}, dating back to~\cite{burer2003nonlinear, burer2005local}, is the so-called Burer-Monteiro (BR) factorization $X=UU^\top$, where $U\in\mathbb{R}^{d\times r}$ and $r$ is selected according to the guidelines in~\cite{pataki1998rank, barvinok1995problems}. +Even though the resulting factorized problem is nonconvex, it has been shown that these bounds on the rank $r$, which are shown to be optimal~\cite{waldspurger2018rank}, guarantee the removal of all spurious local minima, under mild additional assumptions~\cite{boumal2016non}. %so as to remove spurious local minima of the nonconvex factorized problem. Evidently, BR splitting has the advantage of lower storage and faster iterations. - -This splitting results in the following non-convex problem +This factorization results in the nonconvex problem \begin{equation} \label{prob:nc} \begin{cases} \underset{U\in\mathbb{R}^{d \times r}}{\min} \langle C, UU^\top \rangle \\ B(UU^\top) = b, \end{cases} \end{equation} -which can be written in the form of ~\eqref{prob:01}. - +which can be written in the form of~\eqref{prob:01}. %\textbf{Vignette 2: Latent group lasso} - %See Section ? for several examples. \textbf{We should explain this better and give more high level examples to motivate this .} %An example of our template in \eqref{prob:01} is semi-definite ming which provides powerful relaxations to above problems. Denote the space of $d'\times d'$ symmetric matrices by $\mathbb{S}^{d'\times d'}$ and consider % %\begin{equation} % \label{sdp} -% \min_x \{h'(x): A'(x) = b', ~~x \in \mathcal{C'}~~\text{and}~~x\succeq0 \} +% \min_x \{h'(x): A'(x) = b',~~x \in \mathcal{C'}~~\text{and}~~x\succeq0 \} %\end{equation} % %where $h': \mathbb{S}^{d'\times d'} \to \RR$, $A'\colon\mathbb{S}^{d'\times d'}\to\RR^m$, $b'\in\RR^m$, and $C' \subseteq \mathbb{R}^{d'\times d'}$. This template clearly can be put to the form of \eqref{prob:01} by using \emph{Burer-Monteiro factorization} \cite{burer2003nonlinear, burer2005local}. -To solve \eqref{prob:nc}, the inexact Augmented Lagrangian Method (iALM) is widely used~\cite{burer2003nonlinear, burer2005local, kulis2007fast}, due to its cheap per iteration cost and also its empirical success in practice. Every (outer) iteration of iALM calls a solver to inexactly solve an intermediate augmented Lagrangian subproblem to near stationarity, and the user has freedom in choosing this solver, which could use first-order (say,~proximal gradient descent \cite{parikh2014proximal}) or second-order information, such as BFGS \cite{nocedal2006numerical}. -We argue that unlike its convex counterpart~\cite{nedelcu2014computational,lan2016iteration,xu2017inexact}, the convergence rate and the complexity of iALM for ~\eqref{prob:nc} are not well-understood, see Section~\ref{sec:related work} for a review of the related literature. Indeed, addressing this important theoretical gap is one of the contributions of our work. +To solve \eqref{prob:nc}, the inexact Augmented Lagrangian Method (iALM) is widely used~\cite{burer2003nonlinear, burer2005local, kulis2007fast}, due to its cheap per-iteration cost and also its empirical success in practice. Every (outer) iteration of iALM calls a solver to inexactly solve an intermediate augmented Lagrangian subproblem to near stationarity, and the user has freedom in choosing this solver, which could be a first-order algorithm (say,~proximal gradient descent \cite{parikh2014proximal}) or a second-order algorithm, such as BFGS \cite{nocedal2006numerical}. +We argue that, unlike its convex counterpart~\cite{nedelcu2014computational,lan2016iteration,xu2017inexact}, the convergence rate and the complexity of iALM for~\eqref{prob:nc} are not well-understood, as detailed in Section~\ref{sec:related work}. Indeed, addressing this important theoretical gap is one of the key contribution of the present work. -{\textbf{A brief summary of our contributions:} \\[1mm] -$\triangleright$ Our framework is future-proof in the sense that we obtain the convergence rate of iALM for ~\eqref{prob:01} with an arbitrary solver for finding first- and second-order stationary points. \\[2mm] -$\triangleright$ We investigate using different solvers for augmented Lagrangian subproblems and provide overall iteration complexity bounds for finding first- and second-order stationary points of ~\eqref{prob:01}. Our complexity bounds match the best theoretical complexity results in optimization, see Section~\ref{sec:related work}. \\[2mm] -$\triangleright$ We propose a novel geometric condition that simplifies the algorithmic analysis for iALM. We verify the condition for key problems described in Section~\ref{sec:experiments}. +\vspace{7pt} +\noindent {\textbf{Summary of contributions:} \\[1mm] +$\triangleright$ Our framework is future-proof in the sense that we obtain the convergence rate of iALM for~\eqref{prob:01} with an arbitrary solver for finding first- and second-order stationary points of each intermediate subproblem. \\[2mm] +$\triangleright$ We investigate the use of different solvers for augmented Lagrangian subproblems and provide overall iteration complexity bounds for finding first- and second-order stationary points of~\eqref{prob:01}. Our complexity bounds match the best theoretical complexity results in optimization, see Section~\ref{sec:related work}. \\[2mm] +$\triangleright$ We propose a novel geometric condition that simplifies the algorithmic analysis of iALM. We verify the condition for a few key problems described in Section~\ref{sec:experiments}. -\textbf{Roadmap.} Section~\ref{sec:preliminaries} collects the main tools and our notation. % that we utilize. -We present the iALM in Section~\ref{sec:AL algorithm} and obtain its convergence rate to first- and second-order stationary points in Section~\ref{sec:cvg rate}, alongside their iteration complexities. -We provide a comprehensive review of the literature and highlight our key differences in % with the precedents in +%\vspace{7pt} +%\noindent {\textbf{Roadmap:} \\[1mm] +\paragraph{Roadmap. }Section~\ref{sec:preliminaries} collects the technical tools and notation. % that we utilize. +We present iALM in Section~\ref{sec:AL algorithm} and obtain its convergence rate to first- and second-order stationary points in Section~\ref{sec:cvg rate}, alongside their total iteration complexities. +A comprehensive review of the literature, highlighting the key distinctions of our framework, is presented in % with the precedents in Section~\ref{sec:related work}. -Section~\ref{sec:experiments} presents the numerical evidence and comparisons with the state-of-the-art. % methods. +Finally, Section~\ref{sec:experiments} provides the numerical evidence and comparisons with the state-of-the-art. % methods. diff --git a/ICML19/Arxiv version/sections/preliminaries.tex b/ICML19/Arxiv version/sections/preliminaries.tex index 9577be1..e42d790 100644 --- a/ICML19/Arxiv version/sections/preliminaries.tex +++ b/ICML19/Arxiv version/sections/preliminaries.tex @@ -1,180 +1,176 @@ \vspace{-3mm} \section{Preliminaries \label{sec:preliminaries}} \paragraph{\textbf{{Notation.}}} We use the notation $\langle\cdot ,\cdot \rangle $ and $\|\cdot\|$ for the {standard inner} product and the norm on $\RR^d$. For matrices, $\|\cdot\|$ and $\|\cdot\|_F$ denote the spectral and the Frobenius norms, respectively. %For a matrix $M$, its row and null spaces are denoted by $\row(A)$ and $\text{null}(A)$, respectively. For a convex function $g:\RR^d\rightarrow\RR$, the subdifferential set at $x\in \RR^d$ is denoted by $\partial g(x)$ and we will occasionally use the notation $\partial g(x)/\b = \{ z/\b : z\in \partial g(x)\}$. -When presenting iteration complexity results, we often use $\widetilde{O}(\cdot)$ which suppresses the logarithmic dependencies. +When presenting iteration complexity results, we often use $\widetilde{O}(\cdot)$ to suppress the logarithmic dependencies. %The proximal operator $P_g:\RR^d\rightarrow\RR^d$ takes $x$ to %\begin{align} %P_g(x) = \underset{y}{\argmin} \, g(y) + \frac{1}{2}\|x-y\|^2, %\label{eq:defn of prox} %\end{align} %and, if $g=1_C$ is the indicator function of a convex set $C$, we will use the shorthand $P_C=P_{1_C}$ to show the orthogonal projection onto $C$. -We use the indicator function $\delta_\mathcal{X}:\RR^d\rightarrow\RR$ of a set $\mathcal{X}\subset\RR^d$, which takes $x$ to +Throughout, the indicator function $\delta_\mathcal{X}:\RR^d\rightarrow\RR$ of a set $\mathcal{X}\subset\RR^d$ takes $x$ to \begin{align} \delta_\mathcal{X}(x) = \begin{cases} 0 & x \in \mathcal{X}\\ \infty & x\notin \mathcal{X}. \end{cases} \label{eq:indicator} \end{align} - %The tangent cone of $C$ at $x\in C$ is denoted by $T_C(x)$ and The distance function from a point $x$ to $\mathcal{X}$ is denoted by $\dist(x,\mathcal{X}) = \min_{z\in \mathcal{X}} \|x-z\|$. For integers $k_0 \le k_1$, we denote $[k_0:k_1]=\{k_0,\ldots,k_1\}$. - For an operator $A:\RR^d\rightarrow\RR^m$ with components $\{A_i\}_{i=1}^m$, we let $DA(x) \in \mathbb{R}^{m\times d}$ denote the Jacobian of $A$, where the $i$th row of $DA(x)$ is the gradient vector $\nabla A_i(x) \in \RR^d$. %We denote the Hessian of the augmented Lagrangian with respect to $x$ as $\nabla _{xx} \mathcal{L}_{\beta}(x,y)$. \\ %{\color{red} define $A_i$ rigorously.} -\textbf{Smoothness.} -We require $f:\RR^d\rightarrow\RR$ and $A:\RR^d\rightarrow \RR^m$ in~\eqref{prob:01} to be smooth; i.e., there exists $\lambda_f,\lambda_A\ge 0$ such that +\paragraph{\textbf{Smoothness.}} +We require $f:\RR^d\rightarrow\RR$ and $A:\RR^d\rightarrow \RR^m$ in~\eqref{prob:01} to be smooth, namely, there exists $\lambda_f,\lambda_A\ge 0$ such that % %\vspace{-.3cm} \begin{align} \| \nabla f(x) - \nabla f(x')\| & \le \lambda_f \|x-x'\|, \nonumber\\ \| DA(x) - DA(x') \| & \le \lambda_A \|x-x'\|, \label{eq:smoothness basic} \end{align} for every $x,x'\in \mathbb{R}^d$. -\textbf{The augmented Lagrangian method (ALM)}. -ALM is a classical algorithm, first appeared in~\cite{hestenes1969multiplier, powell1969method} and extensively studied in~\cite{bertsekas2014constrained}. -For solving \eqref{prob:01}, ALM suggests solving the problem +\paragraph{\textbf{Augmented Lagrangian Method (ALM)}.} +ALM is a classical algorithm, first appearing in~\cite{hestenes1969multiplier, powell1969method} and then extensively studied in~\cite{bertsekas2014constrained}. +Instead of \eqref{prob:01}, ALM proposes to solve the equivalent problem % %\vspace{-.3cm} \begin{equation} \min_{x} \max_y \,\,\mathcal{L}_\beta(x,y) + g(x), \label{eq:minmax} \end{equation} %\vspace{-.5cm} % where, for $\b>0$, $\mathcal{L}_\b$ is the corresponding augmented Lagrangian, defined as \begin{align} \label{eq:Lagrangian} \mathcal{L}_\beta(x,y) := f(x) + \langle A(x), y \rangle + \frac{\beta}{2}\|A(x)\|^2. \end{align} - %\textbf{AE: Adding the bias $b$ doesn't seem to add much except making the presentation more clunky.} %\vspace{-.5cm} - -The minimax formulation in \eqref{eq:minmax} naturally suggests the following algorithm for solving \eqref{prob:01}. For dual step sizes $\{\s_k\}_k$, consider the iterates +The minimax formulation in \eqref{eq:minmax} naturally suggests the following algorithm for solving \eqref{prob:01}: For dual step sizes $\{\s_k\}_k$, consider the iterates \begin{equation}\label{e:exac} x_{k+1} \in \underset{x}{\argmin} \,\, \mathcal{L}_{\beta}(x,y_k)+g(x), \end{equation} \begin{equation*} y_{k+1} = y_k+\s_k A(x_{k+1}). \end{equation*} -However, updating $x_{k+1}$ above requires solving the nonconvex problem~\eqref{e:exac} to optimality, which is typically intractable. Instead, it is often easier to find an approximate first- or second-order stationary point of~\eqref{e:exac}. +Unfortunately, updating $x_{k+1}$ above requires solving the nonconvex problem~\eqref{e:exac} to optimality, which is typically intractable. Instead, it is often easier to find an approximate first- or second-order stationary point of~\eqref{e:exac}. %We therefore consider an algorithm that only requires approximate stationarity in every iteration. -Hence, we argue that by gradually improving the stationarity precision and increasing the penalty weight $\b$ above, we can reach a stationary point of the main problem in~\eqref{prob:01}, as detailed in Section~\ref{sec:AL algorithm}. +Our key contribution is that, by gradually improving the stationarity precision and increasing the penalty weight $\b$ above, we can reach a stationary point of the main problem in~\eqref{prob:01}, as detailed in Section~\ref{sec:AL algorithm}. -{\textbf{{{Optimality conditions.}}} \label{sec:opt cnds}} +\paragraph{\textbf{Optimality conditions.} \label{sec:opt cnds}} {First-order necessary optimality conditions} for \eqref{prob:01} are well-understood. {Indeed, $x\in \RR^d$ is a first-order stationary point of~\eqref{prob:01} if there exists $y\in \RR^m$ such that \begin{align} \begin{cases} -\nabla f(x) - DA(x)^\top y \in \partial g(x)\\ A(x) = 0, \end{cases} \label{e:inclu1} \end{align} where $DA(x)$ is the Jacobian of $A$ at $x$. Recalling \eqref{eq:Lagrangian}, we observe that \eqref{e:inclu1} is equivalent to \begin{align} \begin{cases} -\nabla_x \mathcal{L}_\beta(x,y) \in \partial g(x)\\ A(x) = 0, \end{cases} \label{e:inclu2} \end{align} which is in turn the necessary optimality condition for \eqref{eq:minmax}. } %Note that \eqref{e:inclu2} is equivalent to %\begin{align} %\left[ %\begin{array}{c} %\nabla_x \L_\b(x,y) \\ %\nabla_y \L_\b(x,y) %\end{array} %\right] %\in %\left\{ %\left[ %\begin{array}{c} % v\\ %0 %\end{array} %\right] %: v\in \partial g(x) \right\} %, %\end{align} %which rephrases the stationarity condition in terms of the gradient of the duality gap of Program~\eqref{eq:Lagrangian}. -Inspired by this, we say that $x$ is an $(\epsilon_f,\b)$ first-order stationary point if +Inspired by this, we say that $x$ is an $(\epsilon_f,\b)$ first-order stationary point of \eqref{prob:01} if \begin{align} \begin{cases} \dist(-\nabla_x \mathcal{L}_\beta(x,y), \partial g(x)) \leq \epsilon_f \\ \| A(x) \| \leq \epsilon_f, \end{cases} \label{eq:inclu3} \end{align} -for $\epsilon_f\ge 0$. +where $\epsilon_f\ge 0$. In light of \eqref{eq:inclu3}, a suitable metric for evaluating the stationarity of a pair $(x,y)\in \RR^d\times \RR^m$ is \begin{align} \dist\left(-\nabla_x \mathcal{L}_\beta(x,y), \partial g(x) \right) + \|A(x)\| , \label{eq:cvg metric} \end{align} -which we use as the first-order stopping criterion. +which we will use as the first-order stopping criterion. %When $g=0$, it is also not difficult to verify that the expression in \eqref{eq:cvg metric} is the norm of the gradient of the duality gap in \eqref{eq:minmax}. As an example, for a convex set $\mathcal{X}\subset\RR^d$, suppose that $g = \delta_\mathcal{X}$ is the indicator function on $\mathcal{X}$.} Let also $T_\mathcal{X}(x) \subseteq \RR^d$ denote the tangent cone to $\mathcal{X}$ at $x$, and with $P_{T_\mathcal{X}(x)}:\RR^d\rightarrow\RR^d$, we denote the orthogonal projection onto this tangent cone. Then, for $u\in\RR^d$, it is not difficult to verify that \begin{align}\label{eq:dist_subgrad} \dist\left(u, \partial g(x) \right) = \| P_{T_\mathcal{X}(x)}(u) \|. \end{align} % When $g = 0$, a first-order stationary point $x\in \RR^d$ of \eqref{prob:01} is also second-order stationary if % %\vspace{-.5cm} \begin{equation} \lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y)) > 0, \end{equation} %\vspace{-.5cm} % where $\nabla_{xx}\mathcal{L}_\b$ is the Hessian with respect to $x$, and $\lambda_{\text{min}}(\cdot)$ returns the smallest eigenvalue of its argument. Analogously, $x$ is an $(\epsilon_f, \epsilon_s,\b)$ second-order stationary point if, in addition to \eqref{eq:inclu3}, it holds that % %\vspace{-.5cm} \begin{equation}\label{eq:sec_opt} \lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y)) \ge -\epsilon_s, \end{equation} for $\epsilon_s>0$. %\vspace{-.5cm} % -Naturally, for second-order stationarity, we use $\lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y))$ as the stopping criterion. +Naturally, for second-order stationarity, we will use $\lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y))$ as the stopping criterion. \paragraph{{\textbf{Smoothness lemma.}}} This next result controls the smoothness of $\L_\b(\cdot,y)$ for a fixed $y$. The proof is standard but nevertheless included in Appendix~\ref{sec:proof of smoothness lemma} for completeness. \begin{lemma}[\textbf{smoothness}]\label{lem:smoothness} For fixed $y\in \RR^m$ and $\rho,\rho'\ge 0$, it holds that \begin{align} \| \nabla_x \mathcal{L}_{\beta}(x, y)- \nabla_x \mathcal{L}_{\beta}(x', y) \| \le \lambda_\b \|x-x'\|, \end{align} for every $x,x' \in \{ x'': \|A(x'') \|\le \rho, \|x''\|\le \rho'\}$, where \begin{align} \lambda_\beta & \le \lambda_f + \sqrt{m}\lambda_A \|y\| + (\sqrt{m}\lambda_A\rho + d \lambda'^2_A )\b \nonumber\\ & =: \lambda_f + \sqrt{m}\lambda_A \|y\| + \lambda''(A,\rho,\rho') \b. \label{eq:smoothness of Lagrangian} \end{align} Above, $\lambda_f,\lambda_A$ were defined in (\ref{eq:smoothness basic}) and \begin{align} \lambda'_A := \max_{\|x\|\le \rho'}\|DA(x)\|. \end{align} \end{lemma} diff --git a/ICML19/Arxiv version/sections/related_works.tex b/ICML19/Arxiv version/sections/related_works.tex index 00f7969..453f069 100644 --- a/ICML19/Arxiv version/sections/related_works.tex +++ b/ICML19/Arxiv version/sections/related_works.tex @@ -1,87 +1,85 @@ %!TEX root = ../main.tex \section{Related works \label{sec:related work}} ALM has a long history in the optimization literature, dating back to~\cite{hestenes1969multiplier, powell1969method}. In the special case of~\eqref{prob:01} with a convex function $f$ and a linear operator $A$, standard, inexact and linearized versions of ALM have been extensively studied~\cite{lan2016iteration,nedelcu2014computational,tran2018adaptive,xu2017inexact}. -Classical works on ALM focused on the general template of~\eqref{prob:01} with nonconvex $f$ and nonlinear $A$, with arguably stronger assumptions and required exact solutions to the subproblems of the form \eqref{e:exac}, which appear in Step 2 of Algorithm~\ref{Algo:2}, see for instance \cite{bertsekas2014constrained}. +Classical works on ALM focused on the general template of~\eqref{prob:01} with nonconvex $f$ and nonlinear $A$, with arguably stronger assumptions and required exact solutions to the subproblems of the form~\eqref{e:exac}, which appear in Step 2 of Algorithm~\ref{Algo:2}, see for instance \cite{bertsekas2014constrained}. A similar analysis was conducted in~\cite{fernandez2012local} for the general template of~\eqref{prob:01}. The authors considered inexact ALM and proved convergence rates for the outer iterates, under specific assumptions on the initialization of the dual variable. -However, unlike our results, the authors did not analyze how to solve the subproblems inexactly and they did not provide total complexity results and verifiable conditions. +However, unlike our results, the authors did not analyze how to solve the subproblems inexactly and they did not provide total complexity results and verifiable conditions, see Section~\ref{sec:experiments}. %\textbf{AE: Mention if this paper was convex or nonconvex?} -Problem~\eqref{prob:01} with similar assumptions to us is also studied in~\cite{birgin2016evaluation} and~\cite{cartis2018optimality} for first-order and second-order stationarity, respectively, with explicit iteration complexity analysis. -As we have mentioned in Section~\ref{sec:cvg rate}, our iteration complexity results matches these theoretical algorithms with a simpler algorithm and a simpler analysis. -In addition, these algorithms require setting final accuracies since they utilize this information in the algorithm. In contrast to \cite{birgin2016evaluation,cartis2018optimality}, Algorithm~\ref{Algo:2} does not set accuracies a priori. +Problem~\eqref{prob:01} is also studied in~\cite{birgin2016evaluation} and~\cite{cartis2018optimality} for first-order and second-order stationarity, respectively, with explicit iteration complexity analysis and similar assumptions to us. +As we have mentioned in Section~\ref{sec:cvg rate}, our iteration complexity results matches these theoretical algorithms with a far simpler algorithm and a straightforward analysis. +In addition, these algorithms require setting final accuracies in advance, since they utilize this information in the algorithm. In contrast to \cite{birgin2016evaluation,cartis2018optimality}, Algorithm~\ref{Algo:2} does not set the final accuracy a priori. \cite{cartis2011evaluation} also considers the same template~\eqref{prob:01} for first-order stationarity with a penalty-type method instead of ALM. Even though the authors show $\mathcal{O}(1/\epsilon^2)$ complexity, this result is obtained by assuming that the penalty parameter remains bounded. We note that such an assumption can also be used to match our complexity results. -\cite{bolte2018nonconvex} studies the general template~\eqref{prob:01} with specific assumptions involving local error bound conditions for the~\eqref{prob:01}. -These conditions are studied in detail in~\cite{bolte2017error}, but their validity for general SDPs~\eqref{eq:sdp} has never been established. This work also lacks the total iteration complexity analysis presented here. % that we can compare. +\cite{bolte2018nonconvex} studies the general template~\eqref{prob:01} with specific assumptions involving local error bound conditions for~\eqref{prob:01}. +These conditions are studied in detail in~\cite{bolte2017error}, but their validity for general SDPs in~\eqref{eq:sdp} has never been established. This work also lacks the total iteration complexity analysis presented here. % that we can compare. Another work~\cite{clason2018acceleration} focused on solving~\eqref{prob:01} by adapting the primal-dual method of Chambolle and Pock~\cite{chambolle2011first}. The authors proved the convergence of the method and provided convergence rate by imposing error bound conditions on the objective function that do not hold for standard SDPs.%~\eqref{eq:sdp}. %Some works also considered the application of ALM/ADMM to nonconvex problems~\cite{wang2015global, liu2017linearized}. %These works assume that the operator in~\eqref{prob:01} is linear, therefore, they do not apply to our setting. %since we have a nonlinear constraint in addition to a nonconvex objective function. \cite{burer2003nonlinear, burer2005local} is the first work that proposes the splitting $X=UU^\top$ for solving SDPs of the form~\eqref{eq:sdp}. -Following these works, the literature on Burer-Monteiro (BM) splitting for the large part focused on using ALM for solving the reformulated problem~\eqref{prob:nc}. - -However, this approach has a few drawbacks: First, it requires exact solutions in Step 2 of Algorithm~\ref{Algo:2} in theory, which in practice is replaced with inexact solutions. Second, their results only establish convergence without providing the rates. In this sense, our work provides a theoretical understanding of the BM splitting with inexact solutions to Step 2 of Algorithm~\ref{Algo:2} and complete iteration complexities. +Following these works, the literature on Burer-Monteiro (BM) splitting for the large part focused on using ALM for solving the reformulated problem~\eqref{prob:nc}. However, this approach has a few drawbacks: First, it requires exact solutions in Step 2 of Algorithm~\ref{Algo:2} in theory, which in practice is replaced with inexact solutions. Second, their results only establish convergence without providing the rates. In this sense, our work provides a theoretical understanding of the BM splitting with inexact solutions to Step 2 of Algorithm~\ref{Algo:2} and complete iteration complexities. %\textbf{AE: Ahmet you discuss global optimality below but are you sure that's relevant to our work? in the sense that we just give an algorithm to solve to local optimality.} %Their practical stopping condition is also not analyzed theoretically. %In addition, Burer and Monteiro in~\cite{burer2005local} needed to bound the primal iterates they have to put an artificial bound to the primal domain which will be ineffective in practice; which is impossible to do without knowing the norm of the solution. % and their results do not extend to the case where projection in primal domain are required in each iteration. \cite{bhojanapalli2016dropping, park2016provable} are among the earliest efforts to show convergence rates for BM splitting, focusing on the special case of SDPs without any linear constraints. For these specific problems, they prove the convergence of gradient descent to global optima with convergence rates, assuming favorable initialization. These results, however, do not apply to general SDPs of the form~\eqref{eq:sdp} where the difficulty arises due to the linear constraints. %{\color{red} Ahmet: I have to cite this result bc it is very relevant.} %\textbf{AE: also if you do want to talk about global optimality you could also talk about the line of works that show no spurious local optima.} %In their followup work~, the authors considered projected gradient descent, but only for restricted strongly convex functions. %Their results are not able to extend to linear constraints and general convex functions. %, therefore not applicable to general SDPs. %They do not apply to the general semidefinite programming since $f(U)=\langle C, UU^\ast \rangle$. %, these conditions are not satisfied. %{\color{red} Ahmet: I have to cite this result bc it is relevant and the reviewers will complain if I don't} \cite{bhojanapalli2018smoothed} focused on the quadratic penalty formulation of~\eqref{prob:01}, namely, \begin{equation}\label{eq:pen_sdp} \min_{X\succeq 0} \langle C, X\rangle + \frac{\mu}{2} \| B(X)-b\|^2, \end{equation} -which after BM splitting becomes +which after BM factorization becomes \begin{equation}\label{eq:pen_nc} \min_{U\in\mathbb{R}^{d\times r}} \langle C, UU^\top \rangle + \frac{\mu}{2} \|B(UU^\top)-b\|^2, \end{equation} for which they study the optimality of the second-order stationary points. These results are for establishing a connection between the stationary points of~\eqref{eq:pen_nc} and global optima of~\eqref{eq:pen_sdp}. In contrast, we focus on the relation of the stationary points of~\eqref{eq:minmax} to the constrained problem~\eqref{prob:01}. %\textbf{AE: so i'm not sure why we are comparing with them. their work establish global optimality of local optima but we just find local optima. it seems to add to our work rather compare against it. you also had a paragraph for global optimality with close initialization earlier. could you put the two next to each other?} Another popular method for solving SDPs are due to~\cite{boumal2014manopt, boumal2016global, boumal2016non}, focusing on the case where the constraints in~\eqref{prob:01} can be written as a Riemannian manifold after BM splitting. In this case, the authors apply the Riemannian gradient descent and Riemannian trust region methods for obtaining first- and second-order stationary points, respectively. They obtain~$\mathcal{O}(1/\epsilon^2)$ complexity for finding first-order stationary points and~$\mathcal{O}(1/\epsilon^3)$ complexity for finding second-order stationary points. While these complexities appear better than ours, the smooth manifold requirement in these works is indeed restrictive. In particular, this requirement holds for max-cut and generalized eigenvalue problems, but it is not satisfied for other important SDPs such as quadratic programming (QAP), optimal power flow and clustering with general affine constraints. In addition, as noted in~\cite{boumal2016global}, per iteration cost of their method for max-cut problem is an astronomical~$\mathcal{O}({d^6})$. % which is astronomical. %ly larger than our cost of $\mathcal{O}(n^2r)$ where $r \ll n$. %\cite{birgin2016evaluation} can handle the same problem but their algorithm is much more complicated than ours. Lastly, there also exists a line of work for solving SDPs in their original convex formulation, in a storage efficient way~\cite{nesterov2009primal, yurtsever2015universal, yurtsever2018conditional}. These works have global optimality guarantees by their virtue of directly solving the convex formulation. On the downside, these works require the use of eigenvalue routines and exhibit significantly slower convergence as compared to nonconvex approaches~\cite{jaggi2013revisiting}.