diff --git a/ICML19/Drafts/sections/preliminaries.tex b/ICML19/Drafts/sections/preliminaries.tex index 106db1e..43975b8 100644 --- a/ICML19/Drafts/sections/preliminaries.tex +++ b/ICML19/Drafts/sections/preliminaries.tex @@ -1,179 +1,179 @@ \vspace{-3mm} \section{Preliminaries \label{sec:preliminaries}} \paragraph{\textbf{{Notation.}}} We use the notation $\langle\cdot ,\cdot \rangle $ and $\|\cdot\|$ for the {standard inner} product and the norm on $\RR^d$. For matrices, $\|\cdot\|$ and $\|\cdot\|_F$ denote the spectral and the Frobenius norms, respectively. %For a matrix $M$, its row and null spaces are denoted by $\row(A)$ and $\text{null}(A)$, respectively. For a convex function $g:\RR^d\rightarrow\RR$, the subdifferential set at $x\in \RR^d$ is denoted by $\partial g(x)$ and we will occasionally use the notation $\partial g(x)/\b = \{ z/\b : z\in \partial g(x)\}$. When presenting iteration complexity results, we use $\tilde{O}(\cdot)$ which suppresses the logarithmic dependencies. %The proximal operator $P_g:\RR^d\rightarrow\RR^d$ takes $x$ to %\begin{align} %P_g(x) = \underset{y}{\argmin} \, g(y) + \frac{1}{2}\|x-y\|^2, %\label{eq:defn of prox} %\end{align} %and, if $g=1_C$ is the indicator function of a convex set $C$, we will use the shorthand $P_C=P_{1_C}$ to show the orthogonal projection onto $C$. We use the indicator function $\delta_\mathcal{X}:\RR^d\rightarrow\RR$ of a set $\mathcal{X}\subset\RR^d$ \begin{align} g(x) = \delta_\mathcal{X}(x) = \begin{cases} 0 & x \in \mathcal{X}\\ \infty & x\notin \mathcal{X}. \end{cases} \label{eq:indicator} \end{align} %The tangent cone of $C$ at $x\in C$ is denoted by $T_C(x)$ and The distance function from a point $x$ to $\mathcal{X}$ is denoted by $\dist(x,\mathcal{X}) = \min_{z\in \mathcal{X}} \|x-z\|$. For integers $k_0 \le k_1$, we denote $[k_0:k_1]=\{k_0,\ldots,k_1\}$. For an operator $A:\RR^d\rightarrow\RR^m$ with components $\{A_i\}_{i=1}^m$, we let $DA(x) \in \mathbb{R}^{m\times d}$ denote the Jacobian of $A$, where the $i$th row of $DA(x)$ is the gradient vector $\nabla A_i(x)$. \textbf{AE: need to revisit this.} We denote the Hessian of the augmented Lagrangian with respect to $x$ as $\nabla _{xx} \mathcal{L}_{\beta}(x,y)$. \\ {\color{red} define $A_i$ rigorously.} \textbf{Smoothness.} We require $f:\RR^d\rightarrow\RR$ and $A:\RR^d\rightarrow \RR^m$ to be smooth; i.e., there exists $\lambda_f,\lambda_A\ge 0$ such that % %\vspace{-.3cm} \begin{align} \| \nabla f(x) - \nabla f(x')\| & \le \lambda_f \|x-x'\|, \nonumber\\ \| DA(x) - DA(x') \| & \le \lambda_A \|x-x'\|, \label{eq:smoothness basic} \end{align} for every $x,x'\in \mathbb{R}^d$. \textbf{Augmented Lagrangian method}. ALM is a classical algorithm, proposed in~\cite{hestenes1969multiplier, powell1969method}, and further studied in~\cite{bertsekas2014constrained}. For solving Program~\eqref{prob:01}, ALM suggests solving the equivalent problem % %\vspace{-.3cm} \begin{equation} \min_{x} \max_y \,\,\mathcal{L}_\beta(x,y) + g(x), \label{eq:minmax} \end{equation} %\vspace{-.5cm} % where $\mathcal{L}_\b$ is the augmented Lagrangian corresponding to %Program~\eqref{prob:01}, defined as \begin{align} \label{eq:Lagrangian} \mathcal{L}_\beta(x,y) := f(x) + \langle A(x), y \rangle + \frac{\beta}{2}\|A(x)\|^2. \end{align} %\textbf{AE: Adding the bias $b$ doesn't seem to add much except making the presentation more clunky.} %\vspace{-.5cm} The minimax formulation in \eqref{eq:minmax} naturally suggests the following algorithm to solve \eqref{prob:01}. For dual step sizes $\{\s_k\}_k$, consider the following iteration invariant: \begin{equation}\label{e:exac} x_{k+1} \in \underset{x}{\argmin} \,\, \mathcal{L}_{\beta}(x,y_k)+g(x), \end{equation} \begin{equation*} y_{k+1} = y_k+\s_k A(x_{k+1}). \end{equation*} -However, updating $x$ above requires solving the non-convex Program~\eqref{e:exac} to global optimality, which is typically intractable. It is instead often easier to find an approximate first- or second-order stationary point of Program~\eqref{e:exac}. +However, updating $x$ above requires solving the non-convex Program~\eqref{e:exac} to global optimality, which is typically intractable. Instead, it is often easier to find an approximate first- or second-order stationary point of Program~\eqref{e:exac}. %We therefore consider an algorithm that only requires approximate stationarity in every iteration. Hence, we argue that by gradually decreasing the stationarity precision and increasing the penalty weight $\b$ above, we can reach a stationary point of Program~\eqref{prob:01}, as in Section~\ref{sec:AL algorithm}. {\textbf{{{Optimality conditions.}}} \label{sec:opt cnds}} {First-order necessary optimality conditions} for Program~\eqref{prob:01} are well-understood. {Indeed, $x\in \RR^d$ is a first-order stationary point of Program \eqref{prob:01} if there exists $y\in \RR^m$ such that \begin{align} \begin{cases} -\nabla f(x) - DA(x)^\top y \in \partial g(x)\\ A(x) = 0, \end{cases} \label{e:inclu1} \end{align} where $DA(x)$ is the Jacobian of $A$ at $x$. Recalling \eqref{eq:Lagrangian}, we observe that \eqref{e:inclu1} is equivalent to \begin{align} \begin{cases} -\nabla_x \mathcal{L}_\beta(x,y) \in \partial g(x)\\ A(x) = 0, \end{cases} \label{e:inclu2} \end{align} which is in turn the necessary optimality condition for Program \eqref{eq:minmax}. } %Note that \eqref{e:inclu2} is equivalent to %\begin{align} %\left[ %\begin{array}{c} %\nabla_x \L_\b(x,y) \\ %\nabla_y \L_\b(x,y) %\end{array} %\right] %\in %\left\{ %\left[ %\begin{array}{c} % v\\ %0 %\end{array} %\right] %: v\in \partial g(x) \right\} %, %\end{align} %which rephrases the stationarity condition in terms of the gradient of the duality gap of Program~\eqref{eq:Lagrangian}. Inspired by this, we say that $x$ is an $(\epsilon_f,\b)$ first-order stationary point if \begin{align} \begin{cases} \dist(-\nabla_x \mathcal{L}_\beta(x,y), \partial g(x)) \leq \epsilon_f \\ \| A(x) \| \leq \epsilon_f, \end{cases} \label{eq:inclu3} \end{align} for $\epsilon_s\ge 0$. In light of \eqref{eq:inclu3}, a suitable metric for evaluating the stationarity of a pair $(x,y)\in \RR^d\times \RR^m$ is \begin{align} \dist^2\left(-\nabla_x \mathcal{L}_\beta(x,y), \partial g(x) \right) + \|A(x)\|^2 , \label{eq:cvg metric} \end{align} which we use as the stopping criterion. When $g=0$, it is also not difficult to verify that the expression in \eqref{eq:cvg metric} is the norm of the gradient of the duality gap in \eqref{eq:minmax}. As an example, for a convex set $\mathcal{X}\subset\RR^d$, suppose that $g = \delta_\mathcal{X}$ is the indicator function on $\mathcal{X}$.} We also let $T_\mathcal{X}(x) \subseteq \RR^d$ denote the tangent cone to $\mathcal{X}$ at $x$, and use $P_{T_\mathcal{X}(x)}:\RR^d\rightarrow\RR^d$ to denote the orthogonal projection onto this tangent cone. Then, for $u\in\RR^d$, it is not difficult to verify that \begin{align}\label{eq:dist_subgrad} \dist\left(u, \partial g(x) \right) = \| P_{T_\mathcal{X}(x)}(u) \|. \end{align} % Similarly, when $g(x) = 0$, a first-order stationary point $x\in \RR^d$ of \eqref{prob:01} is also a second-order stationary point if % %\vspace{-.5cm} \begin{equation} \lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y)) > 0, \end{equation} %\vspace{-.5cm} % where $\lambda_{\text{min}}(\cdot)$ returns the smallest eigenvalue of its argument. Analogously, $x$ is an $(\epsilon_f, \epsilon_s,\b)$ second order stationary point if, in addition to \eqref{eq:inclu3}, it holds that % %\vspace{-.5cm} \begin{equation}\label{eq:sec_opt} \lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y)) \ge -\epsilon_s, \end{equation} for $\epsilon_s>0$. %\vspace{-.5cm} % Naturally, for second-order stationarity, we use $\lambda_{\text{min}}(\nabla _{xx} \mathcal{L}_{\beta}(x,y))$ as the stopping criterion. -\paragraph{{\textbf{Smoothness lemma.}}} The following result controls the smoothness of $\L_\b(\cdot,y)$ for a fixed $y$. The proof is standard but is included in Appendix~\ref{sec:proof of smoothness lemma} for completeness. +\paragraph{{\textbf{Smoothness lemma.}}} The following result controls the smoothness of $\L_\b(\cdot,y)$ for a fixed $y$. The proof is standard, however it is still included in Appendix~\ref{sec:proof of smoothness lemma} for completeness. \begin{lemma}[\textbf{Smoothness}]\label{lem:smoothness} For fixed $y\in \RR^m$ and $\rho,\rho'\ge 0$, it holds that \begin{align} \| \nabla_x \mathcal{L}_{\beta}(x, y)- \nabla_x \mathcal{L}_{\beta}(x', y) \| \le \lambda_\b \|x-x'\|, \end{align} for every $x,x' \in \{ x'': \|A(x'') \|\le \rho, \|x''\|\le \rho'\}$, where \begin{align} \lambda_\beta & \le \lambda_f + \sqrt{m}\lambda_A \|y\| + (\sqrt{m}\lambda_A\rho + d \lambda'^2_A )\b \nonumber\\ & =: \lambda_f + \sqrt{m}\lambda_A \|y\| + \lambda''(A,x_1) \b, \label{eq:smoothness of Lagrangian} \end{align} Above, $\lambda_f,\lambda_A$ were defined in (\ref{eq:smoothness basic}) and \begin{align} \lambda'_A := \max_{\|x\|\le \rho'}\|DA(x)\|. \end{align} \end{lemma} diff --git a/ICML19/Drafts/sections/related_works.tex b/ICML19/Drafts/sections/related_works.tex index f92d80c..1d3ae4d 100644 --- a/ICML19/Drafts/sections/related_works.tex +++ b/ICML19/Drafts/sections/related_works.tex @@ -1,92 +1,92 @@ \begin{figure*}[ht] % \includegraphics[width=0.4\textwidth,natwidth=1300,natheight=1300]{bp_fig1.pdf} \begin{center} {\includegraphics[width=1\columnwidth]{figs/bp_fig3_notconverged.pdf} \includegraphics[width=1\columnwidth]{figs/bp_fig2_small.pdf}} %\centerline{\includegraphics[width=1\columnwidth]{figs/bp_fig2_small.pdf}} \label{fig:bp1} \caption{(\textit{Left}) Effect of using different subsolver for the aforementioned non-convex relaxation. APGM is a first order method and does not guarantee the convergence to second order critical point. LBFGS is second order method without any convergence guarantee in the nonconvex programming. Hence, it also gets stuck in a local minima if not initialized properly. (\textit{Right}) Comparison of different algorithms for solving basis pursuit template. } %(left:$[n=100$, $d=10^4]$, right:$[n=34$, $d=10^2$])} \end{center} \end{figure*} \section{Related works \label{sec:related work}} ALM has a long history in the optimization literature, dating back to~\cite{hestenes1969multiplier, powell1969method}. In the special case of~\eqref{prob:01} with a convex $f$ and a linear operator $A$, standard, inexact and linearized versions of ALM have been extensively studied~\cite{lan2016iteration,nedelcu2014computational,tran2018adaptive,xu2017inexact}. -Classical works on ALM focused on the general template of~\eqref{prob:01} with nonconvex $f$ and nonlinear $A$, with relatively stronger assumptions and required exact solutions to the subproblem in Step 2 of Algorithm~\ref{Algo:2}, namely,~\eqref{e:exac} \cite{bertsekas2014constrained}. +Classical works on ALM focused on the general template of~\eqref{prob:01} with nonconvex $f$ and nonlinear $A$, with arguably stronger assumptions and required exact solutions to the subproblem in Step 2 of Algorithm~\ref{Algo:2}, namely,~\eqref{e:exac} \cite{bertsekas2014constrained}. A similar analysis was conducted in~\cite{fernandez2012local} with the same assumptions on~\eqref{prob:01}, where the authors considered inexact ALM under specific assumptions on the initialization of the dual variable with convergence rates. However, unlike our results, the authors did not provide total complexity results with verifiable conditions. %\textbf{AE: Mention if this paper was convex or nonconvex?} Problem~\eqref{prob:01} with similar assumptions is also studied in~\cite{birgin2016evaluation} and~\cite{cartis2018optimality} for first-order and second-order stationarity, respectively, with explicit iteration complexity analysis. As we have mentioned in Section~\ref{sec:cvg rate}, our iteration complexity results matches these theoretical algorithms with a simpler algorithm and a simpler analysis. In addition, these algorithms require setting final accuracies since they utilize this information in the algorithm. \cite{cartis2011evaluation} also considers the same template~\eqref{prob:01} for first-order stationarity with a penalty-type method instead of ALM. Even though the authors show $\mathcal{O}(1/\epsilon^2)$ complexity, this result is obtained by assuming penalty parameter remains bounded. We note that such an assumption can also be used to improve our complexities. On the contrary, our algorithm does not require setting accuracies a priori. \cite{bolte2018nonconvex} also considers the general template~\eqref{prob:01} with specific assumptions involving local error bound conditions for the problem~\eqref{prob:01}. These conditions are studied in detail in~\cite{bolte2017error}, however, there has been no results showing the existence of these conditions for general SDPs~\eqref{eq:sdp}. Even though the authors point that deriving convergence rate results are straightforward, they did not present iteration complexity results. % that we can compare. Another recent line of work~\cite{clason2018acceleration} focused on solving the problem template~\eqref{prob:01} by adapting the primal-dual method of Chambolle and Pock~\cite{chambolle2011first}. The authors proved the convergence of the method with rate guarantees by assuming error bound conditions on the objective function that do not hold for standard SDPs~\eqref{eq:sdp}. %Some works also considered the application of ALM/ADMM to nonconvex problems~\cite{wang2015global, liu2017linearized}. %These works assume that the operator in~\eqref{prob:01} is linear, therefore, they do not apply to our setting. %since we have a nonlinear constraint in addition to a nonconvex objective function. \cite{burer2003nonlinear, burer2005local} is the first work that proposes the splitting $X=UU^\top$ for solving SDPs of the form~\eqref{eq:sdp}. Starting from these works, literature on Burer-Monteiro in a large part focused on using ALM for solving the reformulated problem~\eqref{prob:nc}. However, this approach suffers from several drawbacks: First, these works focuses on the case of exact solutions in Step 2 of Algorithm~\ref{Algo:2} in theory, however they use inexact solutions in practice. In addition, their results are for convergence only, without any rate guarantees. Our work can be considered as providing a theoretical understanding on this practical approach of using inexact solutions to Step 2 of Algorithm~\ref{Algo:2} with explicit iteration complexities. %Their practical stopping condition is also not analyzed theoretically. %In addition, Burer and Monteiro in~\cite{burer2005local} needed to bound the primal iterates they have to put an artificial bound to the primal domain which will be ineffective in practice; which is impossible to do without knowing the norm of the solution. % and their results do not extend to the case where projection in primal domain are required in each iteration. One of the earliest efforts for showing global optimality with Burer-Monteiro splitting are the series of works~\cite{bhojanapalli2016dropping, park2016provable}. These works focused on the special case of SDPs without linear constraints. For these specific problems, they prove the global convergence of gradient descent on with special initialization. Unfortunately, these results do not apply to general SDPs of the form~\eqref{eq:sdp} where the difficulty arises due to linear constraints. %In their followup work~, the authors considered projected gradient descent, but only for restricted strongly convex functions. %Their results are not able to extend to linear constraints and general convex functions. %, therefore not applicable to general SDPs. Another popular method for solving SDPs with some specific assumptions are due to~\cite{boumal2014manopt, boumal2016global, boumal2016non}. These works focus on the case where the constraints in~\eqref{prob:01} can be written as a smooth Riemannian manifold. In this case, the authors apply Riemannian gradient descent and Riemannian trust region methods for obtaining first and second order stationary points, respectively. They obtain~$\mathcal{O}(1/\epsilon^2)$ complexity for finding first order stationary points is and~$\mathcal{O}(1/\epsilon^3)$ complexity for finding second order stationary points. Even though these complexities are faster than what has been provided in Section~\ref{sec:cvg rate}, the smooth manifold assumption of these works seem to be restrictive. In particular, the requirement holds for max-cut and generalized eigenvalue problems, but it is not satisfied for other important SDPs such as quadratic programming (QAP), optimal power flow and clustering. In addition, as noted in~\cite{boumal2016global}, per iteration cost of their method for Max-Cut problem is an astronomical $\mathcal{O}({d^6})$ for solving~\eqref{prob:nc}. % which is astronomical. %ly larger than our cost of $\mathcal{O}(n^2r)$ where $r \ll n$. %They do not apply to the general semidefinite programming since $f(U)=\langle C, UU^\ast \rangle$. %, these conditions are not satisfied. \cite{bhojanapalli2018smoothed} focused on the quadratic penalty formulation of~\eqref{prob:01}: \begin{equation}\label{eq:pen_sdp} \min_{X\succeq 0} \langle C, X\rangle + \frac{\mu}{2} \| \mathcal{A}(x)-b\|^2. \end{equation} They study the optimality of the second order stationary points of the formulation: \begin{equation}\label{eq:pen_nc} \min_{U\in\mathbb{R}^{d\times r}} \langle C, UU^\top \rangle + \frac{\mu}{2} \| \mathcal{A}(UU^\top)-b\|^2. \end{equation} These results are for establishing a connection between the stationary points of~\eqref{eq:pen_nc} and global optima of~\eqref{eq:pen_sdp}. -In contrast, we focus on the relation to the original constrained problem~\eqref{eq:sdp}. +In contrast, we focus on the relation to the constrained problem~\eqref{eq:sdp}. %\cite{birgin2016evaluation} can handle the same problem but their algorithm is much more complicated than ours. There also exist a line of work for solving SDPs in their original convex formulation, in a storage efficient way~\cite{nesterov2009primal, yurtsever2015universal, yurtsever2018conditional}. These works have global optimality guarantees due to solving convex formulations. Unfortunately, these works requires the use of eigenvalue routines and exhibit significantly slower convergence as compared to non-convex approaches.