diff --git a/NeurIPS 19/sections/abstract.tex b/NeurIPS 19/sections/abstract.tex index 23d9884..ad6c425 100644 --- a/NeurIPS 19/sections/abstract.tex +++ b/NeurIPS 19/sections/abstract.tex @@ -1,22 +1,21 @@ %!TEX root = ../main.tex \begin{abstract} %We consider a canonical nonlinear-constrained nonconvex problem template with broad applications in machine learning, theoretical computer science, and signal processing. %This template involves the Burer-Monteiro splitting for semidefinite programming as a special case. We propose a practical inexact augmented Lagrangian method (iALM) for nonconvex problems with nonlinear constraints. %involving nonlinear operators. We characterize the total computational complexity of our method subject to a verifiable geometric condition, which is closely related to the Polyak-Lojasiewicz and Mangasarian-Fromowitz conditions. \vspace{1mm} -In particular, when a first-order solver is used for the inner iterates, we prove that iALM finds a first-order stationary point with {\color{blue} $\tilde{\mathcal{O}}(1/\epsilon^4)$} calls to the first-order oracle. {If, in addition, the problem is smooth and} a second-order solver is used for the inner iterates, iALM finds a second-order stationary point with $\tilde{\mathcal{O}}(1/\epsilon^5)$ calls to the second-order oracle. -These complexity results match the known theoretical results in the literature. % with a simple, implementable and versatile algorithm. +In particular, when a first-order solver is used for the inner iterates, we prove that iALM finds a first-order stationary point with {\color{blue} $\tilde{\mathcal{O}}(1/\epsilon^4)$} calls to the first-order oracle. {If, in addition, the problem is smooth and} a second-order solver is used for the inner iterates, iALM finds a second-order stationary point with $\tilde{\mathcal{O}}(1/\epsilon^5)$ calls to the second-order oracle, {\color{blue}which matches the known theoretical complexity result in the literature.} % with a simple, implementable and versatile algorithm. \vspace{1mm} We also provide strong numerical evidence on large-scale machine learning problems, including the Burer-Monteiro factorization of semidefinite programs, and a novel nonconvex relaxation of the standard basis pursuit template. For these examples, we also show how to verify our geometric condition. %For these problems and under suitable assumptions, our algorithm in fact achieves global optimality for the underlying convex SDP. {\color{blue} This paper was updated on 22 December 2020 due to an error in the proof of Corollary \eqref{cor:first}. The gradient complexity for obtaining a first order stationary point is now $\tilde{\mathcal{O}}( 1/{\epsilon^4})$}. %\textbf{AE: we should add something about gans if we do that in time.} \end{abstract} diff --git a/NeurIPS 19/sections/related_works.tex b/NeurIPS 19/sections/related_works.tex index ee3096e..6aeb201 100644 --- a/NeurIPS 19/sections/related_works.tex +++ b/NeurIPS 19/sections/related_works.tex @@ -1,89 +1,89 @@ %!TEX root = ../main.tex \section{Related Work \label{sec:related work}} - +{\color{blue}{ ALM has a long history in the optimization literature, dating back to~\cite{hestenes1969multiplier, powell1969method}. In the special case of~\eqref{prob:01} with a convex function $f$ and a linear operator $A$, standard, inexact, and linearized versions of ALM have been extensively studied~\cite{lan2016iteration,nedelcu2014computational,tran2018adaptive,xu2017inexact}. Classical works on ALM focused on the general template of~\eqref{prob:01} with nonconvex $f$ and nonlinear $A$, with arguably stronger assumptions and required exact solutions to the subproblems of the form \eqref{e:exac}, which appear in Step 2 of Algorithm~\ref{Algo:2}, see for instance \cite{bertsekas2014constrained}. A similar analysis was conducted in~\cite{fernandez2012local} for the general template of~\eqref{prob:01}. The authors considered inexact ALM and proved convergence rates for the outer iterates, under specific assumptions on the initialization of the dual variable. However, in contrast, the authors did not analyze how to solve the subproblems inexactly and did not provide total complexity results with verifiable conditions. %\textbf{AE: Mention if this paper was convex or nonconvex?} Problem~\eqref{prob:01} with similar assumptions to us is also studied in~\cite{birgin2016evaluation} and~\cite{cartis2018optimality} for first-order and second-order stationarity, respectively, with explicit iteration complexity analysis. As we have mentioned in Section~\ref{sec:cvg rate}, our iteration complexity results matches these theoretical algorithms with a simpler algorithm and a simpler analysis. In addition, these algorithms require setting final accuracies since they utilize this information in the algorithm while our Algorithm~\ref{Algo:2} does not set accuracies a priori. \cite{cartis2011evaluation} also considers the same template~\eqref{prob:01} for first-order stationarity with a penalty-type method instead of ALM. Even though the authors show $\mathcal{O}(1/\epsilon^2)$ complexity, this result is obtained by assuming that the penalty parameter remains bounded. We note that such an assumption can also be used to improve our complexity results to match theirs. \cite{bolte2018nonconvex} studies the general template~\eqref{prob:01} with specific assumptions involving local error bound conditions for the~\eqref{prob:01}. These conditions are studied in detail in~\cite{bolte2017error}, but their validity for general SDPs~\eqref{eq:sdp} has never been established. This work also lacks the total iteration complexity analysis presented here. % that we can compare. Another work~\cite{clason2018acceleration} focused on solving~\eqref{prob:01} by adapting the primal-dual method of Chambolle and Pock~\cite{chambolle2011first}. The authors proved the convergence of the method and provided convergence rate by imposing error bound conditions on the objective function that do not hold for standard SDPs.%~\eqref{eq:sdp}. %Some works also considered the application of ALM/ADMM to nonconvex problems~\cite{wang2015global, liu2017linearized}. %These works assume that the operator in~\eqref{prob:01} is linear, therefore, they do not apply to our setting. %since we have a nonlinear constraint in addition to a nonconvex objective function. \cite{burer2003nonlinear, burer2005local} is the first work that proposes the splitting $X=UU^\top$ for solving SDPs of the form~\eqref{eq:sdp}. Following these works, the literature on Burer-Monteiro (BM) splitting for the large part focused on using ALM for solving the reformulated problem~\eqref{prob:nc}. However, this proposal has a few drawbacks: First, it requires exact solutions in Step 2 of Algorithm~\ref{Algo:2} in theory, which in practice is replaced with inexact solutions. Second, their results only establish convergence without providing the rates. In this sense, our work provides a theoretical understanding of the BM splitting with inexact solutions to Step 2 of Algorithm~\ref{Algo:2} and complete iteration complexities. %\textbf{AE: Ahmet you discuss global optimality below but are you sure that's relevant to our work? in the sense that we just give an algorithm to solve to local optimality.} %Their practical stopping condition is also not analyzed theoretically. %In addition, Burer and Monteiro in~\cite{burer2005local} needed to bound the primal iterates they have to put an artificial bound to the primal domain which will be ineffective in practice; which is impossible to do without knowing the norm of the solution. % and their results do not extend to the case where projection in primal domain are required in each iteration. \cite{bhojanapalli2016dropping, park2016provable} are among the earliest efforts to show convergence rates for BM splitting, focusing on the special case of SDPs without any linear constraints. For these specific problems, they prove the convergence of gradient descent to global optima with convergence rates, assuming favorable initialization. These results, however, do not apply to general SDPs of the form~\eqref{eq:sdp} where the difficulty arises due to the linear constraints. %{\color{red} Ahmet: I have to cite this result bc it is very relevant.} %\textbf{AE: also if you do want to talk about global optimality you could also talk about the line of works that show no spurious local optima.} %In their followup work~, the authors considered projected gradient descent, but only for restricted strongly convex functions. %Their results are not able to extend to linear constraints and general convex functions. %, therefore not applicable to general SDPs. %They do not apply to the general semidefinite programming since $f(U)=\langle C, UU^\ast \rangle$. %, these conditions are not satisfied. %{\color{red} Ahmet: I have to cite this result bc it is relevant and the reviewers will complain if I don't} %\cite{bhojanapalli2018smoothed} focuses on the quadratic penalty formulation of~\eqref{prob:01}, namely, %\begin{equation}\label{eq:pen_sdp} %\min_{X\succeq 0} \langle C, X\rangle + \frac{\mu}{2} \| B(x)-b\|^2, %\end{equation} %which after BM splitting becomes %\begin{equation}\label{eq:pen_nc} %\min_{U\in\mathbb{R}^{d\times r}} \langle C, UU^\top \rangle + \frac{\mu}{2} \|B(UU^\top)-b\|^2, %\end{equation} %for which they study the optimality of the second-order stationary points. %These results are for establishing a connection between the stationary points of~\eqref{eq:pen_nc} and global optima of~\eqref{eq:pen_sdp}. %\edita{I think we should now remove this next sentence.} %In contrast, we focus on the relation of the stationary points of~\eqref{eq:minmax} to the constrained problem~\eqref{prob:01}. %\textbf{AE: so i'm not sure why we are comparing with them. their work establish global optimality of local optima but we just find local optima. it seems to add to our work rather compare against it. you also had a paragraph for global optimality with close initialization earlier. could you put the two next to each other?} Another popular method for solving SDPs are due to~\cite{boumal2014manopt, boumal2016global, boumal2016non}, focusing on the case where the constraints in~\eqref{prob:01} can be written as a Riemannian manifold after BM splitting. In this case, the authors apply the Riemannian gradient descent and Riemannian trust region methods for obtaining first- and second-order stationary points, respectively. They obtain~$\mathcal{O}(1/\epsilon^2)$ complexity for finding first-order stationary points and~$\mathcal{O}(1/\epsilon^3)$ complexity for finding second-order stationary points. While these complexities appear better than ours, the smooth manifold requirement in these works is indeed restrictive. In particular, this requirement holds for max-cut and generalized eigenvalue problems, but it is not satisfied for other important SDPs such as quadratic programming (QAP), optimal power flow and clustering with general affine constraints. In addition, as noted in~\cite{boumal2016global}, per iteration cost of their method for max-cut problem is an astronomical~$\mathcal{O}({d^6})$. % which is astronomical. %ly larger than our cost of $\mathcal{O}(n^2r)$ where $r \ll n$. %\cite{birgin2016evaluation} can handle the same problem but their algorithm is much more complicated than ours. Lastly, there also exists a line of work for solving SDPs in their original convex formulation, in a storage efficient way~\cite{nesterov2009primal, yurtsever2015universal, yurtsever2018conditional}. These works have global optimality guarantees by their virtue of directly solving the convex formulation. On the downside, these works require the use of eigenvalue routines and exhibit significantly slower convergence as compared to nonconvex approaches~\cite{jaggi2013revisiting}. - - +} +}