Page MenuHomec4science

introduction.tex
No OneTemporary

File Metadata

Created
Mon, Nov 11, 00:19

introduction.tex

\newpage
\section{Introduction}
\label{intro}
We study the nonconvex optimization program
\begin{equation}
\label{prob:01}
\begin{cases}
\underset{x\in \mathbb{R}^d}{\min}\,\, f(x)+g(x)\\
\text{subject to } A(x) = 0,
\end{cases}
\end{equation}
where (possibly nonconvex) $f:\mathbb{R}^d\rightarrow\mathbb{R}$ and (possibly nonlinear) $A:\mathbb{R}^d\rightarrow\mathbb{R}^m$ are both twice-differentiable and moreover
satisfy
\begin{align}
\| \nabla f(x) - \nabla f(x')\| \le \lambda_f \|x-x'\|,
\quad
\| \D A(x) -\D A(x') \| \le \lambda_A \|x-x'\|,
\label{eq:smoothness basic}
\end{align}
for $\lambda_f,\lambda_A\ge 0$ and every $x,x'\in \mathbb{R}^d$. Above, $\nabla f(x)\in \mathbb{R}^d$ is the gradient of $f$ at $x$ and $\D A(x)\in \mathbb{R}^{m\times d}$ is the Jacobian of $A$ at $x$. Lastly, we assume that $g:\mathbb{R}^d\rightarrow\mathbb{R}\cup \{\infty\}$ is a proximal-friendly but possibly non-differentiable (proper closed) convex function.
%For instance, for a convex set $C\subset\RR^d$ and letting $g=1_C$ be the convex indicator function on $C$, Program~\eqref{prob:01} is a noncnovex optimization problem with the convex constraint $x\in C$.
A host of problems in computer science \cite{khot2011grothendieck, lovasz2003semidefinite,zhao1998semidefinite}, machine learning \cite{mossel2015consistency, song2007dependence}, and signal processing \cite{singer2011angular, singer2011three} naturally fall under the template of~\eqref{prob:01}, including max-cut, clustering, generalized eigenvalue, as well as community detection and Quadratic Assignment Problem (QAP).
%
%An example of our template in \eqref{prob:01} is semi-definite programming which provides powerful relaxations to above problems. Denote the space of $d'\times d'$ symmetric matrices by $\mathbb{S}^{d'\times d'}$ and consider
%
%\begin{equation}
% \label{sdp}
% \min_x \{h'(x): A'(x) = b', ~~x \in \mathcal{C'}~~\text{and}~~x\succeq0 \}
%\end{equation}
%
%where $h': \mathbb{S}^{d'\times d'} \to \RR$, $A'\colon\mathbb{S}^{d'\times d'}\to\RR^m$, $b'\in\RR^m$, and $C' \subseteq \mathbb{R}^{d'\times d'}$. This template clearly can be put to the form of \eqref{prob:01} by using \emph{Burer-Monteiro factorization} \cite{burer2003nonlinear, burer2005local}.
To address these applications, this paper builds on the classical ideas from the linearized augmented Lagrangian framework and proposes a simple and intuitive algorithm to solve~(\ref{prob:01}) with provable convergence rate and under an interpretable geometric condition. In this context, we also develop and analyze a variant of the Alternating Direction Method of Multipliers (ADMM). Before we elaborate turn to the details, let us first motivate~\eqref{prob:01} with an important application to Semi-Definite Programming (SDP):
\vspace{-3mm}
\paragraph{\textbf{Vignette: Burer-Monteiro splitting.}}
A powerful convex relaxation for max-cut, clustering, and several other problems mentioned above is provided by the SDP
\begin{equation}
\label{eq:sdp}
\begin{cases}
\underset{X\in\mathbb{S}^{d \times d}}{\min} \langle C, X \rangle \\
\text{subject to } B(X) = b, \,\, X \succeq 0,
\end{cases}
\end{equation}
%
where $C\in \mathbb{S}^{d\times d}$, $X\in \mathbb{S}^{d\times d}$ is positive semidefinite,
%$\mathbb{S}^{d \times d}$ denotes the set of real symmetric matrices,
and ${B}: \mathbb{S}^{d\times d} \to \mathbb{R}^m$ is a linear operator. Here, $\mathbb{S}^{d \times d}$ denotes the set of $d\times d$ real symmetric matrices.
If the unique-games conjecture is true, SDPs achieve the best approximation for the underlying discrete problem~\cite{raghavendra2008optimal}.
Since $d$ is often large, many first- and second-order methods for solving such SDPs are immediately ruled out, not only due to their high computational complexity, but also due to their storage requirements, which are $\mathcal{O}(d^2)$.
A contemporary challenge in optimization therefore is to solve SDPs with little storage and in a scalable fashion. For example, a homotopy conditional gradient method based on Linear Minimization Oracles (LMO) can solve \eqref{eq:sdp} with small storage via sketching \cite{yurtsever2018conditional}; however, such LMO-based methods are extremely slow in obtaining accurate solutions.
%In practice, $d$ is often very large which makes interior point methods, with their poor scaling in $d$, an unattractive option for solving ~\eqref{eq:sdp}. Attempts to resolve this issue prompted extensive research in computationally- and memory-efficient SDP solvers. The first such solvers relied on the so-called Linear Minimization Oracle (LMO), reviewed in Section~\ref{sec:related work}, alongside other scalabe SDP solvers.
A key approach for solving \eqref{prob:01}, dating back to~\cite{burer2003nonlinear, burer2005local}, is the so-called Burer-Monteiro (BR) factorization $X=UU^\top$, where $U\in\mathbb{R}^{d\times r}$ and $r$ is selected according to the guidelines in~\cite{pataki1998rank, barvinok1995problems}.
%\notea{maybe we should call this factorization vs splitting following the standard references like Global Optimality in Tensor Factorization, Deep Learning, and
%Beyond.}
%so as to remove spurious local minima of the nonconvex factorized problem. Evidently, BR splitting has the advantage of lower storage and faster iterations.
This factorization results in the nonconvex problem
\begin{equation}
\label{prob:nc}
\begin{cases}
\underset{U\in\mathbb{R}^{d \times r}}{\min} \langle C, UU^\top \rangle \\
\text{subject to }B(UU^\top) = b,
\end{cases}
\end{equation}
which can be written in the form of~\eqref{prob:01}. When $r$ is sufficiently large and under some additional assumptions, \eqref{prob:nc} provably does not have any spurious local minima~\cite{boumal2016non,waldspurger2018rank}.
The {augmented Lagrangian method} \cite{luenberger1984linear} provides a powerful framework to solve~\eqref{prob:01}.
% reviewed carefully in Section \ref{sec:related work}.
Indeed, for positive $\beta$, it is easy to verify that \eqref{prob:01} is equivalent to
\begin{align}
\min_{x} \max_y \,\,\mathcal{L}_\beta(x,y) + g(x),
\label{eq:minmax}
\end{align}
where
\begin{align}
\label{eq:Lagrangian}
\mathcal{L}_\beta(x,y) := f(x) + \langle A(x), y \rangle + \frac{\beta}{2}\|A(x)\|^2,
\end{align}
is loosely speaking the augmented Lagrangian corresponding to \eqref{prob:01}.\footnote{Strictly speaking, $g$ must be added to the right-hand side of \eqref{eq:Lagrangian} but the current definition of \eqref{eq:Lagrangian} will help keep the notation light and clarify the analysis throughout. } The equivalent formulation in \eqref{eq:minmax} naturally suggests the iterative algorithm
\begin{equation}\label{e:exac}
x_{k+1} \in \underset{x}{\argmin} \, \mathcal{L}_{\beta}(x,y_k)+g(x),
\end{equation}
\begin{equation}
y_{k+1} = y_k+\s_k A(x_{k+1}),
\label{eq:dual-update}
\end{equation}
to solve \eqref{prob:01}, where $\{\sigma_k\}_k$ are the dual step sizes.
Updating $x$ above requires solving the nonconvex problem \eqref{e:exac} to global optimality, which is often intractable. Instead, as in \cite{bertsekas2014constrained,fernandez2012local,birgin2016evaluation,cartis2018optimality,cartis2011evaluation,bolte2018nonconvex,bolte2017error
,clason2018acceleration,wang2015global,wang2018convergence,jiang2019structured,guo2017convergence,yang2017alternating,li2015global}, one may solve \eqref{e:exac} to (approximate) stationarity. For SDPs, similar ideas have also appeared in \cite{burer2003nonlinear, burer2005local,
bhojanapalli2016dropping, park2016provable,boumal2014manopt, boumal2016global, boumal2016non}.
In this paper, we take a different approach to tackle the intractability of (\ref{e:exac},\ref{eq:dual-update}) in the nonconvex setting. The key contribution of this paper is to provably and efficiently address this challenge by proposing and analyzing a linearized augmented Lagrangian algorithm, as well as its ADMM variant.
We note that the convergence (but not its rate) of a linearized and nonconvex variant of ADMM for \eqref{prob:01} has been studied in \cite{liu2019linearized}. Other variants of linearized ADMM in more limited settings have appeared in \cite{bot2018proximal,hong2016convergence}.
%%
%extend our results to solve the problem
%\begin{align}
%\begin{cases}
%\underset{x,z}{\min}\,\, f(x)+g(x)+h(z)+l(z)\\
%A(x)+B(z) = 0,
%\end{cases}
%\label{eq:admm}
%\end{align}
%by developing and analyzing an Alternating Direction Method of Multipliers (ADMM).
\paragraph{\emph{\textbf{Contributions.}} }
In order to solve \eqref{prob:01}, this paper proposes to replace the (intractable) problem \eqref{e:exac} with the simple update
\begin{equation}
x_{k+1} = P_g (x_k - \gamma_k \nabla \mathcal{L}_{\beta_k} (x_k,y_k)),
\label{eq:new update}
\end{equation}
for carefully selected sequences $\{\beta_k,\gamma_k\}_k$. Here, $P_g$ is the proximal operator corresponding to $g$, which is often computationally inexpensive~\cite{parikh2014proximal}.
Put differently, instead of fully solving~\eqref{e:exac}, this paper proposes to apply one iteration of the proximal gradient algorithm for every primal update, which is then followed by a dual update in \eqref{eq:dual-update} and an increase in the penalty weight $\b$ to gradually enforce the (nonlinear) constraints in \eqref{prob:01}.
We prove that this fast and scalable Homotopy Linearized Augmented Lagrangian (HoLAL) achieves first-order stationarity for \eqref{prob:01} at the rate of $1/\sqrt{k}$,
% and discuss briefly the (approximate) local optimality of the results.
%Under standard additional conditions, we also establish local optimality, namely, HoLAL achieves second-order stationarity for \eqref{prob:01}.
and also provide an ADMM variant of HoLAL, with the same convergence rate, which is better suited for a variety of problems that require variable splitting.
In comparison, the only linearized primal-dual nonconvex convergence rate we are aware of is the result in \cite[Theorem 19]{bot2018proximal}, which comes to a standstill after choosing $\theta=1/2$ therein for a fair comparison, even though their problem~(1) therein is less general than our problem \eqref{prob:01}. Moreover, compared with first-order inexact methods that solve \eqref{e:exac} to approximate stationarity (instead of linearizing it),
we improve over the state-of-the-art convergence rate of $1/k^{\frac{1}{3}}$ in~\cite[Corollary 4.2]{sahin2019inexact} and~\cite{cartis2018optimality} to first-order stationarity. \notea{is this claim correct about Cora's work? can someone check? if this paragraph changes, we should also change the corresponding remark after main theorem.}
%, such as a nonconvex solver for basis pursuit \cite{chen2001atomic}, presented in Section \ref{sec:experiments}. \notea{If we say this here, we should add a basis pursuit admm example to the experiments section.}
The success of of HoLAL relies on a geometric regularity condition, detailed in Section~\ref{sec:slater}, which is closely related to Polyak-Lojasiewicz~\cite{karimi2016linear}, Mangasarian-Fromovitz~\cite{bertsekas2014constrained}, and other existing conditions in the literature of nonconvex programming \cite{bolte2018nonconvex,xu2017globally,rockafellar1993lagrange,flores2012complete}.
%(a variant of) the \emph{uniform regularity} \notea{what to cite?}, a geometric condition akin to the well-established Slater's condition in convex optimization. In fact, we establish that uniform regularity, when limited to convex problems, is equivalent to the Slater's condition.
We also verify this regularity condition in several important examples.

Event Timeline