%Denote $B_k = \dist( -\nabla_x \L_{\b_{k-1}}(x_k,y_k),\partial g(x_k)) + \| A(x_k)\|$ for every $k$. Using the convergence proof of the outer algorithm, we have the following bound
Let $K$ denote the number of (outer) iterations of Algorithm~\ref{Algo:2} and let $\epsilon_{f}$ denote the desired accuracy of Algorithm~\ref{Algo:2}, see~(\ref{eq:inclu3}). Recalling Theorem~\ref{thm:main}, we can then write that
\begin{equation}
\epsilon_{f} = \frac{Q}{\b_{K}},
\label{eq:acc_to_b}
\end{equation}
or, equivalently, $\b_{K} = Q/\epsilon_{f}$.
%where $K$ denotes the last outer iterate and $\epsilon$ is the final accuracy we would like to get for the optimality conditions given in~\eqref{eq:inclu3}.
We now count the number of total (inner) iterations $T$ of Algorithm~\ref{Algo:2} to reach the accuracy $\epsilon_{f}$. From \eqref{eq:smoothness of Lagrangian} and for sufficiently large $k$, recall that $\lambda_{\b_k}\le \lambda'' \b_k$ is the smoothness parameter of the augmented Lagrangian. Then, from \eqref{eq:iter_1storder} ad by summing over the outer iterations, we bound the total number of (inner) iterations of Algorithm~\ref{Algo:2} as
\begin{align}\label{eq: tk_bound}
T &= \sum_{k=1}^K\mathcal{O}\left ( \frac{\lambda_{\beta_{k-1}}^2 \rho'^2 }{\epsilon_k} \right) \nonumber\\
%Given that $b\leq e$, the bound suggests picking $e$ as small as possible and $b$ as big as possible.
%
%To minimize the total number of iterations in both cases with flexible $b$ and $e$, the bounds suggest to pick $b=e=\alpha$ and take this value to be as large as possible.
\section{Proof of Lemma \ref{lem:smoothness}\label{sec:proof of smoothness lemma}}
which completes the proof of Lemma \ref{lem:smoothness}.
%\section{Proof of Lemma \ref{lem:11}\label{sec:proof of descent lemma}}
%
%Throughout, let
%\begin{align}
%G = G_{\b,\g}(x,y) = \frac{x-x^+}{\g},
%\end{align}
%for short.
%Suppose that $\|A(x)\|\le \rho$, $\|x\|\le \rho$, and similarly $\|A(x^+)\|\le \rho$, $\|x^+\|\le \rho'$. An application of Lemma \ref{lem:smoothness} yields that
%which completes the proof of Lemma \ref{lem:eval Lipsc cte}.
%
%
%
\section{Clustering \label{sec:verification2}}
We only verify the condition in~\eqref{eq:regularity}.
Note that
\begin{align}
A(x) = VV^\top \mathbf{1}- \mathbf{1},
\end{align}
\begin{align}
DA(x) & =
\left[
\begin{array}{cccc}
w_{1,1} x_1^\top & \cdots & w_{1,n} x_{1}^\top\\
%x_2^\top p& \cdots & x_{2}^\top\\
\vdots\\
w_{n,1}x_{n}^\top & \cdots & w_{n,n}1 x_{n}^\top
\end{array}
\right] \nonumber\\
& = \left[
\begin{array}{ccc}
V & \cdots & V
\end{array}
\right]
+
\left[
\begin{array}{ccc}
x_1^\top & \\
& \ddots & \\
& & x_n^\top
\end{array}
\right],
\label{eq:Jacobian clustering}
\end{align}
%where $I_n\in\RR^{n\times n}$ is the identity matrix,
where $w_{i.i}=2$ and $w_{i,j}=1$ for $i\ne j$. In the last line above, $n$ copies of $V$ appear and the last matrix above is block-diagonal. For $x_k$, define $V_k$ as in the example and let $x_{k,i}$ be the $i$th row of $V_k$.
where $I_n\in \RR^{n\times n}$ is the identity matrix.
Let us make a number of simplifying assumptions. First, we assume that $\|x_k\|< \sqrt{s}$, which can be easily enforced in the iterates. Under this assumption, it follows that
\begin{align}
(\partial g(x_k))_i =
\begin{cases}
0 & (x_k)_i > 0\\
\{a: a\le 0\} & (x_k)_i = 0,
\end{cases}
\qquad i\le d.
\label{eq:exp-subgrad-cluster}
\end{align}
Second, we assume that $V_k$ has nearly orthonormal columns, namely, $V_k^\top V_k \approx I_n$. This can also be easily enforced in each iterate of Algorithm~\ref{Algo:2} and naturally corresponds to well-separated clusters. While a more fine-tuned argument can remove these assumptions, they will help us simplify the presentation here. Under these assumptions, the (squared) right-hand side of \eqref{eq:regularity} becomes
Given a prescribed $\nu$, ensuring $\|x_{k,i}\| \ge \nu$ guarantees \eqref{eq:regularity}. This requirement corresponds again to well-separated clusters. When the clusters are sufficiently separated and the algorithm is initialized close enough to the constraint set, there is indeed no need to separately enforce this condition. In practice, often $n$ exceeds the number of true clusters and a more fine-tuned analysis is required to establish \eqref{eq:regularity} by restricting the argument to a particular subspace of $\RR^n$.
\section{Basis Pursuit \label{sec:verification1}}
We only verify the regularity condition in \eqref{eq:regularity} for \eqref{prob:01} with $f,A,g$ specified in \eqref{eq:bp-equiv}. Note that
\begin{align}
DA(x) = 2 \overline{B} \text{diag}(x),
\label{eq:jacob-bp}
\end{align}
where $\text{diag}(x)\in\RR^{2d\times 2d}$ is the diagonal matrix formed by $x$. The left-hand side of \eqref{eq:regularity} then reads as
To bound the last line above, let $x_*$ be a solution of~\eqref{prob:01} and note that $\overline{B} x_*^{\circ 2} = b $ by definition. Let also $z_k,z_*\in \RR^d$ denote the vectors corresponding to $x_k,x_*$. Corresponding to $x_k$, also define $u_{k,1},u_{k,2}$ naturally and let $|z_k| = u_{k,1}^{\circ 2} + u_{k,2}^{\circ 2} \in \RR^d$ be the amplitudes of $z_k$. To simplify matters, let us assume also that $B$ is full-rank.
We then rewrite the last line of \eqref{eq:cnd-bp-pre} as
where $\eta_n(\cdot)$ returns the $n$th largest singular value of its argument. In the last line above, $B_T$ is the restriction of $B$ to the columns indexed by $T$ of size $n$. Moreover, $z_{k,(n)}$ is the $n$th largest entry of $z$ in magnitude.
Given a prescribed $\nu$, \eqref{eq:regularity} therefore holds if
for every iteration $k$. If Algorithm \ref{Algo:2} is initialized close enough to the solution $z^*$, there will be no need to directly enforce this condition.
%\paragraph{Discussion}
%The true potential of the reformulation of BP in \eqref{eq:bp-equiv} is in dealing with more structured norms than $\ell_1$, where computing the proximal operator is often intractable. One such case is the latent group lasso norm~\cite{obozinski2011group}, defined as
%where $\{\Omega_i\}_{i=1}^I$ are (not necessarily disjoint) index sets of $\{1,\cdots,d\}$. Although not studied here, we believe that the non-convex framework presented in this paper can serve to solve more complicated problems, such as the latent group lasso. We leave this research direction for future work.
\subsection{$\ell_\infty$ Denoising with a Generative Prior}
\editf{(mfs): We need Fabian's input here.}
The authors of \citep{Ilyas2017} have proposed to
project onto the range of a Generative Adversarial network (GAN)
\citep{Goodfellow2014}, as a way to defend against adversarial examples. For a
given noisy observation $x^* + \eta$, they consider a projection in the
$\ell_2$ norm. We instead propose to use our augmented Lagrangian method to
denoise in the $\ell_\infty$ norm, a much harder task:
\caption{{\it{(Top)}} Objective convergence for calculating top generalized eigenvalue and eigenvector of $B$ and $C$. {\it{(Bottom)}} Eigenvalue structure of the matrices. For (i),(ii) and (iii), $C$ is positive semidefinite; for (iv), (v) and (vi), $C$ contains negative eigenvalues. {[(i): Generated by taking symmetric part of iid Gaussian matrix. (ii): Generated by randomly rotating diag($1^{-p}, 2^{-p}, \cdots, 1000^{-p}$)($p=1$). (iii): Generated by randomly rotating diag($10^{-p}, 10^{-2p}, \cdots, 10^{-1000p}$)($p=0.0025$).]}
}
\label{fig:geig1}
\end{figure}
Generalized eigenvalue problem has extensive applications in machine learning, statistics and data analysis~\cite{ge2016efficient}.
The well-known nonconvex formulation of the problem is~\cite{boumal2016non}.
\begin{align}
\begin{cases}
\underset{x\in\mathbb{R}^n}{\min} x^\top C x \\
x^\top B x = 1,
\end{cases}
\label{eq:eig}
\end{align}
where $B, C \in \RR^{n \times n}$ are symmetric matrices and $B$ is positive definite, i.e. $B \succ 0$.
%Due to the invertibilty of $B$, we have $\langle x, Bx \rangle \geq \frac{\| x\|_F^2}{\| B^{-1} \|}$, which implies the constraint $\| x \|_F ^2 \leq \| B^{-1} \|$.
The generalized eigenvector computation is equivalent to performing principal component analysis (PCA) of $C$ in the norm $B$. Moreover, it is also equivalent to computing the top eigenvector of symmetric matrix $S = B^{-1/2}CB^{1-2}$ and multiplying the resulting vector by $B^{-1/2}$. However, for sufficiently large $n$, computing $B^{-1/2}$ is extremely expensive.
The natural convex sdp relaxation for~\eqref{eq:eig} involves lifting $Y = xx^\top$ and removes the non-convex rank$(Y) = 1$ constraint,
Here, we solve~\eqref{eq:eig} because it directly fits into our template with,
\begin{align}
f(x) =& x^\top C x, \quad g(x) = 0\nonumber\\
A(x) =& x^\top B x - 1.
\label{eq:eig-equiv}
\end{align}
We compare our approach against 3 different methods. Manifold based Riemannian gradient descent and Riemannian trust region methods in\cite{boumal2016global} and generalized eigenvector via linear system solver (abbrevated as. GenELin) in~\cite{ge2016efficient}. We have used Manopt software package in \cite{manopt} for the manifold based methods. For GenELin, we have utilized Matlab's backslash operator as the linear solver. The results are compiled in Figure \ref{fig:geig1}.
Here, we verify the regularity condition in \eqref{eq:regularity} for problem \eqref{eq:eig}. Note that