%Denote $B_k = \dist( -\nabla_x \L_{\b_{k-1}}(x_k,y_k),\partial g(x_k)) + \| A(x_k)\|$ for every $k$. Using the convergence proof of the outer algorithm, we have the following bound
%\begin{proof}
%Let $K$ denote the number of (outer) iterations of Algorithm~\ref{Algo:2} and let $\epsilon_{f}$ denote the desired accuracy of Algorithm~\ref{Algo:2}, see~(\ref{eq:inclu3}). Recalling Theorem~\ref{thm:main}, we can then write that
%\begin{equation}
% \epsilon_{f} = \frac{Q}{\b_{K}},
% \label{eq:acc_to_b}
%\end{equation}
%or, equivalently, $\b_{K} = Q/\epsilon_{f}$.
%%where $K$ denotes the last outer iterate and $\epsilon$ is the final accuracy we would like to get for the optimality conditions given in~\eqref{eq:inclu3}.
%We now count the number of total (inner) iterations $T$ of Algorithm~\ref{Algo:2} to reach the accuracy $\epsilon_{f}$. From \eqref{eq:smoothness of Lagrangian} and for sufficiently large $k$, recall that $\lambda_{\b_k}\le \lambda'' \b_k$ is the smoothness parameter of the augmented Lagrangian. Then, from \eqref{eq:iter_1storder} ad by summing over the outer iterations, we bound the total number of (inner) iterations of Algorithm~\ref{Algo:2} as
%Given that $b\leq e$, the bound suggests picking $e$ as small as possible and $b$ as big as possible.
%
%To minimize the total number of iterations in both cases with flexible $b$ and $e$, the bounds suggest to pick $b=e=\alpha$ and take this value to be as large as possible.
\section{Proof of Lemma \ref{lem:smoothness}\label{sec:proof of smoothness lemma}}
which completes the proof of Lemma \ref{lem:smoothness}.
\end{proof}
%\section{Proof of Lemma \ref{lem:11}\label{sec:proof of descent lemma}}
%
%Throughout, let
%\begin{align}
%G = G_{\b,\g}(x,y) = \frac{x-x^+}{\g},
%\end{align}
%for short.
%Suppose that $\|A(x)\|\le \rho$, $\|x\|\le \rho$, and similarly $\|A(x^+)\|\le \rho$, $\|x^+\|\le \rho'$. An application of Lemma \ref{lem:smoothness} yields that
%which completes the proof of Lemma \ref{lem:eval Lipsc cte}.
%
%
%
\section{Clustering \label{sec:verification2}}
We only verify the condition in~\eqref{eq:regularity} here.
Note that
\begin{align}
A(x) = VV^\top \mathbf{1}- \mathbf{1},
\end{align}
\begin{align}
DA(x) & =
\left[
\begin{array}{cccc}
w_{1,1} x_1^\top & \cdots & w_{1,n} x_{1}^\top\\
%x_2^\top p& \cdots & x_{2}^\top\\
\vdots\\
w_{n,1}x_{n}^\top & \cdots & w_{n,n}1 x_{n}^\top
\end{array}
\right] \nonumber\\
& = \left[
\begin{array}{ccc}
V & \cdots & V
\end{array}
\right]
+
\left[
\begin{array}{ccc}
x_1^\top & \\
& \ddots & \\
& & x_n^\top
\end{array}
\right],
\label{eq:Jacobian clustering}
\end{align}
%where $I_n\in\RR^{n\times n}$ is the identity matrix,
where $w_{i.i}=2$ and $w_{i,j}=1$ for $i\ne j$. In the last line above, $n$ copies of $V$ appear and the last matrix above is block-diagonal. For $x_k$, define $V_k$ accordingly and let $x_{k,i}$ be the $i$th row of $V_k$.
where $I_n\in \RR^{n\times n}$ is the identity matrix.
Let us make a number of simplifying assumptions. First, we assume that $\|x_k\|< \sqrt{s}$ (which can be enforced in the iterates by replacing $C$ with $(1-\epsilon)C$ for a small positive $\epsilon$ in the subproblems). Under this assumption, it follows that
\begin{align}
(\partial g(x_k))_i =
\begin{cases}
0 & (x_k)_i > 0\\
\{a: a\le 0\} & (x_k)_i = 0,
\end{cases}
\qquad i\le d.
\label{eq:exp-subgrad-cluster}
\end{align}
Second, we assume that $V_k$ has nearly orthonormal columns, namely, $V_k^\top V_k \approx I_n$. This can also be enforced in each iterate of Algorithm~\ref{Algo:2} and naturally corresponds to well-separated clusters. While a more fine-tuned argument can remove these assumptions, they will help us simplify the presentation here. Under these assumptions, the (squared) right-hand side of \eqref{eq:regularity} becomes
Therefore, given a prescribed $\nu$, ensuring $\min_i \|x_{k,i}\| \ge \nu$ guarantees \eqref{eq:regularity}. When the algorithm is initialized close enough to the constraint set, there is indeed no need to separately enforce \eqref{eq:final-cnd-cluster}. In practice, often $n$ exceeds the number of true clusters and a more intricate analysis is required to establish \eqref{eq:regularity} by restricting the argument to a particular subspace of $\RR^n$.
%To bound the last line above, let $x_*$ be a solution of~\eqref{prob:01} and note that $\overline{B} x_*^{\circ 2} = b $ by definition. Let also $z_k,z_*\in \RR^d$ denote the vectors corresponding to $x_k,x_*$. Corresponding to $x_k$, also define $u_{k,1},u_{k,2}$ naturally and let $|z_k| = u_{k,1}^{\circ 2} + u_{k,2}^{\circ 2} \in \RR^d$ be the vector of amplitudes of $z_k$. To simplify matters, let us assume also that $B$ is full-rank.
%We then rewrite the norm in the last line of \eqref{eq:cnd-bp-pre} as
%where $\eta_n(\cdot)$ returns the $n$th largest singular value of its argument. In the last line above, $B_T$ is the restriction of $B$ to the columns indexed by $T$ of size $n$. Moreover, $z_{k,(n)}$ is the $n$th largest entry of $z$ in magnitude.
%Given a prescribed $\nu$, \eqref{eq:regularity} therefore holds if
%for every iteration $k$. If Algorithm \ref{Algo:2} is initialized close enough to the solution $z^*$ and the entries of $z^*$ are sufficiently large in magnitude, there will be no need to directly enforce \eqref{eq:final-bp-cnd}.
%\paragraph{Discussion}
%The true potential of the reformulation of BP in \eqref{eq:bp-equiv} is in dealing with more structured norms than $\ell_1$, where computing the proximal operator is often intractable. One such case is the latent group lasso norm~\cite{obozinski2011group}, defined as
%where $\{\Omega_i\}_{i=1}^I$ are (not necessarily disjoint) index sets of $\{1,\cdots,d\}$. Although not studied here, we believe that the non-convex framework presented in this paper can serve to solve more complicated problems, such as the latent group lasso. We leave this research direction for future work.
Basis Pursuit (BP) finds sparsest solutions of an under-determined system of linear equations by solving
\begin{align}
%\begin{cases}
\min_{z} \|z\|_1 \subjto
Bz = b,
%\end{cases}
\label{eq:bp_main}
\end{align}
where $B \in \RR^{n \times d}$ and $b \in \RR^{n}$.
%BP has found many applications in machine learning, statistics and signal processing \cite{chen2001atomic, candes2008introduction, arora2018compressed}.
Various primal-dual convex optimization algorithms are available in the literature to solve BP, including~\cite{tran2018adaptive,chambolle2011first}.
%There also exists a line of work \cite{beck2009fast} that handles sparse regression via regularization with $\ell_1$ norm, known as Lasso regression, which we do not compare with here due to their need for tuning the regularization term.
We compare our algorithm against state-of-the-art primal-dual convex methods for solving \eqref{eq:bp_main}, namely, Chambole-Pock~\cite{chambolle2011first}, ASGARD~\cite{tran2018smooth} and ASGARD-DL~\cite{tran2018adaptive}.
%\textbf{AE: Fatih, maybe mention a few other algorithms including asgard and also say why we can't use fista and nista.}
Here, we take a different approach and cast~(\ref{eq:bp_main}) as an instance of~\eqref{prob:01}. Note that any $z \in \RR^d$ can be decomposed as $z = z^+ - z^-$, where $z^+,z^-\in \RR^d$ are the positive and negative parts of $z$, respectively. Then consider the change of variables $z^+ = u_1^{\circ 2}$ and $z^-= u_2^{\circ 2} \in \RR^d$, where $\circ$ denotes element-wise power. Next, we concatenate $u_1$ and $u_2$ as $x := [ u_1^{\top}, u_2^{\top} ]^{\top} \in \RR^{2d}$ and define $\overline{B} := [B, -B] \in \RR^{n \times 2d}$. Then, \eqref{eq:bp_main} is equivalent to \eqref{prob:01} with
\begin{align}
f(x) =& \|x\|^2, \quad g(x) = 0,\subjto
A(x) = \overline{B}x^{\circ 2}- b.
\label{eq:bp-equiv}
\end{align}
We draw the entries of $B$ independently from a zero-mean and unit-variance Gaussian distribution. For a fixed sparsity level $k$, the support of $z_*\in \RR^d$ and its nonzero amplitudes are also drawn from the standard Gaussian distribution.
Then the measurement vector is created as $b = Bz + \epsilon$, where $\epsilon$ is the noise vector with entries drawn independently from the zero-mean Gaussian distribution with variance $\sigma^2=10^{-6}$.
The results are compiled in Figure~\ref{fig:bp1}. Clearly, the performance of Algorithm~\ref{Algo:2} with a second-order solver for BP is comparable to the rest.
%Figure~\ref{fig:bp1} compiles our results for the proposed relaxation.
It is, indeed, interesting to see that these type of nonconvex relaxations gives the solution of convex one and first order methods succeed.
The true potential of our reformulation is in dealing with more structured norms rather than $\ell_1$, where computing the proximal operator is often intractable. One such case is the latent group lasso norm~\cite{obozinski2011group}, defined as
\begin{align*}
\|z\|_{\Omega} = \sum_{i=1}^I \| z_{\Omega_i} \|,
\end{align*}
{where $\{\Omega_i\}_{i=1}^I$ are (not necessarily disjoint) index sets of $\{1,\cdots,d\}$. Although not studied here, we believe that the nonconvex framework presented in this paper can serve to solve more complicated problems, such as the latent group lasso. We leave this research direction for future work.
%\vspace{3mm}
\paragraph{Condition verification:}
In the sequel, we verify that Theorem~\ref{thm:main} indeed applies to~\eqref{prob:01} with the above $f,A,g$.
Note that
%We hereby verify the regularity condition in \eqref{eq:regularity} for \eqref{prob:01} with $f,A,g$ specified in \eqref{eq:bp-equiv}.
\begin{align}
DA(x) = 2 \overline{B} \text{diag}(x),
\label{eq:jacob-bp}
\end{align}
where $\text{diag}(x)\in\RR^{2d\times 2d}$ is the diagonal matrix formed by $x$. The left-hand side of \eqref{eq:regularity} then reads as
To bound the last line above, let $x_*$ be a solution of~\eqref{prob:01} and note that $\overline{B} x_*^{\circ 2} = b $ by definition. Let also $z_k,z_*\in \RR^d$ denote the vectors corresponding to $x_k,x_*$. Corresponding to $x_k$, also define $u_{k,1},u_{k,2}$ naturally and let $|z_k| = u_{k,1}^{\circ 2} + u_{k,2}^{\circ 2} \in \RR^d$ be the vector of amplitudes of $z_k$. To simplify matters, let us assume also that $B$ is full-rank.
We then rewrite the norm in the last line of \eqref{eq:cnd-bp-pre} as
where $\eta_n(\cdot)$ returns the $n$th largest singular value of its argument. In the last line above, $B_T$ is the restriction of $B$ to the columns indexed by $T$ of size $n$. Moreover, $z_{k,(n)}$ is the $n$th largest entry of $z$ in magnitude.
Given a prescribed $\nu$, \eqref{eq:regularity} therefore holds if
for every iteration $k$. If Algorithm \ref{Algo:2} is initialized close enough to the solution $z^*$ and the entries of $z^*$ are sufficiently large in magnitude, there will be no need to directly enforce \eqref{eq:final-bp-cnd}.
\caption{{\it{(Top)}} Objective convergence for calculating top generalized eigenvalue and eigenvector of $B$ and $C$. {\it{(Bottom)}} Eigenvalue structure of the matrices. For (i),(ii) and (iii), $C$ is positive semidefinite; for (iv), (v) and (vi), $C$ contains negative eigenvalues. {[(i): Generated by taking symmetric part of iid Gaussian matrix. (ii): Generated by randomly rotating diag($1^{-p}, 2^{-p}, \cdots, 1000^{-p}$)($p=1$). (iii): Generated by randomly rotating diag($10^{-p}, 10^{-2p}, \cdots, 10^{-1000p}$)($p=0.0025$).]}
}
\label{fig:geig1}
\end{figure}
Generalized eigenvalue problem has extensive applications in machine learning, statistics and data analysis~\cite{ge2016efficient}.
The well-known nonconvex formulation of the problem is~\cite{boumal2016non} given by
\begin{align}
\begin{cases}
\underset{x\in\mathbb{R}^n}{\min} x^\top C x \\
x^\top B x = 1,
\end{cases}
\label{eq:eig}
\end{align}
where $B, C \in \RR^{n \times n}$ are symmetric matrices and $B$ is positive definite, namely, $B \succ 0$.
%Due to the invertibilty of $B$, we have $\langle x, Bx \rangle \geq \frac{\| x\|_F^2}{\| B^{-1} \|}$, which implies the constraint $\| x \|_F ^2 \leq \| B^{-1} \|$.
The generalized eigenvector computation is equivalent to performing principal component analysis (PCA) of $C$ in the norm $B$. It is also equivalent to computing the top eigenvector of symmetric matrix $S = B^{-1/2}CB^{1/2}$ and multiplying the resulting vector by $B^{-1/2}$. However, for large values of $n$, computing $B^{-1/2}$ is extremely expensive.
The natural convex SDP relaxation for~\eqref{eq:eig} involves lifting $Y = xx^\top$ and removing the nonconvex rank$(Y) = 1$ constraint, namely,
Here, however, we opt to directly solve~\eqref{eq:eig} because it fits into our template with
\begin{align}
f(x) =& x^\top C x, \quad g(x) = 0,\nonumber\\
A(x) =& x^\top B x - 1.
\label{eq:eig-equiv}
\end{align}
We compare our approach against three different methods: manifold based Riemannian gradient descent and Riemannian trust region methods in \cite{boumal2016global} and the linear system solver in~\cite{ge2016efficient}, abbrevated as GenELin. We have used Manopt software package in \cite{manopt} for the manifold based methods. For GenELin, we have utilized Matlab's backslash operator as the linear solver. The results are compiled in Figure \ref{fig:geig1}.
\paragraph{Condition verification:}
Here, we verify the regularity condition in \eqref{eq:regularity} for problem \eqref{eq:eig}. Note that
where $\eta_{\min}(B)$ is the smallest eigenvalue of the positive definite matrix $B$.
Therefore, for a prescribed $\nu$, the regularity condition in \eqref{eq:regularity} holds with $\|x_k\| \ge \nu /\eta_{min}$ for every $k$. If the algorithm is initialized close enough to the constraint set, there will be again no need to directly enforce this latter condition.
\subsection{$\ell_\infty$ Denoising with a Generative Prior}{\label{sec:gan}}
The authors of \cite{Samangouei2018,Ilyas2017} have proposed to
project onto the range of a Generative Adversarial network (GAN)
\cite{Goodfellow2014}, as a way to defend against adversarial examples. For a
given noisy observation $x^* + \eta$, they consider a projection in the
$\ell_2$ norm. We instead propose to use our augmented Lagrangian method to
denoise in the $\ell_\infty$ norm, a much harder task:
where $\otimes$ denotes the Kronecker product. Therefore, we can recast \eqref{eq:qap1} as
\begin{align}
\text{tr}((K\otimes L) Y) \,\,\,\,\text{subject to} \,\,\, Y = \text{vec}(P)\text{vec}(P^\top),
\label{eq:qapkron}
\end{align}
where $P$ is a permutation matrix. We can relax the equality constraint in \eqref{eq:qapkron} to a semidefinite constraint and write it in an equivalent form as
\begin{align*}
X =
\begin{bmatrix}
1 & \text{vec}(P)^\top\\
\text{vec}(P) & Y
\end{bmatrix}
\succeq 0 \,\,\, \text{for a symmetric} X \in \mathbb{S}^{(n^2+1) \times (n^2+1)}
\end{align*}
We now introduce the following constraints such that
to make sure X has a proper structure. Here, $B_k$ is a linear operator on $X$ and the total number of constraints is $m = \sum_{k} m_k.$ Hence, SDP relaxation of the quadratic assignment problem takes the form,
%\begin{enumerate}
%\item Each entry of the permutation matrix $P$ is either $0$ or $1$. For this, we need to have that first column of the matrix $X$ is equal to its diagonal elements, i.e., $X_{1, i} = X_{i,i} \,\,\, \forall i \in [2, n^2+1]$.
%\item A permutation matrix is doubly stochastic, meaning that all the entries are nonnegative, row sum and column sum of the matrix is equal to $1$ for each row and column, i.e., $\sum_i p_{ij} = 1$ and $\sum_j p_{ij} = 1 \,\, \forall i,j \in [1, n] $.
%\item We know that a permutation matrix is orthogonal, i.e., $PP^\top= P^\top P = I $. This constraint is enforced through partial traces, i.e., $\text{trace}_1(Y) = \text{trace}_1(Y) = I $.
%\end{enumerate}
\begin{align}
\max\,\,\, & \langle C, X \rangle \nonumber \\%\,\text{trace}(K\otimes L)
\text{trace}_1^*(T) = I \otimes T \,\,\,\,\,\, \text{and} \,\,\,\,\, \text{trace}_2^*(T)= T \otimes I
$$
$1st$ set of equalities are due to the fact that permutation matrices are doubly stochastic. $2nd$ set of equalities are to ensure permutation matrices are orthogonal, i.e., $PP^\top = P^\top P=I$. $3rd$ set of equalities are to enforce every individual entry of the permutation matrix takes either $0$ or $1$, i.e., $X_{1, i} = X_{i,i} \,\,\, \forall i \in [1, n^2+1]$. . Trace constraint in the last line is to bound the problem domain.
By concatenating the $B_k$'s in \eqref{eq:qapcons}, we can rewrite \eqref{eq:qap_sdp} in standard SDP form as
\begin{align}
\max\,\,\, & \langle C, X \rangle \nonumber \\%\,\text{trace}(K\otimes L)
where $\mathcal{G}$ represents the index set for which we introduce the nonnegativities. When $\mathcal{G}$ covers the wholes set of indices, we get the best approximation to the original problem. However, it becomes computationally undesirable as the problem dimension increases. Hence, we remove the redundant nonnegativity constraints and enforce it for the indices where Kronecker product between $K$ and $L$ is nonzero.
We penalize the non-negativity constraints and add it to the augmented Lagrangian objective since a projection to the positive orthant approach in the low rank space as we did for the clustering does not work here.
We take \cite{ferreira2018semidefinite} as the baseline. This is an SDP based approach for solving QAP problems containing a sparse graph.
We compare against the best feasible upper bounds reported in \cite{ferreira2018semidefinite} for the given instances. Here, optimality gap is defined as
We used a (relatively) sparse graph data set from the QAP library.
We run our low rank algorithm for different rank values. $r_m$ in each instance corresponds to the smallest integer satisfying the Pataki bound~\cite{pataki1998rank, barvinok1995problems}.
Results are shown in Table \ref{tb:qap}.
Primal feasibility values except for the last instance $esc128$ is less than $10^{-5}$ and we obtained bounds at least as good as the ones reported in \cite{ferreira2018semidefinite} for these problems.
For $esc128$, the primal feasibility is $\approx 10^{-1}$, hence, we could not manage to obtain a good optimality gap.% due to limited time.