Paper Reading (87): Accelerated Proximal Gradient Methods for Nonconvex Programming

1 Overview

1.1 Topics

2015: Accelerated proximal gradient methods for nonconvex programming

Attachments and codes are as follows:

  1. https://zhouchenlin.github.io/Publications/2015-NIPS-APG_supp.pdf
  2. https://zhouchenlin.github.io/NIPS2015_code.zip

The experimental data set is as follows:

  1. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2
  2. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

1.2 Summary

Non-convex and non-smooth problems in image and signal processing have attracted much attention, among which Accelerated proximal gradient ( APG ) is an excellent method to deal with this problem. However, in non-convex programming, it is not known whether it can converge to a critical point.

This paper extends APG to general non-convex and non-smooth problems by introducing a monitor that satisfies the sufficient descent property, and proposes monotone APG and nonmonotone APG . Among them, nonmonotone APG does not require monotonous reduction of the objective function, and requires less calculation for each iteration.

Within the framework of existing knowledge, this is the first APG-type algorithm proposed for general non-convex and non-smooth optimization problems, which can ensure that each accumulation point is a critical point , and when the problem is convex optimization, convergence The rate is O ( 1 k 2 ) O(\frac{1}{k^2})O(k21) , wherekkk is the number of iterations. The final digital calculation verifies the advantage of the algorithm in speed.

1.3 References

@article{
    
    Li:2015:19,
author		=	{
    
    Huan Li and Zhou Chen Lin},
title		=	{
    
    Accelerated proximal gradient methods for nonconvex programming},
journal		=	{
    
    {
    
    NeurIPS}},
pages		=	{
    
    1--9},
year		=	{
    
    2015},
volume		=	{
    
    28},
}

2 introduction

In recent years, sparse and low-rank learning has received great attention and has been widely used in signal and image processing, statistics and machine learning. l 1 l_1l1Norm and kernel norm respectively as l 0 l_0l0Continuous and convex surrogates for norm and rank. However in many cases they are not optimal as they can only achieve sparsity and low rank under very limited conditions. For this, even non-convex regularizers are proposed, such as lp l_plpNorm, Capped- l 1 l_1l1Penalty, Log-Sum Penalty, Maximum Bump Penalty, Geman Penalty, Smooth Truncated Absolute Deviation, and Schatte- ppp -norm. Both of them aim to solvenon-convex and non-smooth problems:
min ⁡ x ∈ R n F ( x ) = f ( x ) + g ( x ) , (1) \tag{1} \min_{\mathbf {x}\in\mathbb{R}^n}F(\mathbf{x})=f(\mathbf{x})+g(\mathbf{x}),xRnminF(x)=f(x)+g(x),( 1 ) wherefff is differentiable, can be non-convex, andggg is non-convex and non-smooth.

Accelerated gradient methods are at the heart of convex optimization research, and at the same time several methods of this class have been proposed to deal with Equation 1. Among these methods, kkk iterations are enough to find O ( 1 k 2 ) O(\frac{1}{k^2})from the best target valueO(k21) solutions within error. Recently, Ghadimi and Lan proposed aunified form of accelerated gradient(UAG) for convex, non-convex, and stochastic optimization. They proved their algorithm in non-convexfff and convexggConverges in non-convex programs under g , and inO ( 1 k 2 ) O(\frac{1}{k^2})O(k21) speeds up the convergence. Further, they analyzed the convergence rate on gradient maps.

Attouch et al. propose a unified framework to prove the convergence of the general descent method for solving Equation 1 using the Kurdyka-Łojasiewicz ( KL ) inequality. Frankel et al studied the anti-singularization function φ \varphi in the KL propertyφcoefficient C θ t θ \frac{C}{\theta}t^\thetaiCtThe convergence rate of the general descent method in the form of θ . A typical example in their framework is the proximal gradient method. However, there is currently no literature showing the existence of an accelerated gradient method that satisfies the constraints of their framework.

Other solutions to Equation 1 include inertial forward-backward (Inertial forward-backwar, IFB ), iPiano, general iterative shrinkage and thresholding ( GIST ), proximal average gradient descent (Gradient descent with proximal average, GDPA ), and Iteratively reweighted ( IR ), Table 1 provides a proper summary of these methods.

The purpose of this paper is to extend the APG algorithm of Beck and Teboulle to make it suitable for non-convex and non-smooth problems. APG first combines the current point and the previous point to infer the point yk \mathbf{y}_kyk, and then solve a proximal mapping problem. When extending APG to non-convex programming, the main difficulty is how to infer yk \mathbf{y}_kyk. When convexity does not exist, we for F ( yk ) F(\mathbf{y}_k)F(yk) are almost unrestricted. In fact, whenyk \mathbf{y}_kykis a bad inference when F ( yk ) F(\mathbf{y}_k)F(yk) compared toF ( xk ) F(\mathbf{x}_k)F(xk) is arbitrarily large. When in a badyk \mathbf{y}_kykCalculate xk + 1 \mathbf{x}_{k+1} by proximal mappingxk+1时, F ( x k + 1 ) F(\mathbf{x}_{k+1}) F(xk+1) may also be compared toF ( xk ) F(\mathbf{x}_k)F(xk) is arbitrarily large. Beck and Teboulle's monotone APG ensures thatF ( xk + 1 ) ≤ F ( xk ) F(\mathbf{x}_{k+1})\leq F(\mathbf{x}_k)F(xk+1)F(xk) . However, this does not ensure convergence to the critical point. For this, weintroduce a monitor satisfying the sufficient descent property to prevent and correct bad yk \mathbf{y}_kyk

2 preparation

2.1 Basic assumptions

For the function g ≠ ∅ g\neq\emptyg= ,g : R n → ( − ∞ , + ∞ ] g:\mathbb{R}^n\to(-\infty,+\infty]g:Rn(,+ ] ,ifg = { x ∈ R : g ( x ) < + ∞ } g=\{ \mathbf{x} \in \mathbb{R}: g(\mathbf{x})<+\infty\ } }g={ xR:g(x)<+ } Letlim ⁡ inf ⁡ x → x 0 g ( x ) ≥ g ( x 0 ) \lim \inf_{\mathbf{x}\to\mathbf{x}_0}g(\mathbf{x}) \geq g(\mathbf{x}_0)liminfxx0g(x)g(x0) , thenggg at ​​pointx 0 \mathbf{x}_0x0It is the second half continuous. In Equation 1, assume fff hasa Lipschitz continuous gradient, andggg is semi-continuous. SupposeF ( x ) F(\mathbf{x})F ( x ) is mandatory, i.e.FFF has a lower bound and when∥ x ∥ → ∞ \|\mathbf{x}\|\to\inftyx , there isF ( x ) → ∞ F(\mathbf{x})\to\inftyF(x) , where∥ ⋅ ∥ \|\cdot\| isl 2 l_2l2norm.

2.2 KL inequality

Definition 1 : If there exists η ∈ ( 0 , + ∞ ] \eta\in(0,+\infty]the(0,+ ]u ‾ \overline{\mathbf{u}}uNeighborhood UUU , and the functionφ ∈ Φ η \varphi\in\Phi_\etaPhiPhih, then the function R n \mathbb{R}^nRnu ‾ ∈ dom ∂ f : { x ∈ R n : ∂ f ( u ) ≠ ∅ } \overline{\mathbf{u}}\in dom\partial f:\{\mathbf{x} \in \mathbf {R}^n:\partial f(\mathbf{u})\neq\empty\}udomf:{ xRn:f(u)=}有KL property。 For example, if allu ∈ U ⋂ { u ∈ R n : f ( u ‾ ) < f ( u ) < f ( u ‾ ) + η } \mathbf{u}\in U\bigcap\ { \mathbf{u} \in \mathbf{R}^n: f(\overline{\mathbf{u}}) < f(\mathbf{u}) < f(\overline{\mathbf{u}}) + \eta \}uU{ uRn:f(u)<f(u)<f(u)+η},以下不等式成立:
φ ′ ( f ( u ) − f ( u ‾ ) ) d i s t ( 0 , ∂ f ( u ) ) > 1 , (2) \tag{2} \varphi'(f(\mathbf{u})-f(\overline{\mathbf{u}}))dist(0,\partial f(\mathbf{u}))>1, Phi(f(u)f(u))dist(0,f(u))>1,( 2 ) Among them,Φ η \Phi_\etaPhihRepresents a class of functions φ that satisfy the following conditions: [ 0 , η ) → R + \varphi:[0,\eta)\to\mathbb{R}^+Phi:[0,h )R+

  1. φ \varphiφ is concave, andC 1 C^1C1 is located at( 0 , η ) (0,\eta)(0,h ) ;
  2. φ \varphiφ at0 00 is continuous, andφ ( 0 ) = 0 \varphi(0)=0φ ( 0 )=0
  3. φ ′ ( x ) > 0 , ∀ x ∈ ( 0 , η ) \varphi'(x)>0,\forall \mathbf{x}\in(0,\eta)Phi(x)>0,x(0,h ) .

All semi-algebraic functions and subanalytical functions satisfy the KL property , among which semi-algebraic functions are a class of functions, which have the properties of semi-algebraic operations, that is, they can be expressed in the form of complex numbers. Their domain of definition is a closed subset of the complex plane, and its image is an open set. Subanalytical functions refer to functions with analytic properties, their domain of definition is a boundary on the complex plane, and their images can be completely expressed as a finite sum of many simple functions. In particular, the semi-algebraic function desingularization function φ ( t ) \varphi(t)φ ( t ) can be chosen asC θ t θ \frac{C}{\theta}t^\thetaiCtThe form of θ , whereθ ∈ ( 0 , 1 ] \theta\in(0,1]i(0,1 ] . Typical semi-algebraic functions include polynomial functions∥ x ∥ p \|x\|_pxp, where p ≥ 0 p\geq0p0 rank ( X ) \text{rank}(X) rank ( X ) , indicator functions for PSD cones, Stiefel popularity, and constant rank matrices.

2.3 APG under convex programming

First review APG under convex programming. Bech and Teboulle extend Nesterov's accelerated gradient method to non-smooth emotions. This is named the accelerated proximal gradient method , which consists of the following steps:
yk = xk + tk − 1 − 1 tk ( xk − xk − 1 ) , (3) \tag{3} \mathbf{y}_k=\mathbf {x}_k+\frac{t_{k-1}-1}{t_k}(\mathbf{x}_k-\mathbf{x}_{k-1}),yk=xk+tktk11(xkxk1),( 3 ) xk + 1 = prox α kg ( yk − α k ∇ f ( yk ) ), (4) \tag{4} \mathbf{x}_{k+1}=\text{prox}_{\ alpha kg}(\mathbf{y}_k-\alpha_k\nabla f(\mathbf{y}_k)),xk+1=proxαkg(ykakf(yk)),(4) t k + 1 = 4 ( t k ) 2 + 1 + 1 2 , (5) \tag{5} t_{k+1}=\frac{\sqrt{4(t_k)^2+1}+1}{2}, tk+1=24(tk)2+1 +1,( 5 ) wherethe proximal mappingis defined asprox α g ( x ) = arg min ⁡ ug ( u ) + 1 2 α ∥ x − u ∥ 2 \text{prox}_{\alpha g}(\mathbf{x })=\argmin_\mathbf{u}g(\mathbf{u})+\frac{1}{2\alpha}\| \mathbf{x}-\mathbf{u}\|^2proxag(x)=argminug(u)+2 a1xu2 . APG is not a monotonic method, which means thatF ( xk + 1 ) F(\mathbf{x}_{k+1})F(xk+1) is not necessarily greater thanF ( xk ) F(\mathbf{x}_k)F(xk) small. Therefore, Beck and Teboulle further proposed monotonic APG, which contains the following steps:
yk = xk + tk − 1 tk ( zk − xk ) + tk − 1 − 1 tk ( xk − xk − 1 ) , zk + 1 = prox ⁡ α kg ( yk − α k ∇ f ( yk ) ) , tk + 1 = 4 ( tk ) 2 + 1 + 1 2 , xk + 1 = { zk + 1 , if F ( zk + 1 ) ≤ F ( xk ), xk , otherwise. (6–9) \tag{6--9} \begin{align} & \mathbf{y}_k=\mathbf{x}_k+\frac{t_{k-1}}{t_k}\left(\mathbf {z}_k-\mathbf{x}_k\right)+\frac{t_{k-1}-1}{t_k}\left(\mathbf{x}_k-\mathbf{x}_{k-1 }\right), \\ & \mathbf{z}_{k+1}=\operatorname{prox}_{\alpha_k g}\left(\mathbf{y}_k-\alpha_k \nabla f\left(\ mathbf{y}_k\right)\right), \\ & t_{k+1}=\frac{\sqrt{4\left(t_k\right)^2+1}+1}{2}, \\ & \mathbf{x}_{k+1}= \begin{cases}\mathbf{z}_{k+1}, &\text {if } Left(\mathbf{z}_{k+1 }\right) \leq F\left(\mathbf{x}_k\right), \\\mathbf{x}_k, &\text { otherwise. }\end{cases} \end{aligned}yk=xk+tktk1(zkxk)+tktk11(xkxk1),zk+1=proxakg(ykakf(yk)),tk+1=24(tk)2+1 +1,xk+1={ zk+1,xk, if F(zk+1)F(xk), otherwise. (6–9)

3 APG for non-convex programming

In this section, two non-convex and non-smooth programming algorithms based on APG are proposed. We establish convergence in the non-convex case and have O ( 1 k 2 ) O(\frac{1}{k^2}) in the convergent caseO(k21) convergence. When the KL property is satisfied, we provide excellent results on the convergence rate.

3.1 Monotonic APG

Two reasons make convergence analysis difficult for usual APGs as follows:

  1. yk \mathbf{y}_kykmay be a poor inference;
  2. In the existing methods, only the descending property F ( xk + 1 ) ≤ F ( xk ) F(\mathbf{x}_{k+1})\leq F(\mathbf{x}_k) can only be guaranteedF(xk+1)F(xk)

To solve these problems, we need a monitor to correct yk \mathbf{y}_k when possible failuresyk, and the monitor should encourage sufficient descent properties , which is the key to converge to the critical point. The proximal gradient method is known to ensure adequate descent, so we use the proximal gradient step as a monitor:
yk = xk + tk − 1 tk ( zk − xk ) + tk − 1 − 1 tk ( xk − xk − 1 ), zk + 1 = prox ⁡ α yg ( yk − α y ∇ f ( yk ) ), vk + 1 = prox ⁡ α xg ( xk − α x ∇ f ( xk ) ), tk + 1 = 4 ( tk ) 2 + 1 + 1 2 , xk + 1 = { zk + 1 , if F ( zk + 1 ) ≤ F ( vk + 1 ), vk + 1 , otherwise. (10–14) \tag{10--14} \begin{align} & \mathbf{y}_k=\mathbf{x}_k+\frac{t_{k-1}}{t_k}\left(\mathbf {z}_k-\mathbf{x}_k\right)+\frac{t_{k-1}-1}{t_k}\left(\mathbf{x}_k-\mathbf{x}_{k-1 }\right), \\ & \mathbf{z}_{k+1}=\operatorname{prox}_{\alpha_y g}\left(\mathbf{y}_k-\alpha_y \nabla f\left(\ mathbf{y}_k\right)\right), \\ & \mathbf{v}_{k+1}=\operatorname{prox}_{\alpha_x g}\left(\mathbf{x}_k-\alpha_x \example f\left(\mathbf{x}_k\right)\right), \\&t_{k+1}=\frac{\sqrt{4\left(t_k\right)^2+1}+1 }{2}, \\ & \mathbf{x}_{k+1}= \begin{cases}\mathbf{z}_{k+1}, & \text { if } Left(\mathbf{z}_{k+1}\right) \leq Left(\mathbf{v}_{k+1}\right), \\ \mathbf{ v}_{k+1}, & \text { otherwise. }\end{cases} \end{aligned}yk=xk+tktk1(zkxk)+tktk11(xkxk1),zk+1=proxayg(ykayf(yk)),vk+1=proxaxg(xkaxf(xk)),tk+1=24(tk)2+1 +1,xk+1={ zk+1,vk+1, if F(zk+1)F(vk+1), otherwise. ( 10–14 ) Indefinitiony < 1 L \alpha_y<\frac{1}{L}ay<L1 α x < 1 L \alpha_x<\frac{1}{L} ax<L1are fixed constants, or they can be initialized according to a Barzilai-Borwein backtracking linear search, LLL is∇ f \nabla f Lipschitz constant in f .

Our algorithm is an extension of Beck and Teboulle's monotonic APG. The difference is the extra v \mathbf{v}v , which acts as a monitor, will be atx \mathbf{x}when x is updated as a corrective step. Another difference is that our algorithmensures sufficient descent:
F ( xk + 1 ) ≤ F ( xk ) − δ ∥ vk + 1 − xk ∥ 2 , (15) \tag{15} F(\mathbf{x} _{k+1})\leq F(\mathbf{x}_k)-\delta\| \mathbf{v}_{k+1} - \mathbf{x}_k\|^2,F(xk+1)F(xk)δvk+1xk2,( 15 ) whereδ > 0 \delta>0d>0 is a small constant. It is not difficult to understand that descent property alone does not guarantee convergence to the critical point in non-convex programming. We therefore give the convergence result in the following theorem.

Theorem 1 : Let fff is a legal function with Lipschitz continuous gradient,ggg is a lower semicontinuous legal function. For non-convexfff and non-smoothedggg , assumingF ( x ) F(x)F ( x ) is coercive. Then, { x } \{\mathbf{x}\}generated by Eq. 10–14{ x }{ vk } \{ \mathbf{v}_k \}{ vk} is bounded. Letx ∗ \mathbf{x}^*x means{ xk } \{ \mathbf{x}_k \}{ xk} , then there are0 ∈ ∂ F ( x ∗ ) 0\in\partial F(\mathbf{x}^*)0F(x),即 x ∗ \mathbf{x}^* x is a critical point.

Theorem 2 : For convex fff andggg , assuming∇ f \nabla ff is Lipschitz continuous, letx ∗ \mathbf{x}^*x represents any global optimal solution, and then{ xk } \{ \mathbf{x}_k \}{ xk} Equation:
F ( x N + 1 ) − F ( x ∗ ) ≤ 2 α y ( N + 1 ) 2 ∥ x 0 − x ∗ ∥ (16) \tag{16} F(\mathbf{x}_{N+1})-F(\mathbf{x}^*)\leq\frac{2}{\alpha_y(N+1)^2 }\| \mathbf{x}_0-\mathbf{x}^*\|^2F(xN+1)F(x)ay(N+1)22x0x2.( 16 ) When the objective function is locally convex in the neighborhood of the local minimum, Theorem 2 shows that when facing the local minimum, APG can ensureO ( 1 k 2 ) O(\frac{1}{k^2})O(k21) , thus speeding up the convergence.

For a better presentation, we summarize the process of monotonic APG in Algorithm 1:

3.2 Convergence rate under KL assumption

KL properties are a powerful tool. The commonly used APG does not satisfy the sufficient descent property, but this property is crucial for using the KL property, so there is no conclusion under the KL assumption. On the other hand, since the intermediate variable yk \mathbf{y}_kykvk \mathbf{v}_kvk, and zk \mathbf{z}_kzk, our algorithm is more complex than the general descent method, and does not satisfy the conditions. However, due to the monitor step in Equations 12-16, some modified conditions are satisfied and we are still able to obtain satisfactory results under the KL assumption. Under the existing framework, we have the following theorem:

Theorem 3 : Let fff is a legal function with Lipschitz continuous gradient,ggg is a lower semicontinuous legal function. For non-convexfff and non-smoothedggg , assumingF ( x ) F(x)F ( x ) is coercive. If it is further assumedthat fff andggg satisfies the KL property, and forC > 0 , θ ∈ ( 0 , 1 ] C>0,\theta\in(0,1]C>0,i(0,1 ] , the desingularization function has the formφ ( t ) = C θ t θ \varphi(t)=\frac{C}{\theta}t^\thetaφ ( t )=iCtθ , then:

  1. If θ = 1 \theta=1i=1 , then there existsk 1 k_1k1, such that for all k > k 1 k>k_1k>k1,有 F ( x k ) = F ∗ F(\mathbf{x}_k)=F^* F(xk)=F , and the algorithm can stop in finite steps;
  2. 如果 θ ∈ [ 1 2 , 1 ) \theta \in\left[\frac{1}{2}, 1\right) i[21,1 ) , then there existsk 2 k_2k2, such that for all k > k 2 k>k_2k>k2,有
    F ( x k ) − F ∗ ≤ ( d 1 C 2 1 + d 1 C 2 ) k − k 2 r k 2 . (17) \tag{17} F\left(\mathbf{x}_k\right)-F^* \leq\left(\frac{d_1 C^2}{1+d_1 C^2}\right)^{k-k_2} r_{k_2} . F(xk)F(1+d1C2d1C2)kk2rk2.(17)
  3. 如果 θ ∈ ( 0 , 1 2 ) \theta \in\left(0, \frac{1}{2}\right) i(0,21) , then there existsk 3 k_3k3, for all k > k 3 k>k_3k>k3,有
    F ( x k ) − F ∗ ≤ ( C ( k − k 3 ) d 2 ( 1 − 2 θ ) ) 1 1 − 2 θ , (18) \tag{18} F\left(\mathbf{x}_k\right)-F^* \leq\left(\frac{C}{\left(k-k_3\right) d_2(1-2 \theta)}\right)^{\frac{1}{1-2 \theta}}, F(xk)F((kk3)d2(12 i )C)1 2 i1,( 18 ) whereF ∗ F^*F{ xk } \left\{\mathbf{x}_k\right\}{ xk} have the same function value,rk = F ( vk ) − F ∗ r_k=F\left(\mathbf{v}_k\right)-F^*rk=F(vk)F,以及
    d 1 = ( 1 α x + L ) 2 / ( 1 2 α x − L 2 )  and  d 2 = min ⁡ { 1 2 d 1 C , C 1 − 2 θ ( 2 2 θ − 1 2 θ − 2 − 1 ) r 0 2 θ − 1 } . d_1=\left(\frac{1}{\alpha_x}+L\right)^2 /\left(\frac{1}{2 \alpha_x}-\frac{L}{2}\right) \text { and } d_2=\min \left\{\frac{1}{2 d_1 C}, \frac{C}{1-2 \theta}\left(2^{\frac{2 \theta-1}{2 \theta-2}}-1\right) r_0^{2 \theta-1}\right\}. d1=(ax1+L)2/(2 ax12L) and d2=min{ 2 d1C1,12 iC(22 θ 22 θ 11)r02 θ 1}.

F ( x ) F(\mathbf{x}) When F ( x ) is a semi-algebraic function, the desingularization functionφ ( t ) \varphi(t)φ ( t ) can be chosen asC θ t θ \frac{C}{\theta} t^\thetaiCtθ form, whereθ ∈ ( 0 , 1 ] \theta \in(0,1]i(0,1 ] [23]. In this case, as Theorem 3, whenθ = 1 \theta=1i=1 , our algorithm converges within a finite number of times, whenθ ∈ [ 1 2 , 1 ) \theta\in[\frac{1}{2},1)i[21,1 ) converges to a linear rate, whenθ ∈ ( 0 , 1 2 ) \theta \in\left(0, \frac{1}{2}\right)i(0,21) converges to a sublinear rate.

3.3 Non-monotonic APG

Algorithm 1 is a monotonic algorithm. When the problem is ill-conditioned, the monotonic algorithm must crawl along the bottom of the narrow and curved valley so that the objective function value does not increase, resulting in a short step size or even a zigzag, resulting in slow convergence. Removing the requirement for monotonicity improves convergence because larger step sizes can be used when using line search.

On the other hand, Algorithm 1 needs to calculate zk + 1 \mathbf{z}_{k+1} in each iterationzk+1vk + 1 \mathbf{v}_{k+1}vk+1, and use vk + 1 \mathbf{v}_{k+1}vk+1As a monitor to correct zk + 1 \mathbf{z}_{k+1}zk+1. This is a conservative strategy, in fact, when yk y_kykis a good guess, we can accept zk + 1 \mathbf{z}_{k+1}zk+1as xk + 1 \mathbf{x}_{k+1}xk+1. Then only when the condition is not satisfied vk + 1 \mathbf{v}_{k+1}vk+1been accepted. Therefore, we can even reduce the average number of proximal mappings, and we will focus on non-monotonic APG next.

In monotonic APG, Equation 15 is guaranteed. In non-monotonic APG, we allow xk + 1 \mathbf{x}_{k+1}xk+1Obtain a ratio F ( xk ) F(\mathbf{x}_k)F(xk) a larger target value. In particular, we allowxk + 1 \mathbf{x}_{k+1}xk+1Generate a less than F ( xk ) F(\mathbf{x}_k)F(xk) relaxationck c_kcktarget value. ck c_kckShould not be greater than F ( xk ) F(\mathbf{x}_k)F(xk) mostly. ThereforeF ( xk ) , F ( xk − 1 ) , … , F ( x 1 ) F(\mathbf{x}_k),F(\mathbf{x}_{k-1}),\dots,F( \mathbf{x}_1)F(xk),F(xk1),,F(x1) is a good choice:
ck = ∑ j = 1 k η k − j F ( xj ) ∑ j = 1 k η k − j , (19) \tag{19} c_k=\frac{\sum_ {j=1}^k \eta^{kj} F\left(\mathbf{x}_j\right)}{\sum_{j=1}^k \eta^{kj}},ck=j=1kthekjj=1kthekjF(xj),( 19 ) η∈ [ 0 , 1 ) \eta \in[0,1)the[0,1 ) Control the degree of non-monotonicity. Actually,ck c_kckIt can be calculated recursively by following steps: In practice ck c_kck can be efficiently computed by the following recursion:
q k + 1 = η q k + 1 c k + 1 = η q k c k + F ( x k + 1 ) q k + 1 , (20–21) \tag{20--21} \begin{aligned} q_{k+1} & =\eta q_k+1 \\ c_{k+1} & =\frac{\eta q_k c_k+F\left(\mathbf{x}_{k+1}\right)}{q_{k+1}}, \end{aligned} qk+1ck+1=ηqk+1=qk+1ηqkck+F(xk+1),( 20–21 ) whereq 1 = 1 q_1=1q1=1 and c 1 = F ( x 1 ) c_1=F\left(\mathbf{x}_1\right) c1=F(x1)

According to formula 14, we pass xk + 1 \mathbf{x}_{k+1}xk+1Divide it into two parts. Therefore, in non-monotonic APG, we use the following condition instead of Equation 15:
F ( zk + 1 ) ≤ ck − δ ∥ zk + 1 − yk ∥ 2 , F ( vk + 1 ) ≤ ck − δ ∥ vk + 1 − xk ∥ 2 . (22–23) \tag{22--23} \begin{aligned} & F\left(\mathbf{z}_{k+1}\right) \leq c_k-\delta\left\ |\mathbf{z}_{k+1}-\mathbf{y}_k\right\|^2, \\ & F\left(\mathbf{v}_{k+1}\right) \leq c_k -\delta\left\|\mathbf{v}_{k+1}-\mathbf{x}_k\right\|^2 . \end{aligned}F(zk+1)ckdzk+1yk2,F(vk+1)ckdvk+1xk2.( 22–23 ) We use Equation 22 as the criterion described earlier. When formula 22 is satisfied,yk \mathbf{y}_kykis a good guess and directly accepts zk + 1 \mathbf{z}_{k+1}zk+1; Otherwise, calculate vk + 1 \mathbf{v}_{k+1} satisfying formula 23 through formula 2vk+1. When backtracking linear search is used, vk + 1 \mathbf{v}_{k+1} satisfying formula 23 can be found within a limited number of timesvk+1

Combining formulas 20–22, and xk + 1 = zk + 1 \mathbf{x}_{k+1}=\mathbf{z}_{k+1}xk+1=zk+1Then:
ck + 1 ≤ ck − δ ∥ xk + 1 − yk ∥ 2 qk + c_{k+1} \leq c_k-\frac{\delta\left\|\mathbf{x}_{k+1}-\mathbf{y}_k\right\|^2}{q_{k+1 }} .ck+1ckqk+1dxk+1yk2. Correspondingly, the formula 22 andxk + 1 = zk + 1 \mathbf{x}_{k+1}=\mathbf{z}_{k+1}xk+1=zk+1公司的公式23 xk + 1 = vk + 1 \mathbf{x}_{k+1}=\mathbf{v}_{k+1}xk+1=vk+1Let:
ck + 1 ≤ ck − δ ∥ xk + 1 − xk ∥ 2 qk + 1 c_{k+1} \leq c_k-\frac{\delta\left\|\mathbf{x}_{k+1} -\mathbf{x}_k\right\|^2}{q_{k+1}}ck+1ckqk+1dxk+1xk2This means ck c_kckThe sufficient descent condition of replaces F ( xk ) F\left(\mathbf{x}_k\right) in formula 15F(xk) recombination drop condition.

Algorithm 2 summarizes the non-monotonic APG. Similar to monotonic APG, non-monotonic APG has convergence in non-convex cases and O ( 1 k 2 ) O\left(\frac{1}{k^2}\right) in convex programmingO(k21) convergence rate. We then give the convergence result in Theorem 4. Theorem 2 can support Algorithm 2 without modification.

LetΩ 1 = { k 1 , k 2 , ⋯ , kj , ⋯ } \Omega_1=\left\{k_1, k_2, \cdots, k_j, \cdots\right\}Oh1={ k1,k2,,kj,}以及 Ω 2 = { m 1 , m 2 , ⋯   , m j , ⋯   } \Omega_2=\left\{m_1, m_2, \cdots, m_j, \cdots\right\} Oh2={ m1,m2,,mj,} , for example in Algorithm 2, for allk = kj ∈ Ω 1 k=k_j \in \Omega_1k=kjOh1xk + 1 = zk + 1 \mathbf{x}_{k+1}=\mathbf{z}_{k+1}xk+1=zk+1be executed. For all k = mj ∈ Ω 2 k=m_j \in \Omega_2k=mjOh2, Formula 22 is not supported, and Formula 14 is implemented. Then we have Ω 1 ⋂ Ω 2 = ∅ \Omega_1 \bigcap \Omega_2=\emptysetOh1Oh2=Ω 1 ⋃ Ω 2 = { 1 , 2 , 3 , ⋯ \Omega_1 \bigcup \Omega_2=\{1,2,3, \cdotsOh1Oh2={ 1,2,3, }, and have the following theorem.

Theorem 4 : Let fff is a legal function with Lipschitz continuous gradient,ggg is a lower semicontinuous legal function. For non-convexfff and non-smoothedggg , assumingF ( x ) F(x)F ( x ) is coercive. Then forkj ∈ Ω 1 k_j\in\Omega_1kjOh1 { x k } \{\mathbf{x}_k\} { xk}{ vk } \{ \mathbf{v}_k \}{ vk} ,fromykj \mathbf{y}_{k_j}ykjis the bound generated by Algorithm 2, then

  1. If Ω 1 \Omega_1Oh1Or Ω 2 \Omega_2Oh2finite, then for { xk } \{\mathbf{x}_k\}{ xk} Arbitrary cumulative point{ x ∗ } \{\mathbf{x}^*\}{ x},有 0 ∈ ∂ F ( x ∗ ) 0\in\partial F(\mathbf{x}^*) 0F(x)
  2. If Ω 1 \Omega_1Oh1 and Ω 2 \Omega_2 Oh2infinite, then for { xkj + 1 } \left\{\mathbf{x}_{k_j+1}\right\}{ xkj+1} any cumulative pointx ∗ \mathbf{x}^*x ,from{ ykj } \left\{\mathbf{y}_{k_j}\right\}{ ykj} any cumulative pointy ∗ \mathbf{y}^*y , wherekj ∈ Ω 1 k_j \in \Omega_1kjOh1, and { vmj + 1 } \left\{\mathbf{v}_{m_j+1}\right\}{ vmj+1} any cumulative pointv ∗ \mathbf{v}^*v { x m j } \left\{\mathbf{x}_{m_j}\right\} { xmj} any cumulative pointx ∗ \mathbf{x}^*x , wheremj ∈ Ω 2 m_j \in \Omega_2mjOh2,有 0 ∈ ∂ F ( x ∗ ) 0 \in \partial F\left(\mathbf{x}^*\right) 0F(x) 0 ∈ ∂ F ( y ∗ ) 0 \in \partial F\left(\mathbf{y}^*\right) 0F(y),以及 0 ∈ ∂ F ( v ∗ ) 0 \in \partial F\left(\mathbf{v}^*\right) 0F(v)

4 Numerical results

This section uses sparse logistic regression (Logistic regression, LR ) to test the performance of the algorithm. Sparse LR is an extension of LR that is able to perform feature selection while reducing overfitting. Sparse LR is widely used in bioinformatics and text classification. The objective function to be optimized in this paper is as follows, which considers sparse LR and adds a non-convex regularization:
min ⁡ w 1 n ∑ i = 1 n log ⁡ ( 1 + exp ⁡ ( − yixi T w ) ) + r ( w ) , (26) \tag{26} \min _{\mathbf{w}} \frac{1}{n} \sum_{i=1}^n \log \left(1+\exp \left(- y_i \mathbf{x}_i^T \mathbf{w}\right)\right)+r(\mathbf{w}) ,wminn1i=1nlog(1+exp(yixiTw))+r(w),( 26 ) wherer ( w ) r(\mathbf{w})r ( w ) is chosen asl 1 l_1l1惩罚:
r ( w ) = λ ∑ i = 1 d min ⁡ ( ∣ w i ∣ , θ ) , θ > 0. r(\mathbf{w})=\lambda \sum_{i=1}^d \min \left(\left|w_i\right|, \theta\right), \quad \theta>0. r(w)=li=1dmin(wi,i ),i>0.

Algorithms compared include: 1) Monotonic APG (mAPG); (2) Non-monotonic APG (nmAPG); (3) mGIST; (4) nmGIST; and (5) IFB. The experimental data set is a real simulation data set, which contains a total of 72309 samples, and the sample dimension is 20958. The parameters are set as follows:

  1. λ = 0.0001 \lambda = 0.0001l=0.0001 θ= 0.1 λ \theta=0.1\lambdai=0.1 λ , the starting point is set to zero vector;
  2. For nmAPG, set η = 0.8 \eta=0.8the=0.8
  3. For IFB, β = 0.01 \beta=0.01b=0.01 , the Lipschitz constant is calculated by backtracking;
  4. For the sake of fairness, first run mGIST, the termination condition of the algorithm is that the difference between the two relative target values ​​is less than 1 0 − 5 10^{-5}105 or the number of iterations is greater than1000 10001000
  5. The termination condition of other algorithms is to obtain a target value smaller than mGIST or to reach the maximum number of iterations;
  6. The ratio of the training set to the test set is 9:1;
  7. The experimental results are the average of 10 independent experiments.

Guess you like

Origin blog.csdn.net/weixin_44575152/article/details/129794930