Machine Learning/Deep Learning - Study Notes: Concept Supplement (Middle)

Study time: 2022.05.11

Concept supplement (middle)

In the process of learning machine learning and deep learning, some concepts will be relatively unfamiliar (maybe because there is no systematic in-depth study of statistics, operations research and probability statistics; it may also be because the things you see are more practical, The understanding of the theory is not very in-depth), and the understanding of some concepts is limited to the familiar or used ones, and even some concepts can be used but do not know the specific principles, so I want to know a few frequently appearing concepts. Learn about the system. mainly include:

  • Top: Maximum Likelihood Estimation, Bayesian, Fourier, Markov, Conditional Random Field;
  • Medium: convex sets, convex functions and convex optimization, optimization algorithms (optimizers), overfitting and underfitting;
  • Bottom: regularization & normalization, normalization, loss function and pseudo-labels, etc.

6. Convex Sets, Convex Functions and Convex Optimization

Reference content in this section: A detailed explanation of convex functions and convex optimization , convex optimization learning (1) , (2) , (3) , (4) .

There are many times when we want to optimize the value of some function when doing machine learning algorithms. That is, given a function f : R n → R f : R^n → Rf:RnR , we want to find the functionf ( x ) f(x)f ( x ) minimizes (or maximizes) the preimagex ∈ R nx ∈ R^nxRn . We've seen several examples of machine learning algorithms that include optimization problems, such as: least squares, logistic regression, and support vector machine algorithms, all of which can be formulated as optimization problems.

In general, the results of many cases show that finding the global optimum of a function is a very difficult task. However, for a special class of optimization problems— convex optimization problems, we can efficiently find the global optimal solution in many cases. Here, efficiency has both practical and theoretical implications: it means that we can solve any real-world problem in a reasonable amount of time, it means that theoretically we can solve that problem in a certain amount of time, and the time How much depends only on the polynomial size of the problem.

Importance of convex optimization problems:

  1. Convex optimization has good properties, such as the local optimal solution is the global optimal solution, and the convex optimization problem is a polynomial time solvable problem, such as: linear programming problem;
  2. Many non-convex optimization or NP-Hard problems can be transformed into convex optimization problems, methods: dual, relaxation (expanding the feasible region, removing some constraints). In the SVM algorithm, in order to optimize the objective function, Lagrang is used. Daily multiplier method, dual problem, introducing relaxation factor, etc.

6.1 Vector representation of geometry

Before introducing concepts such as convex sets, the vector representation of spatial geometry is first introduced. The line segment representation of line segments is used when defining the concept of convex sets. Let's take an example to understand how to use vectors to represent line segments:

Given two fixed points A(5, 1) and B(2, 3) on a two-dimensional plane, the equation that gives the line segment AB is expressed as follows: { x 1 = θ ∗ 5 + ( 1 − θ ) ∗ 2 x 2 = θ ∗ 1 + ( 1 − θ ) ∗ 3 θ ∈ [ 0 , 1 ] \begin{cases} x_1 = θ*5+(1-θ)*2\\ x_2 = θ*1+(1-θ)*3 \end{cases}\ \ \ \ \ θ∈[0,1]{ x1=i5+( 1i )2x2=i1+( 1i )3     i[ 0 ,1 ] , then:

If point A is regarded as vector a and point B is regarded as vector b, then the vector of line segment AB is expressed as: x → = θ a → + ( 1 − θ ) ∗ b → , θ ∈ [ 0 , 1 ] \overrightarrow{ x} = θ\overrightarrow{a} + (1-θ)*\overrightarrow{b}, \ \ \ θ∈[0,1]x =ia +( 1i )b ,   i[ 0 ,1 ] ;

The vector representation of a straight line is: x → = θ a → + ( 1 − θ ) ∗ b → , θ ∈ R \overrightarrow{x} = θ\overrightarrow{a} + (1-θ)*\overrightarrow{b} , \ \ \ θ∈Rx =ia +( 1i )b ,   iR

From this derivation and generalization to higher dimensions, the vector representation of the following geometry and the vector representation of the triangle can be obtained: x → = θ 1 a 1 → + θ 2 a 2 → + θ 3 a 3 → , θ i ∈ [ 0 , 1 ] & ∑ θ i = 1 \overrightarrow{x} = θ_1\overrightarrow{a_1} + θ_2\overrightarrow{a_2} + θ_3\overrightarrow{a_3}, \ \ \ θ_i∈[0,1]\ \& \ \sumθ_i= 1x =i1a1 +i2a2 +i3a3 ,   ii[ 0 ,1] & ii=1 ;

Give the inverse function: x → = θ 1 a 1 → + θ 2 a 2 → + θ 3 a 3 → , θ i ∈ R & ∑ θ i = 1 \overrightarrow{x} = θ_1\overrightarrow{a_1} + θ_2\overrightarrow{a_2} + θ_3\overrightarrow{a_3}, \ \ \ θ_i∈R\ \& \ \sumθ_i=1x =i1a1 +i2a2 +i3a3 ,   iiR & ii=1 ;

Solve the following function: x → = θ 1 a 1 → + θ 2 a 2 → + ... + θ kak → , θ i ∈ [ 0 , 1 ] & ∑ θ i = 1 \overrightarrow{x} = θ_1\overrightarrow {a_1} + θ_2\overrightarrow{a_2} +...+ θ_k\overrightarrow{a_k}, \ \ \ θ_i∈[0.1]\ \& \ \sumθ_i=1x =i1a1 +i2a2 ++ikak ,   ii[ 0 ,1] & ii=1 ;

Solve the infinitesimal solution: x → = θ 1 a 1 → + θ 2 a 2 → + ... + θ kak → , θ i ∈ R & ∑ θ i = 1 \overrightarrow{x} = θ_1\overrightarrow{a_1} + θ_2\overrightarrow{a_2} +...+ θ_k\overrightarrow{a_k}, \ \ \ θ_i∈R\ \& \ \sumθ_i=1x =i1a1 +i2a2 ++ikak ,   iiR & ii=1 ;

6.2 Convex Set

If the line segment between any two points in the set C is also in the set C, then the set C is called a convex set, otherwise it is a concave set. Mathematically defined as:
∀ x 1 , x 2 ∈ C , ∀ θ ∈ [ 0 , 1 ] , then x = θ ⋅ x 1 + ( 1 − θ ) ⋅ x 2 ∈ C \forall{x_1,x_2\in C} ,\ \forall{θ\in [0,1]}, then x = θ x_1+(1-θ) x_2\in Cx1,x2C, θ[ 0 ,1 ] ,then x=θ x1+( 1i ) x2C
generalizes to k points, namely:
∀ x 1 , x 2 , … , xk ∈ C , ∀ θ i ∈ [ 0 , 1 ] and ∑ i = 1 k θ i = 1 , then x = ∑ i = 1 k θ i ⋅ xi ∈ C \forall{x_1,x_2,…,x_k\in C},\ \forall{θ_i\in [0,1]}and\sum^k_{i=1}θ_i=1, then x = \sum^k_{i=1}θ_i x_i\in Cx1,x2,,xkC, θi[ 0 ,1 ] andi=1kii=1 ,then x=i=1kiixiThe vector representation of the line segment is used in the definition of the convex set above C
, which means that if the point x1 and the point x2 are in the set C, then all the points on the line segment x1x2 are in the set c, and the intersection of the convex sets is still a convex set, as shown below. A few examples of convex sets:

insert image description hereinsert image description here

6.3 Convex Function

For a convex set D, the function f : D → ( − ∞ , ∞ ] f:D→(-∞,∞]f:D( ,] , for anyX 1 ∈ D , X 2 ∈ D X_1\in D,X_2\in DX1D,X2D

  • λ ∈ [ 0 , 1 ] λ\in [0,1] l[ 0 ,1 ],if 恒有f ( λ X 1 + ( 1 − λ ) X 2 ) ≤ λ f ( X 1 ) + ( 1 − λ ) f ( x 2 ) f(λX_1+(1-λ)X_2) ≤ λf (X_1)+(1-λ)f(x_2)f(λX1+( 1l ) X2)λf(X1)+( 1λ ) f ( x2) , then the functionf ( x ) f(x)f ( x ) is the convex function Convex Function on the convex set D;
  • λ ∈ ( 0 , 1 ) λ\in (0,1) l( 0 ,1 ),if 恒有f ( λ X 1 + ( 1 − λ ) X 2 ) < λ f ( X 1 ) + ( 1 − λ ) f ( x 2 ) f(λX_1+(1-λ)X_2) < λf (X_1)+(1-λ)f(x_2)f(λX1+( 1l ) X2)<λf(X1)+( 1λ ) f ( x2) , then the functionf ( x ) f(x)f ( x ) is a strictly convex function on the convex set D;
  • Strongly convex functions are defined as follows: Convex Optimization Theory (2) .

The intuitive understanding is as follows:

img

AB连线λ f ( X 1 ) + ( 1 − λ ) f ( x 2 ) λf(X_1)+(1-λ)f(x_2)λf(X1)+( 1λ ) f ( x2) is greater than or equal to[ x 1 , x 2 ] [x_1,x_2][x1,x2] betweenf ( x ) f(x)f(x)

E.g:

f ( x ) = x 3 f(x)=x^3 f(x)=x3 is not a convex function;

f ( x ) = x f(x)=x f(x)=x is a convex function, not a strictly convex function, not a strongly convex function;

f ( x ) = e x f(x)=e^x f(x)=andx is a convex function, strictly convex, not strongly convex, becauselimx → − ∞ f ′ ′ ( x ) = 0 lim_{x→-∞}f''(x)=0limx f(x)=0 ;

f ( x ) = x 2 f(x)=x^2 f(x)=x2 is a convex function, strictly convex function, strong convex function.

For the first-order and second-order conditions of convex functions, see: Convex Optimization Learning (3) .

Various properties of convex functions and their proofs can be found in: A detailed explanation of convex functions and convex optimization . Its properties include:

  • Property 1: Let f ⊆ R n → R 1 f ⊆ R^n → R^1fRnR1 C C C is a convex set, iffff is a convex function, then for∀ β ∀ββ , prove that the level setD β = { x ∣ f ( x ) ≤ β , x ∈ C } D_β=\{x|f(x)≤\beta,x\in C\}Db={ xf(x)b ,xC } is a convex set;
  • Property 2: The local minimum of a convex optimization problem is the global minimum ;
  • Property 3: The Hessian matrix of a convex function is positive semi-definite;
  • Property 4: If x ⊆ R n , y ⊆ R nx ⊆ R^n, y⊆ R^nxRnyRn Q Q Q is a positive semi-definite symmetric matrix. Prove thatf ( x ) = XTQX f(x) = X^TQXf(x)=XT QXis a convex function;
  • ...

6.4 Convex Optimization

A convex optimization problem in an optimization problem has the form:
minimize f ( x ) subject to x ∈ C minimize\ f(x)\\ subject\ to\ x\in Cminimize f(x)subject to xC
wherefff is a convex function,CCC is a convex set,xxx is the optimization variable. However, since it might be a bit unclear to write this way, we usually write it as:
minimize f ( x ) subject to gi ( x ) ≤ 0 , i = 1 , 2 , … , m hi ( x ) = 0 , i = 1 , 2 , … , p minimize\ f(x)\\ subject\ to\ g_i(x)≤0,\ i=1,2,…,m\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ h_i(x)=0,\ i=1,2,…,pminimize f(x)subject to gi(x)0 , i=1 ,2 ,,m                 hi(x)=0 , i=1 ,2 ,,p
wherefff is a convex function,gi g_igiis a convex function (convex function gi g_igimust be less than zero), hi h_ihiis an affine function, xxx is the optimization variable.

Affine set : An affine set is all points of a line that contains two different points (all affine sets can be represented by a solution set of a system of linear equations), namely: x = θ x 1 + ( 1 − θ ) x 2 , ( θ ∈ R ) x=θx_1+(1−θ)x_2,\ \ (θ∈R)x=θx1+( 1 θ ) x2,  ( iR ) , as shown below:

Affine Set Example Diagram

What is a convex optimization problem can be understood by the following convex optimization properties:

  • The purpose is to find the minimum value of the objective function;
  • objective function f ( x ) f(x)f ( x ) and the inequality constraint functiongi ( x ) g_i(x)gi( x ) are all convex functions, and the definition domain/feasible domain is a convex set;
  • If there is an equality constraint function, then the equality constraint function hi ( x ) h_i(x)hi( x ) is an affine function; an affine function refers to a polynomial function with the highest degree of 1, and the general form isf ( x ) = A x + bf(x) = Ax + bf(x)=Ax+b A A A ism × km × km×k matrix,xxx is akkk vector,bbb is ammm vector;
  • The convex optimization problem has a good property that the local optimal solution is the global optimal solution.

It is important to note the direction of these inequalities: the convex function gi ( x ) g_i(x)gi( x ) must be less than zero. This is becausegi ( x ) g_i(x)gi( x ) 0 − sublevel 0-sublevel0The s u b l e v e l set is a convex set, so the feasible region is the intersection of many convex sets, which are also convex sets. If we ask for some convex functiongi(x) g_i(x)giThe inequality of ( x ) is gi ( x ) ≥ 0 g_i(x) ≥ 0gi(x)0 , then the feasible region is no longer a convex set, and the algorithms we use to solve these problems are no longer guaranteed to find a globally optimal solution.

Also note that only affine functions are allowed to be equality constraints. Intuitively, you can think of an equality constraint hi ( x ) = 0 h_i(x)=0hi(x)=0 is equivalent to two inequality constraintshi ( x ) ≤ 0 h_i(x)≤0hi(x)0 sumhi ( x ) ≥ 0 h_i(x)≥0hi(x)0 . However, if and only ifhi ( x ) h_i(x)hiBoth constraints are valid when ( x ) is both a convex and a concave function, sohi ( x ) h_i(x)hi( x ) must be an affine function.

The optimal value of the optimization problem is expressed as p ∗ p^*p (sometimes denoted asf ∗ f^*f ), and it is equal to the smallest possible value of the objective function in the feasible region.
p ∗ = min { f ( x ) : gi ( x ) ≤ 0 , i = 1 , 2 , … , m ; hi ( x ) = 0 , i = 1 , 2 , … , p } p^* = min\ {f(x):g_i(x)≤0,\ i=1,2,…,m;\ h_i(x)=0,i=1,2,…,p\}p=min{ f(x):gi(x)0 , i=1 ,2 ,,m; hi(x)=0 ,i=1 ,2 ,,p }
whenf ( x ∗ ) = p ∗ f ( x^∗ ) = p^∗f(x )=p , we callx ∗ x^*x is anoptimal point. Note that there can be multiple optimal points even though the optimal value is finite.

6.5 Common convex optimization problems (a special case of convex problems)

For various reasons, it is often convenient to consider special cases of general convex programming formulations. For these special cases, we can usually design very efficient algorithms to solve very large problems, and because of this, when people use convex optimization techniques, you may see these special cases:

  • Linear programming (linear program, LP) :
    • If the objective function f ( x ) f(x)Both f ( x ) and inequality constraints aregi ( x ) g_i(x)gi( x ) is an affine function, then the convex optimization problem is alinear programmingproblem. In other words, these questions have the form:
    • insert image description here
    • where x ∈ R nx ∈ R^nxRn is the optimization variable,c ∈ R n , d ∈ R , G ∈ R m × n , h ∈ R m , A ∈ R p × n , b ∈ R c ∈ R^n , d ∈ R , G ∈ R^ {m × n} , h ∈ R^m , A ∈ R^{p × n} , b ∈ RcRn,dR,GRm × n ,hRm,ARp × n ,bR These variables are defined according to the specific problem, the symbol⪯ ⪯⪯ means (in a multidimensional vector) that the elements are not equal.
  • Quadratic Program (QP) :
    • If the inequality constraints (like linear programming) are affine, and the objective function f ( x ) f(x)f ( x ) is a convex quadratic function, then the convex optimization problem is aquadratic programmingproblem. In other words, these questions have the form:
    • insert image description here
    • The variables are explained as above. According to the specific definition of the specific problem, there is also a symmetric positive semi-definite matrix P ∈ R + n P ∈ R_+^nPR+n.
  • Quadratic constrained quadratic program (quadratically constrained quadratic program, QCQP) :
    • If the objective function f ( x ) f(x)Both f ( x ) and the inequality constraints aregi ( x ) g_i(x)gi( x ) is a convex quadratic function, then the convex optimization problem is aquadratic programmingproblem with quadratic constraints, in the following form:
    • insert image description here
    • Variable interpretation is the same as quadratic programming, except that there are also Q i ∈ S + n , ri ∈ R n , si ∈ R Q_i ∈ S_+^n , r_i ∈ R^n , s_i ∈ RQiS+n,riRn,siR , where $i = 1 , 2, . . . , m $.
  • Semidefinite Programming (SDP) :
    • The last example is more complicated than the previous ones, so don't worry if you don't understand it at first. However, semidefinite programming is becoming more and more popular in research in many areas of machine learning, so you may encounter these problems at some point in the future, so it is good to know what semidefinite programming is in advance.
    • We say that a convex optimization problem is semi-definite programming (SDP) , then its form is as follows:
    • insert image description here
    • where symmetric matrix X ∈ S n X\in S^nXSn is the optimization variable, symmetric matrixC , A 1 , ⋯ , A p ∈ S n C,A_1,\cdots,A_p\in S^nC,A1,,ApSn According to the specific definition of the specific problem, the restriction conditionX ⪰ 0 X\succeq 0X0 meansXXX is a positive semi-definite matrix.
    • The above looks a bit different from the problems we saw earlier, because the optimization variable is now a matrix instead of a vector. If you're curious why such a formulation might be useful, you should look at a more advanced course or book on convex optimization.

It is obvious from the definition that quadratic programming is more general than linear programming (because linear programming is just P = 0 P = 0P=A special case of quadratic programming at 0 ), again, quadratic constrained quadratic programming is more general than quadratic programming. However, it is not obvious that semidefinite programs are actually more general than all previous types, that is, any quadratic constrained quadratic program (and any quadratic or linear program) can be represented as a semidefinite program.

So far, we've discussed a lot of the boring math behind convex optimization, as well as the formal definitions. Next, we can finally get to the fun part: using these techniques to solve practical problems, see: Convex Optimization Learning (4) - Convex Optimization Problems .

Related content of duality, KKT conditions, and sensitivity analysis can be found in: Convex Optimization Study Notes: Duality, KKT Conditions, Sensitivity Analysis .

The optimization algorithms of convex optimization can be seen: convex optimization algorithm, unconstrained optimization algorithm, constrained optimization algorithm , and unconstrained convex optimization problem solving algorithm .

7. Optimization algorithm

Reference content in this section: This article understands various neural network optimization algorithms , what are the optimization algorithms of neural networks , principles and code interpretation from SGD to AdamW , optimization algorithms commonly used in neural networks, and ten optimizers of PyTorch .

This section is mainly based on the 13 optimizers supported in torch.optim in PyTorch, plus necessary extensions to sort out and understand. However, because I have seen some optimizer principles before writing, it is rather cumbersome. I may only make an enumeration and introduction here, and the specific principle content will only be posted.

In machine learning, in order to optimize the objective function, it is necessary to design an appropriate optimization algorithm to achieve the optimization accurately and efficiently. The function of the optimization algorithm is to minimize (or maximize) the loss function E(x) by improving the training method. The commonly used first-order optimization algorithms are gradient descent, and the second-order optimization algorithms include Newton's method. Its system is as follows (see the overview optimization algorithm ):

insert image description here

Optimization algorithms fall into two broad categories:

  1. First-order optimization algorithm: This algorithm uses the gradient value of each parameter to minimize or maximize the loss function E(x). The most commonly used first-order optimization algorithm is gradient descent . Therefore, for univariate functions, derivatives are used to analyze; while gradients are generated based on multivariate functions.
  2. Second-Order Optimization Algorithms: Second-order optimization algorithms use second-order derivatives (also called Hessian methods ) to minimize or maximize the loss function. This method is not widely used due to the computational cost of the second derivative.
    • Newton's method is a second-order optimization algorithm: compared with the gradient descent method, both are iterative solutions, but the gradient descent method is a gradient solution (first-order optimization), while the Newton's method uses the inverse matrix of the second-order Hessian matrix Solve. Relatively speaking, using the Newton method converges faster (less iterations), but each iteration takes longer than the gradient descent method (the computational cost is more expensive, and the quasi-Newton method is actually used instead);
    • The representative algorithms of the quasi-Newton method include the DFP algorithm and the BFGS algorithm, of which the ** LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno, L-BFGS) algorithm in torch.optim is based on BFGS** and further limited The algorithm of approximating and improving the efficiency under the memory of the L-BFGS is characterized by saving memory. The specific introduction can be found in: Understand the L-BFGS algorithm in this article .

For neural networks, especially deep neural networks, due to their complex structure, many parameters, etc., the widely used optimizers (such as from SGD to Adam) belong to different optimization methods of gradient descent.

7.1 Traditional Gradient Descent

This part includes: Batch Gradient Descent, Mini Batch Gradient Descent, Stochastic Gradient Descent ( SGD ) and Averaged Stochastic Gradient Descent ( ASGD / SAG ) . Part of it should be mentioned in the previous study, which can be seen in detail: Neural Network Optimization Algorithm .

But using gradient descent also comes with some challenges:

  • It is difficult to choose an appropriate learning rate. A learning rate that is too small will cause the network to converge too slowly, while a learning rate that is too large may affect convergence and cause the loss function to fluctuate at the minimum value, or even to have gradient divergence.
  • Also, the same learning rate does not apply to all parameter updates. If the training set data is sparse, and the feature frequencies are very different, they should not all be updated to the same extent, but a larger update rate should be used for features that appear infrequently.
  • Another key challenge in minimizing non-convex error functions in neural networks is avoiding getting trapped in multiple other local minima. Actually, the problem does not arise from local minima, but from saddle points, i.e. points where one dimension slopes up and the other dimension slopes down. These saddle points are usually surrounded by a plane of the same error value, which makes it difficult for the SGD algorithm to break out because the gradient is close to zero in all dimensions.

You can take a look at the comparison animation of SGD and other optimizers in advance . The last one that is not out of the local optimum is SGD:

insert image description here

Therefore, we need to further optimize gradient descent. The improvement of the optimization algorithm is nothing more than two directions: ① a more accurate gradient direction (adding momentum ); ② an adaptive learning rate (adding a decay rate ).

7.2 Momentum-based gradient descent

The first is the momentum-based method (Momentum): the idea is very simple, that is, to smooth the high-frequency tiny fluctuations during SGD optimization, thereby speeding up the optimization process. The method is derived from the momentum in physics. In the process of parameter optimization, a certain amount of kinetic energy is accumulated, so it is not easy to change the update direction. The final update direction and size are jointly determined by the past gradient and the current gradient.

It is like a small ball rolling down the process (that is, the process of parameter optimization), if it encounters a flat place, it will not stop immediately, but will continue to roll forward. Momentum's method not only helps to speed up learning, but also effectively avoids falling into local optima such as saddle points. For example: SGDM (SGD with Momentum, stochastic gradient descent with momentum) .

insert image description here

However, the momentum-based update method is easy to miss the optimal point, because its kinetic energy will cause it to repeatedly oscillate around the optimal point, or even directly over the optimal point. Therefore, an improved method is proposed, which is Nesterov Accelerated Gradient ( NAG ) .

The improvement is that on the basis of Momentum, let the ball tentatively move forward one step with the existing kinetic energy, and then calculate the gradient after the movement, and use this gradient to update the current parameters. Thus, the ball has the information to perceive the surrounding environment in advance. This allows the ball to slow down when it is close to the optimum point.

7.3 Gradient descent with adaptive learning rate

Because of different parameters, its importance and the magnitude of each update are different. For parameters that change infrequently, the learning rate needs to be larger, and more information can be learned from individual samples; while for parameters that change frequently, a large amount of information about the samples has been accumulated, and it is not expected that it is too much for a single sample. Sensitive, so expect smaller updates.

The most classic is ** AdaGrad (Adaptive Gradient, adaptive gradient algorithm)**, which uses the square sum of historical gradients (ie, second-order momentum, used to measure the historical update frequency) to adjust the appropriate learning rate η. parameters for large updates and small updates for frequent parameters.

However, AdaGrad has a defect that as Vt gradually increases, the learning rate will become smaller and smaller, and eventually the update will stop. Therefore, some improved methods are proposed:

  • AdaDelta (adaptive learning rate adjustment) , consider a strategy for changing the second-order momentum calculation method (the basic idea is to use the first-order method to approximate the second-order Newton method): instead of accumulating all historical gradients, only focus on the past period of time The descending gradient of the window (this is the origin of the Delta in the AdaDelta name);

  • RMSprop (root mean square backpropagation) , the same improvement to the weight of the previous gradient in AdaGrad, the introduction of exponentially weighted average, that is to give more weight to the gradient closer to the current gradient (RMS is the root mean square root meam square meaning), which is very similar to AdaDelta.

    RMSProp is also an improvement of the RProp (Resilient Back Propagation) algorithm. The introduction and comparison of the two can be seen: Elastic Back Propagation (RProp) and Root Mean Square Back Propagation (RMSProp) .

7.4 The method of fusing momentum and adaptive learning rate

SGD-M adds first-order momentum to SGD, and AdaGrad and AdaDelta add second-order momentum to SGD. Using both first-order momentum and second-order momentum is the ** Adam (Adaptive Moment Estimation, adaptive momentum estimation) algorithm** - Adaptive + Momentum (can also be considered as Momentum+RMSProp).

  • SparseAdam is a "castrated" Adam optimization method for sparse tensors in torch.optim.

However, the shortcomings of Adams may not converge and may miss the global optimal solution, so everyone has been conducting research on the improvement of Adams variant methods. There are mainly the following (roughly in order of appearance):

  • Adamax : Adds a concept of a learning rate upper limit to Adam, so it is also called Adamax;
  • NAdam:Nesterov + Adam = Nadam;
  • ⭐** AdamW **: Adam+L2 regularization, is currently the fastest converging optimizer in practical applications (AdamW is used in BERT);
    • Correspondingly, SGDM is the SGDWM algorithm after adding L2 regularization.
  • AMSGrad : In torch.optim, the AMSGrad algorithm can be enabled through Adam's parametersamsgrad=True. The AMSGrad algorithm is an improvement for Adam. By adding additional constraints, the learning rate is always positive (as if the experimental effect is not ideal);
  • RAdam (Rectified Adam) : "Rectified Adam" (Rectified Adam), it can dynamically turn on or off the adaptive learning rate according to the variance dispersion, and provides a method that does not require adjustable parameter learning rate warm-up (For details, see the paper: RAdam , but some people say that the effect is average).
  • In addition, I saw AdaBound , PAdam, ZO-AdaMM , AdaShift , ACProp ... But there seems to be no encapsulation in torch.

Adam and SGDM, as the two best deep learning optimizers today, have their own advantages in efficiency and accuracy respectively. Adam has a faster optimization speed in the early stage, and SGDM has a higher optimization accuracy in the later stage. Therefore, there is an idea: use the Adam algorithm in the early stage and use the SGDM algorithm in the later stage. For example: SWATS ) (Switching from Adam to SGD) algorithm.

7.5 Other Improvement Methods

  1. Warm up warm up mechanism

Warm up is a learning rate warm-up method mentioned in the ResNet paper. It chooses to use a smaller learning rate at the beginning of training, and trains some steps (15000 steps, see last code 1) or epoches ( After 5epoches, see the last code 2), then modify it to the preset learning for training.

The significance of Warm up is that in the initial stage of model training: the model is still unfamiliar with the data, it needs to use a small learning rate to learn slowly, and constantly correct the weight distribution. If a large learning rate is used at the beginning, the direction If it is correct, it will not have much impact, but once the training is wrong, it may take many epochs to pull it back, or even not pull it back, which directly leads to overfitting. After learning with a small learning rate for a period of time, the model has read each batch of data several times and formed some prior knowledge. At this time, a large learning rate can be used to accelerate learning. The previously learned Prior knowledge can orient the model in the right direction.

The above-mentioned Warmup is Constant Warm up, and its disadvantage is that changing from a small learning rate to a relatively large learning rate may cause the training error to suddenly increase. So Facebook proposed Gradual Warm up in 2018 to solve this problem, that is, starting from the initial small learning rate, each step increases a little bit, until it reaches the initially set relatively large learning rate, using the initially set learning rate. train.

Related articles can be found: pytorch's warm-up warm-up learning strategy , [tuning method] - warmup , learning rate warm-up Warmup .

In the application of torch, it is mainly realized by the learning rate regulator . The lr_schedulermethod can be seen: torch.optim of PyTorch source code interpretation: detailed explanation of optimization algorithm interface , warm-up warm-up learning strategy of pytorch .

  1. Lookahead(k step forward,1 step back)

Lookahead (k step forward, 1 step back) is essentially an optimizer method on top of various optimizers, and any form of optimizer can be used internally. In the author's words: universal wrapper for all optimizers .

Lookahead means: every time you use the optimizer to go forward k steps, do a weighted average of the current 1st and kth steps (ignoring the k-2 steps in the middle); the following figure is an example, the blue line is the optimization The k-step path taken by the controller, and then connect the starting point of the blue line (the red line in the figure), and then take a point on the red line as the value of k+1 steps (depending on α ).

Reasons why Lookahead works:

  • Standard optimization methods often require careful tuning of the learning rate to prevent oscillations and slow convergence, which is even more important in an SGD setting. Lookahead can alleviate this problem with a large inner loop learning rate;
  • When the Lookahead oscillates in the direction of high curvature, the fast weights update is fast forward in the direction of low curvature, and the slow weights smooth the oscillation through parameter interpolation. The combination of fast weights and slow weights improves learning in high curvature directions, reduces variance, and allows Lookahead to achieve faster convergence in practice;
  • On the other hand, Lookahead can also improve the convergence effect. While the fast weights are slowly exploring around the minima, the slow weight update prompts Lookahead to aggressively explore new and better regions, resulting in improved test accuracy. Such exploration may be a level that SGD may not be able to achieve even after 20 updates, thus effectively improving the model convergence effect.

Related articles can be found: New optimization method Lookahead , optimization algorithm: "Lookahead Optimizer: k steps forward, 1 step back" .

8. Overfitting and Underfitting

The meaning of overfitting and underfitting is basically known, and the solution is actually mentioned in the previous article. Here is a summary.

8.1 Mitigating Underfitting

  • Feature engineering: Underfitting is due to insufficient learning. You can consider adding features to mine more features from the data. Sometimes you need to transform features, use combined features, high-order features, add polynomial features...;
  • Model complexity: Simple models can also lead to underfitting. For example, linear models can only fit data with a single function. Trying to use more advanced models/non-linear models helps to solve underfitting, such as using SVMs, neural networks, etc.;
  • Regularization: The regularization parameter is used to prevent overfitting. In the case of underfitting, consider reducing the regularization parameter;
  • Tuning Hyperparameters: Hyperparameters include:
    • In the neural network: learning rate, learning decay rate, number of hidden layers, number of hidden layers, β1β1 and β2β2 parameters in Adam optimization algorithm, batch_size value, etc.;
    • Among other algorithms: the number of trees in random forest, the number of clusters in k-means, the regularization parameter λ, etc.
  • Adding training data is often useless: underfitting is the lack of learning ability of the model, and no matter how much data is added to train it, it will not be able to learn well;
  • Ensemble learning: Use ensemble learning methods, such as bagging multiple weak learners.

8.2 Mitigating overfitting

  • Cross-check: get better model parameters through cross-check;
  • Early stop strategy: It is essentially a cross-validation strategy, selecting an appropriate number of training times to avoid the trained network overfitting the training data;
  • Integrated learning: Bagging multiple weak learners will have a much better effect;
  • Feature engineering: reduce the number of features or use fewer feature combinations, and increase the divided interval for features that are discretized by intervals;
  • Regularization: L1 and L2 regularization are commonly used (L1 regularization can also automatically perform feature selection). If there is a regular term, you can consider increasing the regular term parameter;
  • Simplify the model: For linear regression, we can reduce the degree of polynomial, for neural networks, we can reduce the number of hidden layers, nodes, etc., for decision trees, we can perform pruning operations, etc.;
  • Data set expansion: increase the amount of training data to improve the generalization ability of the model; or increase the noise data to improve the robustness of the model;
  • Resampling: Estimate data distribution parameters based on the current dataset, use that distribution to generate more data, etc.;
  • DropOut strategy: When using the forward propagation algorithm and the back propagation algorithm to train the DNN model, when a batch of data is iterated, a part of the neurons in the hidden layer are randomly removed from the fully connected DNN network;
  • BN: A normalization method, each hidden layer is normalized, so that the input of each layer of neural network maintains the same distribution, and the input distribution is forced to be pulled back to a positive comparison standard with a mean of 0 and a variance of 1. The state distribution makes the input value of the nonlinear transformation function fall into the region that is more sensitive to the input, so as to avoid the problem of gradient disappearance.

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124732184