[Deep Learning] Overview of Machine Learning (2) Gradient Descent Method of Optimization Algorithm (Batch BGD, Stochastic SGD, Mini-batch)


1. Basic concepts

  Machine learning: Through algorithms, machines can learn patterns from large amounts of data to make decisions on new samples.
  Machine learning is the process of learning (or "guessing") general rules from limited observation data, and can generalize the summarized rules to unobserved samples.
Insert image description here

2. Three elements of machine learning

  Machine learning methods can be roughly divided into three basic elements: Model, learning criterion, optimization algorithm.

1. Model

a. Linear model

f ( x ; θ ) = w T x + b f(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^T \mathbf{x} + b f(x;θ)=InTx+b

b. Nonlinear model

  The generalized nonlinear model can be written as multiple nonlinear basis functions ϕ ( x ) \boldsymbol{\phi}(\mathbf{x}) ϕ(x) 的线性组合: f ( x ; θ ) = w T ϕ ( x ) + b f(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) + b f(x;θ)=InTϕ(x)+b使用, ϕ ( x ) = [ ϕ 1 ( x ) , ϕ 2 ( x ) , … , ϕ K ( x ) ] T \boldsymbol{\phi}(\mathbf{x}) = [\phi_1(\mathbf{x}), \phi_2(\mathbf{x}), \ldots, \phi_K(\mathbf{x}) ]^T ϕ(x)=[ϕ1(x),ϕ2(x),,ϕK(x)]T Koreyu K K A vector composed of K nonlinear basis functions, parameters θ \boldsymbol{\theta} θ Including weight direction w \mathbf{w} w sum offset b b b.
  Results ϕ ( x ) \boldsymbol{\phi}(\mathbf{x}) ϕ(x) The truth is that it can be learned. Base function, example:

ϕ k ( x ) = h ( w k T ϕ ′ ( x ) + b k ) \phi_k(\mathbf{x}) = h(\mathbf{w}_k^T \boldsymbol{\phi}'(\ mathbf{x}) + b_k)ϕk(x)=h(wkTϕ(x)+bk)in that, h ( ⋅ ) h(\cdot) h() It's definitely a linear function, < /span> ϕ ′ ( x ) \boldsymbol{\phi}'(\mathbf{x}) ϕ(x) Function, w k \mathbf{w}_k Ink sum b k b_k bk is a learnable parameter, then the model f ( x ; θ ) f(\mathbf{x}; \boldsymbol{\theta}) f(x;θ) is equivalent to the neural network model.

2. Learning Guidelines

a. Loss function

b. Risk minimization criteria

[Deep Learning] Overview of Machine Learning (1) Three Elements of Machine Learning - Model, Learning Criteria, and Optimization Algorithm

3. Optimization

The machine learning problem is transformed into an optimization problem

  Once the training set is determined D \mathcal{D} D, create space F \mathcal{F} F and learning criteria, the next task is to find the optimal model through the optimization algorithm f ( x , θ ∗ ) f( \mathbf{x}, \boldsymbol{\theta}^*) f(x,i). The training process of machine learning is essentially the process of solving optimization problems.

a. Parameters and hyperparameters

  Optimization can be divided into two aspects: parameter optimization and hyperparameter optimization:

  1. Number improvement: ( x ; θ ) (\mathbf{x}; \boldsymbol{\theta}) (x;θ) 中的 θ \boldsymbol{\theta} θ are called parameters of the model, and these parameters are learned through the optimization algorithm. These parameters can be updated iteratively through algorithms such as gradient descent to minimize the loss function.

  2. Hyperparameter optimization: In addition to learnable parameters θ \boldsymbol{\theta} In addition to θ, there is also a type of parameters used to define the model structure or optimization strategy. These parameters are called hyperparameters. For example, the number of categories in the clustering algorithm, the learning rate in the gradient descent method, the coefficient of the regularization term, the number of layers of the neural network, and the kernel function in the support vector machine are all hyperparameters. Unlike learnable parameters, the selection of hyperparameters is usually a combinatorial optimization problem and is difficult to automatically learn through optimization algorithms. Usually, the setting of hyperparameters is based on experience or continuous trial and error adjustment of a set of hyperparameter combinations through search methods.

b. Gradient descent method

  In machine learning, one of the simplest and commonly used optimization algorithms is the gradient descent method. Gradient descent is used to minimize a function, usually a loss function or risk function. The gradient of this function with respect to the model parameters (weights) points in the direction where the function value increases the fastest. The gradient descent method uses this information to update the parameters so that the function value gradually decreases.

Iterative formula of gradient descent method

θ t + 1 = θ t − α ∂ R D ( θ ) ∂ θ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \frac{\partial \mathcal{R}_{\mathcal{D}}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} it+1=itaθRD(θ)

in:

  • θ t \boldsymbol{\theta}_t itis the parameter value at the (t)th iteration.
  • a \alphaα is the learning rate, which controls the step size of parameter update.
  • R D ( θ ) \mathcal{R}_{\mathcal{D}}(\boldsymbol{\theta})RD(θ) is a risk function or a loss function, expressed in the training set ( \mathcal{D}).

The goal of the gradient descent method is to minimize the risk function by iteratively adjusting parameters.

Specific parameter update formula

The parameter update formula can be specified as:

θ t + 1 = θ t − α 1 N ∑ n = 1 N ∂ L ( y ( n ) , f ( x ( n ) ; θ ) ) ∂ θ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \frac{1}{N} \sum_{n=1}^{N} \frac{\partial \mathcal{L}(y^{(n)}, f(\mathbf{x}^{(n)}; \boldsymbol{\theta}))}{\partial \boldsymbol{\theta}} it+1=itaN1n=1NθL(y(n),f(x(n);θ))

in:

  • N NN is the number of samples in the training set.
  • L ( y ( n ) , f ( x ( n ) ; θ ) ) \mathcal{L}(y^{(n)}, f(\mathbf{x}^{(n)}; \ball symbol{\theta }))L(y(n),f(x(n);θ)) is the loss function, which represents the model pair sample n n Prediction error for n.
Learning rate choice

  Learning rate α \alpha α is a key hyperparameter that affects the step size of parameter update. It is important to choose an appropriate learning rate. A learning rate that is too small may cause the convergence speed to be too slow, while a learning rate that is too large may cause the parameters to diverge during the optimization process.

  An improvement on the gradient descent method is to use adaptive learning rate variants such as Adagrad, RMSprop, Adam, etc. These algorithms can automatically adjust the learning rate according to the historical gradient of the parameters, thereby more flexibly adapting to the update needs of different parameters.

c. Stochastic gradient descent

Insert image description here

Batch Gradient Descent (BGD)

  In the batch gradient descent method, each iteration must calculate the gradient on the entire training set, and then update the model parameters, which results in overcomes the high computational cost and memory requirements on large-scale data sets. Its iterative update rules are as follows:

θ t + 1 = θ t − α ∇ R D ( θ t ) \theta_{t+1} = \theta_t - \alpha \nabla \mathcal{R}_{\mathcal{D}}(\theta_t)it+1=itαRD(θt)

in that, α \alpha α 是学习率, ∇ R D ( θ t ) \nabla \mathcal{R}_{\mathcal{D}}( \theta_t) RD(θt) is the loss function on the entire training set with respect to the parameters θ t \theta_t itgradient.

Stochastic Gradient Descent (SGD)

  Stochastic gradient descent reduces computational cost by estimating the gradient using only one sample in each iteration. Its iterative update rules are as follows:

θ t + 1 = θ t − α ∇ L ( θ t , x i , y i ) \theta_{t+1} = \theta_t - \alpha \nabla \mathcal{L}(\theta_t, \mathbf{x}_i, y_i)it+1=itαL(θt,xi,andi)

使用, ∇ L ( θ t , x i , y i ) \nabla \mathcal{L}(\theta_t, \mathbf{x}_i, y_i) L(θt,xi,andi) This is the original version ( x i , y i ) (\mathbf{x}_i, y_i) (xi,andiThe loss function on ) is about the parameter θ t \theta_t itgradient.

Mini-batch Gradient Descent

  To trade off the computational cost and the accuracy of gradient estimation, mini-batch gradient descent is often used. This method uses a mini-batch of samples in each iteration to estimate the gradient, thus achieving both computational efficiency and gradient accuracy.

θ t + 1 = θ t − α ∇ L batch ( θ t , Batch ) \theta_{t+1} = \theta_t - \alpha \nabla \mathcal{L}_{\text{batch}}(\theta_t, \text{Batch}) it+1=itαLbatch(θt,Batch)

其中, ∇ L batch ( θ t , Batch ) \nabla \mathcal{L}_{\text{batch}}(\theta_t, \text{Batch}) Lbatch(θt,Batch) This collection of books Batch \text{Batch} The loss function on Batch is about the parameters θ t \theta_t itgradient.

Advantages of SGD
  1. Computational efficiency: Compared with the batch gradient descent method, SGD has lower computational cost and is especially more practical on large-scale data sets.

  2. Online learning: SGD has the property of online learning, requiring only one sample for each iteration, allowing the model to gradually adapt to new data.

  3. Escape from the local minimum: Since the samples used in each iteration are different, SGD helps to escape from the local minimum, making it more likely to find the global optimal solution.

Challenges of SGD
  1. Instability: The update of each iteration in SGD may be affected by a single sample, resulting in large fluctuations in the update direction.

  2. Learning rate adjustment: Choosing an appropriate learning rate is crucial to the performance of SGD. A learning rate that is too large may cause instability, while a learning rate that is too small may make the model converge slowly.

  3. Parameter adjustment required: The performance of SGD depends on the selection of hyperparameters such as learning rate and mini-batch size, and parameter adjustment is required.

In practice, techniques such as learning rate decay and momentum methods are usually used to improve the performance of SGD.

Guess you like

Origin blog.csdn.net/m0_63834988/article/details/135022885
Recommended