[Deep Learning] Overview of Machine Learning (1) Three Elements of Machine Learning - Model, Learning Criteria, and Optimization Algorithm


1. Basic concepts

  Machine learning: Through algorithms, machines can learn patterns from large amounts of data to make decisions on new samples.
Insert image description here

2. Three elements of machine learning

  Machine learning is to learn (or "guess") general rules from limited observation data, and can generalize the summarized rules to unobserved samples.
  Machine learning methods can be roughly divided into three basic elements: Model, learning criterion, and optimization algorithm.

1. Model

a. Linear model

  Linear model is a simple but widely used model whose hypothesis space is a family of parameterized linear functions. For classification problems, generalized linear functions are generally used, and their expression is:
f ( x ; θ ) = w T x + b f(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^T \mathbf{x} + b f(x;θ)=InTx+b

Then, number θ \boldsymbol{\theta} θ Including weight direction w \mathbf{w} w sum offset b b b. In this model, the input features x \mathbf{x} x Japanese weight direction w \mathbf{w} A linear combination of w is used to produce the output. This is a simple yet effective model, especially suitable for linear relationships of the problem.

b. Nonlinear model

  The generalized nonlinear model can be written as multiple nonlinear basis functions ϕ ( x ) \boldsymbol{\phi}(\mathbf{x}) ϕ(x) 的线性组合: f ( x ; θ ) = w T ϕ ( x ) + b f(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) + b f(x;θ)=InTϕ(x)+b使用, ϕ ( x ) = [ ϕ 1 ( x ) , ϕ 2 ( x ) , … , ϕ K ( x ) ] T \boldsymbol{\phi}(\mathbf{x}) = [\phi_1(\mathbf{x}), \phi_2(\mathbf{x}), \ldots, \phi_K(\mathbf{x}) ]^T ϕ(x)=[ϕ1(x),ϕ2(x),,ϕK(x)]T Koreyu K K A vector composed of K nonlinear basis functions, parameters θ \boldsymbol{\theta} θ Including weight direction w \mathbf{w} w sum offset b b b

  Results ϕ ( x ) \boldsymbol{\phi}(\mathbf{x}) ϕ(x) The truth is that it can be learned. Base function, example:

ϕ k ( x ) = h ( w k T ϕ ′ ( x ) + b k ) \phi_k(\mathbf{x}) = h(\mathbf{w}_k^T \boldsymbol{\phi}'(\ mathbf{x}) + b_k)ϕk(x)=h(wkTϕ(x)+bk)in that, h ( ⋅ ) h(\cdot) h() It's definitely a linear function, < /span> ϕ ′ ( x ) \boldsymbol{\phi}'(\mathbf{x}) ϕ(x) Function, w k \mathbf{w}_k Ink sum b k b_k bk is a learnable parameter, then the model f ( x ; θ ) f(\mathbf{x}; \boldsymbol{\theta}) f(x;θ) is equivalent to the neural network model.

  This nonlinear model introduces a nonlinear basis function ϕ ( x ) \boldsymbol{\phi}(\mathbf{x}) ϕ(x) can adapt more flexibly The complex relationships in the data enable the model to capture richer feature information. Neural network is an important way to implement nonlinear models.

2. Learning Guidelines

a. Loss function

1. 0-1 loss function

  0-1 loss function is the most intuitive loss function and is used to measure the error rate of the model on the training set. defined as:

L ( y , f ( x ; θ ) ) = { 0 if y = f ( x ; θ ) 1 if y ≠ f ( x ; θ ) \mathcal{L}(y, f(\mathbf{x}; \ ball symbol {\theta})) = \begin{cases}0 & \text{if} y = f(\mathbf{x}; \ballsymbol{\theta}) \\ 1 & \text{if } y \neq f(\mathbf{x}; \ball symbol{\theta}) \end{cases}L(y,f(x;θ))={ 01if y=f(x;θ)if y=f(x;θ)

Or use indicator function to express:

L ( y , f ( x ; θ ) ) = I ( y ≠ f ( x ; θ ) ) \mathcal{L}(y, f(\mathbf{x}; \boldsymbol{\theta})) = \mathbb {I}(y \neq f(\mathbf{x}; \ball symbol{\theta}))L(y,f(x;θ))=I(y=f(x;θ))

In that, I ( ⋅ ) \mathbb{I}(\cdot) I() It is an indicating function.

  Although the 0-1 loss function is intuitive, due to its discontinuity and non-differentiability, it is usually replaced by other continuously differentiable loss functions.

2. Square loss function (regression problem)

  The square loss function is often used in regression problems and is defined as:
L ( y , f ( x ; θ ) ) = 1 2 ( y − f ( x ; θ ) ) 2 \mathcal{L}(y, f(\mathbf{x}; \boldsymbol{\theta})) = \frac{1}{2}(y - f(\mathbf{x}; \boldsymbol{\theta} ))^2 L(y,f(x;θ))=21(yf(x;θ))2

The square loss function is suitable for the task of predicting real-valued labels, but is generally not suitable for classification problems.

3. Cross-Entropy Loss

  Can be used in classification tasks to measure the difference between two probability distributions

The cross-entropy loss function is often used in classification problems. Assume the label of the sample y y y is a discrete category, and the output of the model is the conditional probability distribution of the category label:
p ( y = c ∣ x ; θ ) = f c ( x ; θ ) p(y = c|\mathbf{x}; \boldsymbol{\theta}) = f_c(\mathbf{x}; \boldsymbol{\theta}) p(y=cx;θ)=fc(x;θ)

其中, f c ( x ; θ ) f_c(\mathbf{x}; \boldsymbol{\theta}) fc(x;θ) Display model export quantity c c c dimension. The cross-entropy loss function is defined as:

L ( y , f ( x ; θ ) ) = − ∑ c = 1 C y c log ⁡ f c ( x ; θ ) \mathcal{L}(\mathbf{y}, f(\mathbf{x}; \bold symbol{ \theta})) = -\sum_{c=1}^C y_c \log f_c(\mathbf{x}; \ball symbol{\theta})L(y,f(x;θ))=c=1Candclogfc(x;θ)

其中, and \mathbf{y} y is the true label vector of the sample, indicating the category y y y One-hot direction.

4. Hinge loss function

  The Hinge loss function is usually used for binary classification problems such as support vector machines, and is defined as:

L ( y , f ( x ; θ ) ) = max ⁡ ( 0 , 1 − y f ( x ; θ ) ) \mathcal{L}(y, f(\mathbf{x}; \bold symbol {\theta})) = \max(0, 1 - yf(\mathbf{x};\ballsymbol{\theta}))L(y,f(x;θ))=max(0,1yf(x;θ))

Or use the ReLU function to express:

L ( y , f ( x ; θ ) ) = [ 1 − y f ( x ; θ ) ] + \mathcal{L}(y, f(\mathbf{x}; \bold symbol {\theta})) = - yf(\mathbf{x};\ballsymbol{\theta})]_+L(y,f(x;θ))=[1yf(x;θ)]+

So, [ x ] + = max ⁡ ( 0 , x ) [x]_+ = \max(0, x) [x]+=max(0,x)

  These loss functions play a key role in different tasks and models. Choosing an appropriate loss function is an important decision in model design.

b. Risk minimization criteria

  In machine learning, the risk minimization criterion is to find a model that minimizes its expected error on unknown data. However, since we cannot directly calculate the expected risk, the empirical risk (average loss on the training set) is usually used instead. The following are related concepts:

1. Minimize experience risk

  Empirical Risk is the average loss of the model on the training set, expressed as:

R emp ( θ ) = 1 N ∑ n = 1 N L ( y ( n ) , f ( x ( n ) ; θ ) ) \mathcal{R}_{\text{emp}}(\bold symbol{\theta}) = \frac{1}{N}\sum_{n=1}^{N}\mathcal{L}(y^{(n)}, f(\mathbf{x}^{(n)}; \ball symbol {\theta}))Remp(θ)=N1n=1NL(y(n),f(x(n);θ))

In that, L \mathcal{L} L is a lapse function, N N N is the number of samples in the training set. The goal of the empirical risk minimization criterion is to find a set of parameters θ ∗ \boldsymbol{\theta}^* i Minimize empirical risk:

θ ∗ = arg ⁡ min ⁡ θ R emp ( θ ) \boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{R}_{\text{emp}}(\boldsymbol{\theta}) i=argiminRemp(θ)

This is called the Empirical Risk Minimization (ERM) criterion.

2. Overfitting problem

  Since the training set is usually limited and may contain noise, directly minimizing the empirical risk may cause the model to perform well on the training set but perform poorly on unknown data, that is, overfitting occurs.

3. Minimize structural risks

  To prevent overfitting, regularization terms can be introduced on the basis of empirical risk minimization to obtain the Structure Risk Minimization (SRM) criterion:

θ ∗ = arg ⁡ min ⁡ θ ( R emp ( θ ) + λ 2 ∥ θ ∥ 2 2 ) \boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \left( \mathcal{R}_{\text{emp}}(\boldsymbol{\theta}) + \frac{\lambda}{2} \|\boldsymbol{\theta}\|_2^2 \right) i=argimin(Remp(θ)+2λθ22)

in that, λ \lambda λ is a regularization parameter that controls the strength of regularization. Regularization term λ 2 ∥ θ ∥ 2 2 \frac{\lambda}{2} \|\boldsymbol{\theta}\|_2^2 2λθ22By limiting the size of parameters, it helps prevent overfitting.

4. Selection of regularization terms

  Regularization terms can use different functions, such as ℓ 1 \ell_1 1norm. The introduction of regularization can often be interpreted as introducing a prior distribution of the parameters so that they are not entirely dependent on the training data, thereby aiding generalization. The choice of regularization needs to be adjusted during model training to balance the trade-off between empirical risk and the regularization term.

5. Underfitting

  The opposite situation is underfitting, which occurs when the model has insufficient capabilities and cannot fit the training data well, resulting in a high error rate on the training set.

  The goal of machine learning is not only to obtain a good fit on the training set, but also to perform well on unknown data, that is, to have a low generalization error. Therefore, the selection of learning criteria should take into account the generalization ability of the model to prevent overfitting and underfitting.

3. Optimization

The machine learning problem is transformed into an optimization problem

  Once the training set is determined D \mathcal{D} D, create space F \mathcal{F} F and learning criteria, the next task is to find the optimal model through the optimization algorithm f ( x , θ ∗ ) f( \mathbf{x}, \boldsymbol{\theta}^*) f(x,i). The training process of machine learning is essentially the process of solving optimization problems.

a. Parameters and hyperparameters

  Optimization can be divided into two aspects: parameter optimization and hyperparameter optimization:

  1. Number improvement: ( x ; θ ) (\mathbf{x}; \boldsymbol{\theta}) (x;θ) 中的 θ \boldsymbol{\theta} θ are called parameters of the model, and these parameters are learned through the optimization algorithm. These parameters can be updated iteratively through algorithms such as gradient descent to minimize the loss function.

  2. Hyperparameter optimization: In addition to learnable parameters θ \boldsymbol{\theta} In addition to θ, there is also a type of parameters used to define the model structure or optimization strategy. These parameters are called hyperparameters. For example, the number of categories in the clustering algorithm, the learning rate in the gradient descent method, the coefficient of the regularization term, the number of layers of the neural network, and the kernel function in the support vector machine are all hyperparameters. Unlike learnable parameters, the selection of hyperparameters is usually a combinatorial optimization problem and is difficult to automatically learn through optimization algorithms. Usually, the setting of hyperparameters is based on experience or continuous trial and error adjustment of a set of hyperparameter combinations through search methods.

b. Optimization algorithm

  In the process of training the model, commonly used optimization algorithms include gradient descent method, stochastic gradient descent method, Newton method, etc. The core idea of ​​these algorithms is to update the parameters of the model iteratively so that the loss function gradually decreases.

  1. Gradient Descent: The basic idea is to adjust parameters along the gradient direction of the loss function to reduce the loss. The learning rate is an important hyperparameter that controls the step size of parameter updates in each iteration.

  2. Stochastic Gradient Descent (SGD): Similar to the gradient descent method, but only one sample is randomly selected for parameter update in each iteration, which is usually more suitable for large-scale scale data set.

  3. Newton’s Method: Use the second-order derivative information of the loss function to update parameters. The convergence speed is usually faster than the gradient descent method, but the computational cost is higher.

  4. Conjugate Gradient: is particularly suitable for solving linear equations.

  5. Quasi-Newton: Accelerates the convergence of the gradient descent method by approximating the Hessian matrix.

  6. Adaptive learning rate algorithms such as Adam, Adagrad, and RMSprop: These algorithms adaptively adjust the learning rate so that different parameters can have different learning speeds.

  Selecting the appropriate optimization algorithm and hyperparameters is a key issue in the training process. Experimentation and parameter adjustment are usually required to obtain the best performance.

Machine learning = optimization?

  Machine learning is not just about minimizing empirical risks

Supongo que te gusta

Origin blog.csdn.net/m0_63834988/article/details/135000630
Recomendado
Clasificación