Introduction to the main content of Chapter 1 of "Statistical Learning Methods" (Li Hang) (statistical learning overview, classification, three elements)

PS: The content written is my reading notes. If you want to read more detailed content, please buy genuine books.

Chapter 1: Introduction to Statistical Learning and Supervised Learning

1.1 Overview of Statistical Learning

Statistical learning: It is a discipline about computers building probability and statistics models based on data and using the models to predict and analyze data. Also known as statistical machine learning.

The concept of "learning": If a system can change its performance by performing a certain process, this is learning. —— Herbert Simon

The object of statistical learning: data. And the data has a certain statistical regularity and cannot be completely randomly distributed. In this way, data features can be extracted and data analysis and prediction can be carried out.

The main methods of statistical learning: supervised learning, unsupervised learning, reinforcement learning, etc.

Summary of statistical learning methods:

Starting from a given set of limited training data (training data) , it is assumed that the data is generated independently and identically distributed, and the model to be learned belongs to a set of functions, which is called a hypothesis space (hypothesis space ); applying an evaluation Criteria (evaluation criterion) , select an optimal model from the hypothesis space, so that it has the best prediction for the known training data and unknown test data (test data ) under the given evaluation criteria; the selection of the optimal model implemented by an algorithm.

Three elements of statistical learning : model (model), strategy (strategy), algorithm (algorithm)

Statistical learning steps:

Obtain a limited set of training data--Determine the hypothesis space that contains all possible models, that is, the set of learning models--Determine the criteria for model selection, that is, the learning strategy--Implement the algorithm for solving the optimal model, that is, the learning algorithm- -Select the optimal model by learning method--Use the learned optimal model to predict and analyze new data

1.2 Classification of Statistical Learning

1.2.1 Basic classification

1. Supervised learning

Supervised learning means that there is a clear correspondence between input and output, and the prediction model generates corresponding output for a given input. The essence of supervised learning is to learn the statistical law of the mapping from input to output.

According to the different types of input and output variables, different names are given to the prediction task: the prediction problem in which both the input variable and the output variable are continuous variables is called a regression problem; the prediction problem in which the output variable is a finite number of discrete variables is called a classification problem ; A prediction problem where both variables and output variables are sequences of variables is called a labeling problem .

2. Unsupervised learning

Unsupervised learning refers to the machine learning problem of learning a predictive model from unlabeled data. Unlabeled data is data that comes naturally. The essence of unsupervised learning is to learn the statistical regularity or latent structure in the data.

3. Reinforcement learning

Reinforcement learning refers to the machine learning problem in which an intelligent system learns an optimal behavioral strategy in continuous interaction with the environment. Assuming that the interaction between the intelligent system and the environment is based on the Markov decision process, what the intelligent system can observe is the data sequence obtained from the interaction with the environment. The essence of reinforcement learning is to learn optimal sequential decisions.

The interaction between the intelligent system and the environment is shown in the figure. At each step t, the intelligent system observes a state (state) s_t and a reward (reward) r_t from the environment, and takes an action (action) a_t. The environment determines the state s_{t+1} and reward r_{t+1} of the next step t+1 according to the action selected by the intelligent system. The purpose of an intelligent system is not to maximize short-term rewards, but to maximize long-term cumulative rewards. The system keeps trial and error, and has achieved the purpose of learning the optimal strategy.

4. Semi-supervised learning and active learning

Semi-supervised learning refers to the machine learning problem of using labeled data and unlabeled data to learn predictive models. Usually there is a small amount of labeled data and a large amount of unlabeled data. Semi-supervised learning aims to use the information in unlabeled data, assist labeled data, and perform supervised learning to achieve better learning results at a lower cost.

Active learning refers to the machine learning problem that the machine continuously actively gives examples for the teacher to label, and then uses the labeled data to learn the prediction model. Usually supervised learning uses given labeled data, which is often randomly obtained, and can be regarded as "passive learning". The goal of active learning is to find out the most helpful examples for learning and let teachers label them at a small labeling cost. , to achieve a better learning effect.

1.2.2 Classification by model

1.2.3 Classification by algorithm

Online learning (online learning) : Receive a sample each time, make a prediction, then learn the model, and repeat the machine learning of this operation. Some scenarios require that learning must be online. For example, the data cannot be stored sequentially, and the system needs to deal with it in a timely manner.

Batch learning : Receive all the data at once, learn the model, and then make predictions.

1.3 Three elements of statistical learning methods

1.3.1 Model

The model is the conditional probability distribution or decision function to be learned. The hypothesis space of a model contains all possible models. In this book, the model represented by the right decision function is called the non-probability model, and the model represented by the conditional probability is the probability model.

1.3.2 Strategy

1. Loss function and risk function

In supervised learning, for a given input X, the corresponding output Y is given by f(X). The predicted value f(X) of this output may or may not be consistent with the real value Y. A loss function (loss function ) or cost function (cost function) to measure the degree of prediction error. The loss function is a non-negative real-valued function of f(X) and Y, denoted as L(Y,f(X)).

Commonly used loss functions:

(1) 0-1 loss function

L(Y,f(X))=\left\{\begin{matrix} 1, &Y\neq f(X) \\ 0,& Y=f(X) \end{matrix}\right.

(2) Square loss function

L(Y,f(X))=(Y-f(X))^2

(3) Absolute loss function

L(Y,f(X))= \left | Y-f(X) \right |

(4) Log loss function or log likelihood loss function

L(Y,P(Y|X))=-logP(Y|X)

The smaller the value of the loss function, the better the model. Since the model input and output (X, Y) are random variables that follow the joint distribution P(X, Y), the expectation of the loss function is:

R_{exp}(f)=E_{P}[L(Y,f(X))]=\int _{\chi \times \gamma }L(y,f(x))P(x,y)dxdy

This is the theoretical loss of the model f(X) on the average of the joint distribution P(X,Y), called the risk function or expected loss.

But the joint distribution P(X,Y) is unknown, and R_{exp}(f) cannot be calculated directly.

Given a training dataset:

T = \begin{Bmatrix} (x_1,y_1), &(x_2,y_2) , &... \ , &(x_N,y_N) \end{Bmatrix}

The average loss of the model f(X) on the training data set is called empirical risk or empirical loss , denoted as R_{emp}:

R_{emp}(f)=\frac{1}{N}\sum_{i=1}^{N}L(y_i,f(x_i))

The expected risk R_{exp} is the expected loss of the model on the joint distribution, and the empirical risk R_{emp} is the average loss of the model on the training sample set. According to the theorem of large numbers, when N approaches infinity, the empirical risk approaches the expected risk. Empirical risk minimization can be used to train the model.

However, simply using empirical risk is easy to overfit, so the structural risk R_{srm} is proposed , which adds a regularization term or penalty term J on the basis of empirical risk. The definition is:

R_{srm}(f)=\frac{1}{N}\sum_{i=1}^{N}L(y_i,f(x_i))+\lambda J(f)

Where J(f) is the complexity of the model, which is a functional defined on the hypothesis space F. The more complex the model, the larger J(f). \lambda >= 0 is a coefficient to trade off empirical risk and model complexity.

1.3.3 Algorithms

Algorithm refers to the specific calculation method of the learning model.

Guess you like

Origin blog.csdn.net/APPLECHARLOTTE/article/details/127359123