Deep Reinforcement Learning (Wang Shusen Edition) Study Notes (1) - Basics of Machine Learning

foreword 

        Since my work is related to deep reinforcement learning, I want to find an opportunity to review the relevant knowledge of deep reinforcement learning. I happen to have this book in hand. I feel that the knowledge points are quite concise and the content is quite comprehensive. It also provides some study materials. So open a pit to record the learning process of this book. The relevant materials (PPT, source code) of this book can be obtained at the following link: https://www.ituring.com.cn/book/2982

        First, we will talk about some basic theoretical parts of machine learning in the order of this book.

1.1 Linear model

         Linear models are the simplest class of supervised machine learning models and are often used for simple machine learning tasks. A linear model can be thought of as a neural network with a single layer.

linear regression

        A linear model is one that predicts the value of an output variable (dependent variable) through a linear combination of a series of input variables (independent variables). Among them, the linear combination refers to multiplying the respective variables by a weight, and then adding a constant intercept. This process can be represented by the following formula:

f(\textbf{x};\mathbf{w},b)=w_1x_1+w_2x_2+...+w_nx_n+b

        where f(\mathbf{x};\mathbf{w},b) is the output variable, x_1is x_nthe independent variable, bis the intercept, w_1and w_nis the weight of each variable. Linear combinations can also be written as follows: 

f(\mathbf{x};\mathbf{w},b)=\mathbf{x}^T\mathbf{w}+b

        For example, on the issue of housing prices, \mathbf{x}it is the characteristics of the house ( x_1the area of ​​the house, x_2the location of the house, etc.), f(\mathbf{x};\mathbf{w},b)which is the house price, which depends on both the characteristics of the house and \mathbf{x}the sum of parameters . If it is the area of ​​the house, then it is the contribution of the house area to the house price. The larger it is, the stronger the correlation between the house price and the house area is, which is why it is called weight. Regardless of the characteristics of the house, it is simply added to the linear function, so it is called the offset. The offset can be thought of as the average or median of house prices on the market, and it has nothing to do with the characteristics of the house to be appraised.\mathbf{w}bx_1w_1w_1\mathbf{w}b\mathbf{x}^T\mathbf{w}b

        Linear regression is a machine learning algorithm based on a linear model that seeks to find the best-fit line to predict the value of an output variable. Specifically, linear regression uses samples from the training dataset to determine the weights \textbf{w}and intercepts for each independent variable bin order to make predictions on new data:

f(\mathbf{x};\mathbf{\hat{w}},\hat{b})=\mathbf{x}^T\mathbf{\hat{w}}+\hat{b}

least square method

        Least squares is a common mathematical method used to fit data and solve linear regression models. Its main idea is to determine the unknown parameters in the model by minimizing the squared error between the predicted value and the true value.

        Specifically, suppose we have a set of data (x_1,y_1),(x_2,y_2)\cdots (x_n,y_n), and we wish to find a f(\mathbf{x};\mathbf{w},b)=\mathbf{x}^T\mathbf{w}+blinear model of the form to describe these data. \textbf{w}We can use the least squares method to determine the value of the parameter sum bso that the sum of the squares of the distances from all data points to the straight line is the smallest.

         For example, on the issue of housing prices, the model ipredicts the price of the first house \hat{y}_i=f(\mathbf{x}_i;\mathbf{w},b), and the real price of this house is y_i, we hope y_ito be \hat{y}_ias close as possible, so (\hat{y}_i-y_i)^2the smaller the square difference, the better, define the loss function:

L(\mathbf{w},b)=\frac{1}{2n}\sum_{i=1}^{n}[f(\mathbf{x}_i;\mathbf{w},b)-y_i ]^2

        The least squares method hopes to find the sum that makes the loss function as small as possible, that is, makes the prediction of the model as accurate as \textbf{w}possible b . Numerical optimization algorithms (such as stochastic gradient descent) are most commonly used in practice to iteratively update the \textbf{w}sum b

        When the number of model parameters is large and the training data is not enough, regularization is often used to alleviate overfitting . After adding the regularization term, the above least squares model becomes:

\min_{\mathbf{w},b} L(\mathbf{w},b)+\lambda R(\mathbf{w})

Among them L(\mathbf{w},b)is the loss function, R(\mathbf{w})is the regular term, \lambdaand is the hyperparameter to balance the loss function and the regular term. The commonly used regularization items are L1 regularization item and L2 regularization item: 

logistic regression

        Logistic Regression is a machine learning algorithm for classification problems. It predicts labels for new data points by fitting the relationship between features and labels in the dataset. The output of logistic regression is a probability value representing the probability that the data point belongs to a certain class.

        The mathematical function used in logistic regression is the sigmoid function, which maps an input value to a probability value between 0 and 1. The formula for the sigmoid function is:

 sigmoid(z)=\frac{1}{1+exp(-z)}

        Before explaining the algorithm, let's talk about cross-entropy (Cross-entropy) , which is an indicator used to measure the difference between two probability distributions. In machine learning, cross-entropy is often used to measure the difference between the model's predicted results and the actual results . use vector

\mathbf{p}=[p_1,\cdots ,p_m]^T       and          \mathbf{q}=[q_1,\cdots ,q_m]^T

Represents two m-dimensional discrete probability distributions, the elements of the vector are all non-negative, and the sum of all the elements is 1, and the cross entropy between them is defined as

H(\mathbf{p},\mathbf{q})=-\sum_{j=1}^{m}p_j\cdot lnq_j

 Entropy is a special case of cross entropy:

H(\mathbf{p})=H(\mathbf{p},\mathbf{p})=-\sum_{j=1}^{m}p_j\cdot lnp_j

        Similar to cross-entropy is KL divergence (Kullback-Leibler divergence) , also known as relative entropy (relative entropy), which is also a method used to measure the difference between two probability distributions. For discrete distributions, KL divergence degree is defined as

KL(\mathbf{p},\mathbf{q})=\sum_{j=1}^{m}p_j\cdot ln\frac{p_j}{q_j}

It is agreed here ln\frac{0}{0}=0that the KL divergence problem is non-negative, and if and only \mathbf{p}=\mathbf{q}when the value of the KL divergence is 0. This means that when two probability distributions coincide, their KL divergence reaches a minimum value of 0. It is not difficult to see from the definition of KL divergence and cross entropy

KL(\mathbf{p},\mathbf{q})=H(\mathbf{p},\mathbf{q})-H(\mathbf{p})

Since entropy is a function that H(\mathbf{p})does not depend on , once it is fixed , the KL divergence is equal to the cross entropy plus a constant. If it is fixed, then the optimized KL divergence is equivalent to optimizing the cross entropy. In actual work, the data The distribution is always fixed, and we need to make the distribution predicted by the model as close as possible to the distribution of the data, that is, to treat the KL divergence as a minimization optimization problem, which is equivalent to the cross-entropy minimization optimization problem . That is why cross entropy is often used as a loss function.\mathbf{q}\textbf{p}\textbf{p}\mathbf{q}

        According to the example in the book, collect n blood test reports and the final diagnosis as the training set: (x_1,y_1),(x_2,y_2)\cdots (x_n,y_n), the vector x_i\epsilon \mathbb{R}^drepresents all indicators in each i blood test report, and the binary label y_i=1represents positive and y_i=0negative. The classifier's prediction for the i-th blood test report is \hat{y}_i=f(\mathbf{x}_i;\mathbf{w},b), but the real situation is that y_iif you want to use cross-entropy to measure the difference between the predicted value and the real value, you have to represent them as vectors:

\begin{pmatrix} y_i\\ 1-y_i \end{pmatrix}         and        \begin{pmatrix} \hat{y}_i\\ 1-\hat{y}_i \end{pmatrix}

The first element of both vectors corresponds to the probability of being positive, and the second element corresponds to the probability of being negative. Since the labels of the training samples y_iare given, the closer the two vectors are, the smaller their cross-entropy will be . Define the loss function of the problem as the mean value of cross entropy:

L(\mathbf{w},b)=\frac{1}{n}\sum_{i=1}^{n}H(\begin{pmatrix} y_i\\ 1-y_i \end{pmatrix},\begin{pmatrix} \hat{y}_i\\ 1-\hat{y}_i \end{pmatrix})

                      =-\frac{1}{n}\sum_{i=1}^{n}[y_i ln\hat{y}_i+(1-y_i)ln(1-\hat{y}_i)]

We want to find the sum that makes the loss function as small as possible, that is, the prediction of the classifier is as accurate as \textbf{w}possible b:

\min_{\mathbf{w},b} L(\mathbf{w},b)+\lambda R(\mathbf{w})

It can be seen that the formulation is similar to regression problems, and this type of optimization problem is called logistic regression . Usually, the stochastic gradient descent algorithm is used to iteratively update the parameters.

softmax classifier

        The example from the previous section was a binary classification problem, where the data was divided into only two classes, negative and positive. Instead, this section will introduce a multi-classification model - the Softmax classifier, which maps input vectors to a probability distribution. In the Softmax classifier, each category corresponds to a score, and these scores are normalized by the Softmax function to obtain the probability value corresponding to each category . Therefore, the Softmax classifier can be used to predict which category an input sample belongs to.

        Before introducing the softmax classifier, first introduce the softmax activation function. Its input and output are k-dimensional vectors. If it is \mathbf{z}=[z_1,\cdots ,z_k]^Tany k-dimensional real vector, its elements can be positive or negative. The softmax function is defined as:

softmax(\mathbf{z})=\frac{1}{\sum_{l=1}^{k}exp(z_l)}[exp(z_1),\cdots ,exp(z_k)]^T

The output of this function is a k-dimensional vector whose elements are all non-negative and add to 1.

         The linear softmax classifier is "linear function + softmax activation function", which is defined as

\mathbf{\pi}=softmax(\mathbf{z}), in   \mathbf{z}=\mathbf{W}x+\mathbf{b}

where \mathbf{W}\epsilon \mathbb{R}^{k*d}is \mathbf{b}\epsilon \mathbb{R}^kthe parameter of the classifier, d is the dimension of the input vector, and k is the number of labels.

        The example in the book is that there are n=60,000 handwritten digital pictures, each with a size of 28*28 pixels, and the picture needs to be turned into a d=28*28=784-dimensional vector, denoted as \mathbf{x}_1,\cdots ,\mathbf{x}_n\epsilon \mathbb{R}^d. Each picture has a label, which is an integer from 0 to 9, a total of 10, and it needs to be one-hot encoded, for example, 0 is [1,0,0,0,0,0,0,0,0, 0], 1 is [0,1,0,0,0,0,0,0,0,0], ... ,9 is [0,0,0,0,0,0,0,0, 0,1] , becomes a k=10-dimensional one-hot vector, denoted as \mathbf{y}_1,\cdots ,\mathbf{y}_n.

        For the i-th image \mathbf{x}_i, the classifier makes a prediction:

\mathbf{\pi}_i=softmax(\mathbf{W\mathbf{x}_i}+b)

It is a k=10-dimensional vector that can reflect classification results. We want it to be as close to the real label as possible \mathbf{y}_i, and define the loss function as the average cross entropy:

L(\mathbf{W},\mathbf{b})=\frac{1}{n}\sum_{i=1}^{n}H(\mathbf{\mathbf{y}_i,\pi_i})

We want to find the sum that makes the loss function as small as possible, that is, the prediction of the classifier is as accurate as \mathbf{W}possible \mathbf{b}:

\min_{\mathbf{W},\mathbf{b}} L(\mathbf{W},\mathbf{b})+\lambda R(\mathbf{W})

The parameters are then iteratively updated using the stochastic gradient descent algorithm.

1.2 Neural Network

Fully connected neural network

        Fully Connected Layer (Fully Connected Layer) is the most commonly used layer type in neural networks, also known as densely connected layer or fully connected layer. Its function is to connect all neurons in the input layer to all neurons in the output layer, that is, each input neuron is connected to each output neuron.

        In the fully connected layer, the output of each neuron is obtained by the weighted sum of all neurons in the previous layer , and is transformed nonlinearly by an activation function. These weights and bias values ​​are the parameters that need to be learned during the training process of the neural network. Therefore, the fully connected layer is usually the layer with the most parameters in the model.

         We can regard the fully connected layer as a basic component, and then build a fully connected neural network like a building block, also called a multilayer perceptron:

convolutional neural network

        Convolutional Neural Network (CNN) is a common deep learning neural network used for tasks such as image classification, object recognition, and computer vision. This book does not explain the principles of CNN in detail, and will not use these principles. Just remember: the input of CNN is a matrix or third-order tensor, CNN extracts features from this tensor, and outputs the extracted feature vector. Images are usually matrices (grayscale images) and third-order tensors (color images), from which CNNs can be used to extract features, and then one or more fully connected layers for classification or regression .

1.3 Gradient Descent and Backpropagation 

gradient descent

        Both linear model and neural network training can be described as an unconstrained optimization problem:

\min_{\mathbf{w}^{(1)},\cdots \mathbf{w}^{(l)}} L(\mathbf{w}^{(1)},\cdots \mathbf{w} ^{(l)})

        The most commonly used algorithms for this type of optimization problem are gradient descent and stochastic gradient descent. Gradient is a mathematical concept. For unary functions, we use the concept of "derivative". The derivative of a one-variable function is a scalar. For a multivariate function, we use the concepts of "gradient" and "partial derivative". The gradient of a multivariate function is a vector, and each element of the vector is the partial derivative of the function with respect to a variable.

        In machine learning, since our goal is to minimize the objective function, we should update the parameters along the negative direction of the gradient, that is, the gradient descent method, which can gradually approach the local minimum of the function. For a given objective function, we first compute its \mathbf{w}_{now}^{(1)}, \cdots ,\mathbf{w}_{now}^{(l)}partial derivatives with respect to each parameter ( ), which form the gradient vector. We then \alphamultiply the gradient vector with a positive number called the learning rate and subtract the current parameter value from the result to get the new parameter value.

\mathbf{w}_{new}^{(i)}\leftarrow \mathbf{w}_{now}^{(i)}-\alpha \cdot \bigtriangledown _\mathbf{w^{(i)}}L(\mathbf{w}_{now}^{(1)}, \cdots ,\mathbf{w}_{now}^{(l)})

This process is repeated until a convergence condition is reached or a predefined number of iterations is reached. In this way, a local minimum of the objective function can be found.

It should be noted that the basic idea of ​​the gradient descent method is to update the model parameters by calculating the average gradient of all samples. It cannot guarantee to find the global minimum, because there may be multiple local minimums. The stochastic gradient descent method only uses the gradient information of a single sample to update parameters each time, so it is easy to jump out of the local minimum.

backpropagation

Backpropagation is an algorithm for training neural networks. It calculates the gradient of the loss function for each parameter, and then uses the gradient descent method to update the parameters, so that the neural network can learn the correct output.

Specifically, the backpropagation algorithm first calculates the output of the neural network through forward propagation, and then compares the output with the real value to obtain a loss function. Next, the algorithm calculates the gradient of each parameter to the loss function via the chain rule, and uses gradient descent to update the parameters. This process is repeated until a preset stop condition is reached.

Summary of this article

The first chapter of this book mainly talks about some basic concepts of machine learning for introductory deep reinforcement learning, and briefly introduces linear models, neural networks, gradient descent and backpropagation. If the reader has a good foundation in machine learning, you can skip this chapter. Generally speaking, the content of this chapter is relatively basic. If you want a deeper understanding, you can refer to other books.

Guess you like

Origin blog.csdn.net/qq_42286607/article/details/130125662