[Strong recommendation] Teacher Li Hongyi's 2021 in-depth learning course study notes (continuously updated)

Machine learning can be understood as a process of allowing machines to automatically find functions.

Please add a picture description

According to the function of the function, machine learning can be classified as follows.

Please add a picture description

What Alpha Go does is also a classification problem: take the position of white and black on the current chessboard as input, and the output is one of 19*19 calsses.

If you know Li Hongyi's Youtube playback data for each day in the past three years, to predict tomorrow's playback data, you can assume a linear model that includes two parameters w and b, and input x1 as the previous day's data (such as 2.25 ), y is the predicted data of the current day (such as 2.26)
Please add a picture description
Please add a picture description

The loss function is a function about model parameters, which is used to evaluate the quality of the model and model parameter selection. Here we can evaluate by the mean absolute error. Substituting the data of the first day into the model function to obtain the predicted value of the second day, the absolute value of the difference from the real value is e1, and similarly substituting the data of the second day into the forecast of the third day, the absolute value of the difference from the real value of the third day The value is e2, and finally en is obtained, and these sums are averaged, which is the mean absolute error (MAE). In addition, there are MSE (mean square error) and RMSE (root mean square error).

Please add a picture description
Please add a picture description
Please add a picture description

We can try different parameters w and b to calculate the value of loss, and draw a contour map. The redder the line means the greater the loss, and the bluer the line means the smaller the loss. The best parameter should be Around w=1, b=250. For a more precise search, the method of gradient descent can be used. As shown in the figure below, we determine the error curve of b and L with respect to the change of w parameter. The distance of gradient descent is not only related to the derivative of the current point, but also related to the parameter learning rate set by ourselves. Like this, w and b are determined by the machine The parameters learned by ourselves are the parameters of the model. The parameters such as the learning rate that we can set are called hyperparameters. When the weight value is iterated to wt, the gradient becomes 0 at this time, and the weight value will not be updated again. It is easy to see that the gradient descent is easy to fall into the local minimum, but in practical applications, it often does not fall into the local minimum. This is not what we pay attention to when we are doing neural network training.

Please add a picture description

The same method is used when there are two parameters to learn, as follows:

Please add a picture description
Please add a picture description

When the gradient is negative, it needs to increase towards the direction where the gradient is 0, and when the gradient is positive, it needs to decrease towards the direction where the gradient is 0, so it will have a negative sign.

At this time, we have completed the training of the model. Here we choose the linear model to achieve the lowest loss on the training set of 0.48k, and then predict the data from 2021.1.1 to 2021.2.14 in the future, and the difference between the obtained value and the actual value The mean in absolute value is 0.58k.
Please add a picture description
Please add a picture description

The above figure shows the comparison between the predicted value of 1.1 to 2.14 and the actual value. It can be seen that except for the first day, the data of each other day is like shifting the data of the previous day directly to the right, which is not difficult to understand. , because the point of each day's forecast value is the actual value of the previous day multiplied by 0.97 plus 100, so the difference will not be great. But we will find at this time that the real value is actually presented in a certain period: generally, 7 is a cycle. The specific explanation may be that everyone will go out to play on weekends for two days. Then after we discover this rule, we still use the previous day. It is not appropriate to predict the next day, but to use the data of the previous seven days to predict the next day.

Please add a picture description

In the second line, the model at this time is listed, and wj means j days ago. Judging from the weight obtained from the last training, the weight of the previous day corresponds to 0.79, which has the greatest impact on the prediction of the next day. , the final losses on the training set and test set are 0.38k and 0.49k respectively, and then we try to use the data of more days before, and the two losses will not drop too much. At this time, we are using the liner model. It can be seen that the performance of this simple model may not be improved here.

Then we introduced a piecewise linear function, such as the red curve below, which can be obtained by superimposing several blue curves and adding constants. For a general curve, we can use many points to separate it, and then in turn Connect these points with a straight line, so that we get a more complex piecewise linear function, which can also be superimposed in the following way, but the blue curve may be used a lot.
Please add a picture description

After understanding that any curve can be approximated in this way, we need to know what the function of the blue curve is like, as follows, it is changed from the sigmoid function, and because the steering is sharper, it is called Hard Sigmoid.
Please add a picture description

The influence of the parameters w, b, and c on the sigmoid image is as follows:

Please add a picture description
Please add a picture description

As above, we extend the linear model to an arbitrary curve approximated by the superposition of several sigmoid functions and additional constant terms, and extend the linear model considering the previous j days to the final model considering multiple days of arbitrary curves.

Please add a picture description

What the above picture does is: the model uses the red piecewise linear function in the figure below, and then features the playback volume of the first three days to predict the playback volume of the next day. In this way, we have drawn the topology of the network in the above figure.

Please add a picture description

We then represent x to r as a relation of matrix multiplication.

Please add a picture description

The final network structure is as follows:

Please add a picture description

Expressed in the form of vector-matrix multiplication in linear algebra, it is the formula below the figure below.

Please add a picture description

After obtaining such a slightly more complicated model, we found that there are many parameters that need to be determined, including scalar b, transpose of vector c, vector b, and matrix W. We put all elements in a one-dimensional row vector Or in the column vector, each element is marked with seita to facilitate the subsequent use of gradient descent to solve the optimal parameters. The first figure below is to define the loss function under this model, and the second and third figures below are to find the optimal parameters through gradient descent A process of solution, iterate until the gradient is 0 or there are too many times we don't want to do it.

Please add a picture description
Please add a picture description
Usually, because the data set is relatively large, we will divide the data set into several batches, and perform parameter optimization in batches. A parameter optimization in each batch is called an update or iteration. After all the batches are completed once, Called an epoch. For example: a data set contains 1000 samples, we divide it into 100 groups, each group has 10 samples, then each parameter iteration in each group is called an update, and all batches are completed this time After parameter iteration, it is called an epoch, so at this time, an epoch actually contains 1000 updates.

So far, the hyperparameters we can handle include the learning rate, the number of sigmoid functions (that is, the number of neuron nodes), and the number of groups (batch size). These are all parameters that we can adjust by ourselves.

In the above curve fitting process, in addition to using the sigmoid function, we can also use the ReLU function (Rectified Linear Unit, rectified linear unit function) to fit the function curve.

Please add a picture description
Please add a picture description

What should be noted in the above figure is that the cumulative symbol of the ReLU function is 2i, this is because two ReLU functions can be superimposed into a hard sigmoid function, and the latter is the basic unit for function fitting using sigmoid.

These are the two most common activation functions, and we fit the function curve in the model by superimposing the output of these activation functions.

After using the ReLU function as the activation function, by selecting different numbers of neurons, the values ​​of the loss function on the training set and test set are as follows:

Please add a picture description

Further, we can achieve lower Loss by increasing the number of layers of the neural network.

Please add a picture description

The results are as follows. Each layer contains 100 neurons whose activation function is ReLU, and the input feature is the playback volume data of the previous 56 days (in the linear model, the lowest possible loss is obtained when 56 is used).

Please add a picture description

The comparison between the real value and the predicted value in the test set is as follows.

Please add a picture description
Please add a picture description

When the number of layers is increased, although the performance on the test set is even better, the results on the training set become worse, which leads to overfitting, and the network falls into its own small world at this time.

Guess you like

Origin blog.csdn.net/cyy0789/article/details/120444076