Python.NN (1) Introduction to Neural Networks
1
Use perceptrons for multi-task classification
When we introduced the perceptron in Python · SVM (1) · Perceptron dry goods|SVM (1) · The most comprehensive perceptron summary, we used this formula:, and then the y requirement in the sample can only be positive or negative 1. .
This is of course no problem for two classification tasks, but there will be some problems for multiple classification tasks.
Although we can use various methods to make the binary classification model do multi-classification tasks, those methods are generally more troublesome.
In order to make our model naturally adapt to multi-classification tasks, we usually take y in the sample as a one-hot vector, that is
At this point, our perceptron will become like this (take the K classification task of N samples with n-dimensional features as an example; for brevity, the offset b is omitted):
(see my pure mouse drawing forgive me I drew it so ugly|
At this time, the representation of the model will change . Notice that our original output is a number, but after the transformation, the output is a K-dimensional vector.
For this reason, we cannot simply use the previously defined loss function ( ), but should define a new loss function
Since our goal is to make the model output vector and the real ( ) label vector "closer, the better", and "distance" is a natural measure of "distance", so the Euclidean distance to define the loss function is a comparison Nature.
Specifically, we can define the loss function as
After the loss, the derivative can be found. It should be pointed out that although the formulas I wrote next seem obvious, since we are doing matrix derivation work, the real logic behind them is actually not so obvious.
Interested audience masters can take a look at this article Matrix Derivation, here is the result directly, the details will be attached at the end of the article:
Using it, we can write the corresponding gradient descent training algorithm:
The performance of the model is as follows:
At this time, the correct rate is about 50%.
It can be seen that there is basically no category on the diagonal in the model output (and this category is just the most). This is because extending the perceptron to a multi-category model does not change the essence of the linear model, so The classification effect is still linear, so it cannot fit the boundary of the diagonal
2
From perceptron to neural network
So how can the perceptron become a nonlinear model? in particular:
- The existing perceptron has only two layers (input layer and output layer). Can we add more layers?
- Intuitively speaking, the kernel method uses a kernel function that satisfies certain conditions to map the sample space to a high-dimensional space; if I relax the conditions and use some general better functions, will it be effective?
Based on these two ideas, we can make the structure of the perceptron model drawn above into the following structure:
Among them, we usually (assuming a total of N samples):
- Referred to as "activation function"
- We called and is the weight matrix
- We call the layers added in the middle "hidden layers"; for brevity, we will think that only one layer of hidden layers will be added when we discuss later. The case of multiple layers will be discussed in the next article
- Call each circle in the figure a "neuron". Take the above picture as an example:
- The input layer has n neurons
- The hidden layer has m neurons
- There are K neurons in the output layer.
Then, the activation function is actually the general and better function mentioned above; I will attach the relevant introduction of the commonly used activation function at the end of the article, here I will temporarily assume that the activation function is what we used in the SVM before ReLU I have seen, that is
This structure looks very powerful, but it is a lot of trouble to find the derivative (the loss function is still taken (still only the result is given and the details are attached at the end of the article):
Among them, "*" represents the element-wise operation of multiplication (or Hadamard product in a little more professional terms), and ReLU' represents the derivative of the ReLU function. Since we have when it is ReLU , its derivation is also very simple:
Using these two derivation formulas, the corresponding gradient descent algorithm is relatively easy to implement:
The performance of the model is as follows:
Although it is still relatively poor (accuracy rate is about 70%), but there is already a look
3
Use Softmax + Cross Entropy
It may have been discovered by the audience master that I have adjusted the default value of the training rate of the above model to, because a slightly larger training rate will directly explode the model.
This is because we didn't change the last layer, but used it simply and roughly . This leads to the output of the final model is likely to break through the sky, which of course is not what we want to see
Considering that the label is a vector, from another perspective, it is actually a probability distribution vector. So can we turn the output of the model into a probability vector?
In fact, the well-known Softmax does exactly this job. Specifically, if there is a vector now , then there is:
It is not difficult to see that this is a probability vector, and it is quite reasonable from an intuitive point of view.
In the real application of Softmax, there will be a little trick to improve numerical stability. The details will be attached at the end of the article, so I will temporarily press
So after applying Softmax, the output of our model becomes a probability vector.
It is true that Euclidean distance can still be used as the loss function at this time, but a generally better way is to use Cross Entropy as the loss function.
Specifically, the cross entropy of two random variables (true value) and (predicted value) is:
Cross entropy has the following two properties:
- When the true value is 0(), the cross entropy is actually transformed into that the closer the predicted value is to 0 , the closer the cross entropy is to 0. On the contrary, if the predicted value tends to 1, the cross entropy will tend to infinity
- When the true value is 1 (), the cross entropy is actually turned into, the closer the predicted value is to 1, the closer the cross entropy is to 0, otherwise if the predicted value tends to 0, the cross entropy will tend to infinity,
so take the cross entropy as The loss function is reasonable. When cross entropy is actually applied, there are also small tricks to improve numerical stability-put a small value in log to avoid log 0:
After adding these two things, we are about to seek the derivative.
Although the derivation process is more complicated, it is surprising that the final result is almost the same as the previous result, the difference is only the multiple (see the end of the article for the derivation process):
So the corresponding implementation is almost the same:
The performance of the model is as follows:
Although it is still not perfect, it has two advantages compared with the model that does not use Softmax + Cross Entropy:
- The training rate can be adjusted larger (vs), which means that the model is not so easy to explode (what the hell)
- The training of the model is more stable, that is, the results of each training are similar.
- In contrast to the previous model, the performance of the model I gave was actually carefully selected; in general, its performance is actually similar to this (this tells us that a good result is likely to be caused by countless sb The results piled up......):
4
Related mathematical theories
1) Common activation functions
A、Sigmoid:
B 、 Fish :
C, reverse:
D 、 ELU :
E 、 Softplus :
And recently an activation function called SELU has been published. The paper has 102 pages... Interested audiences can refer to [1706.02515] Self-Normalizing Neural Networks
2) Calculation of derivative in neural network
In order to write concisely, we will use the technique of matrix derivation to perform calculations; but if it looks too convoluted, it is recommended to refer to this article and seek the derivation element by element by definition (in fact, I often can’t get around... …)
From simple to complex, let's first look at the derivation process of the multi-class perceptron. We have said before that the multi-classification perceptron can be written as, where (assuming there are N samples):
Many people try to find the time, that would require a vector for matrix rule guide, although it can not be said to be wrong, but it will likely complicate the issue.
In fact, the essence of a multi-class perceptron is to predict a certain sample of a random vector , but it is just a sample matrix generated after multiple samples. Therefore, the essence of the multi-class perceptron is actually :
So the derivation process is reduced to the scalar derivation of the matrix:
It can be seen from this . After finding this special case, how should it be extended to the case of the sample matrix?
Although a rigorous mathematical description will be more troublesome (requiring vectorization of the matrix [also known as "straightening"] or the like), we can intuitively understand it like this:
- When the sample changes from one to multiple, the derivative of the weight matrix should be the sum of from one to multiple,
so we only need to use matrix multiplication to complete the summing process. Note that we can intuitively write:
then
as well as
3)Softmax + Cross Entropy
That is
that is
In this article, we mainly discuss how to apply the perceptron to multi-classification tasks, and get a more powerful model (neural network) through intuitive ideas-deepening the level of the perceptron and applying activation functions.
In addition, we also discussed how to apply Softmax + Cross Entropy to make the model more stable.
However, the scope of our discussion is still limited to a single hidden layer and ReLU activation function. In the next article, we will introduce a more general situation and use a way to visually illustrate the power of neural networks.
Recommended reading:
Why should the data be normalized?
Logistic function and softmax function
video explanation | Why can't all neural network parameters be initialized to all 0
全是通俗易懂的硬货!只需置顶~欢迎关注交流~