Dry goods|Very detailed neural network entry explanation

Python.NN (1) Introduction to Neural Networks

1

Use perceptrons for multi-task classification

When we introduced the perceptron in Python · SVM (1) · Perceptron dry goods|SVM (1) · The most comprehensive perceptron summary, we used this formula:, Dry goods|Very detailed neural network entry explanationand then the y requirement in the sample can only be positive or negative 1. .

This is of course no problem for two classification tasks, but there will be some problems for multiple classification tasks.

Although we can use various methods to make the binary classification model do multi-classification tasks, those methods are generally more troublesome.

In order to make our model naturally adapt to multi-classification tasks, we usually take y in the sample as a one-hot vector, that is

Dry goods|Very detailed neural network entry explanation
At this point, our perceptron will become like this (take the K classification task of N samples with n-dimensional features as an example; for brevity, the offset b is omitted):
Dry goods|Very detailed neural network entry explanation
(see my pure mouse drawing forgive me I drew it so ugly|
At this time, the representation of the model will change Dry goods|Very detailed neural network entry explanation. Notice that our original output is a number, but after the transformation, the output is a K-dimensional vector.

For this reason, we cannot simply use the previously defined loss function ( Dry goods|Very detailed neural network entry explanation), but should define a new loss function

Since our goal is to make the model output vector Dry goods|Very detailed neural network entry explanationand the real ( Dry goods|Very detailed neural network entry explanation) label vector "closer, the better", and "distance" is a natural measure of "distance", so the Euclidean distance to define the loss function is a comparison Nature.

Specifically, we can define the loss function asDry goods|Very detailed neural network entry explanation

After the loss, the derivative can be found. It should be pointed out that although the formulas I wrote next seem obvious, since we are doing matrix derivation work, the real logic behind them is actually not so obvious.

Interested audience masters can take a look at this article Matrix Derivation, here is the result directly, the details will be attached at the end of the article:
Dry goods|Very detailed neural network entry explanation

Using it, we can write the corresponding gradient descent training algorithm:
Dry goods|Very detailed neural network entry explanation

The performance of the model is as follows:

Dry goods|Very detailed neural network entry explanation

At this time, the correct rate is about 50%.

It can be seen that there is basically no category on the diagonal in the model output (and this category is just the most). This is because extending the perceptron to a multi-category model does not change the essence of the linear model, so The classification effect is still linear, so it cannot fit the boundary of the diagonal

2

From perceptron to neural network

So how can the perceptron become a nonlinear model? in particular:

  • The existing perceptron has only two layers (input layer and output layer). Can we add more layers?
  • Intuitively speaking, the kernel method uses a kernel function that satisfies certain conditions to map the sample space to a high-dimensional space; if I relax the conditions and use some general better functions, will it be effective?
    Based on these two ideas, we can make the structure of the perceptron model drawn above into the following structure:

Dry goods|Very detailed neural network entry explanation

Among them, we usually (assuming a total of N samples):

  • Referred to Dry goods|Very detailed neural network entry explanationas "activation function"
  • We called Dry goods|Very detailed neural network entry explanationand Dry goods|Very detailed neural network entry explanationis the weight matrix
  • We call the layers added in the middle "hidden layers"; for brevity, we will think that only one layer of hidden layers will be added when we discuss later. The case of multiple layers will be discussed in the next article
  • Call each circle in the figure a "neuron". Take the above picture as an example:
  • The input layer has n neurons
  • The hidden layer has m neurons
  • There are K neurons in the output layer.
    Then, the activation function is actually the general and better function mentioned above; I will attach the relevant introduction of the commonly used activation function at the end of the article, here I will temporarily assume that the activation function is what we used in the SVM before ReLU I have seen, that isDry goods|Very detailed neural network entry explanation

This structure looks very powerful, but it is a lot of trouble to find the derivative (the loss function is still taken (still only the result is given and the details are attached at the end of the article):

Dry goods|Very detailed neural network entry explanation

Among them, "*" represents the element-wise operation of multiplication (or Hadamard product in a little more professional terms), and ReLU' represents the derivative of the ReLU function. Since Dry goods|Very detailed neural network entry explanationwe have when it is ReLU Dry goods|Very detailed neural network entry explanation, its derivation is also very simple:
Dry goods|Very detailed neural network entry explanation

Using these two derivation formulas, the corresponding gradient descent algorithm is relatively easy to implement:
Dry goods|Very detailed neural network entry explanation

The performance of the model is as follows:
Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

Although it is still relatively poor (accuracy rate is about 70%), but there is already a look

3

Use Softmax + Cross Entropy

It may have been discovered by the audience master that I have adjusted the default value of the training rate of the above model to, because a slightly larger training rate will directly explode the model.

This is because we didn't change the last layer, but used it simply and roughly Dry goods|Very detailed neural network entry explanation. This leads to the output of the final model is likely to break through the sky, which of course is not what we want to see

Considering that the label is a Dry goods|Very detailed neural network entry explanationvector, from another perspective, it is actually a probability distribution vector. So can we turn the output of the model into a probability vector?

In fact, the well-known Softmax does exactly this job. Specifically, if there is a vector now Dry goods|Very detailed neural network entry explanation, then there is:
Dry goods|Very detailed neural network entry explanation

It is not difficult to see that this is a probability vector, and it is quite reasonable from an intuitive point of view.

In the real application of Softmax, there will be a little trick to improve numerical stability. The details will be attached at the end of the article, so I will temporarily press

So after applying Softmax, the output of our model becomes a probability vector.

It is true that Euclidean distance can still be used as the loss function at this time, but a generally better way is to use Cross Entropy as the loss function.

Specifically, the cross entropy of two random variables Dry goods|Very detailed neural network entry explanation(true value) and Dry goods|Very detailed neural network entry explanation(predicted value) is:
Dry goods|Very detailed neural network entry explanation

Cross entropy has the following two properties:

  • When the true value is 0(), the cross entropy is actually transformed into Dry goods|Very detailed neural network entry explanationthat the closer the predicted value is to 0 Dry goods|Very detailed neural network entry explanation, the closer the cross entropy is to 0. On the contrary, if the predicted value tends to 1, the cross entropy will tend to infinity
  • When the true value is 1 (), the cross entropy is actually turned into, the closer the predicted value is to 1, the closer the cross entropy is to 0, otherwise if the predicted value tends to 0, the cross entropy will tend to infinity,
    so take the cross entropy as The loss function is reasonable. When cross entropy is actually applied, there are also small tricks to improve numerical stability-put a small value in log to avoid log 0:

Dry goods|Very detailed neural network entry explanation

After adding these two things, we are about to seek the derivative.

Although the derivation process is more complicated, it is surprising that the final result is almost the same as the previous result, the difference is only the multiple (see the end of the article for the derivation process):

Dry goods|Very detailed neural network entry explanation

So the corresponding implementation is almost the same:

Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

The performance of the model is as follows:
Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

Although it is still not perfect, it has two advantages compared with the model that does not use Softmax + Cross Entropy:

  • The training rate can be adjusted larger (vs), which means that the model is not so easy to explode (what the hell)
  • The training of the model is more stable, that is, the results of each training are similar.
  • In contrast to the previous model, the performance of the model I gave was actually carefully selected; in general, its performance is actually similar to this (this tells us that a good result is likely to be caused by countless sb The results piled up......):
    Dry goods|Very detailed neural network entry explanation

4

Related mathematical theories

1) Common activation functions

A、Sigmoid:

Dry goods|Very detailed neural network entry explanation

Dry goods|Very detailed neural network entry explanation

B 、 Fish :

Dry goods|Very detailed neural network entry explanation

Dry goods|Very detailed neural network entry explanation

C, reverse:

Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

D 、 ELU :
Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

E 、 Softplus :
Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

And recently an activation function called SELU has been published. The paper has 102 pages... Interested audiences can refer to [1706.02515] Self-Normalizing Neural Networks

2) Calculation of derivative in neural network

In order to write concisely, we will use the technique of matrix derivation to perform calculations; but if it looks too convoluted, it is recommended to refer to this article and seek the derivation element by element by definition (in fact, I often can’t get around... …)

From simple to complex, let's first look at the derivation process of the multi-class perceptron. We have said before that the multi-classification perceptron can be written as, where (assuming there are N samples):
Dry goods|Very detailed neural network entry explanation

Many people try to find Dry goods|Very detailed neural network entry explanationthe time, that would require a vector for matrix rule guide, although it can not be said to be wrong, but it will likely complicate the issue.

In fact, the essence of a multi-class perceptron is to predict Dry goods|Very detailed neural network entry explanationa certain sample of a random vector Dry goods|Very detailed neural network entry explanation, Dry goods|Very detailed neural network entry explanationbut it is just a sample matrix generated after multiple samples. Therefore, the essence of the multi-class perceptron is actually Dry goods|Very detailed neural network entry explanation:

Dry goods|Very detailed neural network entry explanation

So the derivation process is reduced to the scalar derivation of the matrix:

Dry goods|Very detailed neural network entry explanation

It can be seen from this Dry goods|Very detailed neural network entry explanation. After finding this special case, how should it be extended to the case of the sample matrix?

Although a rigorous mathematical description will be more troublesome (requiring vectorization of the matrix [also known as "straightening"] or the like), we can intuitively understand it like this:

  • When the sample changes from one to multiple, the derivative of the weight matrix should be the sum of from one to multiple,
    so we only need to use matrix multiplication to complete the summing process. Note that we can intuitively write:

Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

then
Dry goods|Very detailed neural network entry explanation

as well as

Dry goods|Very detailed neural network entry explanation

3)Softmax + Cross Entropy

Dry goods|Very detailed neural network entry explanation
Dry goods|Very detailed neural network entry explanation

That is
Dry goods|Very detailed neural network entry explanation

that is
Dry goods|Very detailed neural network entry explanation

In this article, we mainly discuss how to apply the perceptron to multi-classification tasks, and get a more powerful model (neural network) through intuitive ideas-deepening the level of the perceptron and applying activation functions.

In addition, we also discussed how to apply Softmax + Cross Entropy to make the model more stable.

However, the scope of our discussion is still limited to a single hidden layer and ReLU activation function. In the next article, we will introduce a more general situation and use a way to visually illustrate the power of neural networks.

Recommended reading:

Why should the data be normalized?
Logistic function and softmax function
video explanation | Why can't all neural network parameters be initialized to all 0

全是通俗易懂的硬货!只需置顶~欢迎关注交流~

Dry goods|Very detailed neural network entry explanation

Guess you like

Origin blog.51cto.com/15009309/2553647