Is the deep network not working? Teacher Wu Enda takes you to optimize the neural network (1)

Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

吴恩达老师DeepLearning.ai课程笔记
【吴恩达Deeplearning.ai笔记一】直观解释逻辑回归
【吴恩达deeplearning.ai笔记二】通俗讲解神经网络上
【吴恩达deeplearning.ai笔记二】通俗讲解神经网络下

To improve the training efficiency of a deep neural network, you need to start from all aspects, optimize the entire operation process, and prevent various problems that may occur.

This article involves optimizing data partitioning in deep neural networks, model estimation, prevention of overfitting, data set standardization, weight initialization, gradient testing, etc.

1 data division


If you want to build a neural network model, you must first set up the training set (Training Sets), development set (Development Sets) and test set (Test Sets) of the entire data set.

When the training set is used for training, by changing the values ​​of several hyperparameters, several different models will be obtained. The development set is also called the hold-out cross validation set (Hold-out Cross Validation Sets), which is used to find the best performing model among several different models established. Then apply this model to the test set for testing, and make an unbiased estimate of how well the algorithm is. Usually, the final test set is omitted directly, and the development set is regarded as the "test set".

One issue that needs attention is that it is necessary to ensure that the sources of the training set and the test set are consistent, otherwise it will cause a large deviation in the final result.

2 model estimation


Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

In the left picture in the figure, a simple model, such as linear fitting, cannot be used to classify the data well. After classification, there is a large deviation (Bias), which is called the classification model. Fitting (Underfitting).

In the figure on the right, a complex model is used for classification, such as a deep neural network model. When the complexity of the model is too high, overfitting is prone to occur, resulting in a large variance (Variance) after classification.

In the middle figure, only an appropriate model can be used to make a similar classification of the data.
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

The development set is usually used to diagnose whether there is a bias or time variance in the model:

  • When a model is trained, it is found that the error rate of the training set is small, while the error rate of the development set is large. This model may have overfitting and large variance;
  • When it is found that the error rates of the training set and the development set are both large, and the two are equal, the model may be under-fitting and there is a large deviation;
  • When it is found that the error rate of the training set is large, and the error rate of the development set is much larger than that of the training set, the model is a bit bad, with large variance and bias.

Only when the error rates of the training set and the development set are small, and the difference between the two is small, the model will be a good model with small variance and bias.

When the model has large deviations, methods such as increasing the number of hidden layers of the neural network, the number of nodes in the hidden layer, and training for a longer time can be used to prevent under-fitting. When there is a large variance, methods such as introducing more training samples and regularizing the sample data can be used to prevent overfitting.

3 L2 regularization to prevent overfitting


Add L2 regularization (also called "L2 norm") term to the cost function of logistic regression:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
Among them,
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
L2 regularization is the most commonly used type of regularization, and there is also an L1 regularization term:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
due to L1 regularization, the final result is ω There are a large number of 0s, making the model sparse, so L2 regularization is generally used. The parameter λ is called the regularization parameter, and this parameter is usually set through the development set.

Add a regularization term to the cost function of the neural network:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
this is called the Frobenius Norm (Frobenius Norm), so the regularization term in the neural network is called the Frobenius Norm matrix.
After adding the regularization term, there is back propagation: when
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
updating the parameters:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
so the process of L2 regularization is also called weight decay.

The parameter λ is used to adjust the relative importance of the two terms in the formula. A smaller λ is biased towards minimizing the original cost function, and a larger λ is biased towards minimizing the weight ω. When λ is larger, the weight ω[ι] will approach 0, which is equivalent to eliminating part of the hidden unit in the deep neural network.

On the other hand, when the weight ω[L] becomes smaller, the random change of the input sample X will not have an excessive influence on the neural network model, and the neural network is less likely to be affected by local noise. This is why regularization can reduce model variance.

4 Random inactivation regularization to prevent overfitting


Random inactivation (DropOut) regularization is to pre-set a probability of being eliminated for each node of each layer in a neural network, and then randomly decide to eliminate some of the nodes during training, and get a reduced Neural network to achieve the purpose of reducing variance.

DropOut regularization is mostly used in the field of Computer Vision.

When programming in Python, you can use Inverted DropOut to achieve DropOut regularization:

For a neural network layer 3

keep.prob = 0.8
d3 = np.random.randn(a3.shape[0],a3.shape[1]) < keep.prob
a3 = np.multiply(a3,d3)
a3 /= keep.prob
z4 = np.dot(w4,a3) + b4

The d3 is a randomly generated Boolean matrix with the same size as the third layer, and the value in the matrix is ​​0 or 1. And keep.prob ≤ 1, it can change with the number of nodes in each layer, and determines the number of lost nodes.

For example, when keep.prob is set to 0.8, 20% of the value in matrix d3 will be 0. After multiplying the matrices a3 and d3, it means that 20% of the nodes in this layer will be eliminated. The reason for the need to divide by keep_prob is that a3 will be used in the next step to find z4, and 20% of the value of a3 has been cleared. In order not to affect the final expected output value of the next layer of z4, this step is required To correct the value of the loss, this step is called the reverse random deactivation technology, which ensures that the expected value of a3 will not be affected by the elimination of nodes, and also ensures that the neural network before DropOut is used for testing. .

Like the previous L2 regularization, using DropOut can simplify part of the structure of the neural network to prevent overfitting. In addition, when many nodes are input, each node may be deleted, which can reduce the dependence of the neural network on a certain node, that is, the dependence on a certain feature, diffuse the weight of the input node, and shrink the square of the weight. Norm.

5 Data amplification method to prevent overfitting


Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

Data augmentation (Data Augmentation) is to do some simple transformations on the existing data when additional training samples cannot be obtained. For example, a picture is flipped, enlarged and distorted to introduce more training samples.

6Early stop method to prevent overfitting


Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

Early Stopping is to draw the cost change curve of the training set and the development set on the same coordinate axis when the gradient descent is performed respectively, and correct it in time and stop training when the arrow points to a large deviation between the two.

At the middle arrow, the parameter w will be a value that is not too big or too small, and the occurrence of overfitting can be avoided under ideal conditions. However, this method does not reduce the cost function very well on the one hand, but also wants to avoid overfitting. One method solves two problems, neither of which can be solved well.

7 standardized data sets


The process of standardizing the training and test sets is: the
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
original data is:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
after the first two steps, x minus their average value:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
after the last two steps, x is divided by their variance: when the
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
data set is not standardized, the cost The image and gradient descent process of the function will be:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
After normalization, it will become:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

8 initialize weights


In the previous process of building a neural network, it was mentioned that the weight ω cannot be 0, and it was initialized to a random value. However, in a deep neural network, when the value of ω is initialized too large, it will increase exponentially when entering the deep layer, causing the gradient to explode; if it is too small, it will attenuate exponentially, causing the gradient to disappear.

When ω is randomly initialized in Python, the np.random.randn() method in the numpy library is used. randn is sampled from the unit standard normal distribution (also known as the "Gaussian distribution") with a mean of 0.

As the amount of input data n to a certain layer of the neural network increases, the variance of the output data distribution also increases. It turns out that it can be divided by the square root of the input data amount n to adjust its numerical range, so that the variance of the neuron output is normalized to 1, and it will not be too large to cause exponential explosion or too small and exponential decay. That is to initialize the weight as:

w = np.random.randn(layers_dims[l],layers_dims[l-1]) \* np.sqrt(1.0/layers_dims[l-1])

This ensures that all neurons in the network initially have approximately the same output distribution.
When the activation function is the ReLU function, the weights are best initialized as:

w = np.random.randn(layers_dims[l],layers_dims[l-1]) \* np.sqrt(2.0/layers_dims[l-1])

See reference materials for the proof process of the above conclusions.

9 gradient test


The realization principle of the gradient test is to derive the cost function according to the definition of the derivative. There are: The
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
gradient test formula:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
where the smaller ε, the closer the result is to the true derivative, which is the gradient value. This method can be used to determine whether there is an error when backpropagating the gradient descent.

The process of gradient test is to add a small ε to each parameter θ[i] of the cost function, and obtain a gradient approximation value. The gradient value of
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
J'(θ) at J'(θ) is obtained analytically dθ, and then find the Euclidean distance between them:
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)
When the calculated distance result is close to the value of ε, the gradient value can be considered to be calculated correctly, otherwise, you need to go back to check whether there is a bug in the code.

It should be noted that do not perform gradient testing when training the model. When a regular term is added to the cost function, you also need to bring the regular term for testing, and do not use gradient testing after using random inactivation.

Note: The pictures and materials involved in this article are compiled and translated from the Deep Learning series of Andrew Ng, and the copyright belongs to him. The level of translation and collation is limited, and you are welcome to point out any improper points.

Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

Recommended reading:

Video | What should I do if I can't produce a paper?
Why not try these methods [Deep learning combat] How to deal with RNN input variable-length sequence padding in pytorch
[Basic theory of machine learning] Detailed understanding of maximum posterior probability estimation (MAP)

      欢迎关注公众号学习交流~         

Is the deep network not working?  Teacher Wu Enda takes you to optimize the neural network (1)

Guess you like

Origin blog.51cto.com/15009309/2554213