Deep Learning - Practical Aspects of Deep Learning

How to tune hyperparameters according to theory?

1. The dataset can be divided into three parts: training (train)/testing (dev)/testing (test)

Train on the train set, verify on the dev to find the best model, and finally evaluate the performance of the final model on the test

Data set division: Generally, when the amount of data is small, it can be divided by 3:1:1. When the amount of data is large, the proportion of dev and test data can be smaller.

If train and dev/test come from different sources, then at least make sure that dev and test come from the same distribution

2. Bias and variance

Bias too high: underfitting; overfitting can lead to high variance

Optimal error: Bayesian error In image classification, if the image of train is blurred, it may lead to a large optimal error value

3. The basic idea of ​​judging variance and bias and adjusting

Bias and variance tradeoff

4. Regularization (to solve high variance problems)

4.1 L2 Regularization

Add an item to the loss function: L2 norm

F-norm: sum of squares of all elements in a matrix

L2 regularization is also known as weight decay because the w value is always multiplied by a number less than 1

Why L2 regularization can eliminate the problem of high variance:

Intuitively, if the input value is large, the value of W will approach 0, the effect of many hidden units will be eliminated, and the network will become very simple

If the input becomes larger, W becomes smaller, and the value of z is in the following red area, then g(z) will become linear (each layer becomes linear), and it is impossible to make complex decisions

4.2 Dropout regularization (random deactivation)

How it works: traverse each layer and set the probability of eliminating nodes, thereby simplifying the network and then training

Common implementation method: reverse random inactivation

The following ppt: randomly generate a true/false matrix, in which the probability of true number is less than 0.8 (keep-prob keeps the probability of nodes), then calculate a3, and finally the value of a3 is divided by keep-prob (this step is inverted dropout: pass divide to keep the expected value of a3 unchanged)

The setting of keep-prob: If the number of nodes is relatively small (the possibility of overfitting is relatively small), the value can be larger

Disadvantage: The cost function cannot be clearly defined because some nodes are randomly dropped each time

The dropout method is widely used in CV

4.3 Increase the dataset

Since the cost of directly collecting data will be higher, the amount of data can be increased by deforming the original data

4.4. early stopping

As the number of iterations increases, the error rate on train will decrease, while the error rate on dev may decrease first and then increase. then you can stop at the lowest point of dev

 5. Normalize the input eigenvalues ​​(X)

Zero mean, normalized variance (both train and test should use the same u and xigma for the same operation)

Through normalization, the value of X is in a similar range, which is conducive to the execution of gradient descent (speeding up, you can use a larger step size for each gradient descent, unlike when unnormalized, the later The step size is very small ==== why is this?)

Why is J's graph so long and narrow when it is not normalized? Because if one X is large and one X is small, in order to make J smaller, the w corresponding to the large X should be smaller, and the w corresponding to the small X should be larger.

Normalization is necessary when the value ranges of the eigenvalues ​​are very different

6. Gradient explosion/disappearance

When w is greater than the identity matrix or less than the identity matrix vanish/explode

A way to avoid explosion/disappearance: When initializing w, let it be related to the number of layers n

7. Use Bilateral Derivatives for Gradient Tests

some notes

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325171193&siteId=291194637