Deep Learning - Hyperparameter Tuning

1. The importance level of hyperparameters: red -> orange -> purple

2. How to adjust parameters

2.1 Do not use grid to set selection, because the importance of different parameters is different

 

The choice of parameters ranges from a relatively large, and later a smaller

3. Choose the right range for the hyperparameters

3.1 Uniform selection: such as the number of nodes in each layer or the number of network layers. but not for all hyperparameters

3.2 The scale method: such as choosing the learning rate

If you choose evenly between [0.0001, 1], then 90% of the data is actually from [0.1, 1] and 10% is from [0.0001, 0.1]. So, a more reasonable approach should be

Convert [0.0001,1] to [-4,0] (10^-4 = 0.0001), then sample between [-4,0], so that between [0.0001,0.001] and [0.1,1] The probability of the number is more average

More generally, take the logarithm and write the interval as [a,b]

3.3. Adjustment of the index weight mean

- Why is it bad for uniform distribution to take [0.9, 0.999] directly? Because when b is closer to 1, a small change in it will have a greater impact on the 1/(1-b) result. Here is an example

- Turn the choice of b into consideration of 1-b, and then proceed to approximate scale

4. Practice with hyperparameters: pandas vs canviar

Due to changes in various conditions such as data, parameters need to be adjusted frequently

The first is: pandas (babysitting one model)

Focus on one model at a time and modify it frequently

Choose this when the system is more complex

The second is: canviar (training many models in parrel)

Debug multiple models in parallel at a time

Choose this when you have more computing resources

5. Regularized activation function

Regularize the input to the hidden layer (not just the first input layer)

Implementation: batch normalization algorithm. This defaults to z values ​​instead of a

After calculating z norm , find another formula (with two parameters), and finally use z~ to calculate, and the mean and variance of z~ can be controlled by two parameters

6.2 Using batch-norm in nn The general framework will implement this well, such as tf, only one function is required

6.3 Use bn in mini-batch: average the z value and then subtract the mean, so that no matter what the value of b[l] is, it will not affect the result in the end, so the parameter b[l] can be removed here.

6.4 

6.5 Why is bn useful? Accelerated learning

- Normalization: Consistent with the role of input normalization

- For a certain layer, the input of its previous layer will change, but through normalization, the change can be limited to a certain range, so that the layers can be slightly independent

The following example is to train on black cats, and then test with colored cats as input (covariance shift)

- There will be some noise, will there be a regularization effect?

6.6 Using bn in tests

During training, bn needs to be executed on the entire mini-batch, but during testing, the entire mini-batch may not be executed at the same time (the number of samples is relatively large), so another method is required for estimation

Method: Exponentially weighted average

7. Softmax regression

There are many types of classification (two classifications are only two)

Use the softmax activation function: the difference from the previous one is that its input and output are both vectors, and the front is just a real number; the output result is normalized, and the sum of the output probabilities is 1

When there is only one layer (no hidden layers), the decision between any two classes is linear

7.2 Training a softmax network

Compared with hard max, the most probable result is marked as 1, and the others are 0. Softmax is reflected by the probability.

Definition of Loss Function

Why is back like this? dz = yhat - y? ? ?

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325171470&siteId=291194637