Coursera Wu Enda Improving Deep Neural Networks Notes

Finally finished the exam, now to work!
website link

Week 1

1.1 Train / Dev / Test sets

This section introduces train/dev/test sets, training set, development set (cross-validation set), and test set. To
obtain better parameters, you need to do more circles on the right:
insert image description here
in the era when there was not much data, the split ratio was 73 Open or 622 open, now only about 1% is enough:

If you have 1 million samples and you only need 10,000 for the development set and 10,000 for the test set then 10,000 is only one percent of 1 million so your ratio is 98/1/1% I also I have seen some applications in which there may be more than 1 million samples. The split ratio may become 99.5/0.25/0.25% or the development set accounts for 0.4% and the test set accounts for 0.1%

insert image description here

  1. It is usually divided into training set, development set (cross-validation set) and test set. If there are only two, because there are different ways of speaking, the development set may be called a test set. If you don't need unbiased estimation, you don't need the test set. Of course, it is usually better to separate the three sets. Unbiased estimates will be discussed later.
  2. In addition, the training set must correspond to the samples targeted by the other two sets. Otherwise, it will be like shooting a target. The training is target a, but the test requires target b. The identification of cats in the pictures of cats and cats crawled on the webpage and the pictures of cats and cats sent by users is actually biased. The sources of the three episodes are all the same, they are all pictures of cats on the webpage or pictures of cats on the app.
    insert image description here

1.2 Bias / Variance

This section introduces bias (bias) and variance (variancee): the
trained algorithm should avoid high bias and high variance, two-dimensional is convenient for visualization, but high latitude is not enough.
insert image description here
There are four cases of bias and variance. . . I don’t know how to sum it up, it’s probably as shown in the picture, and then Bayesian error (ideal error) is mentioned. If the Bayesian error is very high, it means that people and any system cannot be classified, so it needs to be analyzed by other means. Here The judgment of is based on the assumption that the Bayesian error is relatively low and that your training and development sets come from the same distribution.
insert image description here
One last question: what does a curve with high variance and high bias look like? The purple line in the figure below is the curve of high variance and high deviation, and then the straight line part is high deviation, and the curve part is high variance, because the two points are overfitted.
insert image description here

1.3 Basic Recipe for Machine Learning

This section describes how to prevent overfitting and underfitting. There is a corresponding solution in the figure.
High bias: The model cannot even fit the training set well.
High variance: Cannot fit the dev set (development set (cross-validation set))
The following is a process to prevent high bias and high variance, and regularization will be discussed below.
Before deep learning and big data, high variance and high bias would lose sight of one another, but now it will not. Regularization is used to reduce variance, which is a very useful method. There is a little trade-off between bias and variance in regularization. It may be Increases the bias a little but usually not much if your network is large enough
insert image description here

1.4 Regularization

This section introduces regularization, because if you use the "more data" method to solve overfitting (high variance), sometimes the data is difficult or expensive to find, and regularization is needed at this time.
J(w,b) with regularization added:
the latter ∣ ∣ w ∣ ∣ 2 2 ||w||^2_2w22, the superscript 2 is the square, and the subscript 2 is the meaning of the L2 norm, because lambda conflicts with the Python keyword, so use lambd instead of the
insert image description here
wording in the neural network:
and because of some mysterious reasons in linear algebra, that The w matrix is ​​not called the L2 norm, but the Frobenius norm of the matrix, marked with the subscript F, dw dwd w becomes originaldw dwd w plusλ mw [ l ] \frac{λ}{m}w^{[l]}mlw[ l ] , other things remain unchanged, substitute intow [ l ] = w [ l ] − α dw [ l ] w^{[l]}=w^{[l]}-αdw^{[l]}w[l]=w[l]αdw[ l ] is the green formula below, so no matter how much the matrix w[l] is, you will let it subtract a small part of itself (become a little smaller), so L2 norm regularization is also called weight decay
insert image description here

1.5 Why regularization reduces overfitting?

The following introduces why regularization can reduce overfitting:
regularization is an additional penalty item added to J to prevent excessive weights: if the adjustment λ is large, it means that w will be small, close to 0, Then some points will have little impact on the neural network, which is equivalent to being dropped by x (upper left corner). In this case, the entire neural network will be cut, a bit like a deep logistic regression. In this case, it will change from high variance to high bias, then there must be a suitable λ that makes the high variance become just right.
insert image description here
Another example is that in the tanh activation function, when λ increases, w decreases, and z also becomes smaller, so the tanh activation function becomes a similar linear function, and then I know that the stacking of linear functions is also a linear function. So you can't fit those very complicated decision functions.
insert image description here
Finally, if you use the gradient descent method, one of the steps in debugging the program is to draw a graph of the cost function J with respect to the number of iterations of the gradient descent. You can see that the cost function J will change after each iteration. Monotonically decreasing. If you implement the regularization part then remember that J now has a new definition if you still use the original definition of J like the first term here you may not see the graph of the monotonically decreasing function so in order to debug the gradient descent program please Make sure you draw the image using this newly defined J function which includes the second term here

1.6 Dropout Regularization

In addition to L2 regularization, another very powerful regularization technique is random inactivation regularization (dropout method): the
general idea is to build a neural network first, and then randomly retain and eliminate different training examples (examples). Node, this is a bit similar to He Kaiming's randomly generated neural network that I saw a few days ago , maybe I can get some inspiration from it.
insert image description here
In random inactivation, inverted random inactivation (inverted dropout) is usually used. The code is as follows:
first define a 3-layer neural network, and then define keep.prob to refer to the probability of retention, here is 0.8, which is to retain and lose 82K. Then d3 is the position where the selected point is saved, and the points are multiplied by a3 to get which points to keep.
Here a3/=keep.prob is used to amplify a3, because it provides about 20% of the correction you need, so that the expected value of a3 is not changed.
insert image description here
Note that when testing (should it be when predicting?), you don't need to drop out to make some hidden units disappear. Since you don't want your output to be random as well, using random dropout during the test phase will only add noise to the predictions. (Theoretically, it is possible to use different randomly deactivated neural networks to make multiple predictions and take the average value, but this method is not efficient and will get almost the same prediction results) and the division by keep.prob operation on the previous
page Even if the expected output of the activation function does not change, the test process does not need to add additional scaling operations.
insert image description here

1.7 Understanding Dropout

Dropout This practice of randomly knocking out neurons from a network may seem crazy but why does it work so well for regularization? The previous section was saying that computing in a smaller neural network and using a smaller neural network seems to have a regularization effect.

We look at this problem from the perspective of a single neuron. For example, at this point, its task is to use these input units to generate a meaningful output. If dropout is used, these inputs will be randomly discarded. Sometimes these two neurons will be used. Dropping sometimes another neuron gets dropped so that means the one I circled in purple can't depend on any one feature because each can be randomly dropped or each of its inputs can be randomly deactivated So at a certain time [I don’t want to put all the bets on this one input neuron], right? Because any input may be deactivated, we don't want to put too much weight on a certain one, so this neuron will be more active in this way. Give each input a relatively small weight to generalize These weights will help compress the squared generics (sum of squares) of these weights and L2 regularization similarly using dropout helps shrink weights and prevent overfitting but more precisely dropout should be seen as an adaptation Form rather than regularization L2 regularization has different punishment methods for different weights, which depends on the size of the activated power. In general, dropout can have a similar effect to L2 regularization, but for different situations, L2 regularization can have a little bit Change so that the scope of application is wider

insert image description here
When running, the retention rate of each layer should be set. The retention rate of layers that are worried about overfitting should be reduced, and the retention rate of layers that are not too worried about overfitting can be set higher. Generally, the input and output retention rate is 1.0, and 0.9 is also acceptable, but generally it will not be halved.
Drop out has some disadvantages:

  1. In cross-validation (grid) search there will be more hyperparameters (more time consuming to run) Another option is to use dropout for some layers (same retention rate) and others not so there is only one hyperparameter
  2. Let the cost function J become less explicit because some neurons are randomly deactivated at each iteration so when you check the performance of the gradient descent algorithm you will find it difficult to determine whether the cost function is well defined (as the iteration The value keeps getting smaller) This is because your definition of the cost function J is unclear or difficult to calculate, so you can’t use the drawing method to debug errors. Usually, at this time, I will close the dropout and set the retention rate to 1 and then Run the code and make sure that the cost function J is monotonically decreasing and finally turn on dropout and hope that when using dropout, no other errors are introduced. I think you need to use other methods than this kind of drawing method to ensure that your code is using dropout. The post-gradient descent algorithm still works

Also note that regularization is only needed for overfitting. In the CV field, because the input layer vector dimension is very large, it is almost impossible to have enough data to contain the value of each pixel, so dropout is usually required, but not all fields need dropout, it is a regularization technique , the purpose is to prevent overfitting, so unless the algorithm has been fitted, dropout will not be considered.
(See programming assignment 2 for the specific operation of PS)
insert image description here

1.8 Other regularization methods

Some other regularization methods:
the first is to increase the data. If more data is not available, then in the limited data, various subtle adjustments can be made (the following 4 is distorted but only used for exaggeration). Increase the amount of data, this way of processing, the information obtained from the picture will be less, but it is also feasible.
insert image description here
One is early stopping, which ends the training near the point where the dev error starts to rise, that is, the selected algorithm is the one with the smallest error in the development set. There is a concept of orthogonalization, that is, only one task is considered at a time, such as making J(w,b) as small as possible, so that the neural network does not overfit, but early stopping combines the two tasks and cannot be solved separately These two questions.

An alternative to early stopping is L2 regularization. You can train the neural network for as long as possible. This can make the hyperparameter search space more decomposed and therefore easier to search, but the disadvantage of this is that you may have to try a lot of regularization. The value of the parameter λ, which makes the computation expensive.

The advantage of early stopping is that as long as you run the gradient descent process once, you need to try small w values, medium w values ​​and large w values ​​instead of trying a lot of values ​​​​of the hyperparameter λ in L2 regularization.

If the computing power is enough, L2 regularization will be better.
insert image description here

1.9 Normalizing inputs

Normalization, the formulas are below, the middle is the graph after X-μ, and the right is the graph of X/σ^2 When to use normalization
insert image description here
, one is the input x 1 , x 2 , x 3 . . . x_1,x_2,x_3...x1,x2,x3The gap between . Its learning rate needs to be small to reach the optimum point, but the right picture can use a longer step size.
insert image description here
There is almost no harm in normalization, so it will be used whether it works or not.

1.10 Vanishing / Exploding gradients

An introduction to vanishing and exploding gradients.
According to the figure below, y_hat is multiplied by multiple w points. You can see that the multiple powers of 1.5 are very large, and it will explode. Similarly, if you choose 0.5, it will disappear. Exploding words. . . It’s not allowed, and if it disappears, the gradient descent will be very slow, and you won’t be able to reach the best point after walking for half a day, and you won’t be able to increase the learning rate, because this w is too small.
insert image description here

1.11 Weight Initialization for Deep Networks

The way to solve the gradient explosion or disappearance is to multiply a coefficient when w is initialized, the one in the purple box, and then if the numerator of relu is 2, if it is tanh, the numerator is 1, which is better, and then there are other definitions or something. Of course, there are still some advanced hyperparameter adjustments that I don’t understand very well. Let’s watch the video for details. (Often used in Xavier Initialization jobs)
insert image description here

1.12 Numerical approximation of gradients

How to check whether the gradient descent is correct:
use the bilateral error/two-sided difference (two-sided difference) in the figure below to verify.
insert image description here

1.13 Gradient checking

Use θ to install w and b, and dθ to install dw and db:
insert image description here
calculate dθapprox (the value of eplison inside is 10^-7) and dθ derived from your own formula, and use the check formula to calculate, if it is approximately equal to 10^-7 , then the gradient descent is done well. If there is only 10^-3, it means that there may be a problem. At this time, you have to constantly modify your code until the check is about 10^-7
insert image description here

1.14 Gradient Checking Implementation Notes

Things to pay attention to when implementing gradient testing:
Point 4 is unstable because points removed by dropout
Point 5 is very, very rare:

This last thing is a bit subtle, although it rarely happens, but it is not impossible that your use of gradient descent is correct. At the time of random initialization, w and b are numbers very close to 0, but as the gradient descent progresses w and b have Maybe your backpropagation algorithm is correct when w and b are close to 0, but the accuracy of the algorithm decreases when w and b get larger, so although I don't use it often, it's a method you can try It is to run the gradient test during random initialization and then train the network for a period of time, then w and b will swing around 0 for a period of time, that is, a small random initial value, and then run the gradient test after several training iterations

insert image description here

bold style
???
insert image description here
insert image description here
Don't use dropout during testing, and don't keep 1/keep_prob
only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.

only during training time, divide each dropout layer by keep_prob to keep the same expected value for the activations.

Programming assignment 1

Using different initialization methods under the same conditions, the obtained train accuracy is very different. For ReLU activation, it is recommended to use He initialization which is similar to it (Xavier initialization, but the applicable activation function is forgotten), and then do not It is recommended to use the initialization method of w all 0, and then the random number should not be too large
insert image description here
insert image description here

Programming assignment 2

insert image description here
L2 regularization considerations, etc.
insert image description here
dropout considerations, etc.
insert image description here
dropout experience:

  1. Remember to use dropout for both fp and bp, and then /keep_prob for fp because: suppose keep_prob is 0.5, then it is 0.5 turned on, and 0.5 is turned off, so the remaining half of the point /keep_prob, that is, x2, It means to support the point that has disappeared by 0.5, and keep the original output (it can be considered that the retained point also has the force of the disappeared point, one top 2), and the other keep_prob values ​​are the same

  2. keep_prob means to keep, for example, 0.8 means to keep 80% of the points (in terms of probability). Therefore, the forward propagation in homework 2 is 0 when D1<keep_prob, I wrote > at the beginning, but I didn’t understand it well

  3. Then fp needs /keep_prob, and also /keep_prob in the bp process, because the two must be consistent:

    During forward propagation, you had divided A1 by keep_prob. In backpropagation, you’ll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if A[1]A[1] is scaled by keep_prob, then its derivative dA[1]dA[1] is also scaled by the same keep_prob).

    To sum up, note that regularization will make your weights smaller. In the second picture of 1.5, we know that w becomes smaller and z becomes smaller, and then tanh can become approximately linear. Linear stacking is also linear, so overfitting will come down.
    insert image description here

* Programming assignment 3

In the end, gradient_check_n was made vague. Let’s sort it out
because theta (θ) is no longer a scalar, so it is stored in a dictionary called parameters. Its vector form is called values, and then this theta is the reshape of w and b. (store the information of w and b), the dictionary and vector are transformed by the xxx_to_xxx function.
This is a dictionary form: print(parameters)
insert image description here
This is a vector form: print(parameters_values)
insert image description here
code part: first, parameters_values ​​is to change parameters from a dictionary to a vector, because There is + and there is -, so np.copy needs to back up a copy of values, and then this value is a vector, that is, a vector, one of the rows or columns is one-dimensional, so thetaplus[i][0] is just a number, A theta here represents a w or b, and then step2, θ+ε, is thetaplus[i][0]+=ε, and then J (θ+ε) is to pass thetaplus over, use print(vector_to_dictionary( thetaplus)) You can compare thetaplus in dictionary form with theta without plus (that is, variable parameters), and you can find that only the first one of W1 has changed, and with more ε, we have achieved our goal: get θ+ε and find J (θ+ε), repeat again, here θ is actually w or b, so i has to go through num_parameters times and measure all the numbers.
insert image description here
Finally, get the gradapprox and then compare it with the grad that the neural network ran out to know if there is any typo.


Week 2

2.1 Mini-batch gradient descent

mini batch, m=5,000,000, divided into 5000 batches, then a batch is 1000 x, let t be the tth batch, for example, x {1} is the first batch, which is x ( 1) to x (1000)
insert image description here
mini -batch algorithm:
Use the mini-batch gradient descent method to traverse the training set (m=5,000,000). Because it is divided into 5000 small batches, 5000 gradient descents are performed. If it is ordinary, it is the batch gradient descent method. , traverse the training set once (m=5,000,000), then only one gradient descent can be done.

Then pay attention to the difference between one iteration and one epoch. One iteration is t, and one epoch is the 5000 iterations here. The data volume of an iteration between mini and non-mini is different, but one epoch is the same, and it also feeds m data.

In fact, it is just splitting the training set, making better use of data and mathematical games.
insert image description here

2.2 Understanding mini-batch gradient descent

The difference between mini and non-mini cost drop pictures
insert image description here
Random gradient descent and comparison between mini-batch and batch, purple line, green line and blue line
insert image description here
Another problem is the size of mini-batch, the recommended power of 2, generally 64 ~512, 1024 are also ok, it takes multiple attempts to find a better one
insert image description here

2.3 Exponentially weighted averages

Introducing Exponentially Weighted (Moving) Averages.
Calculate the temperature for 365 days. v is the local average/moving average of temperature, and the calculation formula is as follows:
insert image description here
According to the difference of β, the right side is 1/(1-β) days, that is, vt is equivalent to the average of these days. It can be seen that the smaller β, the average number of days The smaller it is, the yellow one, the larger the β, the 0.98 one is 50days, and the green one, because there are too many average days, so there will be a certain delay, and 0.9 is the red one, which is just right.
insert image description here

2.4 Understanding exponentially weighted averages

To understand the meaning of this exponential weighted average, expand v100, and then see that it only accounts for 0.35 at 0.9^10, which is equivalent to 1/e, which means that most of the value of v100 is composed of the first 10 days, and then set (1- epsilon) = β, then ( 1 − epsilon ) 1 / epsilon = 1 e (1-epsilon)^{1/epsilon}=\frac{1}{e}(1epsilon)1/epsilon=e1, and then when β=0.98, it is equivalent to averaging 1/epsilon = 50 days of data. This thing is just a rough idea, to aid in understanding, not a real derivation.
insert image description here
In the form of code, only one parameter is enough, so it saves space and only one line of code.
insert image description here

2.5 Bias correction in exponentially weighted averages

Bias correction can make the average calculation more accurate.
When β = 0.98, it should normally be a green line, but what is actually obtained is a purple line.
insert image description here
It can be seen that the initial stage of estimation is not handled very well, because v0 = 0, v1 = 0.98v0 + 0.02θ1 = 0.02θ1, v2=0.0196θ1+0.02θ2, the gap is relatively large.
insert image description here
Therefore, we choose to divide by a coefficient to correct the deviation. When t is relatively large, the denominator is almost 1, which means that there is not much need for correction later. Bias correction can help you make better estimates early on, but it doesn't usually do that. Just look at the next section
insert image description here

2.6 Gradient descent with momentum

momentum, the momentum gradient descent method, is to use the formulas in 2.3 and 2.4, as follows.
For example, if we want to reach the red point in the picture, the blue line will appear when there is no momentum. We hope that the amplitude of the swing on the vertical axis will decrease, and the amplitude of the swing on the horizontal axis will increase, so as to achieve the effect of the red line. After adding momentum, the algorithm finds that the average value of the vertical axis is close to 0 when the algorithm takes the average a few times before, so the momentum of the vertical axis will be very small, and all the differentials in the direction of the horizontal axis point to the direction of the horizontal axis, so the horizontal axis The mean value of is very large. After several iterations of the algorithm, the vertical axis swing becomes smaller and the horizontal axis swing increases, which becomes the red line.

Another explanation is that like a ball rolling into a bowl, the differential gives an acceleration, (1-β)db is the acceleration, because β is less than 1, so βVdb is the friction force (note that this Vdb is actually Vd(b-1 ), because only one variable is used here), so the ball will not accelerate infinitely, the original gradient descent is independent, here the ball can roll down to gain momentum.
insert image description here
Exercise Supplement:
The size of β affects the acceleration and friction. We can see from the perspective of the number of days that
if β is relatively large, it means that the average number of days is more (section 2.3 is equivalent to 1/(1-β) days), and then refer to Much larger, the curve is smoother, and the response becomes slower, like the green line in Section 2.3.
If β is smaller, the curve will oscillate more, but it is still not as oscillating as ordinary gradient descent, because momentum will more or less refer to the previous average, while ordinary gradient descent will not, and is independent.
insert image description here

β is usually 0.9, which is very robust. Then the bias correction mentioned in the previous section is usually not used. There is another way to write momentum, but it is not very good.
insert image description here

2.7 RMS plug

Introducing RMSprop, it feels similar to momentum. Here, the vertical axis is called b, and the horizontal axis is called w. In fact, the ratio is certain, and it is only for the convenience of lectures to distinguish the only one. And note that the square is the square of (dw), the same for db.

In the direction of W in the example, we hope that the learning rate is faster, and in the vertical direction, that is, in the direction of b in the example, we hope to reduce the oscillation in the vertical direction. For the two items of S_dW and S_db, we hope that S_dW is relatively small, so here we divide by It's a small number and S_db is relatively large so what you divide here is a large number so that you can slow down the update in the vertical direction. In fact, if you look at the derivative, you will find that the derivative in the vertical direction is higher than the horizontal direction. The above is larger, so the slope in the b direction is very large. For such a derivative, db is very large and dW is relatively small, because the slope of the function in the vertical direction, that is, the b direction, is steeper than the w direction, that is, it is steeper than the horizontal direction, so the square of db It will be relatively large, so S_db will be relatively large. In contrast, dW will be small or the square of dW will be small, so S_dW will be small. The result is that the update amount in the vertical direction will be divided by a large number, which helps The effect of using RMSprop is to make your updates more like this. The oscillations in the vertical direction are less and the horizontal direction can go all the way.

It roughly means that the original formula does not have a denominator. Now the denominator is added and divided by a number. If this number is for w, because w is not as steep as b, so dw is relatively smaller than db, and the denominator is relatively small. , the whole will increase, which is to make w faster. In the same way, b is steeper than w, so db is relatively large. We put it on the denominator in order to weaken its shock.

Finally, in order for the denominator to not be 0, a very small epsilon will be added, often taken as 1e-8.
insert image description here

2.8 Adam optimization algorithm

Many optimization algorithms cannot be applied to different neural networks. The Adam, RMSprop and momentum introduced here can all be applied to different neural networks.
Adam optimization algorithm: Combining RMSprop and momentum
momentum uses β_1, and RMSProp uses β_2
when using Adam. When using Adam, you need to use the corrected deviation.
The denominator of the connection needs to add the epsilon
insert image description here
hyperparameter α, which needs to be tuned frequently. Come and see which one is better. Then β1 and β2 are generally rarely adjusted. The following are suggested values. Finally the meaning of Adam

Adam stands for Adaptive Moment Estimation (Adaptive Moment Estimation) β1 represents the average value of this derivative, which is called the first-order moment β2 is used to calculate the exponentially weighted average of the square number, also known as the second-order moment Above is the name Adam the origin of

insert image description here

2.9 Learning rate decay

Decay rate learning formula:
decay rate decay_rate and α 0 α_0a0is another hyperparameter that needs to be adjusted
insert image description here
, and there are other decay rate formulas, t is the t of the mini-batch, and there is manual control of α
insert image description here

If you're now thinking wow there are so many hyperparameters how do I choose between so many different options? I'd say don't worry about this for now and we'll have more talk about systematically choosing hyperparameters next week For me the learning rate decay is usually at the bottom of the list of things I try to set a fixed value of alpha and make it well optimized can have a huge impact on the results learning rate decay does help sometimes It can really help speed up training but it's still on the bottom of the list of things I've tried but next week when we talk about hyperparameter optimization you'll see more systematic ways of arranging all the hyperparameters and How to effectively search in it? These are the contents of the learning rate decay

2.10 The problem of local optima

There is no need to be afraid that the neural network will go to a bad local optimum, because it is more likely to go to a saddle point on a high latitude, many parameters, and a larger neural network. I am afraid of going to the worst local best point because I am familiar with low-latitude graphics, but now most of them are high-latitude graphics. The more problem we face is how to get to the local optimal point faster on the saddle point, using Adam et al.
insert image description here
insert image description here

programming homework

insert image description here
Note for momentum:
insert image description here
Note that Adam is for bias correction, bias correction.
insert image description here
Finally, use mini-batch to do, uh, data feeding, and then you can look at the code.
Then compared the update parameters in 3, the method of updating parameters (w and b), there is ordinary gradient descent (gd), momentum and Adam finally
insert image description here
compared their accuracy, I did not expect Adam to be so much better than momentum, but momentum is Similar to ordinary gradient descent (gd), it seems that Adam is better.

Then momentum does not work here because the data is relatively small (mini-batch), so the effect is not obvious, but if you feed multiple epochs, the three models will also perform well, but Adam will still converge much faster. In summary, Adam is the best.
insert image description here

Week 3

(the last week!!!)

3.1 Tuning process

How to choose various hyperparameters
Below ng thinks that α is the most important, the stimulus is the purple box and the yellow box, and then the ones that are not framed are basically not moving, of course, you can have other ideas.
insert image description here
In the early stage of research, for example, if there are only 2 hyperparameters, if 25 are to be selected, they will be selected in the form of a grid. In fact, for hyperparameter 1, 5 different points are actually selected, and hyperparameter 2 is also selected. But if 25 points are randomly selected, hyperparameter 1 and hyperparameter 2 have 25 different values.
insert image description here
Then after selecting 25 points, we will find that the points in one area perform better, so we narrow the scope and continue to randomly select points in that area, so that we can search from coarse to fine.
insert image description here

3.2 Using an appropriate scale to pick hyperparameters

Choose an appropriate range for your hyperparameters. When random, pay attention to constraining the random range.
insert image description here
Then if the random range is large, random resources will be uneven. For example, if the value of α is [0.0001, 1], the random number with high probability is [0.1, 1], so there will be deviation. Then we need to distribute this number axis evenly, according to 0.001, 0.001, 0.01, 0.1, 1, and then take points in each segment evenly, which will be more even.

The method is to first set the left side as 1 0 a 10^a10a , the right side is1 0 b 10^b10b , then set a random r to be [a,b]. Then the value of α is1 0 r 10^r10r . Here, because it is divided every 10 times, it involves the operation/thought of logarithm.
insert image description here

3.3 Hyperparameters tuning in practice: Pandas vs. Caviar

Hyperparameter debugging practice: panda care method vs. fish roe stocking method
First of all, we recommend re-evaluating the hyperparameters within a period of time to see if they need to be adjusted, because some factors such as data will change.
insert image description here
There are two ways to tune hyperparameters: the panda care method and the roe stocking method.
One is the picture on the left, and the hyperparameters are adjusted every day according to the existing situation, just like there are few panda children, and one is taken care of, but a panda can have several children in its life, so a new model can be established after two or three weeks (children ), to continue to look after.
The other is the picture on the right. On the same day, use different hyperparameters to select the scheme, and then start running the data together, and see which scheme is better after running for a certain number of days.

The computing resources you own determine which way to adjust the hyperparameters. If you have enough computers to test many models in parallel, then the roe stocking method is definitely better.

However, in some application areas such as online advertising settings and computer vision recognition, there are massive amounts of data and a large number of models that need to be trained, and it is extremely difficult to train a large number of models at the same time. Many research groups prefer to use panda mode, which is actually determined by the nature of the industry
insert image description here

3.4 Normalizing activations in a network

Batch Norm: Batch normalization
In the same way that normalized input was used in logistic regression before, normalization can be used in neural networks. Some normalizations return to a, some return to z, and in practice, it usually returns to z.
insert image description here
The formula is as follows, where γ and β can be the formulas below If (derived from Znorm), or they can be adjusted by themselves. For example, if you don’t want the mean to be 0 and the variance to be 1, you don’t want the value to always be in the linear part of the sigmoid function. Let them be in the non-linear part, this time they need to be adjusted.
insert image description here

3.5 Fitting Batch Norm into a neural network

The specific process of Batch Norm in the neural network:
It should be noted that the β of Batch Norm is different from the β of Adam and momentum, but β is used in the paper, so it is convenient to read the paper here.
Then the next update parameters of Batch Norm can also use Adam, etc., and can also be used with mini-batch.
insert image description here
Note that if Batch Norm is used, b can be eliminated or become 0. Because the normalization is to meet the mean value of 0, the variance of 1, and then readjust through the parameters β and γ, so the bl in z can be eliminated, and the offset is adjusted by β in z~.

because Batch Norm zeroes out the mean of these ZL values in the layer, there’s no point having this parameter b l b^l bl

Finally, notice how many dimensions there are.
insert image description here
The overall process of mini-batch + Batch Norm (+Adam and other parameter update optimization) neural network:
insert image description here

3.6 Why does Batch Norm work?

Let me talk about why BN is useful (maybe you can read the paper? After all, I have it) I am
a bit too lazy to explain, look at the video and papers linked by the title.
This section links to Netease Cloud Classroom

3.7 Batch Norm at test time

When testing, because this is a mini-batch, then get μ and σ for all batches, and then get z~, of course there are other ways to get γ and β
insert image description here

3.8 Softmax Regression

Softmax is used for multi-classification.
The idea is to see which point accounts for the highest total probability of all points, which is the answer
insert image description here
insert image description here

3.9 Training a softmax classifier

The following are the details of the implementation:
insert image description here
Note that there are some frameworks, as long as the forward propagation is given, bp will be given, and then the dzl dz^l of softmaxdzl as follows
insert image description here

3.10 Deep learning frameworks

The recommended deep learning framework and the points to pay attention to when choosing a framework, one point is: whether it is truly open source, some open source frameworks will not be open source after a few years, or some functions will be moved into private cloud services

One thing I will remind you is that you trust this framework to remain open source for a long time instead of being controlled by a single company, because even if it is open source now, it may be closed for various reasons in the future

insert image description here

3.11 TensorFlow

Explained how to use TensorFlow to get some stuff

First of all, TensorFlow can help you realize backpropagation, just give him the forward propagation formula and learning_rate, because when writing the cost formula, if one of them is tf.Variable, then those ±*/square, etc., will be Overloaded into tf.add(), etc., there is a calculation graph in the upper right corner, and TF calculates the derivative of backpropagation based on this graph. Assign cost and learning_rate to train, GradientDescentOptimizer here is ordinary gradient descent, you can replace it with Adam, etc., and then run session.run to run.

Then x = tf.placeholder(…) means that the value of X can be temporarily reserved, and the value can be passed in later. The following code coefficients defines a matrix, and feed_dict={x:coefficients} is to assign the coefficients to x. Then feed session.run with train. Here is an example of usage. When running mini-batch, you can change the coefficients, and then the input x of the neural network can also be changed accordingly.

Finally, session = tf.Session() and the following two sentences, these two sentences are used to define training, and then the expression on the right feels better, with is easier to clean up in Python, in case the inner loop is executed When an error or exception occurs (I don’t know much, I haven’t checked the function of with in Python).
insert image description here
Detailed comparison in the homework:
insert image description here
usage, used to output sigmoid, note that feed_dict={x:z} is to pass z to x for update the value of x
insert image description here

test

insert image description here
No. You can run deep learning programming languages from any machine with a CPU or a GPU, either locally or on the cloud.
insert image description here

programming homework

What TensorFlow pays attention to is that you need to build a session and run the operations in the session to take effect.
insert image description here
Here is a sess.close(), the first run is 6, and an error will be reported after running once, and then the above code sess=tf.Session() needs to be run again, or directly delete sess.close()
insert image description here

Guess you like

Origin blog.csdn.net/Only_Wolfy/article/details/94618604