Depth learning and training tips, the assistant experience (rpm)

Often asked how can you improve your results with a deep learning training model it? Then every ignorant force, first, I understand much, the second is not much experimentation, the three memory is not forgotten. So write this blog, as well as others to record some of their own experience.

Ilya Sutskever (Hinton students) to learn about the depth insights and practical advice:

 

Data acquisition: must ensure high quality input / output data set, the data set is sufficiently large and representative labels have relatively clear. Lack of data sets is difficult to succeed.

Pretreatment: The centralized data is very important, that is, to make the zero mean data, such that each change in each dimension is 1. Sometimes, when the input changes with the dimension order of the sort, it is best to use dimensions that log (1 + x). Basically, it is important to find credible dimension coding and natural boundaries of a 0 value. Doing so can learn to work better. Such is the case, because the weights are updated by the equation: Change in wij \ propto xidL / dyj (w x represents from layer to layer the weight y, L is the loss function). If x mean a lot (eg 100), then the updated weights will be very large, and are interrelated, which makes learning becomes poor and slow. Maintaining zero mean and a small variance is a key success factor.

Batch: Each implementation of a training sample only in today's computer is very inefficient. On the other hand if the 128 is an example of a batch, efficiency will be greatly improved because of its output is very impressive. In fact the use of the effect of the order of batch 1 well, which is not only available to enhance performance while reducing over-fitting; but this is likely to be beyond the large batch. But do not use too large batch, because there may lead to inefficient and excessive over-fitting. So my advice is: select the appropriate hardware configuration batch scale, according to their means and to be more efficient.

Normalized gradient: gradient to split according to the size of the batch. It is a good idea, because if the batch multiplying (or fold reduction), without changing the learning rate (in any case, not too much).

Learning rate plan: from a normal-sized learning rate (LR) began towards the end of shrinking.

1LR typical value is 0.1, surprisingly, for a large number of neural network problem, the 0.1 is a good learning rate value. Usually prefer smaller learning rate rather than larger.
Use a validation set - a collection of training without the training set, to determine when to stop and when to reduce the learning rate training (such as when validation errors began to increase when the set).
Practice recommended learning rate plan: If it is found validation set bottlenecks, may wish to LR divided by 2 (or 5), then continue. Finally, LR will become very small, which is also the time to stop the training. This ensures that when the performance verification of damage, you will not fit (or over-fitting) the training data. It is important to reduce the LR, LR is controlled by the validation set is the right approach.

But the most important thing is to focus on the learning rate. Some researchers have used a method (such as Alex Krizhevsky), the norm and monitors the update value of the ratio between the weight norm. Ratio value is about 10¯³. If the value is too small, then learning will become very slow; if the value is too large, then learning will be very unstable or even fail.

Weight initialization. Watch weights randomly initialized at the beginning of the study.

If you want to be lazy, try the 0.02 * randn (num_params). The value of this range works well on many different issues. Of course, smaller (or larger) values are also worth a try.
If it works well (for example, an unconventional and / or very deep neural network architecture), you need to use init_scale / sqrt (layer_width) * randn to initialize each weight matrix. In this case, init_scale should be set to 1 or 0.1, or similar value.
The depth and the cycle network, random initialization is extremely important. If not handled well, it looks like not learned anything. We know that, once the conditions are set up, the neural network learns.
An interesting story: For years, researchers believed that SGD can not train the neural network from the depth of randomly initialized. Each attempt ended in failure. Embarrassing, they did not succeed because of the use of "small random weights" to initialize, although little value in practice on shallow network works very well, but the performance on the depth of the network is not good. When the network is very deep, will be the product of a number between weight matrix, the result is not enlarged.
But if it is shallow network, SGD can help us solve this problem.

So concerned initialization is necessary. Try a variety of different initialization, efforts will be rewarded. If the network does not work completely (i.e., not embodiment) continue to improve the random initialization is the right choice.

If you are training RNN or LSTM, to the gradient (remember gradient has been divided by the batch size) using a hard constraint norm. Like 15 or 5 such constraints in my personal experiment works well. Please gradient divided by the size of the batch, and then check to see if it's the norm for more than 15 (or 5). If it does, it will be reduced to 15 (or 5). This little trick play RNN and LSTM a significant role in training, do not do so, the explosive gradient will lead to learning failure, and finally had to use so tiny and useless as 1e-6 learning rate.

Check gradient values: If no or too Theano Torch, gradient achieved only the hands. Very easy to make mistakes in the realization of the gradient, so using a numerical gradient check is essential. Doing so will make you confident in your own code. Super adjustment parameters (such as learning rate and initialization) are very valuable, so good knife should be used wisely.

If you are using LSTM the same time want to train them on a large range depends on the problem, then it should be forgotten LSTM deviation mark is initialized to a larger value. By default, all of the input is forgotten mark S type, the value is small in power, forgetting mark is set to 0.5, which is effective for only part of the problem. This is a warning to LSTM initialized.

Increase data (Data augmentation): uses an algorithm to increase the number of training examples is a creative approach. If the image, and so should convert rotation thereof; if audio portion should be clear to all types of noise and mixing process. Adding data is an art (except in the processing of the image), which requires a certain degree of common sense.

dropout: dropout provides a simple way to improve performance. Remember to adjust the dropout rate, but do not forget to turn off when testing dropout, then the right to find the value of a product (that is, 1-dropout rate). Of course, to ensure that the network will train a little longer. Unlike ordinary training, after entering in-depth training, validation errors will usually increase. dropout network over time to work better and better, so patience is key.

Integrated ( Ensembling ). Training the neural network 10, then the average forecast of its data. The approach is simple, but it can get more direct and significant performance improvements. Some people may be wondering why the average is so effective? It might be explained by an example: If two classifiers error rate of 70%, wherein if a correct rate remains high, the prediction will be closer to the average of the correct result. This effect is more pronounced for trusted network when the network is a result of credible, not credible when the result is wrong.

(The following points are a simplified version of the above)

 

1: Prepare Data: Always ensure that there are a large number of high quality and with a clean label data, so no data, it is impossible to learn
2: Pretreatment: This much said, is zero mean and variance of 1
3: minibatch: recommended value of 128, 1 best, but efficiency is not high, but do not use excessive value, it will be very easy to over-fitting
4: gradient normalized: in fact, calculated gradient after, to be divided by the number of minibatch. This much explanation
5: The following mainly referring to the next learning rate
5.1: general start with a general learning rate, and then it gradually decreases
5.2: 0.1 is a recommended value for a lot of NN problems generally tend to be smaller.
5.3: a recommended schedule for learning rate: if performance is no longer let the learning rate increase on the validation set divided by 2 or 5, and then continue the learning rate will always become very small, in the end you can stop the train.
5.4: The principle is to monitor the ratio of a lot of people with a design of the learning rate (norm gradient divided by the norm of each update of the current weight), if this ratio in the vicinity of 10-3, if less than this value, the learning will be very slow, If greater than this value, then the learning is very unstable, which will bring failure.
6: Use the validation set, you can know when to start reducing the learning rate, and when to stop training.
7: some advice on the selection of the weight initialization:
7.1: If you are lazy, the direct use of 0.02 * randn (num_params) to initialize, of course, other values ​​you can also try
7.2: If the above does not work well, so long turn initiate a weight matrix for each init_scale / sqrt (layer_width) * randn, init_scale may be set to 0.1 or 1
7.3: initialization parameters influence on the results is essential to pay attention.
7.4: depth network, the weights randomly initialized, then the general process using SGD is not good, because a small weight initialization. Shallow effective network for this case, but when deep enough time to die, because the weight update when it is multiplied by a lot of weight, multiply the smaller, somewhat similar to disappear mean gradient (this sentence is I added)
8: If the training RNN or LSTM, be sure that the gradient norm is constrained to 15 or 5 (or first premise normalized gradient), it is important that the RNN and LSTM in.
9: Check the gradient, if it is your own calculation of the gradient.
10: If you use LSTM to solve the problem of the length of time-dependent, remember when initialization bias to be bigger
12: much as possible to find ways to amplify the training data, if you are using image data, it may do little to reverse the image ah like to expand the training set data.
13: dropout
14: Evaluation of the final result, when a few more times, and then averaged about their results.
 
 
Original Address: https: //blog.csdn.net/yanhx1204/article/details/79625382

Guess you like

Origin www.cnblogs.com/ldh-up/p/11652553.html