Often asked how can you improve your results with a deep learning training model it? Then every ignorant force, first, I understand much, the second is not much experimentation, the three memory is not forgotten. So write this blog, as well as others to record some of their own experience.
Ilya Sutskever (Hinton students) to learn about the depth insights and practical advice:
Data acquisition: must ensure high quality input / output data set, the data set is sufficiently large and representative labels have relatively clear. Lack of data sets is difficult to succeed.
Pretreatment: The centralized data is very important, that is, to make the zero mean data, such that each change in each dimension is 1. Sometimes, when the input changes with the dimension order of the sort, it is best to use dimensions that log (1 + x). Basically, it is important to find credible dimension coding and natural boundaries of a 0 value. Doing so can learn to work better. Such is the case, because the weights are updated by the equation: Change in wij \ propto xidL / dyj (w x represents from layer to layer the weight y, L is the loss function). If x mean a lot (eg 100), then the updated weights will be very large, and are interrelated, which makes learning becomes poor and slow. Maintaining zero mean and a small variance is a key success factor.
Batch: Each implementation of a training sample only in today's computer is very inefficient. On the other hand if the 128 is an example of a batch, efficiency will be greatly improved because of its output is very impressive. In fact the use of the effect of the order of batch 1 well, which is not only available to enhance performance while reducing over-fitting; but this is likely to be beyond the large batch. But do not use too large batch, because there may lead to inefficient and excessive over-fitting. So my advice is: select the appropriate hardware configuration batch scale, according to their means and to be more efficient.
Normalized gradient: gradient to split according to the size of the batch. It is a good idea, because if the batch multiplying (or fold reduction), without changing the learning rate (in any case, not too much).
Learning rate plan: from a normal-sized learning rate (LR) began towards the end of shrinking.
1LR typical value is 0.1, surprisingly, for a large number of neural network problem, the 0.1 is a good learning rate value. Usually prefer smaller learning rate rather than larger.
Use a validation set - a collection of training without the training set, to determine when to stop and when to reduce the learning rate training (such as when validation errors began to increase when the set).
Practice recommended learning rate plan: If it is found validation set bottlenecks, may wish to LR divided by 2 (or 5), then continue. Finally, LR will become very small, which is also the time to stop the training. This ensures that when the performance verification of damage, you will not fit (or over-fitting) the training data. It is important to reduce the LR, LR is controlled by the validation set is the right approach.
But the most important thing is to focus on the learning rate. Some researchers have used a method (such as Alex Krizhevsky), the norm and monitors the update value of the ratio between the weight norm. Ratio value is about 10¯³. If the value is too small, then learning will become very slow; if the value is too large, then learning will be very unstable or even fail.
Weight initialization. Watch weights randomly initialized at the beginning of the study.
If you want to be lazy, try the 0.02 * randn (num_params). The value of this range works well on many different issues. Of course, smaller (or larger) values are also worth a try.
If it works well (for example, an unconventional and / or very deep neural network architecture), you need to use init_scale / sqrt (layer_width) * randn to initialize each weight matrix. In this case, init_scale should be set to 1 or 0.1, or similar value.
The depth and the cycle network, random initialization is extremely important. If not handled well, it looks like not learned anything. We know that, once the conditions are set up, the neural network learns.
An interesting story: For years, researchers believed that SGD can not train the neural network from the depth of randomly initialized. Each attempt ended in failure. Embarrassing, they did not succeed because of the use of "small random weights" to initialize, although little value in practice on shallow network works very well, but the performance on the depth of the network is not good. When the network is very deep, will be the product of a number between weight matrix, the result is not enlarged.
But if it is shallow network, SGD can help us solve this problem.
So concerned initialization is necessary. Try a variety of different initialization, efforts will be rewarded. If the network does not work completely (i.e., not embodiment) continue to improve the random initialization is the right choice.
If you are training RNN or LSTM, to the gradient (remember gradient has been divided by the batch size) using a hard constraint norm. Like 15 or 5 such constraints in my personal experiment works well. Please gradient divided by the size of the batch, and then check to see if it's the norm for more than 15 (or 5). If it does, it will be reduced to 15 (or 5). This little trick play RNN and LSTM a significant role in training, do not do so, the explosive gradient will lead to learning failure, and finally had to use so tiny and useless as 1e-6 learning rate.
Check gradient values: If no or too Theano Torch, gradient achieved only the hands. Very easy to make mistakes in the realization of the gradient, so using a numerical gradient check is essential. Doing so will make you confident in your own code. Super adjustment parameters (such as learning rate and initialization) are very valuable, so good knife should be used wisely.
If you are using LSTM the same time want to train them on a large range depends on the problem, then it should be forgotten LSTM deviation mark is initialized to a larger value. By default, all of the input is forgotten mark S type, the value is small in power, forgetting mark is set to 0.5, which is effective for only part of the problem. This is a warning to LSTM initialized.
Increase data (Data augmentation): uses an algorithm to increase the number of training examples is a creative approach. If the image, and so should convert rotation thereof; if audio portion should be clear to all types of noise and mixing process. Adding data is an art (except in the processing of the image), which requires a certain degree of common sense.
dropout: dropout provides a simple way to improve performance. Remember to adjust the dropout rate, but do not forget to turn off when testing dropout, then the right to find the value of a product (that is, 1-dropout rate). Of course, to ensure that the network will train a little longer. Unlike ordinary training, after entering in-depth training, validation errors will usually increase. dropout network over time to work better and better, so patience is key.
Integrated ( Ensembling ). Training the neural network 10, then the average forecast of its data. The approach is simple, but it can get more direct and significant performance improvements. Some people may be wondering why the average is so effective? It might be explained by an example: If two classifiers error rate of 70%, wherein if a correct rate remains high, the prediction will be closer to the average of the correct result. This effect is more pronounced for trusted network when the network is a result of credible, not credible when the result is wrong.
(The following points are a simplified version of the above)