Summary of deep learning practical experience

Write in front:

This article was originally published on how-to-start-a-deep-learning-project and has a translation on Heart of the Machine ( How to build a deep learning project from scratch? Here is a detailed tutorial ).

Ignore the titles in Chinese and English, because this is not a detailed tutorial for getting started, but a summary of experience at various steps in deep learning applications. It is very well written, and I recommend it here.

The specific application of the article is to use the generative adversarial network to color Japanese comics. Although the field is small, there are still many general meanings in the content. The following is the beginning of the application experience given:


Experience:

  • The process of finding bugs in deep learning models is very difficult. So start simple and work your way up , for example optimization of the model (like regularization) can always be done after the code is debugged.
  • We also need to frequently visualize predictions and model metrics , and we need to get the model up and running first so we have a baseline to fall back on. We'd better not get stuck with one big model and try to get all the modules right.
  • Project Research Phase: Conduct research on existing products to explore their weaknesses.
  • Standing on the shoulders of giants. Reading research papers can be painful, but very rewarding.
  • Pay attention to the quality of the data : the categories are balanced, the data is plentiful, there is high-quality information in the data and labels, and the data and label errors are very small and relevant to your problem.
  • Small projects collect few samples compared to academic datasets and transfer learning can be applied where appropriate.
  • Avoid random improvements : Start by analyzing the weaknesses of your own model, rather than randomly improving.
  • Building deep learning is not a simple stack of network layers. Adding good constraints can make learning more efficient, or smarter. For example, applying an attention mechanism lets the network know where to pay attention, and in a variational autoencoder, we train the hidden factors to follow a normal distribution.
  • Many pretrained models are available for solving deep learning problems. (I have a deep understanding of this, in NLP, and in the fields of image processing and machine translation that I have seen.)
  • Both L1 regularization and L2 regularization are common, but L2 regularization is more popular in deep learning. L1 regularization can produce more sparse parameters, however, L2 regularization is still more popular because the solution is likely to be more stable.
  • Gradient Descent : Always closely monitor whether gradients vanish or explode, gradient descent problems have many possible causes that are difficult to prove. Don't jump to learning rate tuning or make the model design change too quickly.
  • Scale input features . We usually scale features to have zero mean in a certain range, such as [-1, 1]. Inappropriate scaling of features is one of the most common causes of exploding or decreasing gradients. Sometimes we compute the mean and variance from the training data to bring the data closer to a normal distribution. If scaling the validation or test data, reuse the mean and variance of the training data. (In fact, you can also check the training sample and the real sample distribution diff by distribution)
  • Batch normalization also helps with gradient descent, so it gradually replaces Dropout. The benefits of combining Dropout and L2 regularization are domain specific. Often, we can test dropout during tuning and collect empirical data to demonstrate its benefits.
  • Activation function : In DL, ReLU is the most commonly used nonlinear activation function. If the learning rate is too high, the activations of many nodes may be at zero. If changing the learning rate doesn't help, we can try leaky ReLU or PReLU. In leaky ReLU, when x < 0, it does not output 0, but has a small predefined downward slope (like 0.01 or set by a hyperparameter). The parameter ReLU (PReLU) pushes one step forward. Each node will have a trainable slope.
  • Make sure that the samples are sufficiently shuffled in each dataset and each batch of training samples.
  • Overfitting a model with a small amount of training data is the best way to debug deep learning. If the loss value does not drop within thousands of iterations, further debgug the code. Go beyond the notion of guesswork and you've reached your first milestone. Then make subsequent modifications to the model: add network layers and customizations; start training with full training data; increase regularization to control overfitting by monitoring the difference in accuracy between training and validation data sets.
  • Early problems mainly came from bugs, not model design and fine-tuning problems. (great)
  • Initializing all weights to 0 is the most common mistake, and deep networks don't learn anything. The weights are initialized according to the Gaussian distribution.
  • Check and test the accuracy of the loss function . The loss value of the model must be lower than the value of the random guess. For example, in a 10-class classification problem, the cross-entropy loss for random guessing is -ln(1/10).
  • Avoid using multiple data loss functions. The weights of each loss function may be of different orders of magnitude and also require some effort to adjust. If we only have one loss function, we can only care about the learning rate.
  • Data Augmentation: Collecting labeled data is an expensive endeavor. For pictures, we can use data enhancement methods such as rotation, random cropping, and shifting to modify existing data and generate more data. Color distortions include hue, saturation, and exposure shift. ( It is used less in the NLP field. )
  • Mini-batch size : The usual batch sizes are 8, 16, 32 or 64. If the batch size is too small, the gradient descent will not be smooth, the model learns slowly, and the loss may oscillate. If the batch size is too large, it will take too long to complete one training iteration (one round of updates), resulting in smaller returns. In our project, we reduce the batch size because each training iteration takes too long. We closely monitor the overall learning rate and loss. If the loss oscillates wildly, we know that the batch size has decreased too much. Batch size affects hyperparameters such as regularization factor. Once we've settled on the batch size, we usually lock in the value.
  • The learning rate and the regularization factor are highly correlated and sometimes need to be tuned together. Don't make fine adjustments too early, you risk wasting time. If the design changes, these efforts will be in vain.
  • Dropout rates are usually between 20% and 50%. Let's start with 20%. Increase the value if the model is overfitting.
  • Grid search is computationally expensive. For smaller projects, they are used sporadically. We start to tune coarse-grained parameters with fewer iterations. In the later fine-tuning stage, we use longer iterations and adjust the value to 3 (or lower).
  • Kaggle is a great place to learn, discuss. After all, it is shallow on paper, and it is necessary to know that this matter must be done.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325851246&siteId=291194637