This article will let you master 22 neural network training skills

The following article comes from Deep Blue AI, author Kuang Ji, invaded and deleted

This article will let you master 22 neural network training skills


Neural network training is a very complex process, in which many variables interact with each other, so it is difficult for our researchers to figure out how these variables affect the neural network during this process. The many tips given in this article are to make it easier and more convenient for everyone to speed up the training network in the process of neural network training. Of course, these tips are not a necessary process for training the network, but as some heuristic suggestions, so that everyone can better understand the task at hand, and choose the appropriate technology in a targeted manner.

First of all, choosing a good initial training state is a very broad topic, including: from image enhancement to choosing hyperparameters, etc. Below we list the specific operations:


1.Overfit a single batch

Single batch overfitting - mainly used to test the performance of our network. First, enter a single batch of data, and make sure that the labels corresponding to this batch of data are correct (if labels are required). Then, repeat training on this batch of data until the loss function value stabilizes. If your network is not able to achieve a perfect accuracy rate (using different metrics), then first check the data, in this method we propose, is to check the performance of our model under the condition of ensuring that the data is correct. In this way, we avoid using too large or complex models to solve simple problems. After all, finding the most suitable method is the most effective (no knife is needed to kill a chicken)


2.Run with a high number of epochs

In many cases, we can get a good result after training the model through a large number of epochs. If, we can afford to train the model for a long time, then we can adopt a strategy to choose the number of epochs (eg: gradually increase from 100 to 500). In this way, when we have a lot of experience in training models, we can summarize our own set of data (called epoch factors). Using these parameters, when we train a new model, we can quickly set the initial training epochs, and follow A certain percentage to increase epochs.


3.Set seeds

To ensure model reproducibility, one approach we can take is to seed any random number generation operation. For example, if we were using TensorFlowframeworks, we could take the following code snippet:

import os, random
import numpy as np
import tensorflow as tf

def set_seeds(seed: int):
 os.environ['PYTHONHASHSEED'] = str(seed)
 random.seed(seed)
 tf.random.set_seed(seed)
 np.random.seed(seed)

Pytorchframe:


np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

The reason for using the operation seed is that the computer cannot actually output random numbers, that is, the computer outputs pseudo-random numbers, which output random numbers according to certain rules. In this case, we can use a series of rules to simulate the generation of random numbers, that is, we use the set_seed function to simulate the generation of random numbers. For details, you can refer to the TensorFlow documentation.


4.Rebalance the dataset

Imbalanced datasets—that is, one class or multiple classes occupy a large part of the entire data set; vice versa, one class or multiple classes occupy a small part of the data set. If the different categories of data we use have essentially the same characteristics, then we consider strategies to solve such problems, such as: up-sampling the data of the smallest category, down-sampling the data of the largest category, collecting additional data samples (if possible ) and use data augmentation to generate fake data samples, etc.


5.Use a neutral class

Consider the following situation, your dataset has two classes: class1 and class2 (i.e. not class1), assuming that these data samples are all labeled by experts (to ensure the accuracy of data labels). Then, if there is an example in this data whose class cannot be determined, the class of this sample may be marked as none, or it may be marked as a class (but with very low confidence). In such cases, we introduce a third category is a good way to solve such problems. In the current case, this additional category is denoted as the "indeterminate" category. During the training of the network model, we can exclude this third type of data from participating in the training. After that, we can use the trained model to relabel these fuzzy labeled data samples.


6.Set the bias of the output layer

For unlabeled datasets, the network cannot avoid guessing data samples in the initial stage. Even though the network model can be trained to learn the correct labels for the data samples, this will greatly increase the training time. We can reduce model training time by designing a better model bias formula during the model design phase. For a sigmoid layer, the bias can be calculated by the following formula (assuming there are only two classes):
bias = log ( pos / negative ) bias=log(pos/negative)bias=l o g ( p os / n e g a t i v e )
When we have created the model, we can use the values ​​calculated above to initialize the bias.


7. Tune the learning rate

If you want to tune some hyperparameters, the first thing to focus on is the learning rate. Below, we give an output graph of the model learning results when the learning rate is set too high:

insert image description here
In contrast, if we use a different and smaller initial learning rate, we can get the following results:
insert image description here
Obviously, the choice of learning rate corresponding to the training time consumption and accuracy of the model is particularly important, and we will not explain it here. If a strategy is designed to select an optimal learning rate through training, the method for selecting the optimal learning rate will be given in detail in a follow-up article (please pay attention to the follow-up article). Here, we give empirical learning values, that is, give an initial learning rate ranging between 0.001 and 0.01.


8.Use fast data pipelines

For small projects, we can use a custom generator. And when we are involved in a large project, we can replace the generator with a specialized dataset mechanism. In the TenorFlow example, we can use tf.datathis API, this API function contains most of the required methods, such as: shuffling, batching and prefetching and so on. This professional dataset mechanism, instead of our customized data generator, can be well applied in our actual projects.


9.Use data augmentation

Data augmentation allows us to train a more robust network model by increasing the number of datasets, or by upsampling small categories of data, but the consumption of these data augmentations is an increase in the number of training sessions. Below we give some commonly used data Common image data enhancement methods:
1.Flip (flip)
insert image description here
2.Rotation (rotation)
insert image description here
3.Scale (zoom)
insert image description here
4.Crop (crop)
insert image description here


10.Train an AutoEncoder on unlabeled data, use latent space representation as embedding

If the labeled datasets we use for training are relatively small, there are still strategies we can leverage to use these datasets to accomplish the task. One of the methods is to use AutoEncoder, the background of which is that we can easily collect unlabeled data. Then, we can use AutoEncoder, and AutoEncoder large a suitable size of latent space (eg: 300 to 600 entries), to get a reasonable and small reconstruction loss function value. In order to obtain the embedding information of the actual data, we can discard the decoder network layer, and then we use the reserved encoder network layer to generate the embedding information.


11.Utilize embeddings from other models

Unlike item 10, which uses our own data to obtain the embedding information, we can also learn the embedding information from other models. For textual data tasks, downloading pre-learned embeddings is a common approach. For image data tasks, we can use a model trained on a large data set (eg: ImageNet), select a fully trained network layer, and cut these outputs, and then use the results of these cuts as embedding information.


12.Use embeddings to shrink data

First of all, assuming that our dataset samples have a category feature information, then at the beginning, the category feature corresponding to a data sample can only take two values, that is, the corresponding one-hot encoding has two subscripts . However, once this category value is expanded to 1000 categories or more, then a sparse one-hot encoding method is no longer efficient, because we can represent the data in a relatively low dimension, then using information embedding is a effective method. We can insert an embedding layer before training (with large category data information, from 0 to 1000, or even larger categories), input category information, or a dimensionality-reduced embedding information. Such representations can be obtained through network model learning.


13.Use checkpointing

When we train a network model for hours or more, but unfortunately, the model crashes, but all the training information is lost, then this is a very frustrating thing. Considering that neither the hardware nor the software runs perfectly, it is a very important operation for us to do a good job in the storage of savepoints. In simple checkpoint usage we might just save the weights of the model every k steps, in later complex checkpoint usage we can save the state of the optimizer, as well as current and any other key information. Then, after the training run has started, we can check for any failed run snapshots and quickly restore all necessary settings based on this run snapshot. In particular, the use of checkpoints in conjunction with 14 custom training sessions is very effective.


14.Write custom training loops

In most cases, using the default training pipeline, such as in TensorFlow model.fit\(\), is efficient enough. However, we noticed that the flexibility of using the default training process is limited, some minor changes may be easy to incorporate, but larger changes are difficult to implement. This is why we recommend writing your own custom algorithm. Here, we won't go into details again, and we will expand in the follow-up series of articles to talk about how to quickly implement and modify the algorithm through different code cases and integrate our own latest ideas.


15.Set hyperparameters appropriately

Modern GPUs are very good at matrix calculations, which is why they are widely used to train large neural network models. By choosing appropriate hyperparameters, we can further improve the efficiency of the algorithm, for example, for Nvidia GPUs (currently mainstream GPUs), we can refer to the following guidelines:

  1. The selected batchsize can be normal by 4, or a multiple of 2

  2. For dense network layers, set both input and output to be divisible by 64

  3. For convolutional layers, set the input and output channels to be divisible by 4, or a multiple of 2

  4. Padding input image from RGB three-channel to 4-channel

  5. Data mode using BHWC (Batch_height_width_channels)

  6. For recurrent network layers, set the batch and hidden layers to be divisible by 4, ideal values ​​are 64, 128, 256

The idea behind these recommendations is to make the data more evenly distributed. Here, we give some reference documents:

Nvidia Documentation 1

Nvidia Documentation 2

Nvidia Documentation 3


16. Use EarlyStopping

When to stop training a model is a difficult question to answer. One phenomenon that can happen is deep bilayer descent: that is, your model metrics start to deteriorate after steadily improving, and then, after some training updates, the model score improves again, even better than before. To avoid bouncing back and forth in between, we can use a validation dataset. This separate dataset is used to measure the performance of the algorithm on new, unseen data. If the performance is not updated within the "patient steps" we set, then the model will not continue to train. Here, the key is to choose an appropriate "patient step parameter", which can help our model customer service overcome the temporary score plateau. A commonly used "patient step parameter" can be selected between 5 and 20 epochs.


17.Use transfer-learning

The idea behind transfer learning is to use the model results that have been trained on a large number of datasets in the industry and apply it to our task. Ideally, the networks we use are trained on the same data types (images, text, audio) and tasks similar to our tasks (classification, translation, detection). There are mainly two related methods:

1.Fine-tuning

Fine-tuning is the task of taking an already trained model and updating the weights of the characteristic problem. Typically, we freeze the first few layers, as they are trained to recognize basic features, and then fine-tune the remaining layers on our dataset.

2. Feature extraction

In contrast to fine-tuning, feature extraction describes a method of extracting features using a trained network. On the preselected trained model, add your own classifier and update only this part of the network; the base layer is frozen. The reason we follow this approach is that the original top layer network is only trained for the characteristic problem, but our task is different. Often by learning custom parts of the network layers from scratch, we can be sure to focus on our dataset—while maintaining the benefits of a large base model.


18.Employ data-parallel training (using data parallel training)

If we want to train our model faster, then we can run the algorithm on multiple GPUs to calculate the training speed. Typically, this is done in a data-parallel fashion: the network is replicated on different devices, and different batches of data are split and distributed. Then, the gradients are averaged and applied over each network replica. On TensorFlow, we can employ a number of different distributed training strategies. The easiest option is "MirroredStategy", but there are many other strategies that I won't go into here. For example, if we write a custom training loop (like item 14 above), we can follow these tutorials. According to our training experience, distributing the data from one GPU to two to three trainings is the fastest, and for large datasets, this is an effective way to reduce the number of training sessions.


19.Use sigmoid activation for multi-label tasks

In the case where a sample can have multiple labels, we can use a sigmoid activation function, unlike softmax, sigmoid is applied to each neuron individually, which means that multiple neurons can be fired with output values ​​between between 0 and 1, which is convenient for interpretation. This method is important in tasks such as classifying samples into multiple classes or detecting a variety of different objects.


20.One-hot encode categorical data (One-hot encoded categorical data

Categorical data needs to be encoded as numbers due to our need for numerical representation. For example, we can't directly feed back the "Golden Retriever" category, but only get the category number representing "Golden Retriever". A more attractive option is to enumerate all possible values, that is, this approach means sorting on "Golden Retriever" coded 1 and "Orange Cat" coded 2. However, these orderings are rarely used in practice, which is why we rely on one-hot encoding, which guarantees the independence of variables.


21.Rescale numerical inputs

The network model is trained by updating the weights, and the optimizer is mainly responsible for this. Normally, if the output values ​​are between [-1, 1], they can be adjusted to the best value, so why is that? Let's assume a hilly landscape, in order to find the lowest point, if there are more hills in the surrounding area, then the more time we spend finding the local hilly minimum. But what if we could modify the current state of the landscape? Can we find solutions faster? This is what we achieve by adjusting the values, when we scale the values ​​to [-1,1], we use the curvature to be more spherical (aka rounder and more uniform). If we train our model with this range of data, we can converge faster. Why is this? Because the size of the feature (i.e. the weight value) affects the magnitude of the gradient, larger features produce larger gradients, resulting in larger weight updates that require more steps to converge, resulting in slower training. If you want to learn more about this, you can check out the TensorFlow tutorial.


22.Use knowledge distillation

You must have heard of the BERT model, right? This Transformer model has hundreds of millions of parameters, but we may not be able to train it on our GPU, this large number of parameters is where the knowledge distillation process becomes effective, we train a second model to produce the output of a larger model , while the output is still the original dataset, but the labels do refer to the output of the model, also known as the soft output. The goal of this technique is to replicate larger models with the help of small models. Here, we have not overly explained the knowledge of knowledge distillation and teacher-student network model. If you need further understanding, you can refer to these related tutorials.


Reference
[1]. https://towardsdatascience.com/tips-and-tricks-for-neural-networks-63876e3aad1a?gi=2732cb2a6e99

[2].https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-ha ve-limited-data-part2/

[3].https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#optional_set_the_ correct_initial_bias

[4].https://docs.nvidia.com/deeplearning/performance/index.html

[5].https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/inde x.html

[6].https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index. html

[7].https://openai.com/blog/deep-double-descent/

[8].https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl

[9].https://www.tensorflow.org/tutorials/images/transfer_learning#rescale_pixel_values

[10].https://medium.com/huggingface/distilbert-8cf3380435b5

Guess you like

Origin blog.csdn.net/weixin_43694096/article/details/125837438