Summary of data preprocessing and model training techniques in computer vision

The main problems of computer vision include image classification, target detection and image segmentation. For image classification tasks, there are two ways to improve accuracy. One is model modification, and the other is various data processing and training techniques (tricks). Various techniques in image classification are also very useful for tasks such as target detection and image segmentation, so they are worth summarizing. Based on an intensive reading of the paper, this article summarizes various image classification taskstricksas follows:

  • Warmup

  • Linear scaling learning rate

  • Label-smoothing

  • Random image cropping and patching

  • Knowledge Distillation

  • Cutout

  • Random erasing

  • Cosine learning rate decay

  • Mixup training

  • AdaBoud

  • AutoAugment

  • Other classic tricks

1. Warmup

Learning rate is one of the most important hyperparameters in neural network training. There are many techniques for learning rate. Warm up is a learning rate preheating method mentioned in the ResNet paper [1]. Since the weights of the model are randomly initialized at the beginning of training (setting them all to 0 is a pitfall, see [2] for the reason), choosing a larger learning rate at this time may cause instability in the model. Learning rate warm-up is to use a smaller learning rate at the beginning of training, train some epochs or iterations, and then modify it to the preset learning rate for training when the model is stable. In the paper [1], when a 110-layer ResNet is used to train on cifar10, it is first trained with a learning rate of 0.01 until the training error is less than 80% (about 400 iterations are trained), and then trained with a learning rate of 0.1.

The above method is constant warmup. In 2018, Facebook improved the above warmup [3], because changing from a small learning rate to a relatively large learning rate may cause the training error to suddenly increase. The paper [3] proposed a gradual warmup to solve this problem, that is, starting from the initial small learning rate, each iteration increases a little until the larger learning rate is initially set.

from torch.optim.lr_scheduler import _LRScheduler
class GradualWarmupScheduler(_LRScheduler):
    """
    Args:
        optimizer (Optimizer): Wrapped optimizer.
        multiplier: target learning rate = base lr * multiplier
        total_epoch: target learning rate is reached at total_epoch, gradually
        after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
    """
    def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
        self.multiplier = multiplier
        if self.multiplier <= 1.:
            raise ValueError('multiplier should be greater than 1.')
        self.total_epoch = total_epoch
        self.after_scheduler = after_scheduler
        self.finished = False
        super().__init__(optimizer)
    def get_lr(self):
        if self.last_epoch > self.total_epoch:
            if self.after_scheduler:
                if not self.finished:
                    self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs]
                    self.finished = True
                return self.after_scheduler.get_lr()
            return [base_lr * self.multiplier for base_lr in self.base_lrs]
        return [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
    def step(self, epoch=None):
        if self.finished and self.after_scheduler:
            return self.after_scheduler.step(epoch)
        else:
            return super(GradualWarmupScheduler, self).step(epoch)

2. Linear scaling learning rate

Linear scaling learning rate is a method proposed in the paper [3] for relatively large batch sizes.

In convex optimization problems, as the batch size increases, the convergence speed decreases, and neural networks have similar empirical results. As the batch size increases, the processing speed of the same amount of data will become faster and faster, but the number of epochs required to achieve the same accuracy will increase. In other words, when using the same epoch, the verification accuracy of a model trained with a large batch size will be reduced compared to a model trained with a small batch size.

The gradual warmup mentioned above is one of the ways to solve this problem. In addition, linear scaling learning rate is also an effective method. During mini-batch SGD training, the gradient descent value is random because the data for each batch is randomly selected. Increasing the batch size will not change the expected gradient, but will reduce its variance. In other words, a large batch size will reduce the noise in the gradient, so we can increase the learning rate to speed up convergence.

The specific method is very simple. For example, in the original ResNet paper [1], the learning rate selected when the batch size is 256 is 0.1. When we change the batch size to a larger number b, the learning rate should change is 0.1 × b/256.

3. Label-smoothing

In classification problems, our last layer is generally a fully connected layer, and then one-hot encoding of the corresponding label, that is, the value of the corresponding category is encoded as 1, and the others are 0. There are some problems when combining this encoding method with adjusting parameters by reducing cross-entropy loss. This approach will encourage the model to have very different output scores for different categories, or in other words, the model will overly trust its judgment. However, for a data set labeled by multiple people, the labeling criteria of different people may be different, and each person's labeling may also have some errors. The model’s overconfidence in labels can lead to overfitting.

Label-smoothing regularization (LSR) is one of the effective methods to deal with this problem. Its specific idea is to reduce our trust in labels. For example, we can slightly reduce the target value of loss from 1 to 0.9, or from 1 to 0.9. 0 rises slightly to 0.1. Label smoothing was first proposed in inception-v2 [4], which transforms the real probability into:

Among them, ε is a small constant, K is the number of categories, y is the real label of the image, i represents the i-th category, and is the probability that the image is of the i-th category. In general, LSR is a regularization method that adds noise to the label y to constrain the model and reduce the degree of model overfitting.

import torch
import torch.nn as nn
class LSR(nn.Module):
    def __init__(self, e=0.1, reduction='mean'):
        super().__init__()
        self.log_softmax = nn.LogSoftmax(dim=1)
        self.e = e
        self.reduction = reduction
    def _one_hot(self, labels, classes, value=1):
        """
            Convert labels to one hot vectors
        Args:
            labels: torch tensor in format [label1, label2, label3, ...]
            classes: int, number of classes
            value: label value in one hot vector, default to 1
        Returns:
            return one hot format labels in shape [batchsize, classes]
        """
        one_hot = torch.zeros(labels.size(0), classes)
        #labels and value_added  size must match
        labels = labels.view(labels.size(0), -1)
        value_added = torch.Tensor(labels.size(0), 1).fill_(value)
        value_added = value_added.to(labels.device)
        one_hot = one_hot.to(labels.device)
        one_hot.scatter_add_(1, labels, value_added)
        return one_hot
    def _smooth_label(self, target, length, smooth_factor):
        """convert targets to one-hot format, and smooth
        them.
        Args:
            target: target in form with [label1, label2, label_batchsize]
            length: length of one-hot format(number of classes)
            smooth_factor: smooth factor for label smooth
        Returns:
            smoothed labels in one hot format
        """
        one_hot = self._one_hot(target, length, value=1 - smooth_factor)
        one_hot += smooth_factor / length
        return one_hot.to(target.device)

 4. Random image cropping and patching

The Random image cropping and patching (RICAP) [7] method randomly crops the middle parts of four images, then splices them into one image, and mixes the labels of the four images. RICAP achieves an error rate of 2.19% on caifar10.

 As shown in the figure below, Ix and Iy are the width and height of the original image. w and h are called boundary position, which determines the size of the four cropped small pictures. w and h are randomly generated from the beta distribution Beta(β, β), where β is also a hyperparameter of RICAP. The final stitched image size remains the same as the original image size.

 5. Knowledge Distillation

 A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then average their predictions. But using an ensemble of all models to make predictions is cumbersome and may be too computationally intensive to deploy to a large number of users. The Knowledge Distillation [8] method is one of the effective methods to deal with this problem.

In the knowledge distillation method, we use a teacher model to help the current model (student model) training. The teacher model is a pre-trained model with higher accuracy, so the student model can improve the accuracy while keeping the model complexity unchanged. For example, ResNet-152 can be used as a teacher model to help the student model ResNet-50 training. During training, we add a distillation loss to penalize the difference between the output of the student and teacher models.

Given the input, assuming p is the true probability distribution, z and r are the outputs of the last fully connected layer of the student model and teacher model respectively. Previously we would use cross-entropy loss l(p, softmax(z)) to measure the difference between p and z. The distillation loss here also uses cross-entropy. Therefore, the total loss function using the knowledge distillation method is

In the above formula, the first term is still the original loss function, and the second term is the distillation loss added to punish the difference in output between the student model and the teacher model. Among them, T is a temperature hyperparameter used to make the output of softmax smoother. Experiments have shown that using ResNet-152 as a teacher model to train ResNet-50 can improve the accuracy of the latter.

6. Cutout

Cutout[9] is a new regularization method. The principle is to randomly subtract part of the image during training, which can improve the robustness of the model. Its source is the object occlusion problem often encountered in computer vision tasks. Generating some similar occluded objects through cutout not only allows the model to perform better when encountering occlusion problems, but also allows the model to consider the environment (context) more when making decisions.

import torch
import numpy as np
class Cutout(object):
    """Randomly mask out one or more patches from an image.
    Args:
        n_holes (int): Number of patches to cut out of each image.
        length (int): The length (in pixels) of each square patch.
    """
    def __init__(self, n_holes, length):
        self.n_holes = n_holes
        self.length = length
    def __call__(self, img):
        """
        Args:
            img (Tensor): Tensor image of size (C, H, W).
        Returns:
            Tensor: Image with n_holes of dimension length x length cut out of it.
        """
        h = img.size(1)
        w = img.size(2)
        mask = np.ones((h, w), np.float32)
        for n in range(self.n_holes):
            y = np.random.randint(h)
            x = np.random.randint(w)
            y1 = np.clip(y - self.length // 2, 0, h)
            y2 = np.clip(y + self.length // 2, 0, h)
            x1 = np.clip(x - self.length // 2, 0, w)
            x2 = np.clip(x + self.length // 2, 0, w)
            mask[y1: y2, x1: x2] = 0.
        mask = torch.from_numpy(mask)
        mask = mask.expand_as(img)
        img = img * mask
        return img

 7. Random erasing

Random erasing[6] is actually very similar to cutout. It is also a data enhancement method that simulates object occlusion. The difference is that cutout sets the pixel value of a randomly selected rectangular area in the picture to 0, which is equivalent to cropping it out, while random erasing replaces the original pixel value with a random number or the average value of the pixels in the data set. Moreover, the size of the area cropped by cutout is fixed each time, and the size of the area replaced by random erasing is random.

from __future__ import absolute_import
from torchvision.transforms import *
from PIL import Image
import random
import math
import numpy as np
import torch
class RandomErasing(object):
    '''
    probability: The probability that the operation will be performed.
    sl: min erasing area
    sh: max erasing area
    r1: min aspect ratio
    mean: erasing value
    '''
    def __init__(self, probability = 0.5, sl = 0.02, sh = 0.4, r1 = 0.3, mean=[0.4914, 0.4822, 0.4465]):
        self.probability = probability
        self.mean = mean
        self.sl = sl
        self.sh = sh
        self.r1 = r1
    def __call__(self, img):
        if random.uniform(0, 1) > self.probability:
            return img
        for attempt in range(100):
            area = img.size()[1] * img.size()[2]
            target_area = random.uniform(self.sl, self.sh) * area
            aspect_ratio = random.uniform(self.r1, 1/self.r1)
            h = int(round(math.sqrt(target_area * aspect_ratio)))
            w = int(round(math.sqrt(target_area / aspect_ratio)))
            if w < img.size()[2] and h < img.size()[1]:
                x1 = random.randint(0, img.size()[1] - h)
                y1 = random.randint(0, img.size()[2] - w)
                if img.size()[0] == 3:
                    img[0, x1:x1+h, y1:y1+w] = self.mean[0]
                    img[1, x1:x1+h, y1:y1+w] = self.mean[1]
                    img[2, x1:x1+h, y1:y1+w] = self.mean[2]
                else:
                    img[0, x1:x1+h, y1:y1+w] = self.mean[0]
                return img
        return img

8.  Cosine learning rate decay

During the training process after warmup, continuous decay of the learning rate is a good way to improve accuracy. Among them are step decay and cosine decay. The former is to continuously subtract a small number from the learning rate as the epoch increases, and the latter is to let the learning rate decrease as the training process curves.

For cosine decay, assuming there are T batches in total (regardless of the warmup stage), in the t-th batch, the learning rate \eta t is:

Here, eta represents the initially set learning rate. This way of decreasing learning rate is called cosine decay.

Below is a visualization of learning rate decay with warmup [4]. Among them, figure (a) is a figure that the learning rate decreases with the increase of epoch. It can be seen that cosine decay is smoother than step decay. Figure (b) shows the change of accuracy with epoch. There is not much difference in the final accuracy between the two, but the learning process of cosine decay is smoother.

 There are more learning rate decay methods in pytorchtorch.optim.lr_scheduler. As for which one is more effective, the answer may be different for different questions. For step decay, the usage is as follows:

# Assuming optimizer uses lr = 0.05 for all groups
# lr = 0.05     if epoch < 30
# lr = 0.005    if 30 <= epoch < 60
# lr = 0.0005   if 60 <= epoch < 90
from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
for epoch in range(100):
    scheduler.step()
    train(...)
    validate(...)

 9. Mixup training

Mixup[10] is a new data enhancement method. Mixup training is to take out 2 pictures at a time, and then linearly combine them to obtain new pictures, which can be used as new training samples to train the network, as shown in the following formula, where x represents the image data and y represents the label, then Get the new \hat{x}\hat{y}.

where λ is a number randomly sampled from Beta(α, α), between [0,1]. During training, only ( , ) is used.

The Mixup method mainly enhances the linear expression between training samples and enhances the generalization ability of the network. However, the Mixup method takes a longer time to converge better.

for (images, labels) in train_loader:
    l = np.random.beta(mixup_alpha, mixup_alpha)
    index = torch.randperm(images.size(0))
    images_a, images_b = images, images[index]
    labels_a, labels_b = labels, labels[index]
    mixed_images = l * images_a + (1 - l) * images_b
    outputs = model(mixed_images)
    loss = l * criterion(outputs, labels_a) + (1 - l) * criterion(outputs, labels_b)
    acc = l * accuracy(outputs, labels_a)[0] + (1 - l) * accuracy(outputs, labels_b)[0]

 10. AdaBound

AdaBound is mentioned in a recent paper [5]. According to the author, AdaBound will make your training process as fast as adam and as good as SGD. As shown in the figure below, using AdaBound will converge faster, the process will be smoother, and the results will be better.

 In addition, this method is not so sensitive to changes in hyperparameters compared to SGD, which means it is more robust. However, it is still necessary to adjust hyperparameters for different problems, but it may take less time.

Of course, AdaBound has not been universally tested, and it may only work well for certain problems.

How to use it: Install AdaBound

pip install adabound

Use AdaBound (same as otherPyTorch optimizers usage)

optimizer = adabound.AdaBound(model.parameters(), lr=1e-3, final_lr=0.1)

 11. AutoAugment

Data enhancement plays a very important role in image classification problems, but there are many enhancement methods, and not all methods are the best if they are used wholesale. So, how do you choose the best data augmentation method? AutoAugment[11] is a method of searching for data augmentation methods suitable for the current problem. This method creates a search space of data enhancement strategies and uses a search algorithm to select a data enhancement strategy suitable for a specific data set. Furthermore, policies learned from one dataset transfer well to other similar datasets.

12. Other classic tricks

Commonly used regularization methods are

  • Dropout

  • L1/L2 regular

  • Batch Normalization

  • Early stopping

  • Random cropping

  • Mirroring

  • Rotation

  • Color shifting

  • PCA color augmentation

other

  • Xavier init[12]

references

  • [1] Deep Residual Learning for Image Recognition(https://arxiv.org/pdf/1512.03385.pdf)

  • [2] http://cs231n.github.io/neural-networks-2/

  • [3] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour(https://arxiv.org/pdf/1706.02677v2.pdf)

  • [4] Rethinking the Inception Architecture for Computer Vision(https://arxiv.org/pdf/1512.00567v3.pdf)

  • [4]Bag of Tricks for Image Classification with Convolutional Neural Networks(https://arxiv.org/pdf/1812.01187.pdf)

  • [5] Adaptive Gradient Methods with Dynamic Bound of Learning Rate(https://www.luolc.com/publications/adabound/)

  • [6] Random erasing(https://arxiv.org/pdf/1708.04896v2.pdf)

  • [7] RICAP(https://arxiv.org/pdf/1811.09030.pdf)

  • [8] Distilling the Knowledge in a Neural Network(https://arxiv.org/pdf/1503.02531.pdf)

  • [9] Improved Regularization of Convolutional Neural Networks with Cutout(https://arxiv.org/pdf/1708.04552.pdf)

  • [10] Mixup: BEYOND EMPIRICAL RISK MINIMIZATION(https://arxiv.org/pdf/1710.09412.pdf)

  • [11] AutoAugment: Learning Augmentation Policies from Data(https://arxiv.org/pdf/1805.09501.pdf)

  • [12] Understanding the difficulty of training deep feedforward neural networks(http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

Guess you like

Origin blog.csdn.net/Angelina_Jolie/article/details/134007342