Common methods and techniques for training CV models

I recently participated in a CV competition and saw that some participants shared the tips they often use when training image recognition models, so I recorded and organized them to facilitate future learning. I have sorted out a lot, they may not be useful every time, but please keep in mind, maybe they will play a role in a certain task in the future!

Mainly from the following nine aspects:

  1. image enhancement

  1. better model

  1. Learning rate and scheduler

  1. optimizer

  1. regularization means

  1. label smoothing

  1. knowledge distillation

  1. Pseudo-label

  1. error analysis

1. Image Enhancement

There are many enhancement methods listed below, some of which have not even been seen, but not every enhancement method is beneficial. You need to choose the appropriate enhancement method according to the task and experiment.

color enhancement

Color Skew:

This enhancement randomly adjusts the hue, saturation, and brightness of the image by multiplying each channel by a randomly chosen factor. The coefficients are chosen from the range [0:6;1:4] to ensure that the resulting image is not too distorted.

def color_skew(image):
    h, s, v = cv2.split(image)
    h = h * np.random.uniform(low=0, high=6)
    s = s * np.random.uniform(low=1, high=4)
    v = v * np.random.uniform(low=0, high=6)
    return cv2.merge((h, s, v))

RGB Norm:

This augmentation normalizes the RGB channels of an image by subtracting the mean of each channel from its value and dividing by the channel's standard deviation. This helps normalize the values ​​in the image and can improve the performance of the model.

def rgb_norm(image):
    r, g, b = cv2.split(image)
    r = (r - np.mean(r)) / np.std(r)
    g = (g - np.mean(g)) / np.std(g)
    b = (b - np.mean(b)) / np.std(b)
    return cv2.merge((r, g, b))

Black and White:

This enhancement converts the image to black and white by converting it to a grayscale color space.

defblack_and_white(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)

Ben Graham: Greyscale + Gaussian Blur:

This enhancement converts the image to grayscale and applies a Gaussian blur to smooth any noise or detail in the image.

def ben_graham(image):
    image = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
    image = cv2.GaussianBlur(image, (5, 5), 0)
    return image

Hue, Saturation, Brightness:

This augmentation converts the image to the HLS color space, which separates the image into hue, saturation, and lightness channels.

def hsb(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2HLS)

LUV Color Space:

This enhancement converts the image to the LUV color space, which is designed to be perceptually consistent and allow for more accurate color comparisons.

def luv(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2LUV)

Alpha Channel:

This enhancement adds an alpha channel to the image, which can be used to increase transparency.

def alpha_channel(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2RGBA)

YZ Color Space:

This enhancement converts the image to the XYZ color space, a device-independent color space that allows for more accurate color representation.

def xyz(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2XYZ)

Luma Chroma:

This enhancement converts the image to the YCrCb color space, which separates the image into luma (brightness) and chroma (color) channels.

def luma_chroma(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2YCrCb)

CIE Lab:

This enhancement converts the image to the CIE Lab color space, which is designed to be perceptually uniform, allowing for more accurate color comparisons.

def cie_lab(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2Lab)

YUV Color Space:

This enhancement converts the image to the YUV color space, which separates the image into luma (brightness) and chroma (color) channels.

def yuv(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2YUV)

Center Crop:

This enhancement randomly crops a rectangular area with an aspect ratio of [3/4,4/3], then randomly scales the crop by a factor between [8%,100%], and finally adjusts the crop to img_{size} * img_ {size} img_{size} * img_{size} square. This is done randomly on each batch.

transforms.CenterCrop((100, 100))

Flippings:

This augmentation increases the probability that the image will be randomly flipped horizontally. For example, with a probability of 0.5, there is a 50% chance that the image will be flipped horizontally.

def flippings(image):
    if np.random.uniform() < 0.5:
        image = cv2.flip(image, 1)
    return image

Random Crop:

This augmentation randomly crops a rectangular region from the image.

transforms.RandomCrop((100, 100))

Random Resized Crop:

This augmentation randomly resizes and crops rectangular regions from an image.

transforms.RandomResizedCrop((100, 100))

Color Jitter:

This enhancement randomly adjusts the brightness, contrast, saturation, and hue of the image.

transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5)

Random Affine:

This augmentation randomly applies an affine transformation to the image, including rotation, scaling, and shearing.

transforms.RandomAffine(degrees=45,translate=(0.1,0.1),scale=(0.5,2.0),shear=45)

Random Horizontal Flip:

Randomly flips the image horizontally with probability 0.5.

transforms.RandomHorizontalFlip()

Random Vertical Flip:

This augmentation randomly flips the image vertically with a probability of 0.5.

transforms.RandomVerticalFlip()

Random Perspective:

This augmentation randomly applies a perspective transformation to the image.

transforms.RandomPerspective()

Random Rotation:

This augmentation randomly rotates the image by a given range of degrees.

transforms.RandomRotation(degrees=45)

Random Invert:

This enhancement randomly inverts the colors of the image.

transforms.RandomInvert()

Random Posterize:

This augmentation randomly reduces the number of bits used to represent each pixel value, creating a color separation effect.

transforms.RandomPosterize(bits=4)

Random Solarize:

This enhancement randomly applies an exposure effect to the image, where pixels above a certain intensity threshold are inverted.

transforms.RandomSolarize(threshold=128)

Random Autocontrast:

This enhancement randomly adjusts the contrast of an image by stretching intensity values ​​across the full available range.

transforms.RandomAutocontrast()

Random Equalize:

This augmentation randomly equalizes the histogram of the image, thus increasing the contrast.

transforms.RandomEqualize()

more advanced enhancements

In addition to the above basic enhancement methods, there are some more advanced enhancement methods.

Auto Augment:

Auto Augment is an augmentation method that uses reinforcement learning to search for the best augmentation strategy for a given dataset. It has been shown to improve the performance of image classification models.

from autoaugment import AutoAugment

auto_augment = AutoAugment()
image = auto_augment(image)

Fast Autoaugment:

Fast Autoaugment is a faster implementation of the Auto Augment method. It uses neural networks to predict the best augmentation strategy for a given dataset.

from fast_autoaugment import FastAutoAugment

fast_auto_augment = FastAutoAugment()
image = fast_auto_augment(image)

Augmix:

Augmix is ​​an augmentation method that combines multiple augmented images to create a single, more diverse and realistic image. It has been shown to improve the robustness and generalization of image classification models.

from augmix import AugMix

aug_mix = AugMix()
image = aug_mix(image)

Mixup/Cutout:

Mixup is an augmentation method that combines two images by linearly interpolating pixel values. Cutout is an augmentation method that randomly removes rectangular regions from an image. These methods have been shown to improve the robustness and generalization of image classification models.

"You take a picture of a cat and add some "transparent dog" on top of it. The amount of transparency is a hyperparam."

x=lambda*x1+(1-lambda)x2,

y=lambda*x1+(1-lambda)y2

Test Time Augmentations(TTA)

Image augmentation is useful not only during training, but also during testing. For the test phase, people call it TTA, just take the test set image multiple augmentations, apply it to the prediction and average the results. This method can increase the robustness of prediction, but correspondingly, it will increase the time. To enhance the test set, it is not suitable for too advanced enhancement methods, such as changing the image scale, cropping different places, flipping, etc.

Personally, I feel that this approach should only be used in competitions~

2. Better Models

Although the following models are a few years apart from now, their outstanding performance makes them still occupy the front row in the competition. Although better models have been produced in recent years, many models are not open source or too large, and are not yet available. be more widely used.

tf_efficientnetv1,v2系列
seresnext

And some ideas and models to try.

Swin Transformer
BeIT Transformer
ViT Transformers

Add more hidden layers behind the backbone

Adding more layers can be beneficial because you can use them to learn more advanced features, but it can also ease the fine-tuning of large pretrained models and even hurt model performance.

Thaw layer by layer

A simple trick that can get you small improvements is to unfreeze the layers of the pretrained backbone as training progresses. First add more layers and freeze the backbone, and then slowly unfreeze the parameters of the backbone to allow it to participate in training.

## Weight freezing
for param in model.parameters():
  param.requires_grad = False 

## Weight unfreezing
for param in model.parameters():
  param.requires_grad = True

Weight Freezing and Unfreezing in TensorFlow

## Weight freezing
layer.trainable = False
## Weight unfreezing
layer.trainable = True

3. Learning rate and scheduler

The learning rate and learning rate scheduler affect the training performance of the model. Changing the learning rate can have a big impact on performance and training convergence.

Learning rate schedulers

Recently, the One Cycle Cosine schedule has been shown to give better results on multiple tasks, and you can use it like this:

One Cycle Cosine scheduling in PyTorch

from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = torch.optim.Adam(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
#这里使用
scheduler = CosineAnnealingLR(optimizer, T_max=num_train_optimization_steps)
num_training_steps = num_train_optimization_steps / args.gradient_accumulation_steps
# Update the scheduler
scheduler.step()
# step the learning rate scheduler here, 
# you will want to step the learning rate scheduler only once per optimizer step nothing more nothing less. 
# So in this case, it should be called before you expect the gradients to be applied.

tensorflow

## One Cycle Cosine scheduling in TensorFlow
optimizer = tf.keras.optimizers.Adam(learning_rate)
scheduler = tf.keras.optimizers.schedules.CosineDecay(learning_rate, decay_steps=num_training_steps)

Tips for Using the Learning Rate Scheduler

  • Learning rate tuning using "Triangular" or "One Cyclic" methods can provide subtle but significant improvements - these intelligent methods of learning rate scheduling can overcome some batch size issues.

  • Spending time researching the best learning rate scheduling method for your task and the model you use is a very important part of how your model converges.

  • Learning rate scheduling strategies can be used to train models with lower batchsize or multiple learning rates.

  • We all know that the learning rate is important, so first try low learning rates first, and see if increasing the learning rate helps or hurts the performance of the model.

  • Increasing the learning rate or multiple learning rates or batchsize or gradient accumulation or learning rate scheduling strategies later in training can sometimes help the model converge better, it's an advanced technique because sometimes it hurts performance but only if you give it Too large a value - remember to test it.

  • Loss scaling helps reduce loss variance and improve gradient flow when using gradient accumulation or multiple learning rates or high batch sizes, but if you're trying to solve the problem by increasing the batch size, try increasing the learning rate because it sometimes will yield better performance.

4. Optimizer

A lot of people are using Adam or AdamW these days. If you want to get the best performance out of the Adam optimizer, there are a few things you need to know:

Finding the optimal weight decay value can be cumbersome, relying on a lot of experimentation (and luck).

  • Another important hyperparameter is beta1 and beta2 used in Adam optimizer, choosing the best value depends on your task and data. Many new missions benefit from lower beta1 and higher beta2, while in established missions they do the opposite. Again: experimentation will be your best friend.

  • In the world of the Adam optimizer, the first rule is not to underestimate the importance of the optimizer's epsilon value. The same principles of finding optimal weight decay hyperparameters apply here.

  • Don't overuse the gradient clipping norm - it can sometimes help when your gradients explode and vice versa - it can prevent convergence for some tasks.

  • Gradient accumulation can still provide some subtle benefits, I usually accumulate gradients of about 2 steps, but if your GPU is not running out of memory, you can push up to 8 steps of gradient accumulation. Gradient accumulation is also useful when using mixed precision.

Also, if you adjust the momentum of SGD enough time, you may get better results, but this again requires a lot of adjustment.

There are several other notable optimizers:

  • AdamW: This is an extension of the Adam algorithm that prevents exponential weight decay of outer model weights and encourages penalized overvolume below default weights.

  • Adafactor: It is designed to have low memory usage and scalability. The optimizer can use multiple GPUs to provide significant optimizer performance.

  • Novograd: Basically another Adam-like optimizer, but with better features. It is one of the optimizers used to train bert-large models.

  • Ranger: The Ranger optimizer is a very interesting optimizer that has achieved good results in performance optimization solutions, but it is not very well known or supported.

  • Lamb: A GPU-optimized reusable Adam optimizer developed by GLUE and QQP competition winners.

  • Lookahead: A popular optimizer that you can use on top of other optimizers and it will give you some performance gains.

5. Regularization means

Use dropouts! Adding dropout between layers usually results in higher training stability and more reliable results, use it in hidden layers. Dropout can also be used to improve performance slightly, try setting layer dropouts before training. tasks and models.

Regularization: When your neural network is overfitting or underfitting, regularization can greatly improve performance, for normal machine learning models, L1 or L2 regularization is fine.

Always use experiments to test ideas: use experiments. experiment. Experiment and try models.

Multi Validations: You can improve the robustness of your model to overfitting by using Multi Validations. However, this comes at the cost of computation time.

6. Label smoothing

Paper link:

When Does Label Smoothing Help?:

https://arxiv.org/pdf/1906.02629.pdf

The core formula is summarized in one line:

Usually works well and can be seen in many games. Taking the binary classification task as an example, the sample code for label smoothing is given below, which can be used directly.

Tensorflow:

loss = BinaryCrossentropy(label_smoothing = label_smoothing)

Pytorch:

from torch.nn.modules.loss import _WeightedLoss

class SmoothBCEwLogits(_WeightedLoss):
    def __init__(self, weight = None, reduction = 'mean', smoothing = 0.0, pos_weight = None):
        super().__init__(weight=weight, reduction=reduction)
        self.smoothing = smoothing
        self.weight = weight
        self.reduction = reduction
        self.pos_weight = pos_weight

    @staticmethod
    def _smooth(targets, n_labels, smoothing = 0.0):
        assert 0 <= smoothing < 1
        with torch.no_grad(): targets = targets * (1.0 - smoothing) + 0.5 * smoothing
        return targets

    def forward(self, inputs, targets):
        targets = SmoothBCEwLogits._smooth(targets, inputs.size(-1), self.smoothing)
        loss = F.binary_cross_entropy_with_logits(inputs, targets,self.weight, pos_weight = self.pos_weight)
        if  self.reduction == 'sum': loss = loss.sum()
        elif  self.reduction == 'mean': loss = loss.mean()
        return loss

7. Knowledge Distillation

Use a large teacher network to guide the learning of a small network.

step:

  • Train large models: Train large models on data.

  • Calculate soft labels: use the trained large model to calculate soft labels. That is, the output of softmax after the large model is "softened"

  • Student model training: On the basis of the large model, train a student model based on the teacher's output as an additional soft label loss function, and adjust the ratio of the two loss functions through interpolation.

8. Pseudo-labeling

Use the model to label unlabeled data (such as test data), and then use the new labeled data to retrain the model.

step:

  • Train the teacher model: Train the model on the data you have.

  • Compute pseudo-labels: Compute soft labels for unlabeled data using a trained large model.

  • Use only targets "sure" by the model: Use only the highest confidence predictions as pseudo-labels to avoid errors as much as possible. (if you don't, it might not work)

  • Studnet model training: Train a student model on new labeled data that you have.

9. Error analysis

When training, many people just adjust the parameters, but don't know how to analyze, and often hear the word analysis bad case in the company. It's equally important, and can even sometimes provide us with additional ideas. An important practice that can save you a lot of time is to use your model to find harder or corrupted data samples. Images can be "harder" for your model for any number of reasons, e.g. small target objects, different colors, cut off targets, invalid annotations, etc. Try to figure out why, it might help you.

Mistakes are sometimes good news!

These are the samples that separate the top leaders from the rest of the players. If you're having trouble explaining what's going on with your model, then look at the validation samples your model encountered.

Finding Your Model's Errors!

The easiest way to find the error is to sort the validation samples by the model's confidence score and see which samples have the lowest prediction confidence.

mistakes_idx = [img_idx for img_idx in range(len(train)) if int(pred[img_idx] > 0.5) != target[img_idx]]
mistakes_preds = pred[mistakes_idx]
sorted_idx = np.argsort(mistakes_preds)[:20]
# Show the images of the sorted idx here..

Summarize

Above, I sorted out a lot, they may not be useful every time, but please keep in mind, maybe they will play a role in a certain task in the future!

Guess you like

Origin blog.csdn.net/AbnerAI/article/details/129344965