Practical tutorial | Interpretation of ECCV 2022 Out Of Distribution classification track championship plan

Welcome to the official WeChat account of "CVHub"!

introduction

As we all know, deep learning has proposed novel network models one after another under the background of the growth of GPU computing power and large-scale data since the AlexNet era. There is a common feature behind these models, that is—— SOTA, so that there is such a famous saying "Idea is cheap, show me the SOTA!" in the CV circle. So these days, if you want to send a paper, if it is not SOTA, you are too embarrassed to contribute, it is really dumbfounding. As a down-seeded contestant who has been lucky enough to vote for several top conferences , I know that the word SOTA is crucial to whether Paper can be accepted in the end, at least for most ordinary people like me. If not, the xxx reviewer will most likely greet you kindly: "Why didn't you compare it with the xxx method? After all, it is a SOTA model in the xxx field!", as if telling the world, "The SOTA model I have seen is better than yours. I have eaten more salt."

However, in reality, is the SOTA model really the state of the art in practical applications? Students who have devoted themselves to the industry should have a deep understanding of this problem. The effect of most so-called SOTA models in practical application is simply bullshit. Of course, for the academic community, most people only need to design a corresponding network structure and method for a specific public data set, and finally combine a bunch of tricks and detailed tuning to overfit the data set and the job is done. up.

Returning to today's topic, we know that deep learning models are usually developed and tested under the implicit assumption that the closed world is drawn from the same distribution with independent and identical distribution (iid) for training and testing data. Ignoring out-of-distribution (OOD) images can lead to poor performance under unseen or unfavorable viewing conditions, which are especially common in real-world scenarios. In other words, if the model that everyone has worked so hard to optimize (adjust) is used in real-world tasks, there is a high probability that the performance of the model will drop sharply. After all, this assumption is no longer valid, so just GG.

Taking the security field as an example, given a city, it is usually arranged according to a certain area first. At this time, we will collect all the data of the machine guns in an area and then update and optimize and test a version of the model. Generally speaking, the effect will be really amazing after it goes online at this time, and the leaders will like it after reading it. At this time, your mood is likely to be like this:

but! Life is always counterproductive, just like Forrest Gump told us: "Life is like a box of chocolates, you never know what you're going to get." After the bit, the same model will encounter a series of inexplicable and unimaginable problems:

  • Wocao! This TM can identify me wrong?

  • NMD, how can this not be detected?

At this time, you in front of the screen may look like this:

and your boss may look like this after listening to your explanation:

In the end, you finally lived the look you hate the most:

This painful experience tells us how important it is to solve the OOD problem!

Out of Distribution

Out of Distribution, OODhas become a central challenge for the safe and reliable deployment of machine learning models in the open world . In fact, not only in the field of security, but also in many fields such as industrial inspection and automatic driving, there are also such problems, that is, whether our model can hold the data set of a new scene. A natural and simple solution is of course to collect data from new scenes and rejoin the model for training, but this is obviously not the topic we are discussing today. Today I mainly introduce a competition held by ECCV 2022 , which aims to solve typical computer vision tasks (ie, multi-class classification, object detection, etc.) on OOD images, which follow a different distribution than the training images. First of all, let's briefly understand the typical cases of OOD scenarios:

It can be seen that the six cases of OOD images are:

  • Shape : that is, the shape and size of the target change greatly;
  • 3D Pose : that is, the attitude and orientation of the target change greatly;
  • Texture : That is, the entire texture of the target changes significantly;
  • Context : That is, the context of the target has changed significantly, such as a train driving in an undersea tunnel;
  • Weather : the impact of different weather and climate on the target, such as spring, summer, autumn and winter or rain, snow, dust, haze, etc.;
  • Occlusion : Different degrees of occlusion will also change the semantics of the target to a certain extent;

Well, the following will explain the champion schemes of the image classification and target detection tracks in this competition.

Image Classification Track

Paper: https://arxiv.org/pdf/2301.04795.pdf

Code: https://github.com/hikvision-research/OOD-CV (404 Not Found)

Track: https://codalab.lisn.upsaclay.fr/competitions/6781#learn_the_details

The OOD-CV challenge is a task that addresses out-of-distribution generalization. In this challenge, the team's core solution can be summarized as:

Noisy Label Learning Is A Strong Test-Time Domain Adaptation Optimizer

The overall Pipeline is a two-stage structure, namely:

  1. A pre-training stage for domain generalization
  2. A test-time training stage for domain adaptation

It should be noted that here only the labeled raw data is utilized in the pre-training phase, while the unlabeled target data is utilized in the test-time training phase. Specifically, in the pre-training stage, the authors propose a simple yet effective Mask-Level Copy-Pastedata augmentation strategy to enhance the out-of-distribution generalization ability against six major challenges of shape , pose , context , texture , occlusion , and climate change (described in superior). While in the test-time training phase, it uses a pre-trained model to assign noisy labels ( ) to unlabeled target data Noisy Label, and proposes a Label-Periodically-Updated DivideMixmethod . After integrating TTAthe and Ensemblestrategies, the Hikvision team's solution currently ranks first in the image classification leaderboard of the OOD-CV challenge.

motivation

Following existing works such as SSNLL for image classification tasks and SFOD for object detection tasks , this paper proposes a noisy label-based learning method, a powerful test-time-domain adaptive optimizer based idea.

First, pre-training a strong baseline model that generalizes well to out-of-distribution data is a necessary prerequisite before testing temporal domain adaptation. An instinctive approach is to stack multiple strong data augmentation strategies on the source data to resist multiple domain shifts. To this end, in addition to traditional data augmentation, the authors also develop a novel Mask-Level CopyPastedata augmentation method. Specifically, given image-level labels, it employs MCTformer, a state-of-the-art weakly supervised semantic segmentation method, to ImageNet-1Ksegment ROBINforeground objects on the and training datasets. In this way, we can obtain three different solutions:

  1. For shape, pose, and texture domain offsets, we can apply affine transformations and color dithering to enhance foreground objects;
  2. For domain offset from context, we paste task-relevant foreground objects onto task-independent images;
  3. For occlusion domain shifting, we can paste task-independent foreground objects onto task-dependent images.

Second, after the model is pre-trained, we can use the pre-trained model to add new pseudo-labels to the target data set, which can be regarded as noise labels. In such cases, existing noisy label learning methods such as DivideMix are naturally used to test temporal adaptation. Therefore, in this challenge, the authors propose a method LabelPeriodically-Updated DivideMixthat can correct noisy labels in time while avoiding overfitting noisy labels.

Finally, after integrating the test-time augmentation (TTA) and model ensemble (Model Ensemble) strategies with various hyperparameters, our solution finally ranked first on the image classification leaderboard of the OOD-CV challenge.

method

Mask-Level Copy-Paste

Mask-Level Copy-PasteThe proposal is mainly used to solve several challenging problems of OOD. The specific approach is to train a Weakly Supervised Semantic Segmentation (WSSS) model by using the image-level labels on the ImageNet-1K and ROBIN training datasets — , and then segment the foreground objects in the image through this model MCTformer. The author also introduced a weakly supervised target detection method based on the YOLOv5 framework. Interested readers can also read the historical articles of the official account "CVHub". Who said that the prospect of the mask level must be?

Here, according to whether the category label is relevant to the task in this challenge, the foreground objects can be divided into two types:

  • Task-related parts (task-related)
  • The part that has nothing to do with the task (task-unrelated)

Similarly, images from ImageNet-1K and ROBIN training datasets can also be divided into task-related and task-independent parts. In this way, we can use the three different schemes mentioned above to alleviate different domain transfer problems. Finally, by stacking with other data augmentation strategies, including AutoAug, , CutMixand rule-based weather simulation methods , we can obtain pre-trained models with strong domain generalization ability.

The author recommends: The data enhancement strategy of this solution can also be applied to your daily development tasks. For example, students often ask how to simulate data under different weather conditions.

Label-Periodically-Updated DivideMix

Here the author regards the test time domain adaptation as a noisy label learning problem. The specific method is to use the pre-training model we introduced in the previous step to label the data on the test set. These label information can be regarded as noise labels (after all, the model It can't be 100% accurate).

Then, after obtaining the noise label, DivideMixmodifying to label-periodically-updatedthe label of can correct the noise label in time and avoid overfitting the noise label. In addition, DivideMixunlike , MixMatchthe popular strong and weak augmentation strategy is adopted here in the component, which uses weak augmentation for pseudo-labeling and strong augmentation for model optimization. The specific Pilepie can refer to the following figure:

Pipeline

Finally, for the specific technical details and experimental parameters, interested students can refer to the original text, which will not be detailed here, the same below.

Object Detection Track

Paper: https://arxiv.org/pdf/2301.04796.pdf

Code: https://github.com/hikvision-research/OOD-CV (404 Not Found)

Track: https://codalab.lisn.upsaclay.fr/competitions/6784#learn_the_details

In order to solve the OOD problem in the object detection track, this scheme proposes a simple but effective Generalize-then-Adapt (G&A)framework , which consists of two parts:

  • Two-stage domain generalization ( two-stage domain generalization)
  • One-stage domain adaptation ( one-stage domain adaptation)

Among them, the domain generalization part consists of a supervised model pre-training stage using source data for model warm-up and a weak semi-supervised model pre-training using source data with box-level labels and auxiliary data with image-level (ImageNet-1K) stage to achieve performance-enhancing tags. While the domain adaptation part is implemented as a passive domain adaptation paradigm, which is further optimized in a self-supervised training manner using only pre-trained models and unlabeled target data.

motivation

To improve the robustness of the model in unknown target domains, the authors propose a simple yet effective Generalize-then-Adapt (G&A) framework to solve the model degradation problem of target detection under domain shift. The specific plan is as follows:

Supervised Model Pre-training

Like the classification track, a powerful Baseline model can also be trained for the domain transfer problem here. This baseline can utilize labeled source data with various powerful data enhancement strategies to simulate potential out-of-distribution data.

Weakly Semi-Supervised Model Pretraining

Previous work has shown that out-of-distribution generalization can be further enhanced using additional auxiliary training. Therefore, we can regard ImageNet-1K as a kind of auxiliary training data with only image-level labels. In this way, the pre-trained target detector in the first stage can be further optimized on the labeled source data (Robin training set with box-level label) and weakly labeled source data (ImageNet-1K with image-level label), which is It is weak semi-supervised target detection.

Source-Free Domain Adaptation

This one is for Test-Time Training, which adapts the model to the target domain by utilizing only the source pre-trained object detector and unlabeled target data without access to the source data. In this challenge, it is simply implemented as a self-training mechanism Mean-Teacherbased .

Finally, after integrating TTA and Model Ensemble, the overall scheme diagram is as follows:

method

Compared with the traditional domain adaptation method of jointly training source data and target data, the G&A framework proposed in this paper is more practical in real-world scenarios, and it decouples the joint training paradigm as shown in the following figure:

It can be seen that the domain generalization phase utilizes only source data, while the test-time domain adaptation phase utilizes only target data. To avoid the following two problems:

  • data expansive transmission
  • data privacy leakage

The G&A framework only allows pretrained model transfer without exchanging source data. The generalization step is usually performed on the server side, while the adaptation step is usually performed on the client side to achieve self-evolution of the model.

In fact, the above two steps can be regarded as upstream and downstream operations. However, existing work in the OOD community usually focuses on either the domain generalization step or the test-time domain adaptation step, without unifying the two steps. Therefore, the author hopes that the outstanding solutions in this challenge leaderboard can inspire the community to pay attention to how to integrate these two steps to further resist the problem of model degradation under domain transfer, which is very worthy of reference.

Note: Students who are interested in OOD can also read more related literature in this area, such as this OOD-CV published by ECCV 2022 .

Summarize

At the beginning of this article, we introduced the importance of the OOD task through a vivid case, and then extended the protagonist of today-about the generalization of out-of-domain distribution in computer vision. At the same time, it is also my honor to briefly explain the champion solution of the image classification and object detection track in the ECCV 2022 OOD-CV challenge, hoping to inspire everyone.

CVHub

If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, remark "CSDN", join the CVHub official academic & technical exchange group, and discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/129097565