Interpretation of OmniBenchmark, the largest representation learning pre-training dataset released by ECCV2022 SenseTime

In recent years, representation learning algorithms based on deep models have achieved excellent results in certain knowledge domains (such as human faces, animals, etc.). However, due to the limited visual categories covered by existing datasets, one covering visual categories It is important to be broad enough to support learning comprehensive visual representations applicable to many visual categories.

In response to this problem, workers from Shangtang proposed OmniBenchmark in ECCV2022. This new benchmark for representation learning includes 21 category domains (called realm in the text), each domain corresponds to a sub-dataset, Including a total of 7372 visual categories (called concept in the text), and 1,074,346 images, OmniBenchmark includes most of the visual category domains.

Compared with existing representation learning and pre-training benchmarks, OmniBenchmark has stronger diversity and complexity. Most importantly, OmniBenchmark covers most visual categories, and there is no overlap between these category concepts. These advantages enable OmniBenchmark to learn a more robust visual representation, and at the same time it is a more challenging benchmark that better reflects the generalization of the model.

Official website: https://zhangyuanhan-ai.github.io/OmniBenchmark/

Paper: https://arxiv.org/pdf/2207.07106.pdf

The introduction of this article will be divided into the following parts:

1. Initial introduction to OmniBenchmark

2. OmniBenchmark build process

3. OmniBenchmark evaluation method and experimental results

1. Preliminary introduction to OmniBenchmark

Large pre-trained models have now become a fundamental technique in computer vision. The generalization ability of the pre-training model, that is, whether it can effectively help the learning of various downstream tasks, is a key indicator for judging the quality of the pre-training model. In existing research, the benchmark datasets for evaluating the generalization ability of pre-trained models mainly focus on the following two downstream task scenarios: cross-domain scenarios (such as from natural domains to synthetic domains) and cross-task scenarios (such as from image classification to instance segmentation ) .

OmniBenchmark focuses on another task scenario: cross-category scenarios (e.g. from pets to street view) . Besides, currently existing benchmark datasets either only cover a small part of semantic categories, or have large overlap between semantic categories. For example, ImageNet-1K [2] contains only a limited number of categories, including mammals, musical instruments, electronic equipment, and commodities, as shown in Figure 2(a). Secondly, CLIP [3] is an existing large benchmark dataset, but the visual categories contained in the 24 sub-datasets it contains have large repetitions, as shown in Figure 2(b).

All in all, a benchmark with a large coverage of semantic categories and no semantic repetition is an ideal pre-training model benchmark, which can better reflect the generalization ability of the model.

Figure 2 Advantages of OmniBenchmark[1]

OmniBenchmark is proposed for the background of the above problems. OmniBenchmark contains 21 sub-datasets, each of which corresponds to a category domain. Figure 2 gives examples of several category domains, including a total of 7372 categories and 1074346 images.

The OmniBenchmark dataset has two distinct advantages.

First, strong diversity and high complexity : OmniBenchmark has more than twice the number of category domains than ImageNet-1K , and more than nine times the number of categories per category domain than ImageNet-1K .

Second, it is concise and easy to use : OmniBenchmark focuses on image classification tasks, and judges the generalization performance of the pre-trained model by measuring the generalization ability of visual expressions between different categories of domains, because it is aimed at classification tasks, OmniBenchmark is easy to use, and there is no overlap between the categories of OmniBenchmark , which has strong simplicity.

2. OmniBenchmark construction process

Next, introduce the construction process of OmniBenchmark , which will help us understand the OmniBenchmark dataset more deeply .

The construction process of OmniBenchmark has been carefully designed and is divided into four steps: new category insertion -> category selection (filtering) -> category domain selection (filtering) -> labeling and deduplication .

2.1 New category insertion

The categories in ImageNet-1K are obtained from WordNet [17], however, there are fewer categories in WordNet, which is not enough to obtain more visual categories. In order to obtain more visual categories, the author proposes to connect the categories in WikiData[4] to WordNet, and the connected WordNet contains about 210K categories.

2.2 Category selection (filtering)

After new category insertion, about 210K categories are obtained, however not all of these categories are useful visual objects.

To obtain useful visual objects, a selection (filtering) process is performed as follows:

First , ask the labelers to discard categories that contain sensitive content (such as violence, pornography);

Second , remove those non-visual categories (such as non-physical concepts such as chemistry and vitamins);

Third , remove those semantically repeated categories to ensure mutual exclusivity between a certain category and other categories. The actual operation method is to keep only the leaf nodes in WordNet;

Fourth , remove those categories with a small number of crawled source images. The process of category selection is shown in Fig. 3(b).

Figure 3 OmniBenchmark construction process and distribution[1]

2.3 Category domain selection (filtering)

In WordNet, there is a hierarchical structure between categories, forming a tree-like hierarchical structure. Further, the authors propose a method for selecting a category domain from WordNet, which is a collection of multiple visual categories.

The specific selection principles are as follows:

First , select those category domains that cover more than 20 categories;

Second , do not select domains that are contained by other domains (that is, the tree corresponding to this category domain cannot be a subtree of another tree);

Third , try not to choose category domains that are not natural concepts or contain more personal information. The process of category domain selection is shown in Fig. 3(c).

2.4 Labeling and deduplication

After obtaining all the category domains and their categories, the images are marked and deduplicated.

The specific labeling method is relatively simple. Give each image a label. When the label is given, let 5 labelers judge whether the image content and category are consistent. Only when 3 or more labelers out of the 5 think that they are consistent. Label it.

In terms of deduplication, differential hashing is used to remove images that are duplicated in datasets such as Bamboo-CLS[5], ImageNet-22K, and PASCAL-VOC[6].

The author also gives the data distribution of OmniBenchmark, as shown in Figure 3(a). It can be seen that the 815 categories in ImageNet-1K are also included in the 21 domains of OmniBenchmark, which can basically reflect the data distribution of ImageNet-1K. At the same time, it can be seen that ImageNet-1K contains very few domains for each concept, and it is difficult to cover the concept completely. In contrast, the number of categories of OmniBenchmark is more than 21 times that of ImageNet-1K, which can completely cover the concept.

The complexity and diversity of OmniBenchmark determines that it can represent the domain distribution in the natural domain relatively thoroughly, and it is also a stronger benchmark for training learning and representation learning. Here are some example images for 3 category domains (Consumer Goods, Bird, and Device).

Figure 4 Sample images of Consumer Goods, Bird, and Device

Source: https://zhangyuanhan-ai.github.io/OmniBenchmark/samples/samples.html

3. OmniBenchmark evaluation method and experimental results

OmniBenchmark is mainly applied to image classification tasks in natural scenes, focusing on the generalization ability of visual representations in different visual domains. The author gives the evaluation specification on OmniBenchmark. The evaluation specification refers to [7]. This specification is called Linear Probing in the article .

Linear Probing assumes that a good visual representation should be an all-round visual representation, which can better express various concept domains without updating the weight of the feature extractor (backbone network). Therefore, in the evaluation process, the weight of the feature extractor is fixed, and fine-tuning of the weight is not allowed. Afterwards, a classifier is learned for each sub-dataset of the concept domain for evaluation. The training set and verification set of each category domain are shown in Table 1. As shown, the evaluation index uses the top-1 accuracy rate.

For more usage methods, please refer to OmniBenchmark's GitHub repository: https://github.com/ZhangYuanhan-AI/OmniBenchmark .

Table 1 The amount of data in the training set and validation set for each category domain

Source: https://zhangyuanhan-ai.github.io/OmniBenchmark/distribution/distribution_pure_statistics.html

In the experimental part of the article, the author gives the results of the existing methods with better performance on OmniBenchmark. The author gives a total of 22 methods/models on OmniBenchmark. These 22 methods are divided into the following four categories:

3.1 Self-supervised approach

Various self-supervised methods. Including MoCoV2[8], SwAV[9], DINO[10], etc., these methods are all based on the Resnet-50[11] backbone network.

3.2 Different CNN Networks

A variety of CNN networks, including Resnet series, EfficientNet-B4[12], transformer-based (Swin-T[13]) and so on.

3.3 Different regularization methods

The Resnet-50 model combines various data augmentations (e.g. CutMix [14]), or distillation methods (MEAL-V2 [15]).

3.4 Larger amount of pre-training data

The basic Resnet-50 model, but pre-trained on more data, including CLIP, Bamboo-CLS, IG-1B[16], etc.

The backbone models of all these methods use the Resnet-50 model (except for different CNN network methods), and are all pre-trained on ImageNet-1K (except for a larger amount of pre-trained data). The evaluation methods all follow the linear probe method, fix the network weights, and train a linear classifier on each category domain for evaluation.

The evaluation results are shown in Figure 5 and Table 2. The indicator uses the top-1 accuracy rate (relative to the Resnet-50 benchmark model).

Figure 5 Experimental results of various models in OmniBenchmark[1]

Table 2 Experimental results of various models in OmniBenchmark [1]

Based on the experimental results, the author gives the following conclusions:

1. The high similarity between the pre-trained data and a certain domain helps to improve the performance of the model on this domain.

For example, the SwAV-Places method in Table 2 is pre-trained on the Places dataset. Most of the datasets are buildings. This method is obtained in the two domains of Structure (building) and Region (region). achieved the highest results among all self-supervised methods.

2. The strong data augmentation used by self-supervised methods may affect the classification on some fine-grained domains.

Self-supervised learning methods typically use powerful data augmentation methods to learn visual representations of images. However, these data augmentations are likely to affect the performance on fine-grained domains. For example, for the domain of Bird, the performance of self-supervised learning methods is mostly poor.

3. The larger CNN network may have overfitting to the domain in ImageNet-1K.

The author tried a variety of larger and deeper CNN networks in the experiment. As shown in Table 2, most of these models can achieve better results than Resnet-50. However, the author found that in some ImageNet-1K contains domains On (such as mammal, device, etc.), the improvement of the model is greater. However, on some domains that are not included in ImageNet-1K (such as aircraft, plant, etc.), the improvement of the model is not so high. The author speculates that these models are in ImageNet-1K After pre-training, there is overfitting on ImageNet-1K.

4. The augmentation method is very sensitive to domain shift.

It can be seen from Table 2 that the effects of various data augmentation methods are relatively poor. These methods neither achieve good improvement on domains included in ImageNet-1K, nor perform poorly on domains not included in ImageNet-1K. The authors point out that this result may be due to the fact that these methods are more severely overfitted on ImageNet-1K.

5. OmniBenchmark is a better benchmark dataset than ImageNet-1K.

In the Bamboo-CLS article, DINO has a 5% advantage over Bamboo-CLS on ImageNet-1K. However, in 10 downstream tasks, Bamboo-CLS showed better performance than DINO on 9 downstream tasks. As a result, this shows that ImageNet-1K does not fully reflect the generalization performance of visual representations.

In OmniBenchmark, Bamboo-CLS has an advantage of nearly 7 points compared with DINO, which is consistent with the experimental results of the other 9 downstream tasks, reflecting that OmniBenchmark is a stronger benchmark than ImageNet-1K and can better reflect the model. Generalization ability of learned visual representations.

Four. Summary

OmniBenchmark is a large-scale and diverse representation learning benchmark that supports representation learning, model pre-training, natural image classification and other tasks.

Compared with existing benchmarks such as ImageNet and CLIP, OmniBenchmark has the advantages of strong diversity and ease of use. Most importantly, OmniBenchmark is more able to reflect the generalization performance of visual expressions learned by the model between multiple category domains.

references

[1] Zhang Y, Yin Z, Shao J, et al. Benchmarking omni-vision repResentation through the lens of visual realms[J]. arXiv preprint arXiv:2207.07106, 2022.

[2] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255.

[3] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. PMLR, 2021: 8748-8763.

[4] Yang S, Luo P, Loy C C, et al. Wider face: A face detection benchmark[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 5525-5533.

[5] Zhang Y, Sun Q, Zhou Y, et al. Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy[J]. arXiv preprint arXiv:2203.07845, 2022.

[6] Everingham M, Van Gool L, Williams C K I, et al. The pascal visual object classes (voc) challenge[J]. International journal of computer vision, 2010, 88(2): 303-338.

[7] Goyal P, Mahajan D, Gupta A, et al. Scaling and benchmarking self-supervised visual repResentation learning[C]//Proceedings of the ieee/cvf International Conference on computer vision. 2019: 6391-6400.

[8] Chen X, Fan H, Girshick R, et al. Improved baselines with momentum contrastive learning[J]. arXiv preprint arXiv:2003.04297, 2020.

[9] Caron M, Misra I, Mairal J, et al. Unsupervised learning of visual features by contrasting cluster assignments[J]. Advances in Neural Information Processing Systems, 2020, 33: 9912-9924.

[10]Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 9650-9660.

[11]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[12]Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International conference on machine learning. PMLR, 2019: 6105-6114.

[13]Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022.

[14]Yun S, Han D, Oh S J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6023-6032.

[15]Shen Z, Savvides M. Meal v2: Boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks[J]. arXiv preprint arXiv:2009.08453, 2020.

[16]Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.

[17]Miller G A. WordNet: An electronic lexical database[M]. MIT press, 1998.

Author丨Julio Zhao

import torch as tensorflow

- End -

The above is this sharing. To obtain massive dataset resources, please visit OpenDataLab official website ; to obtain more open source tools and projects, please visit OpenDataLab Github space . In addition, if there is anything else you want to see, come and tell the little assistant. More data sets are on the shelves, more comprehensive data set content interpretation, the most powerful online Q&A, the most active circle of peers... Welcome to add WeChat opendatalab_yunying to join the OpenDataLab official communication group.

Guess you like

Origin blog.csdn.net/OpenDataLab/article/details/127795542