CVPR2022 | Using the idea of domain adaptation, Peking University and ByteDance propose a new weakly supervised object localization framework

Author: Zhu Lei

Source丨Heart of the Machine

Considering weakly supervised object localization as a domain adaptation task between image and pixel feature domains, Peking University and ByteDance propose a new framework to significantly enhance the performance of weakly supervised image localization based on image-level labels.

As a basic problem of computer vision, object localization can provide important target location information for scene understanding, automatic driving, intelligent diagnosis and treatment and other fields. However, the training of object localization models relies on dense annotation information such as object target boxes or object masks. The acquisition of these dense labels relies on the class judgment of each pixel in the image, thus greatly increasing the time and labor required for the labeling process.

To reduce the burden of annotation work, Weakly Supervised Object Localization (WSOL) trains object localization models by using image-level labels (such as image categories) as supervision signals to get rid of the need for pixel-level annotations in the training process. Most of these methods use the classification activation map (CAM) process to train an image-level feature classifier, and then apply the classifier to pixel-level features to obtain object localization results. However, image-level features usually retain sufficient object information, and only identifying discriminative object features in it is to correctly classify the image. Therefore, when the classifier is applied to the pixel-level features with insufficient object information for object localization, the final localization map can often only perceive part of the object area rather than the entire object.

To solve this problem, this paper regards the CAM-based weakly supervised object localization process as a special domain adaptation task, that is, it guarantees that the classifier trained on the source image-level feature domain still has good performance when applied to the target pixel domain. classification performance, thus making it better for target localization during testing. From this perspective, we can naturally transfer the domain adaptation method to the weakly supervised object localization task, so that the model trained only on the image labels can locate the target object more accurately.

955d9a27f74609835d749a42c7a64e85.png

  • Article address: https://arxiv.org/abs/2203.01714

  • Project address: https://github.com/zh460045050/DA-WSOL_CVPR2022

At present, this research has been accepted by CVPR2022, and the complete training code and model are open source. It was mainly discussed and developed by Zhu Lei and ByteDance Sheqi from Peking University Molecular Imaging/Medical Intelligence Laboratory, and Lu Yanye from Peking University Molecular Imaging/Medical Intelligence Laboratory gave guidance.

method

a116883c10f66a35ce42eb56b8b0483c.png

 Figure 1 - The overall idea of ​​the method

Weakly supervised object localization can actually be seen as fully supervised training of the model e(∙) in the image feature domain (source domain S) based on image-level labels (source domain gold label Y^s), and the The model acts on the pixel feature domain (target domain T) to obtain object localization heatmaps. In general, our method hopes to introduce a domain adaptation method to assist in this process to narrow the feature distribution of the source domain S and the target domain T, thereby enhancing the classification effect of the model e(∙) for the target domain T. , so our loss function can be expressed as:

30460a582efa5a4e476e62dc8b7c38b1.png

where L_c is the source domain classification loss and L_a is the domain adaptation loss.

Since the source and target domains in weakly supervised localization are the image domain and the pixel domain, respectively, the domain adaptation task we face has some unique properties: (1) The number of target domain samples and source domain samples is not balanced (target domain samples It is N times of the source domain, N is the number of image pixels); ② There are samples in the target domain with different labels from the source domain (the background pixels do not belong to any object category); ③ There is a certain relationship between the target domain samples and the source domain samples (image features obtained by aggregating pixel features). To better consider these three properties, we further propose a Domain Adaptive Localization Loss (DAL Loss) as L_a(S,T) to narrow the feature distribution of image domain S and pixel domain T.

76a4d4569fc6f38b83add27170903711.png

Figure 2 - Division of source and target domains in weakly supervised localization and its role in weakly supervised localization

First, as shown in Figure 2-A, we further divide the target domain samples T into three subsets: ① "pseudo source domain sample set T^f" represents target domain samples with similar feature distribution to the source domain; ② "unknown class sample set" T^u" represents the l target domain samples whose categories do not exist in the source domain; ③ "true target domain sample set T^t" represents the remaining samples. According to these three subsets, our proposed domain-adaptive localization loss can be expressed as:

e54088f3f5cd01699a8e8cd4710e88df.png

It can be seen from the above formula that in the domain adaptive localization loss, the pseudo-source domain samples are regarded as the complement of the source domain samples rather than the target domain samples to solve the problem of sample imbalance. At the same time, in order to reduce the interference of samples T^U with unknown categories in the source domain on the classification accuracy, we only use the traditional adaptive loss L_d (such as the maximum mean difference MMD) to narrow the amplified source domain sample set S∪T^ The feature distribution of f and the real target domain sample set T^t. These samples T^u excluded from the domain adaptation process can be used as the Universum regular L_u to ensure that the class boundaries defined by the classifier can also better sense the target domain.

Figure 2-B also visually shows the expected effect of the source domain classification loss and domain adaptive localization loss, where L_c ensures that different categories of source domain samples can be correctly distinguished, L_d narrows the source domain target domain distribution, and L_u will The class boundaries are pulled closer to the target domain samples with unknown labels.

339cb8ef22a056cab81d395ac392edcd.png

Figure 3 - Overall Workflow and Target Sample Distributor Structure

We propose that the domain-adaptive localization loss can easily embed domain-adaptive methods into existing weakly supervised localization methods to greatly improve their performance. As shown in Figure 3, embedding our method on the existing weakly supervised localization model only needs to introduce a target sample assigner (Target Sample Assigner) to divide the target domain sample subsets. The assigner uses the memory matrix M in the training process. Update the anchor points of the unknown target domain sample set T^u and the real target domain sample set T^r in real time, and perform three-way K-means clustering with the two and the source domain features as the clustering centers to obtain each target. The subset to which the domain sample belongs. Finally, according to this sample subset, we can obtain the domain adaptive loss L_d and the Universum regularization L_u, and use the two to supervise the training process together with the source domain classification loss L_c, so that the accuracy of the source domain classification can be guaranteed as much as possible. It is possible to narrow the source domain and target domain features, and reduce the impact of unknown class samples. In this way, when the model is applied to the target domain (ie, pixel features) for object localization, the quality of the resulting localization heatmap will be significantly improved.

experiment

31110d4e448568ba879c04825640df6e.png

Figure 3 - Object localization heatmap and final localization/segmentation results

We validate the effectiveness of our method on three weakly supervised object localization datasets:

From the perspective of visual effects, our method can more comprehensively grasp the object region due to ensuring the distribution consistency between the image and the pixel feature domain. At the same time, since the Universum regularization pays attention to the influence of background pixels on the classifier, the localization heatmap generated by our method can better close to the edge of the object and suppress the responsiveness of the category-related background, such as the water surface to the duck.

It can also be seen from the quantitative results that in terms of target localization performance, our method achieves very good results on all three data, especially for non-fine-grained target localization (ImageNet and OpenImages datasets), All our methods achieve the best localization performance. In terms of image classification performance, the introduction of domain adaptation will lead to the loss of accuracy in the source domain, but by drawing on the multi-stage strategy and using an additional classification model (only using L_c training) to generate classification results, the domain adaptation can be solved. side effects.

In addition, we also have good generalization and can be compatible with multi-class domain adaptation and a variety of weakly supervised object localization methods to improve localization performance.

559006443cb03b897fd437ff7a160346.png

0a35c13581781777b9b6859f6b6616cb.png

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

c5f1db93bd40912a4665f8a2b0d4e460.png

▲Long press to add WeChat group or contribute

f63cc9b8d9837d79a334fc446235750a.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for the field of 3D vision ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

d7866cc87440ee66f3f88ce69637a5a1.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/123675926