[Computer Vision] CVPR 23 New Paper | The latest improved method for anomaly detection: DeSTSeg

1. Introduction

DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection

The paper is:

insert image description here
Paper address:

https://arxiv.org/pdf/2211.11317

2. Background

Industrial anomaly detection aims to discover abnormal regions of products and plays an important role in industrial quality inspection. In industrial scenarios, it is easy to obtain a large number of normal examples but few defective examples.

Most existing industrial anomaly detection methods are based on 2D images. However, in the quality inspection of industrial products, human inspectors utilize 3D shape and color features to determine whether it is a defective product, where 3D shape information is important and necessary for judgment.

The core idea of ​​unsupervised anomaly detection is to find the difference between abnormal and normal representations. Current 2D industrial anomaly detection methods can be divided into two categories:

(1) Refactoring-based approach. Image reconstruction tasks are widely used in anomaly detection methods to learn normal representations. Reconstruction-based methods are easy to implement for single-modal inputs (2D images or 3D point clouds). But for multimodal input, it is difficult to find reconstruction targets.

(2) Methods based on pre-trained feature extractors. An intuitive way to utilize a feature extractor is to map the extracted features to a normal distribution and treat features outside the distribution as anomalies.

2.1 Main contributions

  • A denoising student encoder-decoder is proposed that is trained to explicitly generate different feature representations from teachers with anomalous inputs.
  • Using segmentation networks to adaptively fuse multi-level feature similarities to replace empirical inference methods.
  • Extensive experiments are conducted on benchmark datasets to demonstrate the effectiveness of our method on various tasks.

2.2 Network Introduction: DeSTSeg

Synthetic anomaly images are generated and used during training. In the first step (a), a student network with synthetic inputs is trained to generate feature representations similar to the teacher network from clean images. In the second step (b), the element-wise product of the normalized outputs of the student and teacher networks is concatenated and used to train the segmentation network. The segmentation output is the predicted anomaly score map.

insert image description here

3. Method

  • The proposed DeSTSeg consists of three main components: a pre-trained teacher network, a denoised student network, and a segmentation network.
  • Synthetic anomalies are introduced into normal training images and the model is trained in two steps.
  • In the first step, simulated abnormal images are used as input to the student network, while original clean images are used as input to the teacher network. The weights of the teacher network are fixed, but the student network for denoising is trainable.
  • In the second step, the student model is also fixed. Both the student network and the teacher network take the synthesized abnormal images as input to optimize the parameters in the segmentation network to locate the abnormal regions.
  • For inference, pixel-level anomaly maps are generated in an end-to-end fashion, and corresponding image-level anomaly scores can be computed by post-processing.

3.1 Synthetic Anomaly Generation Synthetic anomaly generation

The training of our model relies on synthetic anomaly images generated using the same algorithm proposed in [Draem]. Random two-dimensional Perlin noise is generated and binarized by a preset threshold to obtain an anomaly mask M. Anomaly images are generated by replacing the masked regions with a linear combination of anomaly-free images and arbitrary images from an external data source A, where the opacity factor β is randomly chosen between [0.15, 1]:

insert image description here
Represents an element-wise multiplication operation. Exception generation is performed online during training. By using this algorithm, three advantages are introduced.

First, compared to drawing rectangular anomaly masks, the anomaly masks generated by random Perlin noise are more irregular and similar to the actual anomaly shape. Second, the images used as anomalous content A can be chosen arbitrarily without careful selection. Third, introducing an opacity factor β can be regarded as data augmentation to effectively increase the diversity of the training set.

3.2 Denoising Student-Teacher Network Denoising teacher-student network

In previous multi-level knowledge distillation methods, the input to the student network (normal image) is the same as the input to the teacher network, and so is the architecture of the student network. However, the proposed denoising student network and teacher network take pairs of abnormal and normal images as input, and the denoising student network has a different encoder-encoder architecture. In the next two paragraphs we examine the motivation for this design.

First, an optimization objective is established to encourage the student network to generate anomalously specific features different from the teacher. We further endow the student network with a more immediate goal: to build normal feature representations on anomalous regions supervised by the teacher network. Since the teacher network is pre-trained on a large dataset, it can generate discriminative feature representations in normal and abnormal regions. Therefore, during inference, the denoising student network will generate different feature representations than the teacher network. Second, considering the feature reconstruction task, it is concluded that the student network should not replicate the architecture of the teacher network. Considering the process of reconstructing the features of earlier layers, it is known that the lower layers of CNN capture local information such as texture and color. In contrast, the upper layers of CNN express global semantic information.

We adopt it as the architecture of the denoising student network. An alternative approach is to use the teacher network as an encoder and reverse the student network as a decoder; preliminary experimental results show that the full encoder-decoder student network performs better. One possible explanation is that pre-trained teacher networks are usually trained on ImageNet for classification tasks; thus, the encoded features in the last layer lack sufficient information to reconstruct feature representations at all levels.

insert image description here

3.3 Segmentation Network segmentation network

insert image description here

4. Experimental results

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131315169