PatchCore principle and code interpretation

paper:Towards Total Recall in Industrial Anomaly Detection

code:GitHub - amazon-science/patchcore-inspection 

existing problems 

A commonly used method for unsupervised defect detection is to directly use the representation in the model pre-trained on ImageNet without specifically performing the migration and adaptation of the target distribution, such as PaDiM. Since this type of method is non-adaptive, the adaptation reliability of the network on a deeper and more abstract layer is limited, because the high-level abstract features learned from ImageNet are not highly correlated with the abstract features required in the industrial environment. In addition, due to the lack of high-dimensional feature representations that can be extracted, the nominal context available for such methods at test time is also limited.

The innovation of this article

In response to the above problems, this paper proposes a new unsupervised defect detection algorithm PatchCore, which has the following characteristics

  1. Maximize the nominal information available during the testing phase
  2. Reduce bias on ImageNet data
  3. Maintain high inference speed,

Specifically include

  1. Use local aggregation, mid-level feature patch
    a. The abstract semantic information of the feature is too little, and the bias of the deep feature to ImageNet data is too large. The mid-level feature can be used in detail information, abstract semantic information, and ImageNet. Get a good balance between bias.
    b. Feature aggregation on local neighborhoods can preserve sufficient spatial context
  2. Introduce greedy coreset subsampling
    to significantly reduce storage memory and improve inference speed

Method and implementation

Locally aware patch features

First of all, in order to retain enough detailed information without making the extracted abstract information too biased towards ImageNet data, the author chooses the feature representation of the middle layer, and uses the feature maps of the second and third layers for the ResNet model.

The author extracts features by performing feature aggregation on local neighborhoods

Here \(\mathcal{N}^{(h,w)}_{p} \) represents a patch of size \(p\times p\) at position \((h,w)\) on the feature map , take p=3 in the text. Then the locally aware features at the position \((h,w)\) are as follows 

Among them, \(f_{agg}\) is the aggregation function of the neighborhood feature vector, and adaptive average pooling is used in this paper.

The code for extracting neighborhood feature vectors is as follows. First, layer2 and layer3 in the pre-trained model are extracted. The size of the original input sent to the model after preprocessing is 224 x 224, assuming batch_size=2, the input size is (2, 3, 224, 224), and the output sizes of layer2 and layer3 are (2, 512, 28 , 28), (2, 1024, 14, 14). Then use the patchfiy function to extract the features in the local neighborhood. Here it is implemented by torch.nn.Unfold. For the usage of this function, see torch.nn.functional.unfold usage interpretation_00000cj's blog-CSDN blog . The difference here is that the stride=patchsize in PaDiM means that each patch does not overlap with each other. For a 28x28 feature map, patch_size=2, stride=2, padding=0, output 14x14. And here patch_size=3, stride=1, padding=1, output 28x28. The neighborhood feature representation dimensions proposed by layer2 and layer3 after patchify are (2, 784, 512, 3, 3) and (2, 196, 1024, 3, 3), respectively, where 784=28x28, 196=14x14.

features = [features[layer] for layer in self.layers_to_extract_from]
# {'layer2': torch.Size([2, 512, 28, 28])
#  'layer3': torch.Size([2, 1024, 14, 14])}

features = [
    self.patch_maker.patchify(x, return_spatial_info=True) for x in features
]

class PatchMaker:
    def __init__(self, patchsize, stride=None):
        self.patchsize = patchsize  # 3
        self.stride = stride  # 1

    def patchify(self, features, return_spatial_info=False):
        """Convert a tensor into a tensor of respective patches.
        Args:
            x: [torch.Tensor, bs x c x w x h]
        Returns:
            x: [torch.Tensor, bs * w//stride * h//stride, c, patchsize,
            patchsize]
        """
        padding = int((self.patchsize - 1) / 2)  # 1
        unfolder = torch.nn.Unfold(
            kernel_size=self.patchsize, stride=self.stride, padding=padding, dilation=1
        )
        unfolded_features = unfolder(features)  # (2,512,28,28)->(2,4608,784)
        number_of_total_patches = []
        for s in features.shape[-2:]:  # [28,28]
            n_patches = (
                s + 2 * padding - 1 * (self.patchsize - 1) - 1
            ) / self.stride + 1
            number_of_total_patches.append(int(n_patches))  # [28,28]
        unfolded_features = unfolded_features.reshape(
            *features.shape[:2], self.patchsize, self.patchsize, -1
        )  # (2,512,3,3,784)
        unfolded_features = unfolded_features.permute(0, 4, 1, 2, 3)  # (2,784,512,3,3)

        if return_spatial_info:  # True
            return unfolded_features, number_of_total_patches
        return unfolded_features

Then bilinear interpolation is performed on the output of layer3 to match layer2, and the features are obtained as follows, where 1568=2x28x28, the batch_size dimension and the spatial dimension are merged together.

features = [x.reshape(-1, *x.shape[-3:]) for x in features]  # [(1568,512,3,3),(1568,1024,3,3)]

Then feature aggregation is performed through adaptive average pooling, that is, the \(f_{agg}\) mentioned above, so that for each position (h, w) on the output feature map of the pre-training model, a preset A single representation of dimension \(d\), where \(d=1024\).

code show as below

调用 features = self.forward_modules["preprocessing"](features)  # (1568,2,1024)

class MeanMapper(torch.nn.Module):
    def __init__(self, preprocessing_dim):
        super(MeanMapper, self).__init__()
        self.preprocessing_dim = preprocessing_dim

    def forward(self, features):
        features = features.reshape(len(features), 1, -1)  # (1568,512,3,3)->(1568,1,4608)
        return F.adaptive_avg_pool1d(features, self.preprocessing_dim).squeeze(1)  # (1568,1,4608)->(1568,1024)


class Preprocessing(torch.nn.Module):
    def __init__(self, input_dims, output_dim):
        super(Preprocessing, self).__init__()
        self.input_dims = input_dims  # [512,1024]
        self.output_dim = output_dim  # 1024

        self.preprocessing_modules = torch.nn.ModuleList()
        for input_dim in input_dims:
            module = MeanMapper(output_dim)
            self.preprocessing_modules.append(module)

    def forward(self, features):  # [(1568,512,3,3),(1568,1024,3,3)]
        _features = []
        for module, feature in zip(self.preprocessing_modules, features):
            _features.append(module(feature))  # [(1568,1024),(1568,1024)]
        return torch.stack(_features, dim=1)  # (1568,2,1024)

In this way, the aggregated features of layer2 and layer3 [(1568, 512, 3, 3), (1568, 1024, 3, 3)] are preprocessed, that is, after adaptive mean pooling and then stack together to obtain (1568, 2, 1024 ) output features.

Then further aggregation is performed to obtain the output of (1568, 1024).

features = self.forward_modules["preadapt_aggregator"](features)  # (1568,1024)

class Aggregator(torch.nn.Module):
    def __init__(self, target_dim):
        super(Aggregator, self).__init__()
        self.target_dim = target_dim  # 1024

    def forward(self, features):  # (1568,2,1024)
        """Returns reshaped and average pooled features."""
        # batchsize x number_of_layers x input_dim -> batchsize x target_dim
        features = features.reshape(len(features), 1, -1)  # (1568,1,2048)
        features = F.adaptive_avg_pool1d(features, self.target_dim)  # (1568,1,1024)
        return features.reshape(len(features), -1)  # (1568,1024)

Coreset-reduced patch-feature memory bank

In the above code, batch_size=2, the output of a batch is (1568, 1024), of which 1568=2x784=28x28, the bottle category training set in the MVTec data set has a total of 209 pieces, so the memory bank \(\ The dimension of mathcal{M} \) is (163856, 1024), where 163856=28x28x209, as the training set \(\mathcal{X}_{N} \) size increases, \(\mathcal{M} \ ) is also becoming larger and larger, and the final reasoning time and storage space are also increasing, so it is usually necessary to reduce the dimensionality of \(\mathcal{M} \) and save\(\mathcal{M} \) the nominal feature encoded in . Random subsampling will lose useful information in \(\mathcal{M} \), this paper uses the coreset subsampling method to reduce \(\mathcal{M} \), coreset selection aims to find a subset\(\mathcal{ S}\subset \mathcal{A}\), for the solution obtained by \(\mathcal{A}\), the most approximate solution can be quickly obtained by \(\mathcal{S}\). According to different problems, the goal of coreset selection is also different, because PatchCore uses nearest neighbor computation, so this paper chooses minmax facility location coreset selection to find subset \(\mathcal{M}_{C}\), in order to reduce coreset The selection time is reduced by random linear projection \(\psi :\mathbb{R} ^{d}\to\mathbb{R} ^{d^{*}},d^{*}<d\) The dimensions of the small element \(m\in\mathcal{M}\), the specific steps are as follows

The implementation code is as follows, where percentage=0.1 means that the dimension is reduced to one-tenth, \(d^{*}=128\), in order to reduce the memory used in the implementation, the implementation of ApproximateGreedyCoresetSampler is randomly selected from 163856 in the dimension of 10 as The initial point, so the calculation of the distance matrix is ​​reduced from 163856x163856 to 163856x10.

features = self.featuresampler.run(features)  # (16385, 1024) 

Call the coreset sampler to implement the GreedyCoresetSampler inherited by the ApproximateGreedyCoresetSampler class, and override the _compute_greedy_coreset_indices method. Here, for the convenience of display, the run and _compute_batchwise_differences methods in the GreedyCoresetSampler class are copied to ApproximateGreedyCoresetSampler.

class ApproximateGreedyCoresetSampler(GreedyCoresetSampler):
    def __init__(
        self,
        percentage: float,  # 0.1
        device: torch.device,  # cuda:0
        number_of_starting_points: int = 10,  # 10
        dimension_to_project_features_to: int = 128,  # 128
    ):
        """Approximate Greedy Coreset sampling base class."""
        self.number_of_starting_points = number_of_starting_points
        super().__init__(percentage, device, dimension_to_project_features_to)
    
    def run(
        self, features: Union[torch.Tensor, np.ndarray]
    ) -> Union[torch.Tensor, np.ndarray]:
        """Subsamples features using Greedy Coreset.

        Args:
            features: [N x D]
        """
        if self.percentage == 1:
            return features
        self._store_type(features)
        if isinstance(features, np.ndarray):
            features = torch.from_numpy(features)
        reduced_features = self._reduce_features(features)  # (163856, 1024) -> (163856, 128)
        sample_indices = self._compute_greedy_coreset_indices(reduced_features)  # (16385,)
        features = features[sample_indices]  # (16385, 1024)
        return self._restore_type(features)

    @staticmethod
    def _compute_batchwise_differences(
        matrix_a: torch.Tensor, matrix_b: torch.Tensor  # (163856, 128),(10,128)
    ) -> torch.Tensor:
        """Computes batchwise Euclidean distances using PyTorch."""
        # (163856,1,128).bmm(163856,128,1)->(163856,1,1)
        a_times_a = matrix_a.unsqueeze(1).bmm(matrix_a.unsqueeze(2)).reshape(-1, 1)  # (163856,1)
        # (10,1,128).bmm(10,128,1)->(10,1,1)
        b_times_b = matrix_b.unsqueeze(1).bmm(matrix_b.unsqueeze(2)).reshape(1, -1)  # (1,10)
        a_times_b = matrix_a.mm(matrix_b.T)  # (163856,10)

        return (-2 * a_times_b + a_times_a + b_times_b).clamp(0, None).sqrt()  # (163856,10)

    def _compute_greedy_coreset_indices(self, features: torch.Tensor) -> np.ndarray:
        """Runs approximate iterative greedy coreset selection.

        This greedy coreset implementation does not require computation of the
        full N x N distance matrix and thus requires a lot less memory, however
        at the cost of increased sampling times.

        Args:
            features: [NxD] input feature bank to sample.
        """
        number_of_starting_points = np.clip(
            self.number_of_starting_points, None, len(features)
        )  # 10
        start_points = np.random.choice(
            len(features), number_of_starting_points, replace=False  # 163856
        ).tolist()  # [61587, 130619, 91549, 30689, 32225, 130105, 25966, 96545, 31837, 4447]

        approximate_distance_matrix = self._compute_batchwise_differences(
            features, features[start_points]  # (163856,128),(10,128)
        )  # (163856,10)
        approximate_coreset_anchor_distances = torch.mean(
            approximate_distance_matrix, axis=-1
        ).reshape(-1, 1)  # # torch.Size([163856]) -> torch.Size([163856,1])
        coreset_indices = []
        num_coreset_samples = int(len(features) * self.percentage)  # 16385

        with torch.no_grad():
            for _ in tqdm.tqdm(range(num_coreset_samples), desc="Subsampling..."):
                select_idx = torch.argmax(approximate_coreset_anchor_distances).item()
                coreset_indices.append(select_idx)
                coreset_select_distance = self._compute_batchwise_differences(
                    features, features[select_idx : select_idx + 1]  # noqa: E203
                )  # (163856,128),(1,128)->(163856,1)
                approximate_coreset_anchor_distances = torch.cat(
                    [approximate_coreset_anchor_distances, coreset_select_distance],
                    dim=-1,
                )  # (163856,2)
                approximate_coreset_anchor_distances = torch.min(
                    approximate_coreset_anchor_distances, dim=1
                ).values.reshape(-1, 1)  # (163856)->(163856,1)

        return np.array(coreset_indices)  # (16385,)

Anomaly Detection with PatchCore

This part of the original text is not well understood. The third-party library faiss that is directly called by the nearest neighbor search and distance calculation in the official implementation does not know much about the principle of faiss. And it seems that formula (7) is not used in the implementation, so I will add it after I understand it later. Paste the original text here

Code

The dimension of the memory bank \(\mathcal{M}\) obtained by the coreset selection of the entire training set is (16385, 1024). Then send it to the search index of faiss, the core code is the following two lines

search_index = faiss.IndexFlatL2(features.shape[-1])
search_index.add(features)

At test time, assuming batch_size=2, the extracted aggregate feature dimension is (1568, 1024), where 1568=2x28x28, and then find the nearest distance and index from the training set\(\mathcal{M}\), and then along the feature dimension Take the average to get the anomaly scores, the code is as follows

query_distances, query_nns = search_index.search(query_features, n_nearest_neighbours=1)  # (1568,1024),(1568,1)
anomaly_scores = np.mean(query_distances, axis=-1)  # (1568,)

The abnormal score is reshaped into (2, 28, 28), and the maximum value along the spatial dimension is obtained to get the abnormal score of the whole picture, shape=(2, ). Perform bilinear interpolation upsampling, and then Gaussian filtering to obtain an output mask of (2, 224, 224), that is, the abnormal score of each pixel in the entire image, which is used to segment abnormal regions. 

Experimental results

Guess you like

Origin blog.csdn.net/ooooocj/article/details/127834029