Deep learning paper: PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection and Localization
PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection and Localization
PDF: https://arxiv.org/pdf/2011.08785.pdf
PyTorch code: https:// github.com/shanglianlm0525/CvPytorch
PyTorch code: https://github.com/shanglianlm0525/PyTorch-Networks
1 Overview
PaDiM simultaneously detects and localizes anomalies in images during single-class learning. PaDiM utilizes a pre-trained Convolutional Neural Network (CNN) for patch embedding and a multivariate Gaussian distribution to obtain a probabilistic representation of the normal class. At the same time, the correlation between different semantic levels of CNN is used to better locate anomalies.
2 Patch Distribution Modeling(PaDiM)
PaDiM is an algorithm based on image patches. It relies on a pre-trained CNN feature extractor. Decompose the image into patches and extract embeddings from each patch using different feature extraction layers. The activation vectors of different levels are concatenated to obtain embedding vectors containing information of different semantic levels and resolutions. This helps to encode fine-grained and global context. However, since the generated embedding vectors may carry redundant information, random selection is used to reduce the dimensionality. A multivariate Gaussian distribution is generated for each patch embedding throughout the training batch. Therefore, for each patch of the training image set, we have a different multivariate Gaussian distribution. These Gaussian distributions are represented as a Gaussian parameter matrix. During inference, each patch location of the test image is scored using the Mahalanobis distance. It uses the inverse of the covariance matrix computed for the patches during training. The Mahalanobis distance matrix forms an anomaly map, with higher scores indicating anomalous regions.
3 PyTorch code
# !/usr/bin/env python
# -- coding: utf-8 --
# @Time : 2023/6/2 15:07
# @Author : liumin
# @File : run_PaDiM.py
import random
from random import sample
import argparse
import numpy as np
import os
import pickle
from tqdm import tqdm
from collections import OrderedDict
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.covariance import LedoitWolf
from scipy.spatial.distance import mahalanobis
from scipy.ndimage import gaussian_filter
from skimage import morphology
from skimage.segmentation import mark_boundaries
import matplotlib.pyplot as plt
import matplotlib
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision.models import wide_resnet50_2, resnet18
import os
# import tarfile
from PIL import Image
from tqdm import tqdm
# import urllib.request
import torch
from torch.utils.data import Dataset
from torchvision import transforms as T
random.seed(1024)
torch.manual_seed(1024)
torch.cuda.manual_seed_all(1024)
# URL = 'ftp://guest:[email protected]/mvtec_anomaly_detection/mvtec_anomaly_detection.tar.xz'
CLASS_NAMES = ['bottle', 'cable', 'capsule', 'carpet', 'grid',
'hazelnut', 'leather', 'metal_nut', 'pill', 'screw',
'tile', 'toothbrush', 'transistor', 'wood', 'zipper']
class MVTecDataset(Dataset):
def __init__(self, dataset_path='/home/liumin/data/mvtec_ad/', class_name='bottle', is_train=True,
resize=256, cropsize=224):
assert class_name in CLASS_NAMES, 'class_name: {}, should be in {}'.format(class_name, CLASS_NAMES)
self.dataset_path = dataset_path
self.class_name = class_name
self.is_train = is_train
self.resize = resize
self.cropsize = cropsize
# self.mvtec_folder_path = os.path.join(root_path, 'mvtec_anomaly_detection')
# download dataset if not exist
# self.download()
# load dataset
self.x, self.y, self.mask = self.load_dataset_folder()
# set transforms
self.transform_x = T.Compose([T.Resize(resize, Image.ANTIALIAS),
T.CenterCrop(cropsize),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])])
self.transform_mask = T.Compose([T.Resize(resize, Image.NEAREST),
T.CenterCrop(cropsize),
T.ToTensor()])
def __getitem__(self, idx):
x, y, mask = self.x[idx], self.y[idx], self.mask[idx]
x = Image.open(x).convert('RGB')
x = self.transform_x(x)
if y == 0:
mask = torch.zeros([1, self.cropsize, self.cropsize])
else:
mask = Image.open(mask)
mask = self.transform_mask(mask)
return x, y, mask
def __len__(self):
return len(self.x)
def load_dataset_folder(self):
phase = 'train' if self.is_train else 'test'
x, y, mask = [], [], []
img_dir = os.path.join(self.dataset_path, self.class_name, phase)
gt_dir = os.path.join(self.dataset_path, self.class_name, 'ground_truth')
img_types = sorted(os.listdir(img_dir))
for img_type in img_types:
# load images
img_type_dir = os.path.join(img_dir, img_type)
if not os.path.isdir(img_type_dir):
continue
img_fpath_list = sorted([os.path.join(img_type_dir, f)
for f in os.listdir(img_type_dir)
if f.endswith('.png')])
x.extend(img_fpath_list)
# load gt labels
if img_type == 'good':
y.extend([0] * len(img_fpath_list))
mask.extend([None] * len(img_fpath_list))
else:
y.extend([1] * len(img_fpath_list))
gt_type_dir = os.path.join(gt_dir, img_type)
img_fname_list = [os.path.splitext(os.path.basename(f))[0] for f in img_fpath_list]
gt_fpath_list = [os.path.join(gt_type_dir, img_fname + '_mask.png')
for img_fname in img_fname_list]
mask.extend(gt_fpath_list)
assert len(x) == len(y), 'number of x and y should be same'
return list(x), list(y), list(mask)
def denormalization(x):
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
x = (((x.transpose(1, 2, 0) * std) + mean) * 255.).astype(np.uint8)
return x
def embedding_concat(x, y):
B, C1, H1, W1 = x.size()
_, C2, H2, W2 = y.size()
s = int(H1 / H2)
x = F.unfold(x, kernel_size=s, dilation=1, stride=s)
x = x.view(B, C1, -1, H2, W2)
z = torch.zeros(B, C1 + C2, x.size(2), H2, W2)
for i in range(x.size(2)):
z[:, :, i, :, :] = torch.cat((x[:, :, i, :, :], y), 1)
z = z.view(B, -1, H2 * W2)
z = F.fold(z, kernel_size=s, output_size=(H1, W1), stride=s)
return z
def plot_fig(test_img, scores, gts, threshold, save_dir, class_name):
num = len(scores)
vmax = scores.max() * 255.
vmin = scores.min() * 255.
for i in range(num):
img = test_img[i]
img = denormalization(img)
gt = gts[i].transpose(1, 2, 0).squeeze()
heat_map = scores[i] * 255
mask = scores[i]
mask[mask > threshold] = 1
mask[mask <= threshold] = 0
kernel = morphology.disk(4)
mask = morphology.opening(mask, kernel)
mask *= 255
vis_img = mark_boundaries(img, mask, color=(1, 0, 0), mode='thick')
fig_img, ax_img = plt.subplots(1, 5, figsize=(12, 3))
fig_img.subplots_adjust(right=0.9)
norm = matplotlib.colors.Normalize(vmin=vmin, vmax=vmax)
for ax_i in ax_img:
ax_i.axes.xaxis.set_visible(False)
ax_i.axes.yaxis.set_visible(False)
ax_img[0].imshow(img)
ax_img[0].title.set_text('Image')
ax_img[1].imshow(gt, cmap='gray')
ax_img[1].title.set_text('GroundTruth')
ax = ax_img[2].imshow(heat_map, cmap='jet', norm=norm)
ax_img[2].imshow(img, cmap='gray', interpolation='none')
ax_img[2].imshow(heat_map, cmap='jet', alpha=0.5, interpolation='none')
ax_img[2].title.set_text('Predicted heat map')
ax_img[3].imshow(mask, cmap='gray')
ax_img[3].title.set_text('Predicted mask')
ax_img[4].imshow(vis_img)
ax_img[4].title.set_text('Segmentation result')
left = 0.92
bottom = 0.15
width = 0.015
height = 1 - 2 * bottom
rect = [left, bottom, width, height]
cbar_ax = fig_img.add_axes(rect)
cb = plt.colorbar(ax, shrink=0.6, cax=cbar_ax, fraction=0.046)
cb.ax.tick_params(labelsize=8)
font = {
'family': 'serif',
'color': 'black',
'weight': 'normal',
'size': 8,
}
cb.set_label('Anomaly Score', fontdict=font)
fig_img.savefig(os.path.join(save_dir, class_name + '_{}'.format(i)), dpi=100)
plt.close()
def load_model(args, device):
# load model
if args.arch == 'resnet18':
model = resnet18(pretrained=True, progress=False)
t_d = 448
d = 100
elif args.arch == 'wide_resnet50_2':
model = wide_resnet50_2(pretrained=True, progress=False)
t_d = 1792
d = 550
model.to(device)
model.eval()
return model, t_d, d
def train(args, model, device, idx):
# set model's intermediate outputs
outputs = []
def hook(module, input, output):
outputs.append(output)
model.layer1[-1].register_forward_hook(hook)
model.layer2[-1].register_forward_hook(hook)
model.layer3[-1].register_forward_hook(hook)
os.makedirs(os.path.join(args.save_path, 'temp_%s' % args.arch), exist_ok=True)
for class_name in CLASS_NAMES:
train_dataset = MVTecDataset(args.data_path, class_name=class_name, is_train=True)
train_dataloader = DataLoader(train_dataset, batch_size=32, pin_memory=True)
train_outputs = OrderedDict([('layer1', []), ('layer2', []), ('layer3', [])])
# extract train set features
train_feature_filepath = os.path.join(args.save_path, 'temp_%s' % args.arch, 'train_%s.pkl' % class_name)
for (x, _, _) in tqdm(train_dataloader, '| feature extraction | train | %s |' % class_name):
# model prediction
with torch.no_grad():
_ = model(x.to(device))
# get intermediate layer outputs
for k, v in zip(train_outputs.keys(), outputs):
train_outputs[k].append(v.cpu().detach())
# initialize hook outputs
outputs = []
for k, v in train_outputs.items():
train_outputs[k] = torch.cat(v, 0)
# Embedding concat
embedding_vectors = train_outputs['layer1']
for layer_name in ['layer2', 'layer3']:
embedding_vectors = embedding_concat(embedding_vectors, train_outputs[layer_name])
# randomly select d dimension
embedding_vectors = torch.index_select(embedding_vectors, 1, idx)
# calculate multivariate Gaussian distribution
B, C, H, W = embedding_vectors.size()
embedding_vectors = embedding_vectors.view(B, C, H * W)
mean = torch.mean(embedding_vectors, dim=0).numpy()
cov = torch.zeros(C, C, H * W).numpy()
I = np.identity(C)
for i in range(H * W):
# cov[:, :, i] = LedoitWolf().fit(embedding_vectors[:, :, i].numpy()).covariance_
cov[:, :, i] = np.cov(embedding_vectors[:, :, i].numpy(), rowvar=False) + 0.01 * I
# save learned distribution
train_outputs = [mean, cov]
with open(train_feature_filepath, 'wb') as f:
pickle.dump(train_outputs, f)
def val(args, model, device, idx):
# set model's intermediate outputs
outputs = []
def hook(module, input, output):
outputs.append(output)
model.layer1[-1].register_forward_hook(hook)
model.layer2[-1].register_forward_hook(hook)
model.layer3[-1].register_forward_hook(hook)
total_roc_auc = []
total_pixel_roc_auc = []
fig, ax = plt.subplots(1, 2, figsize=(20, 10))
fig_img_rocauc = ax[0]
fig_pixel_rocauc = ax[1]
for class_name in CLASS_NAMES:
test_dataset = MVTecDataset(args.data_path, class_name=class_name, is_train=False)
test_dataloader = DataLoader(test_dataset, batch_size=32, pin_memory=True)
test_outputs = OrderedDict([('layer1', []), ('layer2', []), ('layer3', [])])
train_feature_filepath = os.path.join(args.save_path, 'temp_%s' % args.arch, 'train_%s.pkl' % class_name)
print('load train set feature from: %s' % train_feature_filepath)
with open(train_feature_filepath, 'rb') as f:
train_outputs = pickle.load(f)
gt_list = []
gt_mask_list = []
test_imgs = []
# extract test set features
for (x, y, mask) in tqdm(test_dataloader, '| feature extraction | test | %s |' % class_name):
test_imgs.extend(x.cpu().detach().numpy())
gt_list.extend(y.cpu().detach().numpy())
gt_mask_list.extend(mask.cpu().detach().numpy())
# model prediction
with torch.no_grad():
_ = model(x.to(device))
# get intermediate layer outputs
for k, v in zip(test_outputs.keys(), outputs):
test_outputs[k].append(v.cpu().detach())
# initialize hook outputs
outputs = []
for k, v in test_outputs.items():
test_outputs[k] = torch.cat(v, 0)
# Embedding concat
embedding_vectors = test_outputs['layer1']
for layer_name in ['layer2', 'layer3']:
embedding_vectors = embedding_concat(embedding_vectors, test_outputs[layer_name])
# randomly select d dimension
embedding_vectors = torch.index_select(embedding_vectors, 1, idx)
# calculate distance matrix
B, C, H, W = embedding_vectors.size()
embedding_vectors = embedding_vectors.view(B, C, H * W).numpy()
dist_list = []
for i in range(H * W):
mean = train_outputs[0][:, i]
conv_inv = np.linalg.inv(train_outputs[1][:, :, i])
dist = [mahalanobis(sample[:, i], mean, conv_inv) for sample in embedding_vectors]
dist_list.append(dist)
dist_list = np.array(dist_list).transpose(1, 0).reshape(B, H, W)
# upsample
dist_list = torch.tensor(dist_list)
score_map = F.interpolate(dist_list.unsqueeze(1), size=x.size(2), mode='bilinear',
align_corners=False).squeeze().numpy()
# apply gaussian smoothing on the score map
for i in range(score_map.shape[0]):
score_map[i] = gaussian_filter(score_map[i], sigma=4)
# Normalization
max_score = score_map.max()
min_score = score_map.min()
scores = (score_map - min_score) / (max_score - min_score)
# calculate image-level ROC AUC score
img_scores = scores.reshape(scores.shape[0], -1).max(axis=1)
gt_list = np.asarray(gt_list)
fpr, tpr, _ = roc_curve(gt_list, img_scores)
img_roc_auc = roc_auc_score(gt_list, img_scores)
total_roc_auc.append(img_roc_auc)
print('image ROCAUC: %.3f' % (img_roc_auc))
fig_img_rocauc.plot(fpr, tpr, label='%s img_ROCAUC: %.3f' % (class_name, img_roc_auc))
# get optimal threshold
gt_mask = np.asarray(gt_mask_list)
precision, recall, thresholds = precision_recall_curve(gt_mask.flatten(), scores.flatten())
a = 2 * precision * recall
b = precision + recall
f1 = np.divide(a, b, out=np.zeros_like(a), where=b != 0)
threshold = thresholds[np.argmax(f1)]
# calculate per-pixel level ROCAUC
fpr, tpr, _ = roc_curve(gt_mask.flatten(), scores.flatten())
per_pixel_rocauc = roc_auc_score(gt_mask.flatten(), scores.flatten())
total_pixel_roc_auc.append(per_pixel_rocauc)
print('pixel ROCAUC: %.3f' % (per_pixel_rocauc))
fig_pixel_rocauc.plot(fpr, tpr, label='%s ROCAUC: %.3f' % (class_name, per_pixel_rocauc))
save_dir = args.save_path + '/' + f'pictures_{
args.arch}'
os.makedirs(save_dir, exist_ok=True)
plot_fig(test_imgs, scores, gt_mask_list, threshold, save_dir, class_name)
print('Average ROCAUC: %.3f' % np.mean(total_roc_auc))
fig_img_rocauc.title.set_text('Average image ROCAUC: %.3f' % np.mean(total_roc_auc))
fig_img_rocauc.legend(loc="lower right")
print('Average pixel ROCUAC: %.3f' % np.mean(total_pixel_roc_auc))
fig_pixel_rocauc.title.set_text('Average pixel ROCAUC: %.3f' % np.mean(total_pixel_roc_auc))
fig_pixel_rocauc.legend(loc="lower right")
fig.tight_layout()
fig.savefig(os.path.join(args.save_path, 'roc_curve.png'), dpi=100)
def parse_args():
parser = argparse.ArgumentParser('PaDiM')
parser.add_argument('--data_path', type=str, default='/home/liumin/data/mvtec_ad')
parser.add_argument('--save_path', type=str, default='./result')
parser.add_argument('--arch', type=str, choices=['resnet18', 'wide_resnet50_2'], default='wide_resnet50_2')
return parser.parse_args()
if __name__ == '__main__':
args = parse_args()
# device setup
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model, t_d, d = load_model(args, device)
idx = torch.tensor(sample(range(0, t_d), d))
train(args, model, device, idx)
val(args, model, device, idx)
Deep learning paper: WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
PDF: https://arxiv.org/pdf/2303.14814.pdf
PyTorch code: https:// github.com/shanglianlm0525/CvPytorch
PyTorch code: https://github.com/shanglianlm0525/PyTorch-Networks
1 Overview
2
深度学习论文: A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD
A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD
PDF: https://arxiv.org/pdf/2305.17382.pdf
PyTorch代码: https://github.com/shanglianlm0525/CvPytorch
PyTorch代码: https://github.com/shanglianlm0525/PyTorch-Networks
1 Overview
To address the wide diversity of product types in industrial visual inspection, we construct a single model that can quickly adapt to numerous categories and requires little or no normal reference images, providing a more efficient solution for industrial visual inspection. A solution to zero/few-shot tracking for the 2023 VAND challenge is proposed.
1) In the zero-shot task, the proposed solution adds an additional linear layer on top of the CLIP model to map image features to a joint embedding space, which enables it to compare with text features and generate anomaly maps.
2) When a reference image is available (few-shot), the proposed solution utilizes multiple memory banks to store reference image features and compares them with the query image at test time.
In this challenge, our method achieved first place in zero-shot tracking and performed well on segmentation, improving the F1 score by 0.0489 over the second-placed competitor. In few-shot tracking, we achieved fourth place in overall ranking and first place in classification F1 score.
Core points:
- Prompt integration using state and template to make text prompts.
- To localize abnormal regions, an additional linear layer is introduced to map the image features extracted from the CLIP image encoder to the linear space where the text features reside.
- The similarity between the mapped image features and text features is compared to obtain corresponding anomaly maps.
- In few-shot, the extra linear layers of the zero-shot stage are retained and their weights are maintained. In addition, an image encoder is used in the test phase to extract features of reference images and save them into memory banks for comparison with features of test images.
- In order to make full use of shallow and deep features, the features of different stages of the image encoder are utilized simultaneously.
2 Methodology
Overall, we adopt the overall framework of CLIP for zero-shot classification and use a combination of state and template collections to build our text prompts. To localize abnormal regions in images, we introduce additional linear layers to map image features extracted from the CLIP image encoder into the linear space where text features reside. We then perform similarity comparisons on the mapped image features and text features to obtain corresponding anomaly maps. For the few-shot case, we keep the extra linear layers of the zero-shot stage and keep their weights. Furthermore, we use an image encoder to extract the features of a reference image and save them to a memory bank, where they are compared with those of a test image during the testing phase. It should be noted that in order to fully utilize shallow and deep features, we use features from different stages in both zero-shot and few-shot settings.
2-1 Zero-shot AD
Anomaly Classification
is based on the WinCLIP anomaly classification framework. We propose a text prompt integration strategy, which significantly improves the anomaly classification accuracy of Baseline without using complex multi-scale window strategies. Specifically, the integration strategy includes two parts, template-level and state-level:
1) The state-level text prompt uses general text to describe normal or abnormal targets (such as flawless, damaged), without using "chip around edge and corner";
2) template-level text prompt, the proposed scheme screened 85 templates for ImageNet in CLIP, and removed "a photo of the weird [obj.]" etc. Not applicable Templates for anomaly detection tasks.
These two text cues will be extracted as final text features by CLIP's text encoder: F t ∈ R 2 × C F_{t} \in R^{2 \times C}Ft∈R2 x C. _
The corresponding image features are:F c ∈ R 1 × C F_{c} \in R^{1 \times C}Fc∈R1 x C. _
An integrated implementation of state-level and template-level, using the CLIP text encoder to extract text features, and averaging the normal and abnormal features separately. Finally, compare the average values of the normal and abnormal features with the image features, and get the abnormal category probability after softmax as the classification score
s = softmax ( F c F t T ) s = softmax(F_{c}F_{t}^ {T})s=softmax(FcFtT)
finally selectssThe second dimension of s as a result of the anomaly detection classification problem.
Anomaly Segmentation
compares the image-level anomaly classification method to anomaly segmentation. A natural idea is to measure the similarity between the different levels of features extracted by Backbone and the text features. However, the CLIP model is designed based on a classification scheme, i.e., apart from the abstract image features used for classification, no other image features are mapped to a unified image/text space. Therefore, we propose a simple but effective solution to this problem: use an additional linear layer to map different levels of image features into the image/text joint embedding space, that is, linear layer to map patch_tokens, and then based on each patch_token and The text features are used for similarity calculations to obtain anomaly maps. , see the blue Zero-shot Anomaly Map process in the figure above. Specifically, the features of different levels are jointly embedded in the feature space transformation through a linear layer, and the transformed features are compared with the text features to obtain abnormal maps of different levels. Finally, simply sum the anomaly graphs of different levels to obtain the final result.
The training of the Linear Layer (the parameters of the CLIP part are frozen) uses focal loss and dice loss.
2-2 Few-shot AD
Anomaly Classification
For the few-shot setting, the anomaly prediction of the image comes from two parts. The first part is the same as the zero-shot setup. The second part follows the general approach used in many AD methods, considering the maximum value of the anomaly map. The proposed scheme adds these two parts as the final anomaly score.
The Anomaly Segmentation
few-shot segmentation task uses a memory bank, as shown in the yellow background in Figure 1.
To put it bluntly, it is to query the sample and the supporting samples in the memory bank to do cosine similarity, then get the anomaly map through reshape, and finally add it to the anomaly map obtained by zero-shot to get the final segmentation prediction.
In addition, in the few-shot task, the linear layer mentioned above in fine-tune is not used, but the weights trained in the zero-shot task are directly used.
3 Experiments
In short, zero-shot and few-shot are similar in simpler images, but few-shot will improve when facing difficult tasks.