Application of transformer in visual inspection

In recent years, Transformers have shined in the field of computer vision. The work of bringing transformers into target detection/instance segmentation is too numerous to mention. In 2020, Detection Transformers (detr) brought a new target detection paradigm, which inspired many subsequent works.

Introduction to detr

Alt

detr structure

The structure of detr is shown in the figure above. After the picture passes through the CNN backbone network, the feature map is obtained. After adding the location information, it is sent to the transfomer encoder, and a team of trainable object queries is used for cross-attention calculation in the transfomer decoder. The output result is passed through FFN After that, the regression box and category score are obtained directly, without post-processing such as nms, to achieve true end-to-end.
Alt

The transformer structure used in detr

The transformer structure used in detr is shown in the figure above. It can be seen that it consists of two parts: encoder and decoder. The encoder is mainly composed of self attention and ffn repeated N times, and the decoder is composed of self attention, cross attention, and ffn repeated M times. (N and M are both 6 in the paper), and the "add&norm" in the figure represent the residual structure and layernorm layer respectively. The transformer decouples the calculation at the Spatial level and the channel level in cnn (corresponding to attention and ffn respectively). Some people believe that this architecture is the root cause of the success of the transformer rather than the attention module. (This may also be the reason for Google's mlp mixer work.)
Compared with the traditional target detection algorithm, detr abandons the target frame regression method based on feature maps, and directly regards target detection as a set prediction problem, and realizes the prediction of the model. Decoupling from the strict binding relationship with the position in the picture.
Alt

Target frame regression based on feature map: the above respectively correspond to the ideas of yolo, rcnn, centernet, and cornernet

Alt

Each output result of detr is binary matched with the ground truth

During the training process, each output result of detr is matched with the Hungarian classification loss and regression loss of the ground truth, thereby associating each object query with the ground truth, and the matching result is regarded as a positive sample, while the unmatched The result is considered a negative sample. In this way, the problem that multiple predicted values ​​correspond to one ground truth in the traditional target detection algorithm is avoided, thereby realizing the removal of nms. In the paper, if the object queries of detr are set to 100 or 300, good prediction results can be achieved, far less than the tens of thousands of output results of single-stage detection networks such as yolo and centernet, and less than the second stage such as faster rcnn The target detection algorithm has thousands of output results. Since the direct relationship between feature maps and ground truth is avoided, detr can also be easily extended to panoramic segmentation tasks to achieve unified output of things and stuff.
Therefore, it is not difficult to see the main advantages of detr in industrial applications: 1. In complex crowded scenes, detr can avoid low-quality target boxes caused by too close target features. Although two-stage algorithms such as faster rcnn can theoretically avoid such low-quality target frames, detr as a single-stage algorithm can also achieve the same effect, which is indeed a breakthrough. 2. Detr removes post-processing such as nms, which can simplify the deployment code of the model. 3. The application of the attention module enables detr to model the relationship between any two points in the picture, and has natural advantages in detecting slender objects (such as rails, trains, etc.) compared to pure cnn models. 4. It can be directly used for panorama segmentation.
Of course, detr also has some disadvantages, such as too slow training (150~300epoch), poor detection effect of small targets, etc. Many subsequent papers have focused on these two points and done a lot of meaningful work.

Some points found in reading the detr source code:
Detr does not use the fpn structure, but only uses the feature map of the last layer of the backbone network, which may be the reason why detr is weak in small target detection. Why not use multi-feature output? It may not be really useful. The attention structure of the transformer will generate a (B, N, HW, HW) attention weights (attention weights). The larger the size of the input feature map will cause the matrix to increase quadratically, it cannot be stored on the display.
The learning rate of the backbone network and the transformer are not consistent. The learning rate of the backbone network is 0.1 times that of the transformer. The paper did not explain why (inexplicably more hyperparameters were introduced). Personally, I think that the transformer that introduces binary matching is more difficult to train, and there may be greater turbulence (the output results and ground truth matching change). Therefore, the backbone network of detr has been pre-trained, which is essentially a kind of transfer learning, so the learning rate of the backbone network needs to be made relatively small.
The gt bbox of the detr regression implemented in mmdetection is normalized, and the length and width used for gt bbox normalization are the length and width of the resized input image, not the size after padding. If the size of the image input in the training and inference phases is quite different, or if different preprocessing methods are used, it may cause serious deviations in the regression frame during inference.

tgt = torch.zeros_like(query_embed)
......
hs = self.decoder(query=tgt, key=memory, memory_key_padding_mask=mask,
                  pos=pos_embed, query_pos=query_embed)

Only the queries pos of the object queries input by the transformer decoder can be trained, and the query is initialized to 0 (tgt in the above code) every time, which causes the output of the first layer of self attention of the transformer decoder to be 0. I personally think that object queries essentially play a role similar to a trainable anchor. Object queries themselves do not have prior information related to the content of the image, nor should they. Therefore, queries pos that carry information such as location and size are trainable. , and the query is initialized to 0, and the query has real useful information only after cross attention with the feature information.

Some inspired papers based on detr

Target Detection

Deformable DETR

In order to solve the problem of slow DETR training and poor detection of small target objects, Deformable DETR proposes a deformable transformer structure. Deformable DETR believes that the reason for the slow detr training is that the attention mechanism of the transformer needs to model the global dense relationship of the feature map, which makes it take a long time for the model to learn the truly meaningful sparse position. Therefore, Deformable DETR abandons the idea of ​​multiplying each query with the key point, but adaptively samples K points as attention weight and value for interaction (when sampling 1 point, the deformable transformer degenerates into a 1×1 dcn, When all points are sampled, the deformable transformer becomes a normal transformer).
Alt

principle of deformable detr

As shown in the figure, Deformable DETR utilizes feature maps of different levels, that is, the sampling of deformable transformer is not concentrated in the same layer, but may come from feature maps of other levels. In this way, Deformable DETR improves the accuracy of small target objects. In addition, Deformable DETR no longer returns the gt bbox directly, but returns an offset value to speed up the convergence of the regression box:

reference_points = self.reference_points(query_embed).sigmoid()
......
tmp = self.bbox_embed[lid](output)
new_reference_points = tmp
new_reference_points[..., :2] = tmp[..., :2] + inverse_sigmoid(reference_points)
new_reference_points = new_reference_points.sigmoid()

deformable detr solves the problem of weak small target detection and too long training steps of detr (only 50 epochs are used, which is 2/3 smaller than that of detr), but deformable transformer, as a new operator, brings great advantages to the deployment of deformable detr Here comes the rather big problem. If you want to apply it in engineering, you may have to face the problem of handwriting tensorrt operator.

Conditional DETR

For the problem that the number of detr training steps is too long, conditional detr believes that in the decoder, the query of detr should be matched with the content embedding and spatial embedding in the key at the same time. Detr highly relies on high-quality content embedding to locate the extremity area of ​​the object, while This part of the area is precisely the key to locating and identifying objects (that is, the query must participate in both the spatial position calculation of the instance and the category calculation of the instance). Therefore, conditional detr decouples the cross-attention part of the decoder in the original detr, and proposes conditional spatial embedding. Content embedding is only responsible for searching the area related to the object according to the appearance, without considering the matching with the spatial embedding; for the spatial part, the conditional spatial embedding can explicitly locate the extremity area of ​​the object and narrow the scope of the search object, as shown in the figure below .
Alt

Decoupling of content embedding and spatial embedding

Similar to deformable detr, conditional detr no longer directly returns to gt bbox, and also returns an offset value for gt bbox. After 50 epochs training, conditional detr achieved similar training results as detr500 epochs, improving the training speed. Since the structure of multi-scale fpn is also not used, conditional detr and detr are equally in the detection of small target objects.

Sparse R-cnn

Sparse RCNN draws on the idea of ​​detr set prediction to make extraordinary innovations to faster rcnn. Sparse R-CNN removes the RPN structure of faster rcnn and replaces it with learnable proposal boxes (very similar to the object queries in detr), as shown in the following figure:
Alt

sparse rcnn

After the input image passes through the backbone cnn and fpn, feature maps of different levels are obtained. N proposal boxes use roi align to cut out the region of interest on it, and send it to the dynamic convolution structure to perform a pairing with N trainable Proposal Features. One cross-attention calculation, and then get N object categories and regression boxes. Then, Sparse RCNN uses the casecase rcnn cascading idea to re-input the output object category and regression frame into the dynamic convolution for refinement, and finally obtains N object categories and regression frames.
Alt

Dynamic Convolution Structure

Compared with detr, Sparse RCNN has the feeling of splitting object queries into two parts (proposal boxes and proposal features). Proposal boxes are only used to find the location of the instance, while proposal features are only used for prior statistics of the content of the instance. To a certain extent, the decoupling of object queries is achieved, which may be the reason why Sparse RCNN is easier to train. In addition, in the cross-attention of the detr decoder, queries actually perform attention calculations with each patch of the feature map, while Sparse RCNN is one-to-one (this can save a lot of video memory). Sparse RCNN does not explicitly use the structure of the transformer, but whether it is proposal boxes and proposal features, or dynamic convolution and cascade training, it has similar information utilization methods as the transformer.
Sparse RCNN is much better than detr in terms of training speed, and the accuracy has reached a good level, but it adds back the annoying roi align, so it is not dominant in deployment.

Instance Segmentation/Panoramic Segmentation

SOLQ

The name Solq is very similar to the single-stage instance segmentation algorithm SOLO in 2020, but in fact the algorithm principle is more like a combination of deformable detr and dct mask.
Alt

solq principle

As shown in the figure above, solq adds a dct (discrete cosine transform) branch (much like the mask head branch of mask rcnn) to the decoder of deformable detr to predict the mask of the current instance. Using dct brings stronger robustness to instance mask contours than pixel-by-pixel prediction.
Alt

DCT brings better robustness on instance contours

Mask shapes

Compared with Solq, Maskformer is more like a combination of detr and SOLO. Maskformer uses queries to obtain the category and mask kernel of each instance, and the mask kernel and feature map are convolved to obtain the mask of each instance (very SOLO), as shown in the following figure:
Alt

maskformer structure

Maskformer unifies semantic segmentation, instance segmentation, and panoramic segmentation tasks, and the structure is easy to reproduce. But like detr, Maskformer is also difficult to train, and 300 epochs were trained in the paper.

K-Net

Similar to Sparse RCNN, K-Net does not directly use transformer, but draws on detr's set prediction idea. K-Net sends a set of learnable object feat (similar to object queries) to get the mask kernel of each instance after ffn, the mask kernel and feature map are convolved to get the mask of each instance (also very SOLO), and then Use the obtained mask to update the object feat through the kernel update head. After several updates, the category and mask of each instance are obtained (in fact, the entire structure is similar to the transformer with the encoder removed), as shown in the following figure:
Alt

knet structure

The structure of the kernel update head is shown in the figure below. The predicted instance mask and feature map are multiplied by matrix and then Adaptive Kernel Update is performed with object feat (personal feeling is very similar to transformer's cross attention, but more memory-saving), and then object feat performs Kernel Interaction (essentially transformer's self attention) update. In the paper, only 3 Kernel Update Heads and 100 object feats can be used to obtain the state-of-the-art results on each benchmark.
Alt

kernel update head

Compared with Maskformer, K-Net uses a larger feature map for cross attention, but the memory usage is smaller, and the training speed is faster (50 epochs).

Guess you like

Origin blog.csdn.net/blanokvaffy/article/details/121588773