2020 AI competition award-winning plan review series (1) target detection competition trick-Hualu Cup illegal advertising detection competition

Preface

The blogger participated in a series of competitions last year and achieved acceptable results. He won several top 10 domestic competitions and also won the silver medal of kaggle. The blogger’s dream is to one day become a Kaggle GM and become a first-class Data Scientist. One regret is that after participating in these competitions, I did not analyze and learn the excellent ideas and skills that appeared in them. The blogger does not want to miss this opportunity to absorb the "European spirit" in the front row, so he plans to write a series to record what we can learn from the works of the front row players of this kind of competition. After all, what you learn is your own.

The blogger only records and learns very new tricks and models. These competitions are also 20 years old, so many tricks and methods are very new. For example, models involve yolov5, efficientdet, doubleunet, HRnet, vision transformer, and tricks involve adaptive anchors. , Nosiy student, auto augmentation, etc. Have you aroused your interest in learning? I hope you will find something after reading it.
Xiaobai absorbs European gas
The first one is that the first competition in 20 years is the target detection competition for illegal advertisements. Illegal advertisements are divided into three categories: one store with multiple moves, window covering, and illegal traffic signs. We need to use target detection technology to identify it.

First, paste the contest address. Those who did not participate can first understand the next contest questions, and then watch the sharing session.

Game portal

We finally achieved top 10 results, using ensemble of yolov5+efficientdet. At that time, the tricks that were limited to the level of use were quite satisfactory. I will mention the innovation points worth mentioning at the end. Let's take a look at the content that the front row players are worth learning.

6th place (at 1 point of European gas)

1. Target detection => multi-head

The target detection task can be divided into classification + target detection, the two-class model judges whether it is a pure background, and the target detection model specifically locates the position of the foreground frame based on the classification structure. (After writing here, I suddenly thought that this is not to force an arbitrary model to two-stage, or is it the idea of ​​faster rcnn, it can be seen that these ideas are all integrated ). In specific practice, a single model can be used as a multi-head or a classification model + detection model can be trained separately. Which idea is better depends on the results of the experiment. In this competition, the winner adopts a multi-head idea.

Multi-head position: The contestant adopted the efficientdet model, followed by the BiFPN Layer with a two-category head, and finished sixth. The judgment result of this competition is 0.2ACC + 0.8MAP. ACC only considers the judgment accuracy of the foreground and background, so on the basis of this standard, I think the two-category head just strengthens this part, and it should have a certain positive effect. This kind of multi-head idea I often see in kaggle competitions, it is worth studying how to better train each part independently or train the entire network end-to-end. Here I think I can refer to the training skills in the faster-RCNN paper, and iteratively train (this part has not been verified in detail).

Experts commented on this player as window-shielding advertisement. The position of the advertisement may change from illegal to illegal. It has a lot to do with the position of the window. How to make localization better? You can also think about it.

Fifth place (sucked to 3 European gas)

1. Architecture analysis

The structure of this player is as follows:

backbone neck RPN header
resnet101+deformable conv fpn rpn cascade RCNN

What can be learned in this architecture is the use of deformable convolution in the last two layers of the backbone, which I think is similar to using dilated concvolution. The function of dilated convolution is to increase the receptive field. When using dilated convolution, pay attention to the use of HPC design to avoid the checkerboard effect. For example, the resnet series finally adopts the superposition of 125 and 125. Use deformable convolution to adapt to the receptive field, avoiding the complexity of using dilated convolution that is difficult to control or finding the best receptive field. FPN can effectively predict on multiple scales, especially to increase the ability to recognize small targets. Cascade RCNN has performed well in many competitions and is one of the must-have baselines worth testing.

2. Imbalance between classes

For the imbalance between classes, the players are weighted on the loss. This is not explained in detail, but the effect is greatly improved. I also found in the competition that if the category imbalance is not dealt with, then basically the AP of the category with a small proportion is very low. Our team adopted another idea, which will be mentioned at the end. The weighted ratio can be determined by observing loss or calculating the ratio of categories in GT, which is an adjustable parameter.

3. Extraction of global information

The contestants here are not very detailed, but the idea is this: because the two-stage model will first obtain the ROI through the RPN, and then perform the detection on the ROI, so that some global information will be ignored. For this contest, it is obvious that the influence of the background is huge. For example, the window advertisement is obviously affected by the position of the window, and the advertisements of multisigns interact with each other.

So here, the player extracts the global feature map (it is said to be extracted with dilated conv) and superimposes it on the ROI, which enhances the ability of the ROI for global information.

4. Other ideas

  1. Through EDA analysis, the players found that the proportions of the frames were varied, so they increased the number of base anchors.
  2. Using OHEM: The idea of ​​OHEM is very simple, that is, after calculating the loss, the samples in the batch are sorted, and the most inaccurate samples should continue to be added to the training, so that the model "sees" more difficult samples, thereby enhancing the performance of these samples Recognition ability.

Expert comment: I have been asking the question of how to compensate for the distribution of the training set and the test set, but the players have not considered it. I don't know this very well either. If you have readers who understand this point, you can leave a message for exchange.
In addition, experts recommend to distinguish between non-traffic signs and other types of advertising.

Fourth place (at 4 points European gas)

1. Imbalance between classes

For the imbalance between classes, players use online category re-sampling to solve the problem of category data imbalance from the perspective of sample.

2.ATSS adaptive anchor assign algorithm

The idea of ​​using this trick is as follows: First, the players observe the GT information and find that there are many types of anchor ratios, so the first idea is to add more anchors, but the players think this approach has several disadvantages:

  1. It further increases the proportion of negative samples and exacerbates the imbalance of positive and negative samples.
  2. Increase the complexity of the model.
  3. If the traditional kmeans is used to cluster the anchors, it is easy to generate the risk of overfitting. Therefore, the players use the ATSS algorithm to replace the clustering anchor method.
    ATSS is an adaptive anchor algorithm based on retinanet (representative of anchor based) and FCOS (representative of center point in anchor free) to verify the purpose of narrowing the accuracy gap between anchor based and anchor free algorithms. The author gradually reduced the gap between the two types of algorithms by conducting ablation experiments, and finally determined that the difference in the definition of positive and negative samples is the main problem that leads to the difference between anchor based and anchor free.

ATSS selects a batch of the nearest anchors from the feature map of each scale through the distance between the center points of the different anchors and the GT box, and then "automatically" generates the threshold by calculating the mean and variance of the IOU, and then judges the positive sample according to this threshold Still a negative sample. The mean and variance in the whole process are always changing. For example, when the average IOU of a batch of anchors is relatively large, it means that the overlap between this batch of samples and GT is generally high, so the threshold should naturally be increased to obtain better positive samples. ATSS has only one parameter k, which is the number of anchors preset on each level. Experiments prove that this k is relatively robust (meaning it is actually a hyperparameter free). In the author's experiment, the best value is 9.
Below is a paper on this method and a good analysis.

1. ATSS paper address
2. Knowing analysis recommendation
3. ATSS source code implementation

Recommend everyone to read the paper and Zhihu's Chinese analysis.

There are a lot of discussion points here, such as the difference, connection and commonality between two stage and one stage, anchor based and anchor free. When the author writes here, it suddenly occurred to the author that if ATSS+multihead is used, both the anchor free and anchor based gaps can be compensated by ATSS, and the advantages of two stages can be obtained through multi-head. Can the model accuracy be further improved?

3. Remaining half of the multi-signs strategy

This point is a strategy that the players put forward by observing the data carefully. I personally admire this kind of insight very much. We all know that multi-signs is a type of illegal advertising with multiple moves in one store. These billboards appear in pairs or more in groups. These multi-signs billboards appearing in pairs have a certain similarity in content, but it is difficult to use this similarity. Therefore, other types of illegal advertisements or shop signs that are not illegal are often misidentified as multi-signs.
On this basis, this player carried out a very creative training idea, which is to fill one of the signs that appear in pairs with a solid color. This can force the model to learn the detailed associations on the content of multi-signs, reducing the probability of recognizing normal store signs or irrelevant billboards as multi-signs.

4.SAC: switchable hole convolution

Insert picture description here

This module is proposed in DetectoRS. I looked at it to extract multi-scale features through the hollow convolution of different dilation rates, and then merge them, a bit like the parallel structure of googlenet. Then an average pooling attention mechanism was added, and the attention was "forked." First go to the paper address and Zhihu recommendation analysis:

Paper address
knows analysis recommendation

5. Other ideas

  1. A normal (pure background image) training was added to further improve the judgment of whether there are illegal advertisements. This actually helped the team to rise a lot.
  2. For the multi-scale complex problem of GT, a superimposed BiFPN structure is used to increase the fusion of features between different scales.

Expert comment: Is there any further research on the imbalance between the data distribution of the training set and the test set? The players said that they did not consider this aspect.

Third place (sucked to 2 European gas)

1. Data enhancement

This player has done a lot of data enhancement, which involves:

  1. Grid mask
    Grid mask is a data enhancement method based on the analysis of cutout problems, because the cutout is prone to cut off the entire sample or no sample at all. The author believes that avoiding excessive deletion or maintaining continuous regions is the core problem of previous similar algorithms. If too much information is deleted, the original data may be destroyed, resulting in noisy data; if too little information is deleted, it may not achieve the purpose of data enhancement and improving the robustness of the model.

The grid mask forms a grid (0,1) with the same resolution as the original image and multiplies the original image to achieve a similar regularization purpose.
With the grid mask algorithm, the author has achieved an improvement over auto-augment (1.4 points) on imagenet.

The following is Zhihu's previous recommendation on the grid mask algorithm, you can check it out.

Know the analysis recommendation

This player further borrowed from the original intention of the grid mask and counted the length and width of the smallest GT frame in this game to avoid the grid mask covering the entire GT.

2. The auto augment
player uses the auto augment strategy searched by Google on the coco data set, which involves reinforcement learning and can effectively drift the sample distribution between the training set and the test set.
This point is profound and profound. I still don't understand it very well, so I can't comment much here. However, field experts attach great importance to the advantages brought by reinforcement learning.

2. Other ideas

  1. Cascade RCNN is used as the baseline, and the backbone is resnest101.
  2. Deformable Convolution is used to adaptively extract the receptive field, which is adopted by many players.
  3. Because it is difficult to distinguish between samples, the attention mechanism is used to enhance fine-grained feature extraction: SE, CBAM, etc. It is worth noting that the player did not increase the attention mechanism in the position of the backbone, but increased after the ROI, which played a role in the refinement of the feature.
  4. In view of the similarity between the goals of multiple moves in one shop, a Feature Similarity Check module was introduced. This player did not explain in detail, it is probably the calculation of the cosine distance between the target features, retaining the large confidence, and the rest with the big goal Perform similarity calculation and remove those that are less than the threshold. (Personally, it is difficult to actually operate, and the judges also mentioned this problem, because a store with multiple moves may not necessarily have a high degree of similarity in content.)

Second place (sucked to 2 European gas)

1. Imbalance between classes

It is found that different groups of players have adopted various solutions to the problem of category imbalance. The scheme with loss weighting and online resampling has been mentioned earlier. The second-placed solution is to only perform offline data enhancement for a small number of image categories and expand the samples. Finally, I will also mention a new data balancing scheme invented by our group.

2. Other ideas

  1. It is found that some targets in the data set overlap, so a mixup strategy is introduced to alleviate the overlap problem. (This is an excuse for generally using mixup, oh no, original intention, whatever)
  2. As a backbone, res2net has performed well in a lot of competitions. It can be used as one of the backbones to be tested, and the other is senet.
  3. The FCOS architecture + GCnet attention mechanism + Deformable Convolution is used.
  4. A non-maximum suppression method (WBF) based on border fusion is used.

Experts commented on the possibility of accessing the OCR network, and recommended a paper to solve text recognition under deformation. The address of the paper is attached below. I have a relatively shallow understanding of OCR, so I will not blindly analyze it:

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

First place (sucked 3 European gas)

The author wrote this blog while re-listening and translating the live video of the defense, and reviewed various papers and materials. So far, I have been writing for six or seven hours and coded 7k words. But despite the tiresome, let's cheer up our spirits and focus on what knowledge this game will bring us:

1. Difficulty analysis

This player did a very good job of analyzing the difficulties of the competition. This kind of thinking is worth learning.

First of all, this group of contestants analyzed one of the difficulties of this question, that is, the current target detection network, for the judgment of the category, more of the features in the focus box, and less background influence. This is obviously different from general classification tasks. General classification tasks are robust to background changes. For example, a cat will not become a dog just because it changes from a table to a grass. But the most deadly point of this game is that the type of the background (illegal or not) and the category (multiple moves in a store, blinding windows, non-traffic signs) are all heavily dependent on the background.

For example, window-shielding advertisements rely on windows in the background; multisigns are related to each other in position and belong to the relationship between frames; non-traffic signs are more on the road and so on. Obviously, none of these can be judged solely on the features in the frame.

I think this line of thinking is very reasonable.

For the above problem, the player’s attempt is to expand the size of the ROI by 1.5x after extracting the ROI to incorporate more background information to assist judgment.

2. Imbalance between classes

Here is just to mention that the way players solve the imbalance of the category is Focal loss, which is a very common idea.
But everyone can sum up that only the top six in this competition have three different ideas for solving the imbalance of categories. I think the replay is very enjoyable.

3. Other ideas

  1. The multi-head/classification + detection dual model deals with distinguishing classification tasks and detection tasks. The contestants finally chose the multi-head idea.
  2. atss+resnet50+FPN
  3. SEPC (scale balanced spatial pyramid structure)
  4. GIOU loss is used

Finally, attach the champion's CSDN blog summary:

Champion summary portal

Put a chart of the champion's rise:

Insert picture description here

There are still a few points that have not been expanded, such as auto augment, SEPC, SAC, etc. I have continued to write 7,000 or 8,000 words today, and I really can’t write it. I will explain it in detail later.

Guess you like

Origin blog.csdn.net/weixin_36714575/article/details/113971782
Recommended