2020 AI Competition Award-winning Scheme Review Series (2) Remote Sensing Semantic Segmentation Competition trick-2020 Huawei Cloud Artificial Intelligence Competition

Preface

This is a semantic segmentation contest that bloggers participated in last year and finally achieved top 3% (13/377). This is the first time a blogger has participated in the remote sensing image semantic segmentation competition. With this opportunity, the blogger started the learning journey of semantic segmentation. Compared with the final ranking, the blogger is more concerned about what they can learn in the process. After the last review of the illegal advertising target detection competition, the blogger fully realized the importance of the review, so he couldn't wait to review the remote sensing segmentation competition.

Friends who have not seen the resumption of the target detection competition for illegal advertisements can also pay attention to the detection competition, because many models and tricks in the 2020 competition are the latest and involve many cvpr2020 papers. Here is the portal:

Illegal Advertising Target Detection Contest Portal

Going back to this article, this remote sensing segmentation competition is mainly based on the high-definition remote sensing satellite impact map to segment various roads. Some of these roads are "Tongtian Avenue" and some are "Intestinal Path", full of details. In the end, the blogger team’s model is 0.833 and the champion is 0.841, so the difference is not very much. The blogger believes that the main difficulty is how to balance the model to see the big overall goal and the very small path. When the blogger resumed the game, he felt that there was a problem with his thinking in the sprint stage. I had three ideas at the time. They were all based on training a large model (excellent for large road segmentation) and forced the model to learn details through finetune. The first is to force the model to learn details by using Focal loss/OHEM to do finetune; the second is to force the model to pay attention to the details of each picture through a small learning rate and batchsize; the third is to use multi-scale input Perform finetune on.

Facts have proved that these three kinds of finetune ideas are not so work, which also caused the blogger to regret not being able to enter the top 10 with a weak disadvantage.

Now if I come back again, I might consider more:

  1. FPN/BiFPN
  2. Try some more on the remaining high resoluton, such as deformable convoluton.
  3. Using multiple large models to distill small models, these large models will be set to some attention to details and to pay more attention to the whole.
  4. A tipping technique based on OCRNet and large kernel matters paper bloggers found, I will mention it at the end.

Another difficulty is that this game is very constrained on the model's reasoning time and memory usage. I didn't test flops specifically, but the ensmble model with large parameters is almost impossible. Later, it was reported that TTA would be deducted points, so we didn't use TTA (too honest). But it also made me realize that more attention should be paid to the lightweight of the model, and that inference time/flops has practical value.

Okay, some of the murmurs of the above bloggers, let's focus on how much European style we can draw from the front row bosses!

Before starting to study, please browse the next contest questions for those who have not participated in this contest:

2020 Huawei Artificial Intelligence Contest Portal

Insert picture description here

Eighth place (at 1 point of European gas)

1.LinkNet (loss:dice+bce 1:1)

The player used LinkNet34 (backbone is Resnet34). I paid less attention to linknet before, but I heard that several front rows have been used this time, indicating that the network effect is still good. If the backbone is further improved, such as senet or res2net This player should be able to improve.

When doing competitions, the backbone still has to try more to be safer. This time we have two branches of deeplabv3+/unet, and most of our energy is spent on magic reform and EDA. For example, in the latter kaggle competition, the vision transformer-based model dominated the classification, and yolov5 also achieved the dominance in the kaggle wheat detection (although it was later disabled). Therefore, for different data sets, the best baseline is completely different. Reflecting on this point, I think that we should at least be familiar with preparing 4-5 more reliable baselines in order to effectively avoid losing at the starting line at the beginning. For the split competition, it is obvious that unet/deeplabv3+/hrnet+ocr/linknet/senet+fpn should appear on the list.

Speaking of this, I suddenly thought of chatting with a kaggler a few days ago. He said that their team is dedicated to model design, some are dedicated to complex augmentation, and some are responsible for train, etc. This kind of team allocation is obviously more reasonable. Our team is engaged end-to-end, each responsible for a route. In the early stage, the idea of ​​learning more knowledge is advisable, but it should be better to allocate manpower later. These have nothing to do with technology, but our reflections.

LinkNet's architecture is very simple, simpler than Unet. The main focus is on flops and inference speed, and is dedicated to real-time semantic segmentation. It has a variant of D-linknet, which is worth learning, especially this paper is based on remote sensing road segmentation. Attach the portal of the paper:

D-LinkNet :LinkNet with Pretrained Encoder and Dilated Convolution for High Resolution Sate

D-linknet knows about recommended articles

D-linknet is a network specially designed for remote sensing of road segmentation. In the paper, three major difficulties of this type of problem are mainly mentioned:

  1. Remote sensing images have high resolution, and some roads will have a large span, requiring a large receptive field to cover the entire image.
  2. Some roads are very long and narrow, and precise positioning information needs to be retained to restore such small roads.
  3. The road has natural connectivity.
    Recall that in this competition, we also focused on these issues. We used cavity convolution/deformable convolution to adapt the receptive field, and tried post-processing of expansion, corrosion and hole removal to ensure connectivity, but for the problem 2. The positioning of small roads is not well done.

The hollow convolution used by D-linknet is in the middle of the model, using the concatenation method to superimpose the receptive field while maintaining resoution. Hollow convolution can show good performance whether it is connected in series or in parallel.

The architecture of D-linknet is shown in the figure below:
Insert picture description here
D-linknet's hollow convolution overlay does not use the HDC design mentioned in the Tucson group paper, but 1,2,4,8,16 are stacked like this. It may cause a checkerboard effect and affect the accuracy of segmentation.

HDC can refer to the following analysis:

About HDC of Hollow Convolution

At the same time, everyone looks at this picture carefully. The middle part is very similar to the bridge structure of unet, which can guide us to a certain extent on how to design the bridge structure of unet. I found in some competitions that the bridge structure of unet also has a great impact on the final result. Here it is the stack + skip connection of dilation. It is worth noting that its original image is 1024 input, and the resolution is still quite high, so it uses 32 times downsampling. If the input picture is 512 or smaller, the position information lost in the 32 times downsampling is more difficult to recover.

The final output of this group of players used 8 times the TTA, which also illustrates the lightweight of the linknet series.

Seventh place (sucked 1 point of European gas)

Insert picture description here
Put an architecture design diagram first, which is a four-mode fusion, which is enough to show that deeplabv3+ is relatively lightweight. The part worth learning is that the use of OCRnet is a new technology.

1.OCRNet

The core idea of ​​OCRnet is to explicitly enhance the contribution of pixels from the same type of object when constructing contextual information. It is an idea derived from the essence of semantic segmentation.

Microsoft Asia Research Institute analyzed the three biggest problems of the current semantic segmentation model:

  1. Loss of positioning information caused by multiple downsampling.
  2. The receptive field of pixel-level features is not enough, and the object has multiple scales.
  3. The boundary is wrong, and the expressive ability of the boundary feature is relatively low. Many semantic segmentation errors come from boundaries.

Thus pointed out:

  1. HRnet: used to maintain high resolution
  2. OCRnet: used to enhance pixel contextual semantic information
  3. SegFix: Used to solve the problem of boundary errors.

The following figure shows the architecture of
Insert picture description here
OCRnet : For this part of HRnet/ OCRNet, I plan to write a series of source code analysis, so I won’t go into details here.

6th place (at 1 point of European gas)

1. Model selection

I used HRnet+OCR, and the idea of ​​why this model is better, here is the idea of ​​the players:

The low-resolution feature contains rich semantic information, but due to the down-sampling operation, it loses part of the location information. On the contrary, the high-resolution feature has relatively less semantic information, but retains more position information, and the positioning is relatively accurate, which is conducive to detecting small targets. Since most roads are long and narrow, the model must be positioned accurately, which means that road detection depends on high-resolution features. Most of the existing networks down-sample high-resolution features to low-resolution features, and then restore high-resolution features from low-resolution features, but the lost location information cannot be fully recovered. HRNet maintains high-resolution features throughout the entire stage, which is conducive to road detection. In addition, different branches of HRNet produce features with different resolutions, and these features interact to obtain information, and finally obtain high-resolution features containing multi-scale information. So choose HRNet as our backbone. In addition, we believe that road segmentation does not rely on very high-level semantic information, so there is no need for a very deep network, and in the case of limited training data, a large network has a huge amount of parameters, and there is a risk of overfitting. Therefore, we chose the small model in the HRNet series: HRNet18. The segmentation head selects the OCR attention module, and uses the target context to enhance the feature representation. The entire network structure is small, and the training speed and inference speed are fast. In the experiment, it is found that HRNet18 is better than hrnet32, hrnet48 and deeplabv3+ (resnest50) in speed and accuracy.

What can be learned here is an idea of ​​choosing a model from a problem (as for believe it or not, hahaha...).
For this part of HRnet/OCRNet, I plan to write a series of source code analysis, so I won’t go into details here.

2. Loss design

What I learned here is also the idea of ​​loss design:

Insert picture description here
The problem of cross entropy loss is that it cannot solve the problem of data imbalance between foreground and background. At the end of the game, most models can better predict the main road part, but there will be small roads and difficult sample recognition. In order to solve the above problems, the players weighted the loss of the pixels near the edge, and then used OHEM to filter the simple samples, and the final loss was lovasz+bce. lovasz is a directable alternative form of IOU, which can optimize IOU and has good results in some games.

Insert picture description here
With reference to the figure above, the player extracts the edge, increases the ce loss of the edge, and then sets the threshold for the loss. The part less than the threshold is regarded as a simple sample and does not participate in backpropagation. Finally, bce+lovasz softmax was used.

3. Exponential moving average (EMA)

Haven't used it, keep it for now, and make it up when you come back.

Fifth place (sucked a mouthful of Huazi)

1. Some small points

  1. RandomGridShuffle: Cut the image into four blocks, and then shuffle the order, which is not a very common augmentation.
  2. BCE Loss + cosine simulated annealing is used for training, after finding the best model, use Lovasz Hinge Loss + StepLR for fine-tuning (also the method recommended by the author of Lovasz), and linear start is used for fine-tuning.
  3. FPN / Unet + EfficientNetB5

Fourth place (sucked to 2 European gas)

First, the author shared the source code, applause~~

Source portal

1. Unbalanced foreground and background

Competitors calculated that the proportions of foreground and background pixels were approximately 12.2% and 87.8%, and the pixel categories were extremely unbalanced, so we added a weight (3:1) to the loss to reduce the impact of category imbalance.
It shows that although the weighting method is simple, it is not necessarily weak. It is necessary to test this dumb method to solve the problem of imbalance.

2. Ignore edge prediction

It feels like a show operation, and there is no detailed explanation. Come back and look at the source code and make it up.

Third place (sucked a big smoke gun)

1. Some small points

  1. unet + efficientnetB5
  2. lovasz+focal loss
  3. Radam+LookAhead optimizer + cosine simulated annealing
  4. Multi-scale training, adjust the image size between different batches.
  5. Inflation forecast

The above are the more common techniques, which means that you can get the top three after you use them well.

First place (sucked 1 point of European gas)

First of all, open source the code, applause encouragement, the portal is as follows:

Game source portal

1. Scheme design

The encoding module uses the ResNeXt200 network pre-trained by ImageNet. Based on the ED architecture, a feature adaptive fusion method for channel attention enhancement is proposed, and a gradient-based edge constraint module is designed. While enhancing the spatial details and semantic features, it also improves the feature response of the road edge and realizes the accurate extraction of multi-scale roads. The architecture is as follows:
Insert picture description here
Features of this architecture:

  1. The input is four channels, and the addition of a channel is the gradient information calculated by the Sobel operator, so as to ensure more accurate edge extraction.
  2. Backbone chose a very deep Resnext200, which can extract deeper semantic information, indicating that with the input of gradient information, a deeper network can be used and good positioning information recovery can be obtained.
  3. The structure in the middle is a bit like D-linknet, except that the concatenated hollow convolution becomes the ASPP structure.
  4. CAF is added after upsampling and fusion, which is similar to the channel-dimensional attention mechanism. I have also found in multiple competitions that upsampling and fusion is a better place to add attention, which can be understood as the importance of channel attention to balance the weight of each channel after channel concat.
  5. The ECM route is a structure similar to FPN, which can compensate for positioning information.

After the competition, I obtained the consent of the players, reproduced the network and verified it in another kaggle competition. Unfortunately, it did not perform well in pathological segmentation. It may be that this design idea is more suitable. Remote sensing road segmentation relying on edge accuracy.

to sum up

After writing the top ten plan and reviewing it, I found that there was no such thing as what I had imagined or the previous tricks for participating in the target detection competition were endless. Although the Eight Immortals crossed the sea, each showed his magical powers. But overall there are no special highlights.

In this list, HRnet+OCRnet can be regarded as occupying half of the country, indicating that the generalization is indeed good. I plan to do a source code analysis later to sort out these two frameworks, and perhaps add Segfix together.

In this review, the biggest gain is not these tricks, but how to adjust the thinking in the game. Everything has to be returned to the data, EDA's exploration, find any problems and then prescribe the right medicine, instead of messing up and changing it to the end.

For example, if it is found that small target recognition is not good: consider multi-scale training, FPN, and increase resolution... For
example, if edge recognition is not good: consider increasing edge loss and increase gradient information. For
example, if a certain type of recognition is not good: consider yes Not a data balance problem? Do you want to perform offline data enhancement for this class separately? Or does this class do not need deep semantic features to consider maintain resolution?
For example, if the data is unbalanced: we should consider data enhancement, loss weighting, online resampling, focal loss, or half as mentioned in the previous target detection article?
For example, it is found that the overall effect is not good: it is necessary to consider whether the category semantic information is poor, and consider using OCRnet? Do more integrations between features? and many more.


Just imagine, I’m going to rest first...

[End of this article]

Guess you like

Origin blog.csdn.net/weixin_36714575/article/details/114011609