Semantic Segmentation of Image Segmentation Review

The blogger's research direction is image segmentation, and I think I will post a paper on panoramic segmentation. Semantic segmentation and instance segmentation are the only way for panoramic segmentation. So I first list the excellent papers related to semantic segmentation in the top conference that I have read recently, so as to facilitate review and consolidation, and have a macro grasp of the direction of semantic segmentation.

Table of contents

1. Summary of the paper

1.1 Classic segmentation algorithm

1.1.1 FCN

1.1.2 U-Net

 1.1.3 SegNet

1.1.4 Deeplab series

 1.1.5 GCN (Global Convolutional Network)

1.1.6 DANet

1.1.7 Swin Transformer

1.2 Real-time segmentation algorithm

1.2.1 ENet

1.2.2 BiSeNet

​1.2.3 DFANet

1.3 RGB-D segmentation

1.3.1 RedNet

 1.3.2 RDFNet

 1.4 Extended reading

2. Common data sets

2.1 PASCAL Visual Object Classes(VOC)

2.2 PASCAL Context

2.3 Microsoft Common Objects in Context (MS COCO)

2.4 Cityscapes

2.5 ADE20K/MIT Scene Parsing(SceneParse150)

2.6 SiftFlow

2.7 Stanford background

2.8 Berkeley Segmentation Dataset(BSD)

2.9 Youtube-Objects

2.10 KITTI

3. Evaluation indicators

3.1 Confusion Matrix

3.2 Pixel Accuracy (PA)

3.3 Category Average Pixel Accuracy (MPA)

 3.4 Intersection over Union (IoU)

3.5 Mean Intersection over Union (MIoU)

3.6 Frequency Weighted Intersection over Union (FWIoU)

4. Loss function

4.1 Log loss

4.2 Dice loss

4.3 IOU loss 

4.4  Lovasz-Softmax loss

4.5 Focal loss 

 Five activation functions

 5.1 sigmoid activation function

5.2 tanh activation function

5.3 ReLU activation function

5.4 Leaky ReLU activation function

5.5 Parametric ReLU activation function


1. Summary of the paper

 Image segmentation is the process of dividing an image into multiple sub-regions such that the pixels within each sub-region have similar characteristics. Image segmentation is a basic problem in the field of computer vision, which is widely used in medical image analysis, target tracking, automatic driving and other fields. Semantic segmentation is a special form of image segmentation that divides each pixel in an image into a set of predefined semantic categories, independent of object instances. Therefore, semantic segmentation can be viewed as a generalization of image classification problems rather than pixel-level object detection or instance segmentation problems. Semantic segmentation is the basis of many computer vision tasks, such as autonomous driving, intelligent video surveillance and other fields, because it can help computers understand the semantic meaning of different regions in the image, so as to make more accurate judgments and decisions. Currently, the convolutional neural network (CNN) used in deep learning and specific structures and algorithms in semantic segmentation have enabled significant progress in many fields in semantic segmentation.

1.1 Classic segmentation algorithm

1.1.1 FCN

论文:Fully Convolutional Networks for Semantic Segmentation

FCN is a pioneering work in the field of semantic segmentation. It was proposed by Jonathan Long, Evan Shelhamer and Trevor Darrell in 2015. End-to-end training paved the way for the development of subsequent semantic segmentation algorithms. FCN achieves the best segmentation results on PASCAL VOC (data from 2012, which is 20% higher than the previous method, reaching an average IoU of 62.2%), NYUDv2 and SIFT Flow. For a typical image, the inference only needs 1 /3 seconds.

The main contributions of the algorithm:

  • Replacing the traditional fully connected layer with a convolutional layer enables the FCN to receive an input image of any size and preserve both spatial information and pixel-level prediction results in the output. Another important contribution of the FCN algorithm is the introduction of an upsampling layer in order to restore the output size to the input image size. In FCN, the downsampling layer obtains higher-level features by downsampling the image size, and the upsampling layer restores the image size to the original size through deconvolution, and combines low-resolution features with high-level features while restoring the size. The resolution features are fused to obtain pixel-level prediction results.
  • Another important feature of the FCN algorithm is the introduction of Skip Connections, which connects the output of the convolutional layer with the input of the upsampling layer, thereby retaining richer semantic information and improving the accuracy of predictions. . Skip connections also allow the FCN algorithm to generate predictions through different levels of features, resulting in better segmentation at both the detail and semantic levels.

Network Model Diagram

  • FCN-32s: Directly upsample the output of pool5 to the size of the original image for prediction. As can be seen from Figure 4, the prediction result is very rough (mIoU 59.4)
  • FCN-16s: Upsample pool5 by 2 times, add pool 4 features, and then upsample to the original image size, the result will be finer than 32s (mIoU 62.4)
  • FCN-8s: The feature after adding pool5 (upsampling 2 times) and pool4, first upsampling 2 times, and then adding it to pool3 to get FCN-8s feature (mIoU 62.7)

 In traditional classification CNNs, pooling operations are used to increase the field of view while reducing the resolution of feature maps. Very effective for classification tasks, the classification model focuses on the overall category of the image, and does not care about its spatial location. That's why there is a structure of frequent convolutional layers followed by pooling layers to ensure that more abstract and prominent features can be extracted. On the other hand, pooling and convolution with stride are not good for semantic segmentation, and these operations will bring about the loss of spatial information. Different semantic segmentation models use different mechanisms in the codec, but all aim to recover the information lost when reducing the resolution in the encoder.

Another important point of the semantic segmentation architecture is to use deconvolution on the feature map to upsample the low-resolution segmentation map to the input image resolution, or use atrous convolution to partially avoid the resolution at the encoder decline. Atrous convolutions are computationally expensive, even on modern GPUs.

1.1.2 U-Net

论文: U-Net: Convolutional Networks for Biomedical Image Segmentation 

The U-Net architecture includes a shrinking path that captures contextual information and a symmetric expanding path that supports precise localization, such that a network can be trained end-to-end using very few images. U-Net is mainly used for medical image segmentation, and it achieved better results than previous methods in the ISBI Neuron Structure Segmentation Challenge.

Training a deep learning network requires a large amount of data. The difficulty of semantic segmentation lies in: the combination of high-resolution features and low-resolution features can also be regarded as the combination of low-level location information and high-level semantic information. The joint use is a very worthwhile research problem

  • High-resolution features can guarantee positional information
  • Low-resolution features can guarantee semantic information

On the basis of FCN, this paper has made some modifications and optimizations, which can ensure more accurate segmentation results based on the use of a small amount of data.

The main core of FCN:

  • After the conventional convolutional layer, an upsampling layer is added to expand the feature map, and the high-resolution features of the previous layer are used when expanding, and better positional information is introduced.

The main improvements of U-Net are:

  • In the upsampling part, many channels are also used to allow more context information to be passed to the high-resolution layer
  • Some data enhancement methods are used to allow the network to learn more robust features
  • weighted loss is used

U-Net frame structure:

As shown in Figure 1, U-Net and FCN are similar in structure, and are also divided into encoder and decoder modules. There are only convolution and pooling layers in the network structure, including two paths:

  • A contraction path (encoder) that captures contextual information
  • Symmetric extension path (decoder) for precise positioning

 

Features:

  • The up and down sampling of U-Net uses the convolution operation of the same data, and uses the skip link, so that the features of the down-sampling layer are directly transmitted to the up-sampling layer, ensuring more accurate pixel positioning of U-Net.
  • U-Net training efficiency is also higher than FCN, U-Net only needs one training, while FCN-8s needs 3 training

 1.1.3 SegNet

论文: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation 

The novelty of SegNet lies in the way the decoder upsamples its lower resolution input feature maps.

  • SegNet uses " unpooling " in the decoder to upsample feature maps and keep high frequency details intact in segmentation.
  • The encoder discards the fully connected layer (convolution like FCN), so it is a lightweight network with fewer parameters.

Network structure:

  • Encoder: VGG-16 network (remove the last classification layer), a total of 13 convolutional layers, the convolutional structure is Conv+BN+ReLU, and the downsampling is realized by max pooling (2x2, s=2), but generally The position information will be lost, and the incremental loss (boundary detail) is very unfavorable to the result of semantic segmentation. The boundary in semantic segmentation is very important, so it is very important to retain as much boundary information as possible in the encoder. Considering memory and efficiency issues, some methods propose to retain the position corresponding to the maximum value to retain position information, so this method is also used in this article.
  • Decoder: Each encoder corresponds to a decoder, so the decoder also has a total of 13 layers, which are upsampled by using memorized max-pooling. The decoding structure is shown in Figure 3,
  • Classification layer: The output of the last layer of decoder will be input to the multi-class soft-max classifier to generate the corresponding category prediction for each pixel

This approach removes the need for learned upsampling. The upsampled feature map is sparse, so a trainable convolution kernel is then used for convolution to generate a dense feature map. Comparing SegNet with semantic segmentation networks such as FCN, the results reveal the trade-off between memory and precision involved in achieving good segmentation performance.

1.1.4 Deeplab series

Deeplab v1

论文:Semantic image segmentation with deep convolutional nets and fully connected crfs

DeepLab v1 is the earliest DeepLab model, originally proposed in 2014. DeepLab v1 mainly uses hollow convolution (also known as dilated convolution) to increase the size of the receptive field, so that larger images can be processed. In addition, DeepLab v1 also replaces the fully connected layer with expanded convolution, which reduces the amount of network parameters and avoids the problem of overfitting. The post-processing introduces fully-connected CRF to locate more accurate segmentation boundaries. 

DeepLab v2

论文:Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

DeepLab v2 is an improved version of DeepLab v1, originally proposed in 2015. DeepLab v2 mainly uses multi-scale information fusion and spatial pyramid pooling (ASPP) to further improve the accuracy of segmentation.

 

ASPP framework: 4 parallel convolutions with different dilation rates 

Deeplab v3

论文:Rethinking atrous convolution for semantic image segmentation

DeepLab v3 is the most important version in the DeepLab series, originally proposed in 2017. DeepLab v3 uses deformable convolution to better adapt to the segmentation tasks of objects of different scales, and also further improves the ASPP module to improve the feature expression ability.

Network structure:

DeepLab V3 contains two implementation structures: cascaded model cascade type and ASPP model pyramid pooling type.

The two models are shown in the two figures below.

  • Block1, 2, 3, and 4 in the cascaded model are the layer structure of the ResNet network (the V3 backbone network uses ResNet50 or 101), but in Block4, the 3*3 convolution and the shortcut branch 1*1 convolution step Stride is changed from 2 to 1. No downsampling is performed, and the 3*3 convolution is replaced with an expansion convolution. The following Block5, 6, and 7 are copies of Blockd. (The rate in the figure is not the real expansion coefficient, the real expansion coefficient = rate*Multi - grid parameter)

  • The size of the target varies greatly, and the general solutions are as follows: multi-scale fusion, encoder-decoder structure, using additional structures to capture long-distance dependencies, spatial pyramid pooling (multiple different pooling rates, capturing different scales Characteristics)
  • Added batch normalization in ASPP to speed up training and improve the effect
  • Added a parallel global average pooling to capture global information

 ASPP uses different expansion rates to capture information of different scales, but when the expansion rate becomes larger, there are fewer effective elements, that is, the larger the interval, there will be a lot of weights falling outside the feature map, which will not work. The extreme case is that the effect of this 3x3 convolution is similar to a 1x1 convolution. Therefore, the author uses global average pooling to capture global information in the last layer of feature maps, inputs the "image-level" features into 1x1 convolution + BN, and then upsamples to the required size.

Deeplabv3+

论文:Encoder-decoder with atrous separable convolution for semantic image segmentation

Deeplab v3+ introduces a new encoder-decoder structure called "ASPP with decoder". Among them, the ASPP part is similar to the ASPP module in Deeplab v3, but multiple convolution operations with different hole rates are added, and a 1x1 convolutional layer is added after the last layer for dimensionality reduction. The decoder part uses the deconvolution operation to upsample the low-resolution feature map to the same resolution as the original image, and then combines it with the high-resolution feature map output by the ASPP part to finally generate a prediction result.

Compared with Deeplab v3, the improvements of Deeplab v3+ mainly include the following points:

  •  ASPP with decoder structure: Deeplab v3+ introduces the ASPP with decoder structure, which improves the network's ability to identify targets of different scales, and at the same time reduces the quantization error of the final prediction result.

  • Xception-based backbone: Deeplab v3+ uses an Xception-based backbone. Compared with ResNet, Xception has fewer parameters and higher computational efficiency.

  • Multi-scale training and testing: Deeplab v3+ uses multi-scale methods in both training and testing stages, which can improve the robustness and prediction accuracy of the network. 

 Network model:

 1.1.5 GCN (Global Convolutional Network)

Paper: Large Kernel Matters - Improve Semantic Segmentation by Global Convolutional Network  GCN (Global Convolutional Network) is a deep convolutional neural network for image semantic segmentation. Compared with the traditional convolutional neural network, GCN extends the convolution operation from the local receptive field to the global image information, so that it can better preserve the global features, and the segmentation effect of some small objects is better.

Semantic segmentation requires not only image segmentation but also classification of segmentation objects. In segmentation architectures where fully-connected layers cannot be used, this study finds that large-dimensional kernels can be used instead. Another reason for adopting a large kernel structure is that although many deep networks such as ResNet have a large receptive field, related research has found that the network tends to obtain information in a much smaller area, and the concept of an effective receptive field has been proposed . Large kernel structures are computationally expensive and have many structure parameters. Therefore, k×k convolution can be approximated as a combination of two distributions of 1×k+k×1 and k×1+1×k. This module is called the Global Convolutional Network (GCN).

Next, let's talk about the structure. ResNet (without hole convolution) forms the encoder part of the entire structure, while the GCN network and deconvolution layer form the decoder part. The structure also uses a simple residual module called Boundary Refinement (BR).
  

1.1.6 DANet

论文:Dual attention network for scene segmentation

A new scene segmentation model - Dual Attention Network (DAN) is proposed, which aims to effectively capture global context information and the interrelationships between objects to improve the performance of scene segmentation. The main contributions are as follows:

  •  Dual attention mechanism: The DAN model introduces a dual attention mechanism, including object-level and pixel-level attention, to capture global and local information. Object-level attention is used to filter important objects, while pixel-level attention is used to focus on pixels with high attention.

  • Attention-based feature enhancement: The DAN model enhances the feature representation through the attention mechanism, making the model pay more attention to important information. Through the attention mechanism, DAN can adaptively adjust the feature weights to improve the accuracy of scene segmentation.

  • Enhanced transmission of spatial information: The DAN model introduces an attention mechanism in the convolution process, allowing the model to focus on key areas and better capture global and local context information. This further strengthens the transfer of spatial information and improves the performance of scene segmentation. 

In order to capture both spatial and channel correlations, DANet uses a parallel approach, using non-local modules in both channels and spaces.

  • Position attention: It is also spatial attention, and the relationship between each pixel and other pixels can be obtained. The closer the relationship, the greater the weight
  • Channel attention: is also a category attention, you can get the weight of each channel, the more important the channel, the greater the weight

  

 1.1.7 Swin Transformer

论文:Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" is a paper jointly published by the Chinese University of Hong Kong, Tsinghua University and Horizon Robotics Inc. on CVPR 2021. This paper proposes a new Vision Transformer architecture, namely Swin Transformer, which solves two main problems in the existing Vision Transformer architecture by adopting a mechanism called shifted window: 1. High computation of high-resolution images Cost; 2. Computational cost for long sequences.

Swin Transformer reduces computational cost by dividing the input image into multiple small blocks and then performing calculations on these small blocks. However, unlike the conventional division method, Swin Transformer uses the shifted window mechanism, that is, there is a certain overlapping area between two adjacent small blocks. This mechanism enables the model to make better use of local information in space and improves the accuracy of the model.

Swin Transformer is evaluated on COCO2017 object detection, instance segmentation and semantic segmentation datasets and compared with existing Transformer architectures. Experimental results show that Swin Transformer can achieve better performance while maintaining the same computational cost, and can achieve higher performance with the same computational cost. In addition, Swin Transformer can be further extended through horizontal and vertical scalability to adapt to images of different sizes.

  

1.2 Real-time segmentation algorithm

1.2.1 ENet

论文:ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

The ability to perform pixel-level semantic segmentation in real-time is critical in mobile applications. The disadvantage of recent DCNN networks is that they require
a lot of floating-point operations and have long runtimes, which hinder the usability of semantic segmentation. ENet is one of the earliest articles on real-time semantic segmentation, which has achieved a trade-off between accuracy and computing time.

Design ideas for some lightweight semantic segmentation models

  • Reduce the number of upsampling and downsampling (downsampling to 1/32 in FCN, and only 1/4 in ENet), because frequent downsampling will lead to information loss.
  • Use a lightweight Decoder (this article uses the segnet decoder) instead of the symmetrical structure of UNet.
  • The experiment found that the relu effect of removing the first few convs will be better. After the test, the reason is that the network depth is not enough. So use prelu instead of relu.
  • Use decomposed convolution (5*5 -> 1*5+5*1, the amount of calculation after decomposition is roughly equivalent to 3*3)
  • Use dilated convolutions to increase the receptive field, and the effect is very good.
  • Regularization, namely dropout.
  • Downsampling is necessary but loses information. In order to reduce information loss, downsampling is to perform conv and max pooling respectively, and then concat the results of the two to retain more information.
  • The network structure is as shown in the figure below

  • Add bn and prelu after conv (maybe ordinary convolution, dilated convolution, transposed convolution).
  • The following table is the overall structure. You can see that the downsampling is only downsampled 2 times.

 1.2.2 BiSeNet

论文:BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

This paper summarizes the previous real-time semantic segmentation algorithms and finds that there are currently three acceleration methods:

  • Limit the input size by clipping or resize to reduce computational complexity. Although this method is simple and effective, the loss of spatial detail still compromises the prediction, especially the boundary part, resulting in a decrease in the accuracy of measurement and visualization;
  • Speed ​​up processing by reducing the number of network channels, especially in the early stages of the backbone model, but this will weaken the spatial information.
  • Discard the final stage of the model (such as ENet) in pursuit of an extremely compact framework. The disadvantage of this method is also obvious: since ENet discards the downsampling in the final stage, the receptive field of the model is not enough to cover large objects, resulting in poor discrimination ability.

These speed-up methods will lose a lot of Spatial Details or sacrifice Spatial Capacity, resulting in a significant drop in accuracy. In order to make up for the loss of spatial information, some algorithms use U-shape to restore spatial information. However, U-shape will slow down, and a lot of lost information cannot be recovered simply by fusing shallow features.

To sum up, in the real-time semantic segmentation algorithm, spatial information needs to be paid attention to while accelerating . A new bidirectional segmentation network BiSeNet is proposed in the paper . First, a spatial path with a small step size is designed to preserve the spatial location information to generate high-resolution feature maps; meanwhile, a semantic path with a fast downsampling rate is designed to obtain an objective receptive field. A new feature fusion module is introduced on top of these two modules to fuse the feature maps of the two to achieve a balance between speed and accuracy.

Specifically, the spatial path Spatial Path uses more Channels and shallower networks to retain rich spatial information to generate high-resolution features; the context path Context Path uses fewer Channels and deeper networks to quickly downsample to obtain sufficient Context. Based on the output of these two networks, a Feature Fusion Module (FFM) is also designed to fuse the two features.

1.2.3 DFANet

论文:DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation

DFANet is a semantic segmentation network for road scene understanding released by Megvii in April 2019. The backbone of the network uses a lightweight Xception network, and then cascaded through sub-network and sub-stage respectively Aggregate feature information.

The starting point of the design of the network model is to better and fully utilize the information of each feature map in the downsampling process. In this paper, two strategies are used to realize the information aggregation of different feature maps. The first one is to collect different sizes of backbone Feature information is reused for information fusion of semantic layer and spatial layer (corresponding sub-network). The second is to improve the performance of network features (corresponding to sub-stage) through the combination of features at different stages in the process path of the network structure.

network structure

 It can also be seen from Figure 1 that the focus of the sub-network is to upsample the high-level feature map of the previous backbone to the input of the next backbone to refine the prediction results. The sub-network can also be seen as a process of pixel prediction from coarse to fine. The sub-stage is a combination of different features in the "coarse" to "fine" part of the corresponding stage. This can transfer receptive fields and high-dimensional structural details by combining different feature maps of the same dimension. The fc attention module in the network structure refers to the design concept of SENet, and improves the performance of the network by selecting the proportion of feature maps channel by channel. Therefore, in order to better extract semantic information and category information, the fully connected layer in FC attention is pre-trained on ImageNet. (In classification tasks, the end of the network is generally Global Pooling + FC to obtain a category probability vector.) FC is followed by a 1 x 1 convolution to match the channel dimension of each backbone output feature map.

The paper points out the shortcomings of the multi-branch input network in the lightweight model: 1. The multi-branch input network lacks the combination of high-level features between different branches; 2. There is also a lack of information interaction between different branches ; 3. For high-resolution input images, the network's inference speed is limited.

1.3 RGB-D segmentation

1.3.1 RedNet

论文:RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation

This is an article in 2018. The reason why I read it is that its overall architecture can be used as the basic framework of RGB-D, which is relatively general. On the other hand, it is because of the restoration boundary comparison mentioned in the previous article. good. In the context of 2018, resnet has proposed that RedNet uses a residual block as a building block in the encoder and decoder. The author also proposed multi-scale deep supervision, which is used in many articles now. Finally, a score of 47.8% was achieved on SUN RGB-D.
FCN is divided into two types, one is the encoder-decoder structure and the other is the expansion convolution. The encoding and decoding structure expands the receptive field through continuous convolution, but the size of the image gradually becomes smaller, and the lost information of downsampling is in the decoder. When it is difficult to recover, the expansion convolution expands the receptive field by increasing the expansion rate, but the size of the image remains unchanged, resulting in a very large amount of calculation.
RedNet uses resnet34 as the backbone of rgb and depth, and each layer will be fused, and the spatial information of the encoder will be transferred to the decoder using a jump connection structure.
model structure

The point to note is that trans is the upsampling operation of the model decoder:

 There are also four Agent blocks in the jump connection in the framework, which are actually four ordinary 1x1 convolutions. The purpose is to reduce the size of the encoder channel, so that the encoder channel can be fused with the decoder. And it is only used in resnet50, resnet34 is not used, because the encoder of resnet has no dimension expansion.
In the encoder, each residual block is down-sampled in the first module, followed by several convolutions, and there will be a residual connection next to the down-sampling operation, which is very clear in the original resnet, and turn Upsampling means that the convolution at the beginning will not affect the size, and then upsampling at the last convolution, the size of the image will be doubled, the channel will be reduced by two times, and the jump connection next to it will also operate in the same way.
Specific configuration:

 1.3.2 RDFNet

论文:RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation

In multi-level indoor semantic segmentation using RGB-D data, it has been shown that incorporating depth features into RGB features helps improve segmentation accuracy . However, previous studies have not fully exploited the potential of multimodal feature fusion by simply concatenating RGB and depth features or averaging RGB and depth score maps. This paper proposes a new network that extends the core idea of ​​residual learning to RGB-D semantic segmentation. This network can effectively capture multi-level RGB-D CNN features by including a multi-modal feature fusion module and a multi-level feature refinement module. Two RGB-D datasets, NYUDv2 and SUN RGB-D, are mainly used.

The emergence of RefineNet's network is a milestone in the semantic segmentation of RGB-D images. RefineNet utilizes residual learning with skip connections, which can easily backpropagate gradients during training. The multi-level features in RefineNet are connected by short-range and long-range residual value connections, so they can be efficiently trained and incorporated into high-resolution feature maps. Inspired by this work, this paper proposes a novel RGB-D fusion network (RDFNet), which extends the core idea of ​​residual learning to RGB-D semantic segmentation . The network structure is composed as follows (Figure 1):

 The detailed components of MMFNet in this paper are shown in Figure 5. The feature fusion block of this paper contains the same components as in RefineNet, but with different inputs and requires slightly different operations. Given the RGB and deep ResNet features, the author's MMFNet first reduces the size of each feature by a convolution to facilitate efficient training while reducing parameter explosion. Each feature is then passed through two RCUs and a convolution, as in RefineNet. There are certain differences in the purpose of RCU in MMFNet and RCU in RefineNet. The RCU in the author's MMFNet needs to perform some non-linear transformations specifically for modality fusion. Two features in different modalities are complementary combined to improve each other by manipulation, where the features in RefineNet mainly improve coarse higher-level features by adopting lower-level features with higher resolution. Subsequent additional convolutions in MMFNet are crucial for adaptively fusing features with different modalities and appropriately rescaling feature values ​​for summation. Since color features usually have better discriminative power than deep features for semantic segmentation, summation fusion in blocks is mainly used to learn complementary or residual deep features, which can improve RGB features to distinguish confusion patterns. The importance of each modality feature can be controlled by a learnable parameter in the convolution after RCU. 

 1.4 Extended reading

The above papers are classic papers in the direction of semantic segmentation included in the top conference, suitable for beginners in the direction of image segmentation. Reading these papers and reproducing the source code can lay a solid foundation. When you have a certain knowledge reserve and experience accumulation, you can continue to delve into the more popular, novel or challenging directions in the past two years. During this period, a lot of literature is required to be read. The following are some ways to obtain literature:

  • Google Scholar: https://scholar.google.com/ . Google Scholar supports keyword search, and also supports various filters, providing pdf downloads and citation generation. You can't climb over the wall? It doesn't matter, there are many mirrors of Google Scholar in China, such as https://scholar.glgoo.org/ and so on.
  • Baidu Academic: http://xueshu.baidu.com/ . The function of Baidu Scholar is similar to that of Google Scholar. It is friendly to Chinese people, and has thoughtful analysis of various citations, as well as a document help function, but the collection of documents is slightly inferior.
  • dblp: computer science bibliography
  • CVF Open Access
  • The latest in Machine Learning | Papers With Code
  • SCI-Hub.  https://sci-hub.tw/ . Document download artifact, supports keyword search, DOI search, etc. The most important thing is that many documents that cannot be downloaded by the two major academics can be obtained for free here.
  • arXiv. Some articles cannot be downloaded due to copyright issues, we can download its preprint. Of course, arXiv is only a platform for submitting preprints of papers, and none of the papers in it have passed peer review (peer review), so the quality of the articles varies, so don't take all the articles in it as authoritative.
  • Official websites of ACM, IEEE, Springer, Elsevier, etc.

2. Common data sets

2.1 PASCAL Visual Object Classes(VOC)

The VOC data set is one of the mainstream data sets of computer vision. It can be used for classification, segmentation, target detection, action detection and character positioning. It contains 21 categories and their labels: vehicles, houses, animals, airplanes, bicycles , boats, buses, cars, motorcycles, trains, bottles, chairs, dining tables, potted plants, sofas, TVs, birds, cats, cows, dogs, horses, sheep and people. The whole dataset is divided into two parts, training set and validation set.

https://pan.baidu.com/s/1TdoXJP99RPspJrmJnSjlYg#list/path=%2F Extraction code: jz27

2.2 PASCAL Context

PASCAL Context is an extension of VOC 2010 with pixel-level labels for all training images. It contains more than 400 categories. Because some categories of data sets are relatively rare, 59 categories of data sets are commonly used for training networks.

For specific operations, see: https://blog.csdn.net/qq_28869927/article/details/93379892

2.3 Microsoft Common Objects in Context (MS COCO)

MS COCO is another large-scale object detection, segmentation and text localization dataset. The dataset contains many categories, as well as a large number of labels.

Link: https://blog.csdn.net/qq_41185868/article/details/82939959

2.4 Cityscapes

Cityscapes is another large-scale dataset that focuses on semantic understanding of urban street scenes. It contains a set of different stereoscopic video sequences of street views from 50 cities, with 5k frames of high-quality pixel-level annotations, and a set of 20k weakly annotated frames. It includes semantic and dense pixel annotations for 30 categories, grouped into 8 categories: plane, person, vehicle, building, object, nature, sky, and void.

Link: https://blog.csdn.net/zz2230633069/article/details/84591532

2.5 ADE20K/MIT Scene Parsing(SceneParse150)

ADE20K/MIT Scene Parsing (SceneParse150) provides a standard training and evaluation platform for scene segmentation algorithms. The data comes from ADE20K, which contains more than 25,000 images.

Link: http://groups.csail.mit.edu/vision/datasets/ADE20K/

2.6 SiftFlow

SiftFlow contains 2688 datasets labeled with LabelMe. There are a total of 33 categories for semantic segmentation, and the pictures with a resolution of 256*256 contain 8 different outdoor scenes.

Link: http://people.csail.mit.edu/celiu/SIFTflow/

2.7 Stanford background

This dataset contains 715 images selected from existing public datasets, containing label categories: sky, tree, road, grass, water, building, mountain, and foreground object.

Link: http://dags.stanford.edu/projects/scenedataset.html

2.8 Berkeley Segmentation Dataset(BSD)

The BSD dataset consists of color images and grayscale images, with a total of 300 images (now increased to 500 images). It is divided into two parts, of which 200 are the training set and 100 are the test set.

Link: https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/

2.9 Youtube-Objects

This dataset comes from video data on the YouTube website. The picture pixels of the data set are 480*360, with a total of 10167 pictures.

Link: https://data.vision.ee.ethz.ch/cvl/youtube-objects/

2.10 KITTI

KITTI is mainly used for robotics and autonomous driving, and it contains a lot of videos. Collected by vehicle sensors, it was not used for semantic segmentation at the beginning. Later, through the efforts of many people, there are many marked pictures, which can be used for semantic segmentation.

Link: https://blog.csdn.net/Solomon1558/article/details/70173223

3. Evaluation indicators

3.1 Confusion Matrix

Confusion matrix (confusion matrix) is also called error matrix (error matrix). The confusion matrix is ​​a matrix drawn with the statistics of the number of categories predicted by the model as the horizontal axis and the statistics of the number of real labels as the vertical axis.

 As shown in the figure above, it represents the classification result of the statistical classification model. The diagonal represents the number of consistent model predictions and data labels, so the accuracy can also be calculated by dividing the sum of the diagonals of the confusion matrix by the number of test set images. The larger the number on the diagonal, the better, representing the better prediction result of the model in this class. Other places are where the prediction is wrong, and the smaller the value, the better the model prediction.

The evaluation index standard in image segmentation is usually a variant of pixel accuracy and IoU. For the convenience of explanation, the assumption is as follows: there are k+1 classes (from L0 to Lk, which contains an empty class or background), and Pij represents the number of pixels that belong to class i but are predicted to be class j. That is, Pii represents the true quantity, while Pij and Pji are interpreted as false positives and false negatives, respectively. And the following uses a graphical method to show concepts such as TP/TN/FP/FN in the segmentation diagram.

In the above figure, GroundTruth is on the left and Prediction is on the right, which is the predicted mask image. The Prediction graph is divided into four parts, where the large white slash mark is true negative (TN, the real background part in the prediction), and the red line part is marked as false negative (FN, which is predicted as the background in the prediction, but the actual is not the part of the background), the blue slash is false positive (FP, the part that is divided into a certain label in the prediction, but it is not actually the part of the label), and the middle fluorescent yellow block is true positive (TP, A certain label part of the prediction, in line with the true value).

In the above figure, GT is on the left and Prediction is on the right, which is the predicted mask image. On the left is the intersection of the predicted and ground truth masks, and on the right is the union of the predicted and ground truth masks.

3.2 Pixel Accuracy (PA)

The meaning of pixel accuracy is the ratio of the number of pixels with the correct prediction category to the total number of pixels. It corresponds to the above accuracy (Accuracy), the calculation formula is as follows:

The following simplified understanding can also be done, which corresponds to the calculation formula of the above-mentioned accuracy rate: sum of diagonal elements/sum of all elements of the matrix.

3.3 Category Average Pixel Accuracy (MPA)

Category average pixel accuracy (mean pixel accuracy) is referred to as MPA, which means to calculate the proportion of correctly classified pixels for each class, and then accumulate and average.

 3.4  Intersection over Union (IoU)

The meaning of the intersection ratio is the ratio of the intersection and union of the model's prediction results for a certain category and the actual value. It's just that for target detection, it is the intersection ratio between the detection frame and the real frame, and for image segmentation, it is to calculate the intersection ratio between the predicted mask and the real mask.

 

Calculation formula: Take the calculation of the IoU of the two-category positive example (category 1) as an example.

The intersection is TP, and the union is the sum of TP, FP, and FN, then the calculation formula of IoU is as follows.

IoU = TP / (TP + FP + FN)

3.5 Mean Intersection over Union (MIoU)

The average intersection ratio (mean IOU) is referred to as mIOU, that is, the intersection of the predicted area and the actual area is divided by the union of the predicted area and the actual area, so that the IoU under a single category is calculated, and then this algorithm is repeated to calculate the IoU of other categories. Then calculate their average.

The meaning it expresses is the ratio of the intersection and union of the model's predicted results for each category and the true value, and then sum up and then calculate the average.

 

3.6 Frequency Weighted Intersection over Union (FWIoU)

Frequency-weighted intersection-over-union ratio FWIoU is an improvement of MIoU, which sets weights for each class according to its frequency of occurrence.

Among all the metrics mentioned above, MIoU is the most commonly used metric due to its simplicity and representativeness, and most researchers use it to report their results.

4. Loss function

4.1 Log loss

Cross entropy , the formula of binary classification cross entropy is as follows:

pytorch code implementation:

#二值交叉熵,这里输入要经过sigmoid处理
import torch
import torch.nn as nn
import torch.nn.functional as F
nn.BCELoss(F.sigmoid(input), target)
#多分类交叉熵, 用这个 loss 前面不需要加 Softmax 层
nn.CrossEntropyLoss(input, target)

4.2 Dice loss

Dice loss is proposed for the problem that the proportion of the foreground is too small. The dice coefficient is derived from the binary classification, which essentially measures the overlap of two samples. The formula is as follows:

 Dice Loss = 1 - DSC, pytorch code implementation:

import torch
import torch.nn as nn
 
class DiceLoss(nn.Module):
	def __init__(self):
		super(DiceLoss, self).__init__()
 
	def	forward(self, input, target):
		N = target.size(0)
		smooth = 1
 
		input_flat = input.view(N, -1)
		target_flat = target.view(N, -1)
 
		intersection = input_flat * target_flat
 
		loss = 2 * (intersection.sum(1) + smooth) / (input_flat.sum(1) + target_flat.sum(1) + smooth)
		loss = 1 - loss.sum() / N
 
		return loss
 
class MulticlassDiceLoss(nn.Module):
	"""
	requires one hot encoded target. Applies DiceLoss on each class iteratively.
	requires input.shape[0:1] and target.shape[0:1] to be (N, C) where N is
	  batch size and C is number of classes
	"""
	def __init__(self):
		super(MulticlassDiceLoss, self).__init__()
 
	def forward(self, input, target, weights=None):
 
		C = target.shape[1]
 
		# if weights is None:
		# 	weights = torch.ones(C) #uniform weights for all classes
 
		dice = DiceLoss()
		totalLoss = 0
 
		for i in range(C):
			diceLoss = dice(input[:,i], target[:,i])
			if weights is not None:
				diceLoss *= weights[i]
			totalLoss += diceLoss
 
		return totalLoss

4.3 IOU loss 

IOU loss is somewhat similar to Dice loss, and IOU is expressed as follows:

Soft_IOU_loss pytorch code implementation:

#针对多分类问题,二分类问题更简单一点。
import torch
import torch.nn as nn
import torch.nn.functional as F

class SoftIoULoss(nn.Module):
    def __init__(self, n_classes):
        super(SoftIoULoss, self).__init__()
        self.n_classes = n_classes

    @staticmethod
    def to_one_hot(tensor, n_classes):
        n, h, w = tensor.size()
        one_hot = torch.zeros(n, n_classes, h, w).scatter_(1, tensor.view(n, 1, h, w), 1)
        return one_hot

    def forward(self, input, target):
        # logit => N x Classes x H x W
        # target => N x H x W

        N = len(input)

        pred = F.softmax(input, dim=1)
        target_onehot = self.to_one_hot(target, self.n_classes)

        # Numerator Product
        inter = pred * target_onehot
        # Sum over all pixels N x C x H x W => N x C
        inter = inter.view(N, self.n_classes, -1).sum(2)

        # Denominator
        union = pred + target_onehot - (pred * target_onehot)
        # Sum over all pixels N x C x H x W => N x C
        union = union.view(N, self.n_classes, -1).sum(2)

        loss = inter / (union + 1e-16)

        # Return average loss over classes and batch
        return -loss.mean()

4.4  Lovasz-Softmax loss

Lovasz-Softmax loss is a loss designed for IOU optimization proposed in CVPR2018. It has a miraculous effect in the competition. The mathematical derivation is beyond the scope of the author. If you are interested, you can take a look at the paper. Although it is difficult to understand, it is relatively easy to use. In general, it is the Lovasz extension to the Jaccard loss, and the loss performs better.
In addition, the author said in the github answer that since the Lovasz softmax optimization is aimed at image-level mIoU, smaller batch size training will damage the performance of commonly used dataset-level mIoU. And the loss applies to the finetuning process. Using it with other loss weights will have better results.
The author gives the loss calculation of two classifications and multi-classifications. I personally feel that there are three steps:

  • Calculate the errors of each pixel, the errors calculated by the hinge used in the binary classification, and directly calculate the difference between the predicted value and the real value in multi-classification;
  • According to the sorting of errors, sort the labels, and then calculate the Jaccard grad (the lovasz_grad function in the code);
  • Combine errors and Jaccard grad to get the desired loss.

pytorch code implementation (excerpted from the author's GitHub):

import torch
from torch.autograd import Variable
import torch.nn.functional as F
import numpy as np

def lovasz_grad(gt_sorted):
    """
    Computes gradient of the Lovasz extension w.r.t sorted errors
    See Alg. 1 in paper
    """
    p = len(gt_sorted)
    gts = gt_sorted.sum()
    intersection = gts - gt_sorted.float().cumsum(0)
    union = gts + (1 - gt_sorted).float().cumsum(0)
    jaccard = 1. - intersection / union
    if p > 1: # cover 1-pixel case
        jaccard[1:p] = jaccard[1:p] - jaccard[0:-1]
    return jaccard
# --------------------------- BINARY LOSSES ---------------------------
def lovasz_hinge(logits, labels, per_image=True, ignore=None):
    """
    Binary Lovasz hinge loss
      logits: [B, H, W] Variable, logits at each pixel (between -\infty and +\infty)
      labels: [B, H, W] Tensor, binary ground truth masks (0 or 1)
      per_image: compute the loss per image instead of per batch
      ignore: void class id
    """
    if per_image:
        loss = mean(lovasz_hinge_flat(*flatten_binary_scores(log.unsqueeze(0), lab.unsqueeze(0), ignore))
                          for log, lab in zip(logits, labels))
    else:
        loss = lovasz_hinge_flat(*flatten_binary_scores(logits, labels, ignore))
    return loss
    
def lovasz_hinge_flat(logits, labels):
    """
    Binary Lovasz hinge loss
      logits: [P] Variable, logits at each prediction (between -\infty and +\infty)
      labels: [P] Tensor, binary ground truth labels (0 or 1)
      ignore: label to ignore
    """
    if len(labels) == 0:
        # only void pixels, the gradients should be 0
        return logits.sum() * 0.
    signs = 2. * labels.float() - 1.
    errors = (1. - logits * Variable(signs))
    errors_sorted, perm = torch.sort(errors, dim=0, descending=True)
    perm = perm.data
    gt_sorted = labels[perm]
    grad = lovasz_grad(gt_sorted)
    loss = torch.dot(F.relu(errors_sorted), Variable(grad))
    return loss
    
def flatten_binary_scores(scores, labels, ignore=None):
    """
    Flattens predictions in the batch (binary case)
    Remove labels equal to 'ignore'
    """
    scores = scores.view(-1)
    labels = labels.view(-1)
    if ignore is None:
        return scores, labels
    valid = (labels != ignore)
    vscores = scores[valid]
    vlabels = labels[valid]
    return vscores, vlabels

# --------------------------- MULTICLASS LOSSES ---------------------------
def lovasz_softmax(probas, labels, classes='present', per_image=False, ignore=None):
    """
    Multi-class Lovasz-Softmax loss
      probas: [B, C, H, W] Variable, class probabilities at each prediction (between 0 and 1).
              Interpreted as binary (sigmoid) output with outputs of size [B, H, W].
      labels: [B, H, W] Tensor, ground truth labels (between 0 and C - 1)
      classes: 'all' for all, 'present' for classes present in labels, or a list of classes to average.
      per_image: compute the loss per image instead of per batch
      ignore: void class labels
    """
    if per_image:
        loss = mean(lovasz_softmax_flat(*flatten_probas(prob.unsqueeze(0), lab.unsqueeze(0), ignore), classes=classes)
                          for prob, lab in zip(probas, labels))
    else:
        loss = lovasz_softmax_flat(*flatten_probas(probas, labels, ignore), classes=classes)
    return loss


def lovasz_softmax_flat(probas, labels, classes='present'):
    """
    Multi-class Lovasz-Softmax loss
      probas: [P, C] Variable, class probabilities at each prediction (between 0 and 1)
      labels: [P] Tensor, ground truth labels (between 0 and C - 1)
      classes: 'all' for all, 'present' for classes present in labels, or a list of classes to average.
    """
    if probas.numel() == 0:
        # only void pixels, the gradients should be 0
        return probas * 0.
    C = probas.size(1)
    losses = []
    class_to_sum = list(range(C)) if classes in ['all', 'present'] else classes
    for c in class_to_sum:
        fg = (labels == c).float() # foreground for class c
        if (classes is 'present' and fg.sum() == 0):
            continue
        if C == 1:
            if len(classes) > 1:
                raise ValueError('Sigmoid output possible only with 1 class')
            class_pred = probas[:, 0]
        else:
            class_pred = probas[:, c]
        errors = (Variable(fg) - class_pred).abs()
        errors_sorted, perm = torch.sort(errors, 0, descending=True)
        perm = perm.data
        fg_sorted = fg[perm]
        losses.append(torch.dot(errors_sorted, Variable(lovasz_grad(fg_sorted))))
    return mean(losses)

def flatten_probas(probas, labels, ignore=None):
    """
    Flattens predictions in the batch
    """
    if probas.dim() == 3:
        # assumes output of a sigmoid layer
        B, H, W = probas.size()
        probas = probas.view(B, 1, H, W)
    B, C, H, W = probas.size()
    probas = probas.permute(0, 2, 3, 1).contiguous().view(-1, C)  # B * H * W, C = P, C
    labels = labels.view(-1)
    if ignore is None:
        return probas, labels
    valid = (labels != ignore)
    vprobas = probas[valid.nonzero().squeeze()]
    vlabels = labels[valid]
    return vprobas, vlabels

4.5 Focal loss 

Focal loss is a loss function proposed by He Yuming for the imbalance of training samples. formula:

 

It can be considered that focal loss is a variant of cross entropy, and two parameters α \alphaα, β \betaβ are designed for the following two problems:

  • Positive and negative samples are unbalanced, such as too many negative samples;
  • There are a large number of simple and easy-to-classify samples.

For the first question, it is easy to think that in the loss function, weights can be added to the sample loss of different categories. If there are few positive samples, the weight of the positive sample loss will be increased. This is the role of the parameter α \alphaα in the focal loss; the second question , the parameter β \betaβ is designed. It can be seen from the formula that when the predicted value of the sample pt is relatively large, that is, the sample is easily divided, and (1-pt)^beta will be small, so the loss of the easily divided sample will be significant. If it is reduced, the model will pay more attention to the optimization of difficult sample loss.
pytorch code implementation:

import torch
import torch.nn as nn
# --------------------------- BINARY LOSSES ---------------------------
class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2, weight=None, ignore_index=255):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.weight = weight
        self.ignore_index = ignore_index
        self.bce_fn = nn.BCEWithLogitsLoss(weight=self.weight)

    def forward(self, preds, labels):
        if self.ignore_index is not None:
            mask = labels != self.ignore
            labels = labels[mask]
            preds = preds[mask]

        logpt = -self.bce_fn(preds, labels)
        pt = torch.exp(logpt)
        loss = -((1 - pt) ** self.gamma) * self.alpha * logpt
        return loss
# --------------------------- MULTICLASS LOSSES ---------------------------
class FocalLoss(nn.Module):
    def __init__(self, alpha=0.5, gamma=2, weight=None, ignore_index=255):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.weight = weight
        self.ignore_index = ignore_index
        self.ce_fn = nn.CrossEntropyLoss(weight=self.weight, ignore_index=self.ignore_index)

    def forward(self, preds, labels):
        logpt = -self.ce_fn(preds, labels)
        pt = torch.exp(logpt)
        loss = -((1 - pt) ** self.gamma) * self.alpha * logpt
        return loss

 Five activation functions

The activation function is used to add nonlinear factors, because the expressive power of the linear model is not enough. The introduction of nonlinear activation functions can make the expressive ability of deep neural networks more powerful.

 5.1 sigmoid activation function

 Sigmoid is also called the Logistic activation function, which compresses real values ​​into the range of 0 to 1, and can also be used in the output layer of the predicted probability. This function converts large negative numbers to 0 and large positive numbers to 1. The mathematical formula is:

 The figure below shows the graph of the sigmoid function and its derivatives:

insert image description here

Generally speaking, in the process of training neural networks, for derivation, continuous derivation, and processing of binary classification problems, the Sigmoid activation function is generally used, because the Sigmoid function can smoothly map the real number domain to the [0,1] space. The function value can be interpreted as the probability of belonging to the positive class (the value range of probability is 0~1).
In addition, the Sigmoid function is monotonically increasing, continuously derivable, and the derivative form is very simple. But for multi-category problems, the Sigmoid function seems to have more than enough energy.
In addition, the output of the sigmoid function is greater than 0, so that the output is not 0 mean, which is called the offset phenomenon, which will cause the neurons of the latter layer to receive the non-zero mean output signal of the previous layer as input.

Advantages:
(1) The output mapping of the Sigmoid function is between (0,1), monotonously continuous, the output range is limited, the optimization is stable, and it can be used as the output layer.
(2) Derivation is easy.
Disadvantages:
(1) Due to its soft saturation, it is easy to produce gradient disappearance, which leads to problems in training.
(2) Its output is not centered on 0.

5.2 tanh activation function

Tanh is also a very common activation function. It is actually a variant of the sigmoid function. The tanh function is defined by the following formula:

insert image description here

Derivative:

insert image description here

 

insert image description here 

 

The Tanh activation function is also called the hyperbolic tangent activation function. It solves the non-zero-centered output problem of the Sigmoid function. Similar to the Sigmoid function, the Tanh function also uses the true value, but the Tanh function compresses it into the interval -1 to 1. Unlike Sigmoid, the output of the Tanh function is centered at zero because the interval is between -1 and 1. You can think of the Tanh function as two sigmoid functions put together. In practice, the Tanh function is used in preference to the Sigmoid function. Negative inputs are treated as negative values, zero input values ​​map close to zero, and positive inputs are treated as positive values. The only downside: the Tanh function also has the problem of vanishing gradients, so it also "kills" gradients when saturated.

In order to solve the problem of gradient disappearance, let's discuss another nonlinear activation function - rectified linear unit (ReLU), which is significantly better than the previous two functions and is now the most widely used function.

5.3 ReLU activation function

Mathematical formula:

insert image description here

Function graph and its derivative graph:

insert image description here 

insert image description here 

When the input x<0, the output is 0, and when x>0, the output is x. This activation function makes the network converge more quickly. It doesn't saturate, i.e. it resists the vanishing gradient problem, at least in the positive region (x > 0), so the neuron doesn't backpropagate all zeros in at least half of the region. ReLU is computationally efficient due to the simple thresholding used. But ReLU neurons also have some disadvantages:

  • Not centered on zero: Similar to the Sigmoid activation function, the output of the ReLU function is not centered on zero.
  • During the forward pass, if x < 0, the neuron remains inactive and "kills" the gradient during the backward pass. This way the weights cannot be updated and the network cannot learn. When x = 0, the gradient at that point is undefined, but this is resolved in the implementation by taking the gradient to the left or right.

To solve the vanishing gradient problem in the ReLU activation function, when x < 0, we use Leaky ReLU - this function tries to fix the dead ReLU problem. Let's take a closer look at Leaky ReLU.

5.4 Leaky ReLU activation function

Mathematical formula:

insert image description here

 Function image:

insert image description here

The concept of Leaky ReLU is: when x < 0, it gets a positive gradient of 0.1. This function alleviates the dead ReLU problem to some extent, but the results of using this function are not consistent. Although it has all the characteristics of the ReLU activation function, such as computational efficiency, fast convergence, and no saturation in the positive region.

Leaky ReLU can be extended more. Instead of multiplying x by a constant term, let x multiply by hyperparameters, which seems to work better than Leaky ReLU. This extension is Parametric ReLU.

5.5 Parametric ReLU activation function

insert image description here 

where α is a hyperparameter. Here a random hyperparameter is introduced, which can be learned because you can backpropagate it. This enables neurons to choose the best gradient for negative regions, and with this ability, they can become ReLU or Leaky ReLU.

In conclusion, it is better to use ReLU, but you can experiment with Leaky ReLU or Parametric ReLU to see if they are better for your problem.
 

 

 

Guess you like

Origin blog.csdn.net/m0_70140421/article/details/129131684