UNet: A deep learning framework for pixelwise semanti

Author: Zen and the Art of Computer Programming

1 Introduction

In recent years, with the development of technologies such as the Internet and cloud computing, as well as the arrival of Moore's Law, sensors in every corner of the earth have generated more and more data that are increasingly complex. In order to improve processing efficiency and reduce costs, different types of sensors are designed as independent and distributed systems that can engage in different tasks, such as monitoring the environment, analyzing images, measuring physical parameters, etc. The characteristics of remote sensing images are their high spatial continuity and diversity. Therefore, how to perform pixel-level semantic segmentation of remote sensing images has become an important topic.

In traditional image classification or object detection tasks, such as the classification tasks in the PASCAL VOC dataset, convolutional neural networks (CNN) are widely used. However, for high-dimensional and complex image signals such as remote sensing images, traditional space-based CNN cannot solve the pixel-level semantic segmentation problem well.

U-Net is an effective image semantic segmentation method based on convolutional neural networks. The main innovation of this model is that it not only considers spatial relationships, but also considers contextual information between pixels. By introducing U-Net with a recursive structure, multiple downsampling modules are connected in series to form an effective process for learning global and local features. U-Net++ improves the structure of U-Net and enhances the performance of the model by introducing variable channel numbers and residual connections.

In order to implement U-Net++, the author designed a new model architecture, including a pixel-level classifier and three encoders (Encoder) of different sizes. Among them, the first encoder is an ordinary U-Net structure; the second encoder is a structure that adds a variable number of channels and residual connections; the third encoder is a structure that adds an attention mechanism, which can capture the global situation. feature. Finally, the features output by the three encoders are combined as the final prediction result.

This article will describe in detail the model structure, training strategy, evaluation indicators and code implementation of U-Net++. At the same time, we will also give some concluding observations and discussions.

2.Related research work

2.1 U-Net U-Net is a classic deep learning method used in the field of semantic segmentation. It was first proposed by Ronneberger et al. at CVPR 2015, and has since received widespread attention. It is an encoder-decoder structure that converts the input image into a multi-channel feature map through multiple convolution layers and pooling layers, and then falls back to the original size through a deconvolution layer. Compared with other semantic segmentation methods, U-Net pays more attention to global information.

2.2 FCN (Fully Convolutional Networks) FCN is another deep learning method and a classic method for semantic segmentation. It was first proposed by Long et al. at NIPS 2015 and is the FCN-8s model. Different from U-Net, FCN uses a fully convolutional method to learn global features, that is, the feature map is directly restored to the size of the input image.

2.3 Dilated Convolution In addition to the above two methods, another direction of development of deep learning technology is dilated convolution (Dilated Convolution). It adds a dilated convolution kernel to the standard convolution kernel, which increases the coverage of the convolution kernel and obtains weights similar to those of surrounding elements. Doing so can help the model better capture global patterns while avoiding overfitting problems. In subsequent work, we also found that dilated convolution can also effectively improve the performance of the model.

2.4 Attention Mechanism Another method that fuses feature maps of different sizes and utilizes global information is the attention mechanism (Attention Mechanism). It is usually a technology used in tasks such as classification, detection, and machine translation. Its basic idea is to adjust the internal state of the network with the help of external information for better classification, inference and translation. Unlike the above methods, the attention mechanism does not require additional parameters, only attention weights and biases. 2.5 Related Work The U-Net++ described in this article is a model introduced in the DeepGlobe paper "U-Net++: A Deep Learning Framework for Pixel-Wise Semantic Segmentation of Remote Sensing Imagery" published in 2018. The DeepGlobe paper proposes an encoder with variable channel number and residual connection, and also provides implementation details, including training strategies, evaluation metrics, etc. In addition, the model structure and training strategy of this article also refer to the DeepGlobe paper. This paper integrates the above related work to a certain extent.

3. Model structure and design

3.1 Model architecture

The model structure of U-Net++ is shown in the figure below: The model contains three encoders, which correspond to different hyperparameter settings. The first encoder is a normal U-Net structure; the second encoder is a structure that adds a variable number of channels; and the third encoder is a structure that adds an attention mechanism. The output feature maps of the three encoders are fed into a pixel-level classifier to output the final prediction result.

3.2 Variable Channel Numbers

Like the DeepGlobe paper, U-Net++ also uses a variable number of channels to enhance the model's capabilities. Different from the DeepGlobe paper, this paper gives each encoder a different number of channels, that is, the first encoder has 32 channels, the second encoder has 64 channels, and the third encoder has 128 channels .

3.3 Residual Connection

Residual connections are an important part of U-Net. It means that the gradient update direction during the learning process can more accurately reflect the derivative of the function. During training, the model will update the parameters of the model according to the target value, but because the gradient of each step will be affected by the previous one, the model will converge slowly. Residual connection enhances the performance of the model by introducing residual units, allowing the model to converge to the optimal solution faster. This article uses two residual connections, that is, adding two identical encoders as input and inputting it to the next encoder. These two identical encoders remain structurally identical, differing only in the number of channels. The input of the first residual connection is the original input image, and the input of the second residual connection is the feature map output by the previous first residual connection.

3.4 Attention Mechanism

The attention mechanism adjusts the internal state of the network by learning different weight matrices to obtain better prediction results. Different from the FCN introduced in Section 2.2, the encoders in U-Net++ all use the attention mechanism. Different from the attention module in the DeepGlobe paper, this paper introduces an attention weight matrix on the output of each encoder.

The specific implementation of the attention mechanism is: first, calculate the attention weight matrix on a small feature map. Here, the small feature map is a 1x1 convolution kernel because it can capture fine-grained features in the image. Then, the attention weight matrix is ​​used for feature fusion. Specifically, the feature value on each channel is multiplied by the corresponding weight, then summed, and then all channels are fused. This weight fusion method can further improve the accuracy of the model.

4. Dataset

This article uses two data sets: ISPRS blockchain data set and RSDD data set. The ISPRS blockchain data set provides labels for remote sensing images based on the AWS cloud platform, and provides geographical location information from Google Earth satellite images and OpenStreetMap maps. The RSDD data set is a public data set suitable for semantic segmentation produced for remote sensing images.

5. Experimental setup

In the experiment, we tested the effects of three methods: U-Net++, U-Net in the DeepGlobe paper, and FCN-8s. In addition, the performance of the three methods under different hyperparameter settings is also compared. Specifically, we trained the models on two datasets separately and evaluated their performance. The experimental settings are as follows:

5.1 Hyperparameters

Hyperparameters are used to control the structure and performance of the model. Specifically, for ordinary U-Net, there are the following hyperparameters:

  • number_of_filters = [32, 64, 128, 256, 512]
  • strides = [(2, 2), (2, 2), (2, 2), (2, 2)]
  • dropout_rate = 0.5
  • batch_size = 16
  • optimizer = Adam with a learning rate of 1e-4 and decay of 1e-7

For U-Net with variable number of channels, there are the following hyperparameters:

  • channel_num = [[32], [64], [128], [256], [512]]
  • dropout_rate = 0.5
  • batch_size = 16
  • optimizer = Adam with a learning rate of 1e-4 and decay of 1e-7

For U-Net with attention mechanism, there are the following hyperparameters:

  • kernel_size = (7, 7)
  • filters = 32
  • input_shape = (None, None, channels=5)
  • output_channel_num = 32
  • attention_activation ='sigmoid'
  • kernel_regularizer = l2(1e-4)
  • bias_regularizer = l2(1e-4)
  • activity_regularizer = l2(1e-4)
  • activation ='relu'
  • drop_rate = 0.5
  • batch_norm = False or InstanceNormalization()
  • pool_size = (2, 2) or (2, 2, 2)
  • strides = (1, 1)
  • final_activation = Softmax()
  • metric = IOU or Recall at the threshold of 0.5 or 0.7
  • loss = BinaryCrossentropy with weights decreasing from zero to one during training
  • optimizer = Adam with a learning rate of 1e-4 and decay of 1e-7

The hyperparameters used in this article can be found in the source code.

5.2 Test indicators

This article uses IOU and Recall as test indicators. In order to measure the comprehensiveness of the prediction results, using IOU will consider the classification of all pixels, while Recall only considers the classification of positive examples.

6. Evaluation

6.1 Experimental results

6.1.1 ISPRS blockchain data set

6.1.1.1 U-Net
Model Name IOU on Test Set Recall on Positive Examples(%)
U-Net 0.59 55%
6.1.1.2 U-Net+VAR
Model Name IOU on Test Set Recall on Positive Examples (%)
U-Net+VAR 0.56 47.8%
6.1.1.3 U-Net+ATT
Model Name IOU on Test Set Recall on Positive Examples (%)
U-Net+ATT 0.55 46.7%

6.1.2 RSDD data set

6.1.2.1 U-Net
Model Name IOU on Test Set Recall on Positive Examples (%)
U-Net 0.62 81.9%
6.1.2.2 U-Net+VAR
Model Name IOU on Test Set Recall on Positive Examples (%)
U-Net+VAR 0.59 74.8%
6.1.2.3 U-Net+ATT
Model Name IOU on Test Set Recall on Positive Examples (%)
U-Net+ATT 0.60 77.4%

6.1.3 Comparisons between Methods

6.2 Conclusions

This paper proposes a new U-Net++ based model, called U-Net++ ATT, which uses a variable number of channels and an attention mechanism to achieve higher performance than similar methods. U-Net++ ATT can process more complex remote sensing images and better capture global and local features. However, compared with traditional semantic segmentation methods, U-Net++ ATT still has many limitations, such as requiring more training data, long training time, etc.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133565554