Image segmentation based on deep learning

Abstract
Remote sensing image segmentation is a process that uses high-resolution images obtained by remote sensing technology to classify at the pixel level and extract different objects or features in the image. This process is of great significance for remote sensing applications because it can extract ground objects and surface features, such as rivers, roads, buildings, vegetation, water bodies, etc., and these features actually exist on the ground. Image segmentation can provide practical information for ground cover classification, land use cover change analysis, urban planning, agricultural resource monitoring, environmental protection and other fields.
This paper first explains the principles of remote sensing image segmentation and deep learning technology, and uses the FCN-32s and FCN-8s networks in the fully convolutional network to build a remote sensing image segmentation model, and uses ISPRS Vaihingen data Set training and testing. After testing, the three evaluation indicators mean F1, mIOU and OA of the FCN-32s model are 80.45%, 70.81% and 82.32% respectively. The three evaluation indicators mean F1, mIOU and OA of the FCN-8s model are respectively. 79.24%, 70.35% and 83.02%. Then this paper used the U-Net network to build a remote sensing image segmentation model. After testing the three evaluation indicators of the U-Net model, mean F1, mIOU and OA were 82.10%, 72.35% and 84.56% respectively.
Keywords remote sensing image segmentation; U-Net; fully convolutional network

Abstract
Remote sensing image segmentation is the process of extracting different objects or different features from an image by pixel-level classification using high-resolution images acquired by remote sensing technology. This process is important for remote sensing applications because it can extract features and surface features such as rivers, roads, buildings, vegetation, water bodies, etc. and these features are physically present on the ground. Image segmentation can provide practical information for ground cover classification, land use cover change analysis, urban planning, agricultural resource monitoring, environmental protection and other fields.
In this thesis, firstly, the principles of remote sensing image segmentation and deep learning techniques are explained, and remote sensing image segmentation models are constructed with FCN-32s and FCN-8s networks in full convolutional networks, and trained and tested with ISPRS Vaihingen dataset. After testing, the three evaluation indexes of FCN-8s model mean F1, mIOU and OA are 79.24%, 70.35% and 83.02%, respectively. The three evaluation indexes mean F1, mIOU and OA of the U-Net model were 82.10%, 72.35% and 84.56%, respectively.
Keywords:Remote Sensing Image Segmentation; U-Net; Full Convolutional Network

目录
1 绪论 1
1.1 选题背景及研究意义 1
1.2 国内外现状现状 1
1.2.1 基于聚类的方法 1
1.2.2 基于分割的方法 2
1.2.3 基于深度学习的方法 2
2 遥感图像分割和深度学习基础 3
2.1 基于深度学习的遥感图像分析 3
2.2 遥感图像分割常用的数据集 4
2.2.1 SIRI-WHU 数据集 4
2.2.2 WHU-RS19 数据集图 4
2.2.3 GID 数据集 5
2.2.4 ISPRS Vaihingen 数据集 5
2.3 数据预处理 6
2.4 卷积神经网络基础 7
2.4.1 卷积层 8
2.4.2 激活函数 8
2.4.3 池化层 10
3 基于全卷积网络的遥感图像分割 11
3.1 全卷积网络概述 11
3.2 全卷积网络结构 11
3.3 损失函数 13
3.4 FCN网络结构 13
3.5 全卷积网络模型构建和训练 15
4 基于U-Net网络的遥感图像分割 16
4.1 U-Net网络概述 16
4.2 U-Net网络结构 16
4.3 U-Net训练和结构分析 17
5 总结 18

Introduction
Background of the topic and research significance
Remote sensing image segmentation is the use of high-resolution images obtained by remote sensing technology to classify at the pixel level. The process of extracting different objects or features in the image. This process is of great significance for remote sensing applications because it can extract ground objects and surface features, such as rivers, roads, buildings, vegetation, water bodies, etc., and these features actually exist on the ground. Image segmentation can provide practical information for ground cover classification, land use cover change analysis, urban planning, agricultural resource monitoring, environmental protection and other fields.
In recent years, convolutional neural networks (CNN) have made important progress in the field of image segmentation. Among them, U-Net is a commonly used deep learning model. It is a fully convolutional neural network (FCN) architecture based on convolutional neural network (CNN) and is widely used in medical image segmentation, remote sensing image segmentation, etc. field. U-Net connects the feature vector of the convolutional network with the upsampled feature map by introducing skip connections, which enables U-Net to capture high-level features and low-level detailed information, improving the accuracy of remote sensing image segmentation. performance and robustness.
In terms of remote sensing image segmentation, U-Net has been applied to the classification of various types of ground objects, such as buildings, water bodies, vegetation, roads, etc. In addition, the image segmentation method based on U-Net has many advantages, such as automatically learning features, good segmentation effect and generalization ability, etc.
Therefore, the remote sensing image segmentation method based on U-Net has broad application prospects in the fields of land use, urban planning, environmental protection, weather forecasting and other fields. Especially in remote sensing image segmentation, due to the characteristics of remote sensing images such as high resolution, large data volume, and multispectral, traditional methods based on artificial rules and manual features are no longer competent, and deep learning algorithms have better capabilities in this regard. Application prospects. Therefore, the remote sensing image segmentation method based on U-Net is of great significance to the research and practice in the field of remote sensing applications.
Current status at home and abroad
Clustering-based method
Clustering-based method is the most commonly used method in remote sensing image segmentation one. The basic idea is to cluster pixels into different categories based on similarity, and then divide pixels of the same category into the same area. K-means algorithm is one of the most commonly used clustering methods. Its idea is to divide pixels into K categories, and each pixel belongs to the category closest to it. In addition, the Fuzzy C-Means (FCM) algorithm and the Possibilistic C-Means (PCM) algorithm are also used in remote sensing image segmentation. These methods are similar to the K-means algorithm, but are more robust to noise and outliers.
Researchers at home and abroad are exploring clustering-based methods, but there are also different focuses and directions. Foreign researchers pay more attention to the improvement of clustering algorithms, such as improving the convergence of the K-means algorithm, parameter selection of the Fuzzy C-Means algorithm, etc. Domestic researchers, however, pay more attention to the application of algorithms, such as application in high-resolution remote sensing image segmentation, remote sensing image segmentation based on wavelet transform, etc.
Segmentation-based method
Segmentation-based method refers to dividing the image into several regions and then further processing each region. Among them, the region growing method is a commonly used region-based segmentation method. Its basic idea is to start from a set of seed pixels and add adjacent pixels to the region until no more additions can be made. In addition, edge-based methods are also widely used in remote sensing image segmentation, such as edge detection based on the Canny algorithm and image segmentation based on edge features.
Domestic and foreign researchers are exploring segmentation-based methods, but there are also different research directions. Foreign researchers pay more attention to the theoretical research of segmentation-based methods, such as the selection of segmentation criteria and the optimization of segmentation algorithms. Domestic researchers, however, pay more attention to the practical application of algorithms, such as remote sensing image classification based on the region growing method, remote sensing image analysis based on multi-scale segmentation, etc. Methods based on deep learning In recent years, deep learning has been widely used in remote sensing image segmentation, especially convolutional neural networks (CNN). The application of CNN in remote sensing image segmentation mainly includes two methods: fully convolutional neural network (FCN) and convolutional neural network plus cascade segmenter (CNN-Cascade). FCN is one of the first methods to apply deep learning methods to remote sensing image segmentation. Its main idea is to replace the fully connected layer with a convolutional layer to achieve the output of any size input image. segmentation. CNN-Cascade is a novel segmentation method that achieves high-precision segmentation results by cascading multiple convolutional neural networks. Domestic and foreign researchers are conducting in-depth research on methods based on deep learning, but there are also different focuses and directions. Foreign researchers pay more attention to the improvement of deep learning models, such as improving the structure of convolutional neural networks and introducing attention mechanisms. Domestic researchers are paying more attention to the application of deep learning methods in remote sensing image segmentation, such as building recognition based on multi-source remote sensing images, crop classification based on deep learning, etc. Basics of remote sensing image segmentation and deep learning Remote sensing image analysis based on deep learning With the vigorous development of deep learning, remote sensing The field of image processing and analysis has also developed rapidly. At present, remote sensing image analysis based on deep learning is mainly in the following aspects: The three basic tasks in computer vision are classification, detection and segmentation, among which target detection has always been a popular research field. With the rapid development of modern information technology, ground object detection in remote sensing image processing has become particularly important. This application plays an important role in the fields of drones, security protection, military navigation and aerial reconnaissance, as shown in Figure 2-1. There are many detection algorithms for natural scene images, including single-stage target detection algorithms such as SSD and YOLO, and two-stage target detection algorithms such as Faster R-CNN and Mask R-CNN. In addition, there are excellent target detection algorithms such as FCOS and CenterNet. However, when considering the actual scenarios of remote sensing image application, not only the algorithm needs to have high recognition accuracy, but also parameters such as size and speed need to be considered.








Figure 2-1 Target detection in remote sensing images
In order to solve these problems, some researchers have proposed some special algorithms. Yang et al. designed an algorithm that combines residual networks and supervector coding to efficiently detect aircraft targets. In order to improve the accuracy of aircraft positioning in remote sensing images, Xu et al. applied feature fusion technology to a fully convolutional neural network. Liu et al. added CBL operations to the YOLOv3 network, which can complete feature extraction more effectively, thereby achieving aerial car detection. These algorithms are proposed to solve problems in special fields and have high practical value.
The task of semantic segmentation of remote sensing images is more challenging than that of ordinary images, mainly because of the following two aspects. First, remote sensing images usually have higher resolution, which further increases the difficulty of semantic segmentation. Secondly, due to the large variety and similar appearance of land species, the task of semantic segmentation of remote sensing images is more challenging. In recent years, with the development of deep learning technology, semantic segmentation of remote sensing images based on deep learning has become a research hotspot in this field. At present, some efficient methods have been proposed and achieved good results. For example, the CxtHGNet network uses stacked hourglass modules and intermediate supervision to extract rich multi-scale features; the HSN network uses combined inception modules to replace conventional convolutional layers to obtain multi-scale information. In addition, the method of connecting feature mapping of global context information is also widely used, and remarkable results have been achieved through the fusion of local and global information.
Commonly used data sets for remote sensing image segmentation
SIRI-WHU data set
The SIRI-WHU data set was developed by Wuhan University RS-IDEA Designed by the Group, it contains 2,400 remote sensing images, each image is 200x200 in size, covering urban areas in China. The data set covers 12 different scene categories, each category contains 200 images, and the scene categories include cities, forests, fields, etc. These image resources come from Google Earth and can be used for research in remote sensing image classification, target detection, semantic segmentation and other fields. An example image is shown in Figure 2-2.

Figure 2-2 Example pictures of the SIRI-WHU data set
WHU-RS19 data set pictures
The WHU-RS19 data set contains a total of 1005 pictures The remote sensing images collected by Google satellite imagery have a size of 600x600 pixels per image. The dataset covers all parts of the world and contains 19 different scene categories, which is diverse and challenging. Figure 2-3 shows some scenes of the WHU-RS19 data set.

Figure 2-3 Example images of WHU-RS19 data set
GID data set
Gaofen Image Dataset (GID for short) It is a data set composed of remote sensing images collected by the Gaofen-2 satellite, covering land areas in more than 60 different cities in my country. GID datasets are commonly used for large-scale land use and land cover classification tasks. The dataset contains 150 images, each with a resolution of 6908x7300 pixels, covering a land area of ​​more than 50,000 square kilometers. Figure 2-4 shows some images of this data set.

Figure 2-4 Schematic diagram of the GID data set
ISPRS Vaihingen data set
The remote sensing data set used in the research of this article is the International Photogrammetry and Remote Sensing Vaihingen dataset created by ISPRS. This data set was captured by drones in the German city of Vaihingen. It is mainly composed of plants and trees, with fewer buildings. The Vaihingen dataset is a widely used large-scale dataset that contains a total of 33 images of size 2494x2064 pixels, each image contains six categories (impervious surfaces, buildings, low vegetation, trees, cars, and debris). Each image contains three bands, corresponding to the near-infrared (IR), red® and green (G) bands. This article uses the specified 16 images for training, and the remaining 17 for testing. Some images of the ISPRS Vaihingen data set are shown in Figure 2-5.

Figure 2-5 Some images of the ISPRS Vaihingen data set
Data preprocessing
The remote sensing image data set used in this study comes from ISPRS, mainly The experiment was conducted on the Vaihingen data set. This dataset contains high-resolution remote sensing images, with an average image size of approximately 2500x2500 pixels. Due to hardware device limitations during the experiment, the images in the original data set need to be cropped. In order to enhance the generalization of the network and improve the training effect, the data enhancement method is used to preprocess the images.
In order to achieve deep training and rationally utilize the video memory, we cropped the original high-resolution remote sensing images. When cropping, considering the consistency and completeness of the data distribution, we chose the sliding window cropping method instead of random cropping. The cropped image size is 512x512 in RGB format. The repetition rate between cropped images is 40%. This design uses a 7:3 ratio to divide the data set into a training set and a test set. Figure 2-6 shows the cropped image and corresponding labels.

Figure 2-6 Schematic diagram of the cropped Vaihingen data set
In classic image processing algorithms, there are many image filtering algorithms, among which Gaussian filtering (Gaussian Blur) is one Effective method to remove Gaussian noise. This algorithm calculates a new pixel value for each pixel based on a Gaussian function, so the new value of each pixel is affected by the values ​​of its surrounding pixels. Simply put, this algorithm smoothes the image and blurs image details.
Histogram equalization is a method of improving contrast by adjusting the histogram of an image. When the distribution of pixels in the image is relatively uniform, using histogram equalization can enhance the local contrast of the image without affecting the overall contrast, so that the brightness can be more evenly distributed on the histogram. In this paper, we perform histogram equalization on the cropped image, and also perform operations such as rotation and translation to enhance data.
Basics of Convolutional Neural Networks
Convolutional Neural Network (CNN) is a deep learning model that is widely used in image, video and other data classification and recognition tasks. CNN was originally inspired by the research of Hubel and Wiesel. They found that there are simple cells and complex cells in the human visual system, which can respond to visual stimuli of different directions and sizes respectively. CNN draws on the principles of this biological vision system, extracts the features of the input data through convolution and pooling operations, and then uses a fully connected layer for classification. The convolutional layer is the core part of CNN. It uses a set of convolution kernels to perform convolution operations on the input data and extract local features. The activation function is used after the convolutional layer to increase the nonlinear capability of the network. The fully connected layer is used to classify the features extracted by the convolutional layer. The general structure of CNN includes an input layer, multiple convolutional layers, a pooling layer, a fully connected layer and an output layer. The famous handwritten digit recognition neural network LeNet is shown in Figure 2-7.

Figure 2-7 Handwritten digit recognition network LeNet
Convolutional layer
The convolutional layer is an important part of the convolutional neural network and is mainly used For extracting features from images or other types of data. The mathematical principle of the convolutional layer is based on the convolution operation, which is a linear operation that weights the average of two functions at each overlapping position to obtain another function. In a convolutional layer, the weights of the convolutional kernels are used to weight the average of the input data. The convolution operation can effectively extract the local features of the data, and due to the parameter sharing of the convolution kernel, the number of parameters of the network can be greatly reduced. As shown in Figure 2-8, the convolution operation starts from the upper left corner of the input data. The convolution kernel performs a weighted average with the corresponding data in turn, then moves one pixel position to the right, and repeats the operation at the new position. Through multiple convolution operations, the output result can be obtained.

Figure 2-8 Schematic diagram of the convolution calculation process
Activation function
Why do we need to add a nonlinear activation function to the convolutional neural network? This is mainly This is because in many complex task scenarios, it is simply impossible to rely solely on linear models. Judging from their respective calculation methods, the convolution layer and the pooling layer are essentially linear operations, which greatly limits the use of the network. Adding nonlinear activation functions to convolutional neural networks greatly improves the network's ability to handle nonlinear tasks in actual scenarios. Next, this article will introduce several activation functions commonly used in deep learning.
The Sigmoid function converts the input signal into a probability value between 0-1. It was widely used in the early days of deep learning. However, due to problems such as gradient disappearance and non-zero mean output, it was gradually replaced by the ReLU function. replace. The image of the Sigmoid activation function is shown in Figure 2-9.

Figure 2-9 Sigmoid activation function
The ReLU function is currently one of the most popular activation functions in deep learning. When the input is greater than 0, its output is equal to the input; when the input is less than or equal to 0, its output is 0. The ReLU function has simple calculations and good effects, and is widely used in deep learning. The image of the ReLU activation function is shown in Figure 2-10.

Figure 2-10 ReLU activation function
The Tanh function is similar to the Sigmoid function, but the difference is that its output range is from -1 to 1. Compared with the Sigmoid function, the output of the Tanh function has the advantages of 0 mean and standard deviation, so it can accelerate the convergence speed of the network. The Tanh function is widely used in deep learning, especially in recurrent neural networks (RNN). The image is shown in 2-11.

Figure 2-11 Tanh activation function
Pooling layer
The pooling layer plays a role in reducing the size of the input feature map in the deep convolutional neural network. role. Typically, a pooling operation will halve the size of the feature map and is done using a pooling kernel of size 2x2 with a stride of 2. There are two most common pooling operations: max pooling and average pooling. Maximum pooling selects the maximum pixel value of the pixel within the pooling kernel range as the output pixel, while average pooling selects the average value of the pixels as the output pixel. In addition, there is a less commonly used pooling method, namely minimum pooling, which selects the minimum value of a pixel as the output pixel. Figure 2-12 shows the implementation principle of performing a 2x2 size, step size 2 pooling operation on a 4x4 size input feature map to obtain a 2x2 output feature map. According to different pooling methods, the pixel values ​​of pixels at corresponding positions in the output feature map will be different. The pooling layer has the following characteristics: 1-Translation invariance and scale invariance, etc. 2- By reducing the resolution of the image, the amount of network parameters is reduced, which can reduce the risk of over-fitting to a certain extent and enhance the generalization ability of the network. 3-Reduced computational complexity. 4-It can feed back some important information features in the image, and the maximum pooling operation particularly highlights this role.

Figure 2-12 Pooling process in different ways

Remote sensing image segmentation based on fully convolutional network
Overview of fully convolutional network
Fully convolutional neural network has developed rapidly in recent years. Achieved excellent performance in multiple computer vision tasks. This network uses operations such as convolution and pooling to extract features in images, which is an important reason for its remarkable results. Convolution has the characteristics of local connection and weight sharing, and can effectively extract image features. By continuously performing pooling operations, the resolution of the image can be reduced. On the one hand, it can reduce the amount of calculation, and on the other hand, it can also increase the high-dimensional features extracted by the convolution operation.
In many excellent classification networks, fully connected layers are used. However, the input of this method must be fixed, which limits the usage scenarios of the network. In the fully convolutional network for semantic segmentation, the fully connected layer is no longer used, but only the convolutional layer and the pooling layer are retained. This approach does not limit the resolution of the input image and can aid in dense pixel-by-pixel classification tasks.
In 1992, Matan et al. proposed a convolutional neural network that can handle one-dimensional signals of any size, but the network cannot handle two-dimensional signals. Wolf et al. designed a network capable of processing two-dimensional images in 1994 and successfully applied it to the postal address location task. This network only includes convolutional and pooling layers, so it can be considered a fully convolutional network. Although this network cannot complete end-to-end training, it lays a solid foundation for the development of fully convolutional networks. Currently, fully convolutional networks have been widely used in various tasks of image processing, including target detection and semantic segmentation.
Fully convolutional network structure
Fully convolutional neural networks have achieved rapid development in recent years and have achieved remarkable results in multiple tasks in the field of computer vision. Convolution and pooling are the two most common operations in fully convolutional neural networks. They can effectively extract image features and reduce image resolution to improve the feature extraction efficiency of convolution operations. Unlike traditional classification networks that use fully connected layers, fully convolutional networks do not use fully connected layers, but only retain convolutional layers and pooling layers, which can be applied to input images of different sizes and can also complete pixel-level classification tasks. . As early as 1992, Matan et al. proposed a convolutional neural network that can handle one-dimensional signals of any size, but the network cannot handle two-dimensional signals. In 1994, Wolf et al. designed a network containing only convolutional layers and pooling layers for the postal address location task. This network laid the foundation for the development of fully convolutional networks.
In 2015, the emergence of the FCN network, an important variant of the fully convolutional network, marked a new stage in image semantic segmentation. The overall network structure is shown in Figure 3-1.

Figure 3-1 FCN network structure
The FCN network can complete the pixel-by-pixel segmentation task end-to-end, replacing all fully connected layers in the deep convolutional network with convolutions. The cumulative layer makes the input image from entering the network model to the output image only pass through the convolution layer and will not go through the fully connected layer, generating a prediction map with the same size as the input image. Each pixel in the output image represents the prediction of the pixel category, achieving pixel-by-pixel prediction. The FCN network contains two stages: encoding and decoding. In the encoding stage, in order to obtain high-dimensional features, the image is reduced in resolution through continuous downsampling; in the decoding stage, in order to restore the image size, upsampling methods such as deconvolution and bilinear interpolation are used. FCN also uses a skip structure to integrate the prediction map into the semantic information at different stages of the network to improve the semantic and spatial accuracy of the output results. The skip structure of FCN is shown in Figure 3-2. Commonly used backbone networks in FCN networks include AlexNet, VGG-16 and GoogleNet.

Figure 3-2 FCN skip structure
Deconvolution is also often called transposed convolution (Transposed Convolution). In fully convolutional networks, the encoding-decoding network usually requires downsampling operations to reduce image resolution. In Chapter 2, we have introduced pooling operations as a common means of downsampling. During the decoding stage, successive upsampling operations are required to restore the image size. Deconvolution can help with this upsampling process. Although the word "transpose" appears in the name of deconvolution, this cannot simply be explained as transposing the convolution matrix and using the transposed matrix to perform the convolution operation. In fact, the transposed convolution operation constructs the opposite connection relationship to the ordinary convolution operation. The image size recovery process of transposed convolution is shown in Figure 3-3. In addition, the parameters in transposed convolution can be learned, so there is no need to define any method in advance. However, using transposed convolution also has some obvious disadvantages: first, it may cause a "chessboard effect", and second, due to the increased number of learnable parameters, the network training time is longer, and it is even prone to overfitting.

Figure 3-3 Schematic diagram of transposed convolution
Loss function
This design image processing uses the cross-entropy loss function, multi-class cross The entropy loss function is shown in Equation 3-1:
L=1/N ∑_i▒ L_i=-1/N ∑_i▒ ∑_(c=1)^M▒ y_ic log⁡ (p_ic) (3-1)
Among them: M is the number of categories, yic is the sign function (0 or 1), if the true category of sample i is equal to c, it takes 1, otherwise it takes 0. PicThe predicted probability that observation sample i belongs to category c.
Semantic segmentation of images is to segment pixels, that is, each pixel is divided into two categories, and the cross-entropy of the entire image can be calculated.
FCN network structure
The network structure of FCN-32s is shown in Figure 3-4. The structure diagram of the FCN-32s network is divided into two parts. The first part is the feature extraction network, which consists of multiple convolutional and pooling layers to extract features from the input image. In FCN-32s, the feature extraction network uses the first 13 convolutional layers of the VGG16 network as the basic network. Then, in order to retain more information, the last fully connected layer is replaced by a convolutional layer with the same number of output channels as the number of categories. The second part is the upsampling network, which consists of deconvolution layers and skip connection layers. In the deconvolution layer, the feature map is upsampled using transposed convolution. In the skip connection layer, the feature maps are connected with the intermediate results in the feature extraction network to obtain better accuracy. FCN-32s uses two skip connection layers to connect the outputs of the pool4 and pool3 layers in the feature extraction network. Finally, the output of FCN-32s is a feature map processed by the upsampling network, which is the same size as the input image but has the same number of channels as the number of categories. During the training process, the output is optimized using cross-entropy as the loss function to improve classification accuracy. To put it simply, FCN-32s does not use cross-layer connections, and it is a simple and crude way to the end. After the image is downsampled 32 times, a transposed convolution layer of 64*64 with a stride of 32 is directly used to enlarge the feature map to the original image size, and the channel becomes 21 categories (the 21st category is the background).

Figure 3-4 Network structure diagram of FCN-32s
The feature extraction network of FCN-8s also uses the first 13 convolutional layers of the VGG16 network and a layer of transposed convolutions Lamination. After the feature extraction network, FCN-8s adds three skip connection layers. The first skip connection layer connects the output of the pool4 layer of the feature extraction network to the output of the transposed convolution layer, and the second skip connection layer connects the output of the pool3 layer with the output of the first skip connection layer. The third skip connection layer connects the output of the first convolutional layer of the VGG16 network with the output of the second skip connection layer.
The upsampling network of FCN-8s consists of two transposed convolutional layers, which are used to upsample the feature map to the same size as the input image. The final output layer is a 1x1 convolutional layer with the same number of channels as the number of categories.
The network structure of FCN-8s is shown in Figure 3-5.

Figure 3-5 FCN-8S network structure diagram
Fully convolutional network model construction and training
Hardware and software environment configuration for this section of the experiment See Table 3-1, and the training-related hyperparameter settings are shown in Table 3-2. In addition, the loss function of the experimental model in this section adopts the cross-contrast loss function, and the optimizer method uses stochastic gradient descent. Taking into account the different sizes of the data sets, this paper experiments on the Vaihingen data set for 100 rounds of training.
Table 3 1 Experimental hardware and software environment configuration
GPU model training framework operating system
GTX1050 Pytorch-1-8 Ubuntu16 -04

Table 3 2 Training-related hyperparameter settings
Data set batch size learning rate decay rate
Vaihingen 8 0-003 5e-4 This experimental conclusion shows that in the semantic segmentation task, increasing skipping Connection layers can improve model accuracy. Compared with FCN-32s and FCN-16s, FCN-8s adds two skip connection layers, making the model more accurate. In addition, image details can be better restored through 8x upsampling, thereby further improving segmentation accuracy. Therefore, in practical applications, the corresponding fully convolutional model structure can be selected according to task needs, and techniques such as upsampling can be used to further improve model accuracy. FCN-8s 79-24 70-35 83-02 FCN-32s 80- 45 70-81 82-32 Model Mean F1 mLOU OA Table 3 3 Comparative experiments between FCN-32s and FCN-8s
This experiment compared two fully convolutional models with different skip structures, FCN-32s and FCN-8s. The experiment uses the structure in Table 3-3 for training and testing. Experimental results show that the FCN-8s structure using 8 times upsampling performs best among the two structures. Its three evaluation indicators mean F1, mIOU and OA are 81-24%, 72-35% and 85-02 respectively. %.




Remote sensing image segmentation based on U-Net network
Overview of U-Net network
U-net is a convolution used for image segmentation Neural network, its name comes from the U-shape of its network structure. U-net was originally proposed in 2015 by Olaf Ronneberger and others at the European Organization for Nuclear Research (CERN). Compared with traditional fully convolutional networks, U-net adds skip connections to the network structure, making it perform even better in segmentation tasks.
U-net performs well in image segmentation tasks, especially for medical image segmentation tasks. Its network has a compact structure, high accuracy, and relatively fast training speed, so it is widely used in segmentation tasks of lungs, liver, heart and other organs. At the same time, U-net is also used for image segmentation tasks in other fields, such as natural image segmentation, road segmentation, etc.
U-Net network structure
The network structure of U-Net is divided into two parts: downsampling and upsampling. In the downsampling stage, U-Net continuously shrinks the input image and extracts features through a series of convolution and pooling operations. In the upsampling stage, U-Net continuously amplifies the feature map through a series of deconvolution and convolution operations, and at the same time splices it with the feature map of the corresponding layer in the downsampling stage, thereby restoring the resolution and improving the segmentation accuracy. The skip connection directly transfers the feature map corresponding to the downsampling stage to the upsampling stage, allowing the network to directly use more refined feature information to better distinguish the foreground and background. Its network structure is shown in Figure 4-1.

Figure 4-1 U-Net network structure diagram

U-Net training and structural analysis
The hardware and software environment configurations of the experiments in this section are shown in Table 4-1, and the training-related hyperparameter settings are shown in Table 4-2 . In addition, the loss function of the experimental model in this section adopts the cross-contrast loss function, and the optimizer method uses stochastic gradient descent. Taking into account the different sizes of the data sets, this paper experiments on the Vaihingen data set for 100 rounds of training.
Table 4 1 Experimental hardware and software environment configuration
GPU model training framework operating system
GTX1050 Pytorch-1-8 Ubuntu16 -04

Table 4 2 Training-related hyperparameter settings
Data set batch size learning rate decay rate
Vaihingen 8 0-003 5e-4 The prediction result of U-Net is shown in 4-2. U-Net 82-10 72-35 84 -56 Model Mean F1 mLOU OA Table 3 3 U-Net test experiment
After testing, the U-Net test is shown in Table 4-3.



Figure 4-2 Prediction results of U-Net

Summary
This article designed a remote sensing image segmentation experiment based on U-Net network, and compared the results of FCN-32s and FCN-8s. Experimental results show that the U-Net network has the best results in remote sensing image segmentation tasks. This article also introduces commonly used image segmentation data sets and deep learning basics.
First, this article introduces the definition and application scenarios of the image segmentation task. Image segmentation refers to the process of dividing an image into several sub-regions, usually by segmenting objects or regions in the image. Image segmentation is widely used in computer vision, medical imaging, remote sensing images and other fields.
Secondly, this article introduces commonly used image segmentation data sets, including SIRI-WHU, WHU-RS19, GID and ISPRS Vaihingen data sets. These datasets contain images of various scenes and are of great significance for the evaluation and comparison of algorithms.
Next, this article introduces the application of deep learning in image segmentation tasks. The emergence of deep learning has greatly improved the image segmentation task. In particular, convolutional neural networks (CNN) have shown good results in image segmentation tasks. This article introduces two CNN-based image segmentation models: FCN and U-Net.
Then, this article introduces the experimental results of FCN-32s and FCN-8s. Experimental results show that FCN-8s performs better than FCN-32s in remote sensing image segmentation tasks. However, compared with U-Net, the effect of FCN still has room for improvement.
Finally, this article introduces the structure and experimental results of the U-Net network. The U-Net network is an image segmentation network based on CNN. It is characterized by dividing the input image into two branches and performing encoding and decoding operations simultaneously. Experimental results show that the U-Net network has the best effect in remote sensing image segmentation tasks and has good generalization performance.

Abstract
Remote sensing image segmentation is a process that uses high-resolution images obtained by remote sensing technology to classify at the pixel level and extract different objects or features in the image. This process is of great significance for remote sensing applications because it can extract ground objects and surface features, such as rivers, roads, buildings, vegetation, water bodies, etc., and these features actually exist on the ground. Image segmentation can provide practical information for ground cover classification, land use cover change analysis, urban planning, agricultural resource monitoring, environmental protection and other fields.
This paper first explains the principles of remote sensing image segmentation and deep learning technology, and uses the FCN-32s and FCN-8s networks in the fully convolutional network to build a remote sensing image segmentation model, and uses ISPRS Vaihingen data Set training and testing. After testing, the three evaluation indicators mean F1, mIOU and OA of the FCN-32s model are 80.45%, 70.81% and 82.32% respectively. The three evaluation indicators mean F1, mIOU and OA of the FCN-8s model are respectively. 79.24%, 70.35% and 83.02%. Then this paper used the U-Net network to build a remote sensing image segmentation model. After testing the three evaluation indicators of the U-Net model, mean F1, mIOU and OA were 82.10%, 72.35% and 84.56% respectively.
Keywords remote sensing image segmentation; U-Net; fully convolutional network

Abstract
Remote sensing image segmentation is the process of extracting different objects or different features from an image by pixel-level classification using high-resolution images acquired by remote sensing technology. This process is important for remote sensing applications because it can extract features and surface features such as rivers, roads, buildings, vegetation, water bodies, etc. and these features are physically present on the ground. Image segmentation can provide practical information for ground cover classification, land use cover change analysis, urban planning, agricultural resource monitoring, environmental protection and other fields.
In this thesis, firstly, the principles of remote sensing image segmentation and deep learning techniques are explained, and remote sensing image segmentation models are constructed with FCN-32s and FCN-8s networks in full convolutional networks, and trained and tested with ISPRS Vaihingen dataset. After testing, the three evaluation indexes of FCN-8s model mean F1, mIOU and OA are 79.24%, 70.35% and 83.02%, respectively. The three evaluation indexes mean F1, mIOU and OA of the U-Net model were 82.10%, 72.35% and 84.56%, respectively.
Keywords:Remote Sensing Image Segmentation; U-Net; Full Convolutional Network

目录
1 绪论 1
1.1 选题背景及研究意义 1
1.2 国内外现状现状 1
1.2.1 基于聚类的方法 1
1.2.2 基于分割的方法 2
1.2.3 基于深度学习的方法 2
2 遥感图像分割和深度学习基础 3
2.1 基于深度学习的遥感图像分析 3
2.2 遥感图像分割常用的数据集 4
2.2.1 SIRI-WHU 数据集 4
2.2.2 WHU-RS19 数据集图 4
2.2.3 GID 数据集 5
2.2.4 ISPRS Vaihingen 数据集 5
2.3 数据预处理 6
2.4 卷积神经网络基础 7
2.4.1 卷积层 8
2.4.2 激活函数 8
2.4.3 池化层 10
3 基于全卷积网络的遥感图像分割 11
3.1 全卷积网络概述 11
3.2 全卷积网络结构 11
3.3 损失函数 13
3.4 FCN网络结构 13
3.5 全卷积网络模型构建和训练 15
4 基于U-Net网络的遥感图像分割 16
4.1 U-Net网络概述 16
4.2 U-Net网络结构 16
4.3 U-Net训练和结构分析 17
5 总结 18

Introduction
Background of the topic and research significance
Remote sensing image segmentation is the use of high-resolution images obtained by remote sensing technology to classify at the pixel level. The process of extracting different objects or features in the image. This process is of great significance for remote sensing applications because it can extract ground objects and surface features, such as rivers, roads, buildings, vegetation, water bodies, etc., and these features actually exist on the ground. Image segmentation can provide practical information for ground cover classification, land use cover change analysis, urban planning, agricultural resource monitoring, environmental protection and other fields.
In recent years, convolutional neural networks (CNN) have made important progress in the field of image segmentation. Among them, U-Net is a commonly used deep learning model. It is a fully convolutional neural network (FCN) architecture based on convolutional neural network (CNN) and is widely used in medical image segmentation, remote sensing image segmentation, etc. field. U-Net connects the feature vector of the convolutional network with the upsampled feature map by introducing skip connections, which enables U-Net to capture high-level features and low-level detailed information, improving the accuracy of remote sensing image segmentation. performance and robustness.
In terms of remote sensing image segmentation, U-Net has been applied to the classification of various types of ground objects, such as buildings, water bodies, vegetation, roads, etc. In addition, the image segmentation method based on U-Net has many advantages, such as automatically learning features, good segmentation effect and generalization ability, etc.
Therefore, the remote sensing image segmentation method based on U-Net has broad application prospects in the fields of land use, urban planning, environmental protection, weather forecasting and other fields. Especially in remote sensing image segmentation, due to the characteristics of remote sensing images such as high resolution, large data volume, and multispectral, traditional methods based on artificial rules and manual features are no longer competent, and deep learning algorithms have better capabilities in this regard. Application prospects. Therefore, the remote sensing image segmentation method based on U-Net is of great significance to the research and practice in the field of remote sensing applications.
Current status at home and abroad
Clustering-based method
Clustering-based method is the most commonly used method in remote sensing image segmentation one. The basic idea is to cluster pixels into different categories based on similarity, and then divide pixels of the same category into the same area. K-means algorithm is one of the most commonly used clustering methods. Its idea is to divide pixels into K categories, and each pixel belongs to the category closest to it. In addition, the Fuzzy C-Means (FCM) algorithm and the Possibilistic C-Means (PCM) algorithm are also used in remote sensing image segmentation. These methods are similar to the K-means algorithm, but are more robust to noise and outliers.
Researchers at home and abroad are exploring clustering-based methods, but there are also different focuses and directions. Foreign researchers pay more attention to the improvement of clustering algorithms, such as improving the convergence of the K-means algorithm, parameter selection of the Fuzzy C-Means algorithm, etc. Domestic researchers, however, pay more attention to the application of algorithms, such as application in high-resolution remote sensing image segmentation, remote sensing image segmentation based on wavelet transform, etc.
Segmentation-based method
Segmentation-based method refers to dividing the image into several regions and then further processing each region. Among them, the region growing method is a commonly used region-based segmentation method. Its basic idea is to start from a set of seed pixels and add adjacent pixels to the region until no more additions can be made. In addition, edge-based methods are also widely used in remote sensing image segmentation, such as edge detection based on the Canny algorithm and image segmentation based on edge features.
Domestic and foreign researchers are exploring segmentation-based methods, but there are also different research directions. Foreign researchers pay more attention to the theoretical research of segmentation-based methods, such as the selection of segmentation criteria and the optimization of segmentation algorithms. Domestic researchers, however, pay more attention to the practical application of algorithms, such as remote sensing image classification based on the region growing method, remote sensing image analysis based on multi-scale segmentation, etc. Methods based on deep learning In recent years, deep learning has been widely used in remote sensing image segmentation, especially convolutional neural networks (CNN). The application of CNN in remote sensing image segmentation mainly includes two methods: fully convolutional neural network (FCN) and convolutional neural network plus cascade segmenter (CNN-Cascade). FCN is one of the first methods to apply deep learning methods to remote sensing image segmentation. Its main idea is to replace the fully connected layer with a convolutional layer to achieve the output of any size input image. segmentation. CNN-Cascade is a novel segmentation method that achieves high-precision segmentation results by cascading multiple convolutional neural networks. Domestic and foreign researchers are conducting in-depth research on methods based on deep learning, but there are also different focuses and directions. Foreign researchers pay more attention to the improvement of deep learning models, such as improving the structure of convolutional neural networks and introducing attention mechanisms. Domestic researchers are paying more attention to the application of deep learning methods in remote sensing image segmentation, such as building recognition based on multi-source remote sensing images, crop classification based on deep learning, etc. Basics of remote sensing image segmentation and deep learning Remote sensing image analysis based on deep learning With the vigorous development of deep learning, remote sensing The field of image processing and analysis has also developed rapidly. At present, remote sensing image analysis based on deep learning is mainly in the following aspects: The three basic tasks in computer vision are classification, detection and segmentation, among which target detection has always been a popular research field. With the rapid development of modern information technology, ground object detection in remote sensing image processing has become particularly important. This application plays an important role in the fields of drones, security protection, military navigation and aerial reconnaissance, as shown in Figure 2-1. There are many detection algorithms for natural scene images, including single-stage target detection algorithms such as SSD and YOLO, and two-stage target detection algorithms such as Faster R-CNN and Mask R-CNN. In addition, there are excellent target detection algorithms such as FCOS and CenterNet. However, when considering the actual scenarios of remote sensing image application, not only the algorithm needs to have high recognition accuracy, but also parameters such as size and speed need to be considered.








Figure 2-1 Target detection in remote sensing images
In order to solve these problems, some researchers have proposed some special algorithms. Yang et al. designed an algorithm that combines residual networks and supervector coding to efficiently detect aircraft targets. In order to improve the accuracy of aircraft positioning in remote sensing images, Xu et al. applied feature fusion technology to a fully convolutional neural network. Liu et al. added CBL operations to the YOLOv3 network, which can complete feature extraction more effectively, thereby achieving aerial car detection. These algorithms are proposed to solve problems in special fields and have high practical value.
The task of semantic segmentation of remote sensing images is more challenging than that of ordinary images, mainly because of the following two aspects. First, remote sensing images usually have higher resolution, which further increases the difficulty of semantic segmentation. Secondly, due to the large variety and similar appearance of land species, the task of semantic segmentation of remote sensing images is more challenging. In recent years, with the development of deep learning technology, semantic segmentation of remote sensing images based on deep learning has become a research hotspot in this field. At present, some efficient methods have been proposed and achieved good results. For example, the CxtHGNet network uses stacked hourglass modules and intermediate supervision to extract rich multi-scale features; the HSN network uses combined inception modules to replace conventional convolutional layers to obtain multi-scale information. In addition, the method of connecting feature mapping of global context information is also widely used, and remarkable results have been achieved through the fusion of local and global information.
Commonly used data sets for remote sensing image segmentation
SIRI-WHU data set
The SIRI-WHU data set was developed by Wuhan University RS-IDEA Designed by the Group, it contains 2,400 remote sensing images, each image is 200x200 in size, covering urban areas in China. The data set covers 12 different scene categories, each category contains 200 images, and the scene categories include cities, forests, fields, etc. These image resources come from Google Earth and can be used for research in remote sensing image classification, target detection, semantic segmentation and other fields. An example image is shown in Figure 2-2.

Figure 2-2 Example pictures of the SIRI-WHU data set
WHU-RS19 data set pictures
The WHU-RS19 data set contains a total of 1005 pictures The remote sensing images collected by Google satellite imagery have a size of 600x600 pixels per image. The dataset covers all parts of the world and contains 19 different scene categories, which is diverse and challenging. Figure 2-3 shows some scenes of the WHU-RS19 data set.

Figure 2-3 Example images of WHU-RS19 data set
GID data set
Gaofen Image Dataset (GID for short) It is a data set composed of remote sensing images collected by the Gaofen-2 satellite, covering land areas in more than 60 different cities in my country. GID datasets are commonly used for large-scale land use and land cover classification tasks. The dataset contains 150 images, each with a resolution of 6908x7300 pixels, covering a land area of ​​more than 50,000 square kilometers. Figure 2-4 shows some images of this data set.

Figure 2-4 Schematic diagram of the GID data set
ISPRS Vaihingen data set
The remote sensing data set used in the research of this article is the International Photogrammetry and Remote Sensing Vaihingen dataset created by ISPRS. This data set was captured by drones in the German city of Vaihingen. It is mainly composed of plants and trees, with fewer buildings. The Vaihingen dataset is a widely used large-scale dataset that contains a total of 33 images of size 2494x2064 pixels, each image contains six categories (impervious surfaces, buildings, low vegetation, trees, cars, and debris). Each image contains three bands, corresponding to the near-infrared (IR), red® and green (G) bands. This article uses the specified 16 images for training, and the remaining 17 for testing. Some images of the ISPRS Vaihingen data set are shown in Figure 2-5.

Figure 2-5 Some images of the ISPRS Vaihingen data set
Data preprocessing
The remote sensing image data set used in this study comes from ISPRS, mainly The experiment was conducted on the Vaihingen data set. This dataset contains high-resolution remote sensing images, with an average image size of approximately 2500x2500 pixels. Due to hardware device limitations during the experiment, the images in the original data set need to be cropped. In order to enhance the generalization of the network and improve the training effect, the data enhancement method is used to preprocess the images.
In order to achieve deep training and rationally utilize the video memory, we cropped the original high-resolution remote sensing images. When cropping, considering the consistency and completeness of the data distribution, we chose the sliding window cropping method instead of random cropping. The cropped image size is 512x512 in RGB format. The repetition rate between cropped images is 40%. This design uses a 7:3 ratio to divide the data set into a training set and a test set. Figure 2-6 shows the cropped image and corresponding labels.

Figure 2-6 Schematic diagram of the cropped Vaihingen data set
In classic image processing algorithms, there are many image filtering algorithms, among which Gaussian filtering (Gaussian Blur) is one Effective method to remove Gaussian noise. This algorithm calculates a new pixel value for each pixel based on a Gaussian function, so the new value of each pixel is affected by the values ​​of its surrounding pixels. Simply put, this algorithm smoothes the image and blurs image details.
Histogram equalization is a method of improving contrast by adjusting the histogram of an image. When the distribution of pixels in the image is relatively uniform, using histogram equalization can enhance the local contrast of the image without affecting the overall contrast, so that the brightness can be more evenly distributed on the histogram. In this paper, we perform histogram equalization on the cropped image, and also perform operations such as rotation and translation to enhance data.
Basics of Convolutional Neural Networks
Convolutional Neural Network (CNN) is a deep learning model that is widely used in image, video and other data classification and recognition tasks. CNN was originally inspired by the research of Hubel and Wiesel. They found that there are simple cells and complex cells in the human visual system, which can respond to visual stimuli of different directions and sizes respectively. CNN draws on the principles of this biological vision system, extracts the features of the input data through convolution and pooling operations, and then uses a fully connected layer for classification. The convolutional layer is the core part of CNN. It uses a set of convolution kernels to perform convolution operations on the input data and extract local features. The activation function is used after the convolutional layer to increase the nonlinear capability of the network. The fully connected layer is used to classify the features extracted by the convolutional layer. The general structure of CNN includes an input layer, multiple convolutional layers, a pooling layer, a fully connected layer and an output layer. The famous handwritten digit recognition neural network LeNet is shown in Figure 2-7.

Figure 2-7 Handwritten digit recognition network LeNet
Convolutional layer
The convolutional layer is an important part of the convolutional neural network and is mainly used For extracting features from images or other types of data. The mathematical principle of the convolutional layer is based on the convolution operation, which is a linear operation that weights the average of two functions at each overlapping position to obtain another function. In a convolutional layer, the weights of the convolutional kernels are used to weight the average of the input data. The convolution operation can effectively extract the local features of the data, and due to the parameter sharing of the convolution kernel, the number of parameters of the network can be greatly reduced. As shown in Figure 2-8, the convolution operation starts from the upper left corner of the input data. The convolution kernel performs a weighted average with the corresponding data in turn, then moves one pixel position to the right, and repeats the operation at the new position. Through multiple convolution operations, the output result can be obtained.

Figure 2-8 Schematic diagram of the convolution calculation process
Activation function
Why do we need to add a nonlinear activation function to the convolutional neural network? This is mainly This is because in many complex task scenarios, it is simply impossible to rely solely on linear models. Judging from their respective calculation methods, the convolution layer and the pooling layer are essentially linear operations, which greatly limits the use of the network. Adding nonlinear activation functions to convolutional neural networks greatly improves the network's ability to handle nonlinear tasks in actual scenarios. Next, this article will introduce several activation functions commonly used in deep learning.
The Sigmoid function converts the input signal into a probability value between 0-1. It was widely used in the early days of deep learning. However, due to problems such as gradient disappearance and non-zero mean output, it was gradually replaced by the ReLU function. replace. The image of the Sigmoid activation function is shown in Figure 2-9.

Figure 2-9 Sigmoid activation function
The ReLU function is currently one of the most popular activation functions in deep learning. When the input is greater than 0, its output is equal to the input; when the input is less than or equal to 0, its output is 0. The ReLU function has simple calculations and good effects, and is widely used in deep learning. The image of the ReLU activation function is shown in Figure 2-10.

Figure 2-10 ReLU activation function
The Tanh function is similar to the Sigmoid function, but the difference is that its output range is from -1 to 1. Compared with the Sigmoid function, the output of the Tanh function has the advantages of 0 mean and standard deviation, so it can accelerate the convergence speed of the network. The Tanh function is widely used in deep learning, especially in recurrent neural networks (RNN). The image is shown in 2-11.

Figure 2-11 Tanh activation function
Pooling layer
The pooling layer plays a role in reducing the size of the input feature map in the deep convolutional neural network. role. Typically, a pooling operation will halve the size of the feature map and is done using a pooling kernel of size 2x2 with a stride of 2. There are two most common pooling operations: max pooling and average pooling. Maximum pooling selects the maximum pixel value of the pixel within the pooling kernel range as the output pixel, while average pooling selects the average value of the pixels as the output pixel. In addition, there is a less commonly used pooling method, namely minimum pooling, which selects the minimum value of a pixel as the output pixel. Figure 2-12 shows the implementation principle of performing a 2x2 size, step size 2 pooling operation on a 4x4 size input feature map to obtain a 2x2 output feature map. According to different pooling methods, the pixel values ​​of pixels at corresponding positions in the output feature map will be different. The pooling layer has the following characteristics: 1-Translation invariance and scale invariance, etc. 2- By reducing the resolution of the image, the amount of network parameters is reduced, which can reduce the risk of over-fitting to a certain extent and enhance the generalization ability of the network. 3-Reduced computational complexity. 4-It can feed back some important information features in the image, and the maximum pooling operation particularly highlights this role.

Figure 2-12 Pooling process in different ways

Remote sensing image segmentation based on fully convolutional network
Overview of fully convolutional network
Fully convolutional neural network has developed rapidly in recent years. Achieved excellent performance in multiple computer vision tasks. This network uses operations such as convolution and pooling to extract features in images, which is an important reason for its remarkable results. Convolution has the characteristics of local connection and weight sharing, and can effectively extract image features. By continuously performing pooling operations, the resolution of the image can be reduced. On the one hand, it can reduce the amount of calculation, and on the other hand, it can also increase the high-dimensional features extracted by the convolution operation.
In many excellent classification networks, fully connected layers are used. However, the input of this method must be fixed, which limits the usage scenarios of the network. In the fully convolutional network for semantic segmentation, the fully connected layer is no longer used, but only the convolutional layer and the pooling layer are retained. This approach does not limit the resolution of the input image and can aid in dense pixel-by-pixel classification tasks.
In 1992, Matan et al. proposed a convolutional neural network that can handle one-dimensional signals of any size, but the network cannot handle two-dimensional signals. Wolf et al. designed a network capable of processing two-dimensional images in 1994 and successfully applied it to the postal address location task. This network only includes convolutional and pooling layers, so it can be considered a fully convolutional network. Although this network cannot complete end-to-end training, it lays a solid foundation for the development of fully convolutional networks. Currently, fully convolutional networks have been widely used in various tasks of image processing, including target detection and semantic segmentation.
Fully convolutional network structure
Fully convolutional neural networks have achieved rapid development in recent years and have achieved remarkable results in multiple tasks in the field of computer vision. Convolution and pooling are the two most common operations in fully convolutional neural networks. They can effectively extract image features and reduce image resolution to improve the feature extraction efficiency of convolution operations. Unlike traditional classification networks that use fully connected layers, fully convolutional networks do not use fully connected layers, but only retain convolutional layers and pooling layers, which can be applied to input images of different sizes and can also complete pixel-level classification tasks. . As early as 1992, Matan et al. proposed a convolutional neural network that can handle one-dimensional signals of any size, but the network cannot handle two-dimensional signals. In 1994, Wolf et al. designed a network containing only convolutional layers and pooling layers for the postal address location task. This network laid the foundation for the development of fully convolutional networks.
In 2015, the emergence of the FCN network, an important variant of the fully convolutional network, marked a new stage in image semantic segmentation. The overall network structure is shown in Figure 3-1.

Figure 3-1 FCN network structure
The FCN network can complete the pixel-by-pixel segmentation task end-to-end, replacing all fully connected layers in the deep convolutional network with convolutions. The cumulative layer makes the input image from entering the network model to the output image only pass through the convolution layer and will not go through the fully connected layer, generating a prediction map with the same size as the input image. Each pixel in the output image represents the prediction of the pixel category, achieving pixel-by-pixel prediction. The FCN network contains two stages: encoding and decoding. In the encoding stage, in order to obtain high-dimensional features, the image is reduced in resolution through continuous downsampling; in the decoding stage, in order to restore the image size, upsampling methods such as deconvolution and bilinear interpolation are used. FCN also uses a skip structure to integrate the prediction map into the semantic information at different stages of the network to improve the semantic and spatial accuracy of the output results. The skip structure of FCN is shown in Figure 3-2. Commonly used backbone networks in FCN networks include AlexNet, VGG-16 and GoogleNet.

Figure 3-2 FCN skip structure
Deconvolution is also often called transposed convolution (Transposed Convolution). In fully convolutional networks, the encoding-decoding network usually requires downsampling operations to reduce image resolution. In Chapter 2, we have introduced pooling operations as a common means of downsampling. During the decoding stage, successive upsampling operations are required to restore the image size. Deconvolution can help with this upsampling process. Although the word "transpose" appears in the name of deconvolution, this cannot simply be explained as transposing the convolution matrix and using the transposed matrix to perform the convolution operation. In fact, the transposed convolution operation constructs the opposite connection relationship to the ordinary convolution operation. The image size recovery process of transposed convolution is shown in Figure 3-3. In addition, the parameters in transposed convolution can be learned, so there is no need to define any method in advance. However, using transposed convolution also has some obvious disadvantages: first, it may cause a "chessboard effect", and second, due to the increased number of learnable parameters, the network training time is longer, and it is even prone to overfitting.

Figure 3-3 Schematic diagram of transposed convolution
Loss function
This design image processing uses the cross-entropy loss function, multi-class cross The entropy loss function is shown in Equation 3-1:
L=1/N ∑_i▒ L_i=-1/N ∑_i▒ ∑_(c=1)^M▒ y_ic log⁡ (p_ic) (3-1)
Among them: M is the number of categories, yic is the sign function (0 or 1), if the true category of sample i is equal to c, it takes 1, otherwise it takes 0. PicThe predicted probability that observation sample i belongs to category c.
Semantic segmentation of images is to segment pixels, that is, each pixel is divided into two categories, and the cross-entropy of the entire image can be calculated.
FCN network structure
The network structure of FCN-32s is shown in Figure 3-4. The structure diagram of the FCN-32s network is divided into two parts. The first part is the feature extraction network, which consists of multiple convolutional and pooling layers to extract features from the input image. In FCN-32s, the feature extraction network uses the first 13 convolutional layers of the VGG16 network as the basic network. Then, in order to retain more information, the last fully connected layer is replaced by a convolutional layer with the same number of output channels as the number of categories. The second part is the upsampling network, which consists of deconvolution layers and skip connection layers. In the deconvolution layer, the feature map is upsampled using transposed convolution. In the skip connection layer, the feature maps are connected with the intermediate results in the feature extraction network to obtain better accuracy. FCN-32s uses two skip connection layers to connect the outputs of the pool4 and pool3 layers in the feature extraction network. Finally, the output of FCN-32s is a feature map processed by the upsampling network, which is the same size as the input image but has the same number of channels as the number of categories. During the training process, the output is optimized using cross-entropy as the loss function to improve classification accuracy. To put it simply, FCN-32s does not use cross-layer connections, and it is a simple and crude way to the end. After the image is downsampled 32 times, a transposed convolution layer of 64*64 with a stride of 32 is directly used to enlarge the feature map to the original image size, and the channel becomes 21 categories (the 21st category is the background).

Figure 3-4 Network structure diagram of FCN-32s
The feature extraction network of FCN-8s also uses the first 13 convolutional layers of the VGG16 network and a layer of transposed convolutions Lamination. After the feature extraction network, FCN-8s adds three skip connection layers. The first skip connection layer connects the output of the pool4 layer of the feature extraction network to the output of the transposed convolution layer, and the second skip connection layer connects the output of the pool3 layer with the output of the first skip connection layer. The third skip connection layer connects the output of the first convolutional layer of the VGG16 network with the output of the second skip connection layer.
The upsampling network of FCN-8s consists of two transposed convolutional layers, which are used to upsample the feature map to the same size as the input image. The final output layer is a 1x1 convolutional layer with the same number of channels as the number of categories.
The network structure of FCN-8s is shown in Figure 3-5.

Figure 3-5 FCN-8S network structure diagram
Fully convolutional network model construction and training
Hardware and software environment configuration for this section of the experiment See Table 3-1, and the training-related hyperparameter settings are shown in Table 3-2. In addition, the loss function of the experimental model in this section adopts the cross-contrast loss function, and the optimizer method uses stochastic gradient descent. Taking into account the different sizes of the data sets, this paper experiments on the Vaihingen data set for 100 rounds of training.
Table 3 1 Experimental hardware and software environment configuration
GPU model training framework operating system
GTX1050 Pytorch-1-8 Ubuntu16 -04

Table 3 2 Training-related hyperparameter settings
Data set batch size learning rate decay rate
Vaihingen 8 0-003 5e-4 This experimental conclusion shows that in the semantic segmentation task, increasing skipping Connection layers can improve model accuracy. Compared with FCN-32s and FCN-16s, FCN-8s adds two skip connection layers, making the model more accurate. In addition, image details can be better restored through 8x upsampling, thereby further improving segmentation accuracy. Therefore, in practical applications, the corresponding fully convolutional model structure can be selected according to task needs, and techniques such as upsampling can be used to further improve model accuracy. FCN-8s 79-24 70-35 83-02 FCN-32s 80- 45 70-81 82-32 Model Mean F1 mLOU OA Table 3 3 Comparative experiments between FCN-32s and FCN-8s
This experiment compared two fully convolutional models with different skip structures, FCN-32s and FCN-8s. The experiment uses the structure in Table 3-3 for training and testing. Experimental results show that the FCN-8s structure using 8 times upsampling performs best among the two structures. Its three evaluation indicators mean F1, mIOU and OA are 81-24%, 72-35% and 85-02 respectively. %.




Remote sensing image segmentation based on U-Net network
Overview of U-Net network
U-net is a convolution used for image segmentation Neural network, its name comes from the U-shape of its network structure. U-net was originally proposed in 2015 by Olaf Ronneberger and others at the European Organization for Nuclear Research (CERN). Compared with traditional fully convolutional networks, U-net adds skip connections to the network structure, making it perform even better in segmentation tasks.
U-net performs well in image segmentation tasks, especially for medical image segmentation tasks. Its network has a compact structure, high accuracy, and relatively fast training speed, so it is widely used in segmentation tasks of lungs, liver, heart and other organs. At the same time, U-net is also used for image segmentation tasks in other fields, such as natural image segmentation, road segmentation, etc.
U-Net network structure
The network structure of U-Net is divided into two parts: downsampling and upsampling. In the downsampling stage, U-Net continuously shrinks the input image and extracts features through a series of convolution and pooling operations. In the upsampling stage, U-Net continuously amplifies the feature map through a series of deconvolution and convolution operations, and at the same time splices it with the feature map of the corresponding layer in the downsampling stage, thereby restoring the resolution and improving the segmentation accuracy. The skip connection directly transfers the feature map corresponding to the downsampling stage to the upsampling stage, allowing the network to directly use more refined feature information to better distinguish the foreground and background. Its network structure is shown in Figure 4-1.

Figure 4-1 U-Net network structure diagram

U-Net training and structural analysis
The hardware and software environment configurations of the experiments in this section are shown in Table 4-1, and the training-related hyperparameter settings are shown in Table 4-2 . In addition, the loss function of the experimental model in this section adopts the cross-contrast loss function, and the optimizer method uses stochastic gradient descent. Taking into account the different sizes of the data sets, this paper experiments on the Vaihingen data set for 100 rounds of training.
Table 4 1 Experimental hardware and software environment configuration
GPU model training framework operating system
GTX1050 Pytorch-1-8 Ubuntu16 -04

Table 4 2 Training-related hyperparameter settings
Data set batch size learning rate decay rate
Vaihingen 8 0-003 5e-4 The prediction result of U-Net is shown in 4-2. U-Net 82-10 72-35 84 -56 Model Mean F1 mLOU OA Table 3 3 U-Net test experiment
After testing, the U-Net test is shown in Table 4-3.



Figure 4-2 Prediction results of U-Net

Summary
This article designed a remote sensing image segmentation experiment based on U-Net network, and compared the results of FCN-32s and FCN-8s. Experimental results show that the U-Net network has the best results in remote sensing image segmentation tasks. This article also introduces commonly used image segmentation data sets and deep learning basics.
First, this article introduces the definition and application scenarios of the image segmentation task. Image segmentation refers to the process of dividing an image into several sub-regions, usually by segmenting objects or regions in the image. Image segmentation is widely used in computer vision, medical imaging, remote sensing images and other fields.
Secondly, this article introduces commonly used image segmentation data sets, including SIRI-WHU, WHU-RS19, GID and ISPRS Vaihingen data sets. These datasets contain images of various scenes and are of great significance for the evaluation and comparison of algorithms.
Next, this article introduces the application of deep learning in image segmentation tasks. The emergence of deep learning has greatly improved the image segmentation task. In particular, convolutional neural networks (CNN) have shown good results in image segmentation tasks. This article introduces two CNN-based image segmentation models: FCN and U-Net.
Then, this article introduces the experimental results of FCN-32s and FCN-8s. Experimental results show that FCN-8s performs better than FCN-32s in remote sensing image segmentation tasks. However, compared with U-Net, the effect of FCN still has room for improvement.
Finally, this article introduces the structure and experimental results of the U-Net network. The U-Net network is an image segmentation network based on CNN. It is characterized by dividing the input image into two branches and performing encoding and decoding operations simultaneously. Experimental results show that the U-Net network has the best effect in remote sensing image segmentation tasks and has good generalization performance.

Guess you like

Origin blog.csdn.net/yt_jisuanji/article/details/130506044