Object segmentation technology-overview of semantic segmentation

Preface

The blogger is currently a senior artificial intelligence engineer. He has published many SCI articles and won many international competition awards. He understands the principles of various models, the modeling process of each model, and the analysis methods of various topics. The purpose is to enable people with zero foundation to quickly use various code models. Each article contains practical projects and runnable code. Everyone is welcome to subscribeLearn quickly in one article - Deep learning project practice

Object segmentation technology-overview of semantic segmentation

Object segmentation is an important task in the field of computer vision, which aims to accurately segment specific targets or objects from images or videos. Unlike object detection, which focuses on object locations and bounding boxes, object segmentation requires accurately identifying and labeling each pixel of the object to achieve pixel-level understanding of the object.

definition

We can break down target segmentation into two technical implementation parts: one is semantic segmentation and the other is instance segmentation. For image classification, object detection and image segmentation:

  • Image classification aims to determine the category to which the image belongs.
  • Target detection is based on image classification and further determines where the target in the image is located, usually in the form of a bounding box.
  • Image segmentation is a more advanced task of target detection. Target detection only needs to frame the bounding box of each target. Semantic segmentation needs to further determine which pixels in the image belong to which target. However, semantic segmentation does not distinguish between different instances belonging to the same category. That is to say, if there is overlap of target objects, semantic segmentation will only identify a common pixel target: Insert image description here
    , while instance segmentation requires Distinguish between:Insert image description here

Then let’s understand the overall definition of image segmentation: In the field of computer vision, image segmentation (Object Segmentation) refers to the process of subdividing a digital image into multiple image sub-regions (a collection of pixels), and the features within the same sub-region There is a certain similarity, but the characteristics of different sub-regions show obvious differences:Insert image description here

The goal of image segmentation is to classify each pixel in the image. The application fields are very wide: autonomous driving, medical imaging, image beautification, 3D reconstruction, etc.:

Insert image description here

principle

Simply put, our goal is to input an RGB color image (height×width×3) or a grayscale image (height×width×1), and then output a segmentation map containing each pixel category label (height×width×1) ). As shown below:

Insert image description here

We can clearly distinguish target objects with the naked eye, but if we want the computer to achieve target cutting, we can only output it in the form of image data and then process it. In the image processing part, we know that the picture is composed of a three-dimensional array RGB form, and there will be some differences between different pixel blocks. That is to say, we can use these pixel differences to distinguish different target objects, but if there are color differences between pixels Not big, it's difficult to distinguish different objects for that long. The prediction target can be encoded using one-hot encoding, which creates an output channel for each possible class. That is to say, separate the different numbers of the matrix of Semantic Labels mentioned above:

Insert image description here

When the predictions are superimposed onto a single channel, it can give the area in an image where a specific class is located:

Insert image description here

If you are interested in the above target segmentation images, you can take a look at the PASCAL VOC data set. The PASCAL VOC (Visual Object Classes) data set is a data set widely used for computer vision tasks such as target detection, image classification, and semantic segmentation. This dataset was released by the Computer Vision Laboratory of the University of Oxford in the UK from 2005 to 2012 and is defined by a standardized image annotation and evaluation protocol:Insert image description here

  • Store image files in JPEGImages
  • The image information used for segmentation is recorded in the segmentation in imagesets.
  • SegmentationClass contains annotation information for semantic segmentation.
  • SegmentationObject is the annotation information of instance segmentation.

Insert image description hereInsert image description here

implementation technology

After understanding the above basic concepts, we can learn about some techniques for achieving target segmentation, which are still divided into two parts: semantic segmentation and instance segmentation.

Semantic Segmentation

Definition: Semantic segmentation aims to assign each pixel in an image to the corresponding semantic category without distinguishing between different instances. For example, in an image containing cars, pedestrians, and roads, the goal of semantic segmentation is to label each pixel in the image as a car, pedestrian, or road. The previous article has described it in more detail. If you don’t understand, just look at the previous picture.

Algorithm: Some common semantic segmentation algorithms include:

  • Fully Convolutional Networks (FCN)
  • U-Net
  • DeepLab
  • SegNet

These algorithms are introduced one by one and will be implemented one after another in the future. In order to classify a pixel, the traditional CNN-based semantic segmentation method uses an image patch around the pixel as the input of the CNN for training and prediction. However, such algorithms usually have three major drawbacks:

  1. The calculation and storage efficiency is low and the storage space is very high. For example, the size of the image block used for one pixel is 15×15, and the window needs to be continuously slid. After each sliding window, the CNN is used for discrimination and classification. Therefore, the required storage space is very demanding depending on the number and size of sliding windows.
  2. The calculation efficiency is low. Adjacent pixel blocks are basically repeated. Convolution is calculated one by one for each pixel block. This calculation is also repeated to a large extent.
  3. Classification performance is limited, and the size of the pixel block limits the size of the receptive field. Usually the size of the pixel block is much smaller than the size of the entire image, and only some local features can be extracted.

Subsequent algorithms were continuously modified and optimized on CNN to form the current semantic segmentation algorithm ecosystem.

Fully Convolutional Networks (FCN)

Fully Convolutional Networks (FCN) is a deep learning architecture for image segmentation, proposed by Jonathan Long, Evan Shelhamer, and Trevor Darrell in 2015. Compared with the traditional convolutional neural network (CNN) architecture, the innovation of FCN is that it replaces the fully connected layer with a fully convolutional layer, allowing the network to accept input images of any size and output segmentation results of corresponding sizes. .

Usually, after the traditional CNN structure extracts image features through a convolutional layer, these features are mapped into a fixed-length feature vector through several fully connected layers. This structure is suitable for image-level classification and regression tasks. For example, in classic CNN models such as AlexNet, especially in image classification tasks (such as ImageNet classification), the ultimate goal is to obtain a numerical description of the entire input image, usually a probability distribution vector.

Taking AlexNet as an example, its structure includes convolutional layers, pooling layers and fully connected layers. In the convolution and pooling layers, local features of the image are extracted and gradually reduce the spatial dimension. In the fully connected layer, these features are compressed into a fixed-length vector, and the probability distribution of classification is finally output. Such a design enables the model to understand the entire image at a high level and output global information about the entire input image, which is suitable for image classification. However, for some tasks that require more fine-grained information, such as the location of the target or pixel-level segmentation, such a structure may not be flexible enough.

Different from this, structures such as fully convolutional networks (FCN) replace fully connected layers with fully convolutional layers, allowing the model to process input images of variable sizes and output segmentation results of corresponding sizes. This structure is more suitable for pixel-level tasks such as semantic segmentation or instance segmentation. It can be understood more abstractly, let’s look at the picture:

Take the traditional VGG convolutional network as an example:

Insert image description here

After going through convolution and pooling, the 7x7x512 features are finally compressed into a one-dimensional vector of fixed length 4096 through the fully connected layer. Let’s take a look at the FCN network structure:

Insert image description here

The FCN network structure is mainly divided into two parts: the full convolution part and the deconvolution part. There is no fully connected layer, which means that there is no need for fixed-length feature vectors, The full convolution part is some classic CNN networks (such as VGG, ResNet, etc.), used to extract features, mainly through deconvolution layers to achieve semantic segmentation, deconvolution layers (Deconvolutional Layer) is usually used for upsampling operations to map low-resolution feature maps to the same resolution as the input image in order to generate pixel-level segmentation results. However, it is important to note that the term “deconvolution” here means the opposite of how traditional convolution operates.

Upsampling Upsampling:

The convolution operation and pooling operation in the convolution process will make the size of the feature map smaller. In order to obtain dense pixel prediction of the original image size, the obtained feature map needs to be upsampled. Upsampling can be achieved through bilinear interpolation (Bilinear), and bilinear interpolation is easily implemented through transposed convolution (transposed convolution) with a fixed convolution kernel. Transposed convolution is deconvolution (deconvolution). In the paper, the author does not fix the convolution kernel, but makes the convolution kernel a learnable parameter. The transposed convolution operation process is as follows. In the figure below, blue is the input of the deconvolution layer, green is the output of the deconvolution layer, and the transposed convolution in which the inner and outer circles of the elements are filled with 0:
Insert image description here

From the above figure, you can see that the convolution expands the 2×2 original image by two circles, and then through the 3×3 convolution kernel, the convolution result image is increased to the size of 4×4. After convolution, the resulting image is larger than the original image: it is called full convolution. In fact, full convolution is the process of deconvolution. Through full convolution The product expands the original image and increases the resolution of the original image, so deconvolution of the image is also called "upsampling" the image. Therefore, it can also be understood directly that the convolution and deconvolution of images is not a simple transformation and restoration process. That is, the image is first convolved and then deconvolved using the same convolution kernel. It cannot be restored to the original image, because the image is simply enlarged after deconvolution and cannot be restored to the original image.

The goal of the deconvolution layer is to restore abstract semantic features to a level closer to the original resolution of the input image through upsampling. This helps preserve local details and improve segmentation accuracy. In TensorFlow, deconvolution operations are usually implemented through Conv2DTranspose layers. This layer is similar to a normal convolutional layer, but it performs a transposed convolution, also known as a fractionally strided convolution or deconvolution. This operation upsamples by inserting zero elements (padding) between inputs:

from tensorflow.keras.layers import Conv2DTranspose

# 假设输入特征图大小为 (4, 4, 256)
input_feature_map = Input(shape=(4, 4, 256))

# 反卷积层
upsampled_feature_map = Conv2DTranspose(128, (3, 3), strides=(2, 2), padding='same')(input_feature_map)

strides=(2, 2)The stride in the height and width directions is 2, which implements the upsampling operation of the input feature map.

U-Net

U-Net is a deep learning architecture for image segmentation tasks, proposed by Ronneberger et al. in 2015. U-Net gets its name because its network structure is U-shaped. It has achieved great success in fields such as medical image segmentation, and is especially suitable for situations such as small samples and imbalanced data. In addition to knowing the category of the image, medical staff also want to know the location distribution of various tissues in the image, and U-net can realize the positioning of picture pixels. The network classifies each pixel in the image. The final output is an image segmented according to the category of pixels.

The network structure of U-Net is divided into two main parts: the encoder (Encoder) and the decoder (Decoder), with a U-shaped structure connecting them in the middle. This structure preserves the information of the image resolution and helps to better capture the local and global features of the image.

Insert image description here

The arrows in the picture above represent the following conversion operations:

  1. Blue arrow: After convolving the image with a 3×3 convolution kernel, the feature channel is output through the ReLU activation function;
  2. Gray arrow: Crop and copy the image in the downsampling process on the left;
  3. Red arrow: downsample the image through max pooling, and the pooling kernel size is 2×2;
  4. Green arrow: deconvolution, upsampling the image, the convolution kernel size is 2×2;
  5. Cyan arrow: Use a 1×1 convolution kernel to convolve the image.

The specific network architecture and operations are shown in the figure above. The U-net network has four layers in total, and the images are down-sampled 4 times and up-sampled 4 times respectively. The next chapter on building semantic segmentation will describe the network operations of each layer in detail, and the code will be more specific.

DeepLab

DeepLab is a deep learning architecture developed by Google Research for image segmentation tasks. Its goal is to achieve high-quality semantic segmentation. A series of versions of DeepLab continue to introduce new technologies and improvements, the most important of which are DeepLabV3 and DeepLabV3+.

DeepLabV1:
  1. Atrous Convolution: DeepLabV1 introduces atrous convolution, also known as dilated convolution, which is used to expand the receptive field to better capture the Contextual information.
  2. Multi-scale processing: Apply multiple convolution kernels under different hole rates to process information at different scales.
DeepLabV2:
  1. Conditional Random Field (CRF): Fully connected CRF is introduced to refine the segmentation results to improve the details of the boundaries.
DeepLabV3:
  1. Variation of atrous convolution: uses atrous convolution modules with different hole rates to form a deep atrous convolutional network (ASPP) structure. The ASPP module applies multiple different hole rates in parallel to capture multi-scale contextual information.
  2. Multi-scale pyramid pooling (ASPP): is used to effectively process features of different scales and improve segmentation performance.
  3. Global average pooling: Use a global average pooling layer to process input images of different sizes.
DeepLabV3+:
  1. Encoder-Decoder Structure: introduces an encoder-decoder structure that uses depthwise separable convolutions for more efficient feature extraction.
  2. Xception model as the basis: Using the Xception model as the encoder improves the effect of feature extraction.
  3. Spatial upsampling of decoder: Spatial upsampling is performed using bilinear interpolation and convolution to restore the encoder's output to the original resolution.
SegNet

SegNet is a deep learning architecture for image segmentation tasks, proposed in 2015 by a research team at the University of Cambridge. SegNet mainly focuses on semantic segmentation, that is, segmenting images into different semantic regions. Its design is inspired by the application of deep learning in the field of autonomous driving, such as road segmentation.

SegNet consists of two parts, the encoder (Encoder) and the decoder (Decoder), and its structure is somewhat similar to the autoencoder.

Encoder:

  • The encoder consists of convolutional and pooling layers to extract high-level features of the input image. These features are downsampled in the encoder, reducing the spatial resolution.

decoder:

  • The decoder is the opposite of the encoder and consists of upsampling layers and deconvolutional layers. The task of the decoder is to restore the low-resolution feature map produced by the encoder to the resolution of the original input image.

(Decoder) two parts, its structure is somewhat similar to the autoencoder.

Encoder:

  • The encoder consists of convolutional and pooling layers to extract high-level features of the input image. These features are downsampled in the encoder, reducing the spatial resolution.

decoder:

  • The decoder is the opposite of the encoder and consists of upsampling layers and deconvolutional layers. The task of the decoder is to restore the low-resolution feature map produced by the encoder to the resolution of the original input image.

SegNet uses max pooling in the encoder stage, but unlike traditional max pooling, it records the index of the maximum value in each pooling window. These pooling indices will be passed to the decoder for non-linear upsampling during the upsampling stage. SegNet is mainly used for image segmentation tasks, especially road segmentation tasks in the field of autonomous driving. It can assign each pixel in an image to which semantic category it belongs, thereby enabling accurate identification of roads, vehicles, pedestrians, etc. in autonomous driving systems.

Guess you like

Origin blog.csdn.net/master_hunter/article/details/134524610