Semantic Segmentation of Images

Processed method schema:

Most of the semantic segmentation work in the pre-DL era is to perform image segmentation based on the low-level visual cues of the image pixels themselves. Since such methods do not have an algorithm training stage, the computational complexity is often not high, but on more difficult segmentation tasks (if no artificial auxiliary information is provided), the segmentation effect is not satisfactory.

After computer vision entered the era of deep learning, semantic segmentation has also entered a new stage of development. A series of semantic segmentation based on convolutional neural network "training" represented by fully convolutional neural networks (FCN) Methods have been proposed one after another, and the accuracy of image semantic segmentation has been repeatedly refreshed. The following introduces three representative practices in the field of semantic segmentation in the DL era.

2.2.1 Fully Convolutional Neural Network [3]

Fully Convolutional Neural Network (FCN) can be said to be the pioneering work of deep learning in image semantic segmentation task, from the Trevor Darrell group of UC Berkeley, published in CVPR 2015, the top conference in the field of computer vision, and won the best paper honorable mention.

The idea of ​​FCN is intuitive, that is, to directly perform pixel-level end-to-end semantic segmentation, which can be implemented based on the mainstream deep convolutional neural network model (CNN). As the so-called 'full convolutional neural network', in FCN, the traditional fully connected layers fc6 and fc7 are implemented by convolutional layers, and the final fc8 layer is replaced by a 21-channel (channel) 1x1 convolutional layer , as the final output of the network. There are 21 channels because the PASCAL VOC data contains 21 categories (20 object categories and one "background" category). The following figure shows the network structure of FCN. If the original image is H×W×3, after several stacked convolution and pooling layer operations, the corresponding response tensor (Activation tensor) of the original image can be clip_image001obtained clip_image002. number of channels. It can be found that due to the downsampling effect of the pooling layer, the length and width of the response tensor are much smaller than those of the original image, which brings problems to the direct training at the pixel level.


In order to solve the problem caused by downsampling, FCN uses bilinear interpolation to upsample the length and width of the response Zhang Liang to the size of the original image. In addition, in order to better predict the details in the image, FCN will also respond to the shallow layer in the network. Also consider it. Specifically, the responses of Pool4 and Pool3 are also taken as the output of the models FCN-16s and FCN-8s respectively, and combined with the output of the original FCN-32s to make the final semantic segmentation prediction (as shown in the figure below) .


imageimage

The figure below shows the semantic segmentation results of different layers as the output. It can be clearly seen that due to the different downsampling multiples of the pooling layer, the semantic segmentation fineness is different. For example, FCN-32s, because it is the output of the last layer of convolution and pooling of FCN, the model has the highest downsampling multiple, and its corresponding semantic segmentation result is the roughest; while FCN-8s can be obtained due to the smaller downsampling multiple. finer segmentation results.

image

2.2.2Dilated Convolutions [4]

One of the shortcomings of FCN is that due to the existence of the pooling layer, the size of the response tensor (length and width) is getting smaller and smaller, but the original design of FCN needs to output the same size as the input, so FCN does upsampling . But upsampling cannot recover all the lost information losslessly.

For this, dilated convolution is a good solution - since the downsampling operation of pooling will bring information loss, then remove the pooling layer. However, the removal of the pooling layer will bring the receptive field of each layer of the network smaller, which will reduce the prediction accuracy of the entire model. The main contribution of Dilated convolution is how to remove the pooling downsampling operation without reducing the receptive field of the network.

Taking the 3×3 convolution kernel as an example, when the traditional convolution kernel performs the convolution operation, the convolution kernel and the “continuous” 3×3 patch in the input tensor are multiplied point by point and then summed (as shown in Figure a below). , the red dot is the input "pixel" corresponding to the convolution kernel, and the green is the receptive field in the original input). The convolution kernel in the dilated convolution is to perform the convolution operation on the 3×3 patch of the input tensor at a certain pixel interval. As shown in Figure b below, after removing a pooling layer, it is necessary to replace the traditional convolutional layer with a dilated convolution layer with "dilation=2" after the removed pooling layer. At this time, the convolution kernel will input the tensor The position of every other "pixel" is used as the input patch for the convolution calculation, and it can be found that the perceptual field corresponding to the original input has been expanded (dilate); for the same reason, if another pooling layer is removed, it is necessary to The convolutional layer is replaced with a dilated convolution layer of "dilation=4", as shown in Figure c. In this way, even if the pooling layer is removed, the receptive field of the network can be guaranteed, thereby ensuring the accuracy of image semantic segmentation.

image

As can be seen from the following image semantic segmentation renderings, the use of the technology of dilated convolution can greatly improve the recognition of semantic categories and the fineness of segmentation details.

image

2.2.3 Post-processing operations represented by conditional random fields [5]

At present, many image semantic segmentation works based on deep learning use conditional random field (CRF) as the final post-processing operation to optimize the semantic prediction results.

Generally speaking, CRF regards the category of each pixel in the image as a variable clip_image001[4], and then considers the relationship between any two variables to build a complete graph (as shown in the figure below).

image

In the fully linked CRF model, the corresponding energy function is:

image

Among them clip_image001[6]is the unary item, which represents the semantic category corresponding to the X i pixel, and its category can be obtained from the prediction results of FCN or other semantic segmentation models; and the second item is the binary item, which can be used to connect the semantic relationship between pixels// Relationships are taken into account. For example, the probability that pixels such as "sky" and "bird" are adjacent in physical space should be higher than the probability that pixels such as "sky" and "fish" are adjacent. Finally, by optimizing the CRF energy function, the image semantic prediction result of FCN is optimized, and the final semantic segmentation result is obtained. It is worth mentioning that there has been work [5] to embed the CRF process originally separated from the deep model training into the neural network, that is, to integrate the FCN+CRF process into an end-to-end system. The benefits of doing so It is the energy function of the final prediction result of CRF, which can be directly used to guide the training of FCN model parameters to obtain better image semantic segmentation results.

image

3 Outlook

Although deep learning-based image semantic segmentation technology can achieve rapid segmentation results compared with traditional methods, it requires too much data annotation: not only massive image data, but also these images need to provide accurate pixel-level labeling information ( Semantic labels). Therefore, more and more researchers begin to turn their attention to the problem of image semantic segmentation under weakly-supervised conditions. In this type of problem, images only need to provide image-level annotations (eg, with "people", "cars", no "television") without expensive pixel-level information to achieve semantic segmentation comparable to existing methods precision.

In addition, the problem of image semantic segmentation at the instance level is also popular. This type of problem not only requires image segmentation of different semantic objects, but also requires segmentation of different individuals with the same semantics (for example, the pixels of the nine chairs in the picture need to be marked with different colors)

image

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325828865&siteId=291194637