Autopilot scene segmentation system based on improved Deeplabv3plus (source code & tutorial)

1. Research Background

With the rapid development of artificial intelligence technology, the automatic driving of vehicles is getting closer and closer to people's lives. In the entire operation process of autonomous driving, it first needs to rely on various on-board sensors to collect various environmental data around the vehicle, and then use various analysis algorithms to analyze the computer-perceivable environmental information, and then use the perceived information to guide vehicle planning decisions. However, most of the current sensors such as lidar used in vehicles are expensive, which is not conducive to the large-scale popularization of autonomous vehicles. In comparison, the camera is cheap and can obtain a large amount of surrounding environment information. Therefore, it is of great significance to study the camera-based autonomous driving perception algorithm.
Image semantic segmentation is one of the most important technologies in autonomous driving perception. Using the results of image semantic segmentation, the vehicle's drivable area information and front obstacle information can be obtained. After the rise of deep learning and convolutional neural networks in recent years, many image semantic segmentation algorithms based on deep learning have emerged, which can basically achieve end-to-end image semantic segmentation output. However, there are still some problems in the actual use of the distance image semantic segmentation algorithm in the automatic driving system:
(1) Many algorithms cannot run in real time, which cannot meet the needs of safety-oriented autonomous driving;
(2) The scene in the automatic driving scene The complexity and variety cause the training data samples to be unbalanced, and many algorithms do not perform well in the detection of small samples.
Therefore, there are still many problems to be solved in image semantic segmentation of autonomous driving urban scenes. The research focus of this paper is based on the real-time image semantic segmentation algorithm in autonomous driving scenes.

2. Picture demonstration

2.png

3.png

4.png

3. Video demonstration

Automatic driving scene segmentation system based on improved Deeplabv3plus (source code & tutorial)_哔哩哔哩_bilibili

4. Introduction to Deeplabv3plus

DeepLabV3plus is a model for semantic segmentation, which proposes a new encoder-decoder structure, adopts DeepLabv3 as the encoder module, and uses a simple but effective decoder module. The model can control the resolution of the extracted encoder features through atrous convolution (atrous convolution), which trades off accuracy and running time. In addition, the network also uses the Xception model for segmentation tasks, and applies Depthwise Separable Convolution to the ASPP module and the decoder module, resulting in a faster and stronger encoder-decoder network. Its network structure is as follows:
image.png

5. Improvement direction

replace the backbone

The backbone in the DeepLabV3+ paper is Xception, and ResnetV2-50 and ResnetV2-101 are used in the project I downloaded.

The overall model is saved as a PB with more than 100 M, and the running time on the CPU is more than 1 second.

In order to speed up the network, replace the backbone with MobileNetV2.

Replace ordinary convolution with depthwise separation convolution

The number of parameters in the ASPP part and the decoder part is equally scary, so all ordinary convolutions are replaced with depth-separated convolutions.

At the same time, the number of channels in the ASPP and decoder parts has also been reduced to a certain extent.

Add a fusion of the underlying features

When analyzing the ID card components, it was found that the detail segmentation effect was poor. In order to improve the details, the 1/2 size feature map is fused with the decoder feature, and finally achieved good results.

Improved network structure

image.png

6. Code implementation

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf

from src.deeplabv3.nets.config import *
from src.deeplabv3.nets import resnet_utils
from src.deeplabv3.nets.resnet_v1 import bottleneck, resnet_arg_scope

slim = tf.contrib.slim

@slim.add_arg_scope
def bottleneck_hdc(inputs,
               depth,
               depth_bottleneck,
               stride,
               rate=1,
               multi_grid=(1,2,4),
               outputs_collections=None,
               scope=None,
               use_bounded_activations=False):
  """Hybrid Dilated Convolution Bottleneck.
  Multi_Grid = (1,2,4)
  See Understanding Convolution for Semantic Segmentation.
  When putting together two consecutive ResNet blocks that use this unit, one
  should use stride = 2 in the last unit of the first block.
  Args:
    inputs: A tensor of size [batch, height, width, channels].
    depth: The depth of the ResNet unit output.
    depth_bottleneck: The depth of the bottleneck layers.
    stride: The ResNet unit's stride. Determines the amount of downsampling of
      the units output compared to its input.
    rate: An integer, rate for atrous convolution.
    multi_grid: multi_grid sturcture.
    outputs_collections: Collection to add the ResNet unit output.
    scope: Optional variable_scope.
    use_bounded_activations: Whether or not to use bounded activations. Bounded
      activations better lend themselves to quantized inference.
  Returns:
    The ResNet unit's output.
  """
  with tf.variable_scope(scope, 'bottleneck_v1', [inputs]) as sc:
    depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4)
    if depth == depth_in:
      shortcut = resnet_utils.subsample(inputs, stride, 'shortcut')
    else:
      shortcut = slim.conv2d(
          inputs,
          depth, [1, 1],
          stride=stride,
          activation_fn=tf.nn.relu6 if use_bounded_activations else None,
          scope='shortcut')

    residual = slim.conv2d(inputs, depth_bottleneck, [1, 1], stride=1, 
      rate=rate*multi_grid[0], scope='conv1')
    residual = resnet_utils.conv2d_same(residual, depth_bottleneck, 3, stride,
      rate=rate*multi_grid[1], scope='conv2')
    residual = slim.conv2d(residual, depth, [1, 1], stride=1, 
      rate=rate*multi_grid[2], activation_fn=None, scope='conv3')

    if use_bounded_activations:
      # Use clip_by_value to simulate bandpass activation.
      residual = tf.clip_by_value(residual, -6.0, 6.0)
      output = tf.nn.relu6(shortcut + residual)
    else:
      output = tf.nn.relu(shortcut + residual)

    return slim.utils.collect_named_outputs(outputs_collections,
                                            sc.name,
                                            output)

def deeplabv3(inputs,
              num_classes,
              depth=50,
              aspp=True,
              reuse=None,
              is_training=True):
  """DeepLabV3
  Args:
    inputs: A tensor of size [batch, height, width, channels].
    depth: The number of layers of the ResNet.
    aspp: Whether to use ASPP module, if True, will use 4 blocks with 
      multi_grid=(1,2,4), if False, will use 7 blocks with multi_grid=(1,2,1).
    reuse: Whether or not the network and its variables should be reused. To be
      able to reuse 'scope' must be given.
  Returns:
    net: A rank-4 tensor of size [batch, height_out, width_out, channels_out].
    end_points: A dictionary from components of the network to the 
      corresponding activation.
  """
  if aspp:
    multi_grid = (1,2,4)
  else:
    multi_grid = (1,2,1)
  scope ='resnet{}'.format(depth)
  with tf.variable_scope(scope, [inputs], reuse=reuse) as sc:
    end_points_collection = sc.name + '_end_points'
    with slim.arg_scope(resnet_arg_scope(weight_decay=args.weight_decay, 
      batch_norm_decay=args.bn_weight_decay)):
      with slim.arg_scope([slim.conv2d, bottleneck, bottleneck_hdc],
                          outputs_collections=end_points_collection):
        with slim.arg_scope([slim.batch_norm], is_training=is_training):
          net = inputs
          net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1')
          net = slim.max_pool2d(net, [3, 3], stride=2, scope='pool1')

          with tf.variable_scope('block1', [net]) as sc:
            base_depth = 64
            for i in range(2):
              with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                net = bottleneck(net, depth=base_depth * 4, 
                  depth_bottleneck=base_depth, stride=1)
            with tf.variable_scope('unit_3', values=[net]):
              net = bottleneck(net, depth=base_depth * 4, 
                depth_bottleneck=base_depth, stride=2)
            net = slim.utils.collect_named_outputs(end_points_collection, 
              sc.name, net)

          with tf.variable_scope('block2', [net]) as sc:
            base_depth = 128
            for i in range(3):
              with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                net = bottleneck(net, depth=base_depth * 4, 
                  depth_bottleneck=base_depth, stride=1)
            with tf.variable_scope('unit_4', values=[net]):
              net = bottleneck(net, depth=base_depth * 4, 
                depth_bottleneck=base_depth, stride=2)
            net = slim.utils.collect_named_outputs(end_points_collection, 
              sc.name, net)

          with tf.variable_scope('block3', [net]) as sc:
            base_depth = 256

            num_units = 6
            if depth == 101:
              num_units = 23
            elif depth == 152:
              num_units = 36

            for i in range(num_units):
              with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                net = bottleneck(net, depth=base_depth * 4, 
                  depth_bottleneck=base_depth, stride=1)
            net = slim.utils.collect_named_outputs(end_points_collection, 
              sc.name, net)

          with tf.variable_scope('block4', [net]) as sc:
            base_depth = 512

            for i in range(3):
              with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                net = bottleneck_hdc(net, depth=base_depth * 4, 
                  depth_bottleneck=base_depth, stride=1, rate=2, 
                  multi_grid=multi_grid)
            net = slim.utils.collect_named_outputs(end_points_collection, 
              sc.name, net)

          if aspp:
            with tf.variable_scope('aspp', [net]) as sc:
              aspp_list = []
              branch_1 = slim.conv2d(net, 256, [1,1], stride=1, 
                scope='1x1conv')
              branch_1 = slim.utils.collect_named_outputs(
                end_points_collection, sc.name, branch_1)
              aspp_list.append(branch_1)

              for i in range(3):
                branch_2 = slim.conv2d(net, 256, [3,3], stride=1, rate=6*(i+1), scope='rate{}'.format(6*(i+1)))
                branch_2 = slim.utils.collect_named_outputs(end_points_collection, sc.name, branch_2)
                aspp_list.append(branch_2)

              aspp = tf.add_n(aspp_list)
              aspp = slim.utils.collect_named_outputs(end_points_collection, sc.name, aspp)
              net = aspp

            with tf.variable_scope('img_pool', [net]) as sc:
              """Image Pooling
              See ParseNet: Looking Wider to See Better
              """
              pooled = tf.reduce_mean(net, [1, 2], name='avg_pool', 
                keep_dims=True)
              pooled = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, pooled)

              pooled = slim.conv2d(pooled, 256, [1,1], stride=1, scope='1x1conv')
              pooled = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, pooled)

              pooled = tf.image.resize_bilinear(pooled, tf.shape(net)[1:3])
              pooled = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, pooled)

            with tf.variable_scope('fusion', [aspp, pooled]) as sc:
              net = tf.concat([aspp, pooled], 3)
              net = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, net)

              net = slim.conv2d(net, 256, [1,1], stride=1, scope='1x1conv')
              net = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, net)
          else:
            with tf.variable_scope('block5', [net]) as sc:
              base_depth = 512

              for i in range(3):
                with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                  net = bottleneck_hdc(net, depth=base_depth * 4, 
                    depth_bottleneck=base_depth, stride=1, rate=4)
              net = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, net)

            with tf.variable_scope('block6', [net]) as sc:
              base_depth = 512

              for i in range(3):
                with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                  net = bottleneck_hdc(net, depth=base_depth * 4, 
                    depth_bottleneck=base_depth, stride=1, rate=8)
              net = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, net)

            with tf.variable_scope('block7', [net]) as sc:
              base_depth = 512

              for i in range(3):
                with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
                  net = bottleneck_hdc(net, depth=base_depth * 4, 
                    depth_bottleneck=base_depth, stride=1, rate=16)
              net = slim.utils.collect_named_outputs(end_points_collection, 
                sc.name, net)

          net = slim.conv2d(net, num_classes, [1,1], stride=1, 
            activation_fn=None, normalizer_fn=None, scope='logits')
          net = slim.utils.collect_named_outputs(end_points_collection, 
            sc.name, net)

          end_points = slim.utils.convert_collection_to_dict(
              end_points_collection)

          return net, end_points

if __name__ == "__main__":
  x = tf.placeholder(tf.float32, [None, 512, 512, 3])

  net, end_points = deeplabv3(x, 21)
  for i in end_points:
    print(i, end_points[i])

7. System integration

Source code & Environment Deployment Video Tutorial & Dataset & Custom UI interface in the figure below
1.png
Refer to the blog "Complete Source Code & Environment Deployment Video Tutorial & Dataset & Custom UI Interface"

8. References

[1] Guo Xu. Analysis and prospect of unmanned driving technology from the perspective of artificial intelligence [D]. 2017 [
2] Sun Zhijun, Xue Lei, Xu Yangming, et al. A review of deep learning research [D]. 2012
[3] Lu Jianfeng, Lin Hai , Pan Zhigeng. Application of adaptive region growing algorithm in medical image segmentation [D]. 2005
[4] Ding Haiyong, Wang Yuxuan, Mao Yuqiong, et al. Research on high resolution remote sensing image segmentation based on dynamic threshold region splitting and merging algorithm[J]. Surveying and Mapping Bulletin. 2016, (8). 145-146.
[5] Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, etc. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. [J].IEEE Transactions on Pattern Analysis & Machine Intelligence.2018,40(4).834-848.
[6]Vijay Badrinarayanan,Alex Kendall,Roberto Cipolla.SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J ].IEEE Transactions on Pattern Analysis and Machine Intelligence.2017,39(12).2481-2495.
[7]Shelhamer, Evan,Long, Jonathan,Darrell, Trevor.Fully Convolutional Networks for Semantic Segmentation[J].IEEE Transactions on Pattern Analysis & Machine Intelligence.2017,39(6).640-651.
[8]Zhang, Xiangyu,Zou, Jianhua,He, Kaiming,等.Accelerating Very Deep Convolutional Networks for Classification and Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence.2016,38(10).1943-1955.DOI:10.1109/TPAMI.2015.2502579.
[9]Antonio Criminisi,Ender Konukoglu,Jamie Shotton.Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning[J].Foundations & trends in computer graphics & vision.2011,7(2).
[10]P. Haffner,L. Bottou,Y. Bengio,等.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE.1998,86(11).

Guess you like

Origin blog.csdn.net/m0_74241524/article/details/127647243