SCNN network understanding - understanding of traffic scenes

Paper name: Spatial As Deep: Spatial CNN for Traffic Scene Understanding
Paper download address: https://arxiv.org/abs/1712.06080
GitHub code address: https://github.com/XingangPan/SCNN



0 Preface

The Spatial Convolutional Neural Network (SCNN) was proposed in the paper "Spatial As Deep: Spatial CNN for Traffic Scene Understanding" published by the Chinese University of Hong Kong and SenseTime Technology Group Co., Ltd. on AAAI2018 in 2017. Spatial CNN is mainly used to deal with the tasks 自动驾驶领域中感知任务in 车道线检测. The code implemented in the thesis is mainly modified from resnet. So you need to have a basic understanding of resnet. SCNN can achieve explicit and efficient spatial information propagation among neurons in the same CNN layer. It works well in cases where objects have strong shape priors, such as the long and thin continuous property of lane lines. SCNN has strong robustness. High recognition accuracy of lane lines has been achieved in normal, crowded, night, no lane lines, shadows, arrows, glare, curves and crossroad scenes.

Table 3 of the original paper shows the difference between traditional CNN and SCNN:
(a) is the traditional model, (b) is the SCNN model
insert image description here
detection comparison of various lane line detection models in various scenarios:
insert image description here


1 Question Driven

Autonomous driving has received a lot of attention in both academia and industry in recent years. One of the most challenging tasks for autonomous driving is computer vision tasks such as 交通场景理解and 车道检测. 语义分割1. Lane detection helps to guide the vehicle and can be used in driver assistance systems (Urmson et al. 2008). 2. While semantic segmentation provides more detailed locations about surrounding objects such as vehicles or pedestrians.

问题1: In practical applications, these tasks can be very challenging considering many harsh scenarios, including harsh weather conditions, dimness or glare, etc. Existing traditional CNN networks cannot handle these extreme scenarios well, and the detection accuracy of semantic objects with strong shape priors for occlusion but weak appearance cohesion in images is not high. For example, covered lane lines, poles, etc. They are all typical strong shape priors but weak appearance cohesion (the shape has very strong prior knowledge, we have a strong prior assumption about its shape, but its color, texture, etc. are not very cohesive sex). It is easy for the human eye and brain to discern objects in an image based on their prior knowledge and contextual information, but existing algorithms have not yet reached industry-accepted standards.
问题2Another challenge in traffic scene understanding is the poor performance of scene detection for lane line detection when going up and downhill .

解决方案:
In order to solve the above problem 1, this paper proposes a spatial CNN (Spatial CNN, SCNN), which is an extension of a deep convolutional neural network to a rich spatial level. In a layer-by-layer CNN, a convolutional layer receives the input of the previous layer, applies a convolutional operation and non-linear activation, and sends the result to the next layer. This process is done sequentially. Similarly, SCNN regards the rows or columns of feature maps as layers, and sequentially performs convolution, nonlinear activation, and sum operations to form a deep neural network. In this way, information can be 同层的神经元transmitted between. It is especially useful for structured objects such as lanes, poles, or occluded trucks, since spatial information can be enhanced by propagation between layers. In the case of discontinuous or chaotic objects, SCNN can keep the smoothness and continuity of lane lines and poles well.
insert image description here

2 related work

For lane detection, most existing algorithms are based on hand-constructed low-level features (Aly 2008; Son et al. 2015; Jung, Youn, and Sull 2016), which limit the ability to cope with harsh conditions . Only Huval et al. (2015) made the first attempt to adopt deep learning in lane detection, but without a large and general dataset . In terms of semantic segmentation, CNN-based methods have become mainstream and achieved great success (Long, Shelhamer, and Darrell 2015; Chen et al. 2017).

There are some other attempts to exploit spatial information in neural networks. Visin et al. (2015) and Bell et al. (2016) use recurrent neural networks to pass information along each row or column, so in one RNN layer, each pixel location can only receive information from the same row or column. Liang et al. (2016a; 2016b) proposed variants of LSTM to exploit contextual information in semantic object parsing, but such models are computationally expensive . Researchers have also attempted to combine CNNs with graphical models such as MRF or CRF, where message passing is achieved through convolutions with large kernels (Liu et al. 2015; Tompson et al. 2014; Chu et al. 2016). Compared with the above methods, SCNN has the following three 优点:
(1) Compared with traditional dense MRF/CRF, SCNN uses sequential message passing to greatly improve computational efficiency; (to deal with traditional models) (2)
with residual The message is passed in the form of , which makes SCNN easy to train. (Coping with variant models of LSTM)

OS : It is to understand the way of thinking and problem-solving solutions of predecessors' work, find out the problems that cannot be solved by their research, and lead to solutions to the problems in this paper.

3 Contributions of this article (innovation points)

3.1 Lane Detection Dataset (lane line detection data set)

In this paper, we present a large scale challenging dataset for traffic lane detection. Besides, none of these datasets annotates the lane markings that are occluded or are unseen because of abrasion,while such lane marking scan be inferred by human and is of high value in real applications.

In this paper, a large-scale challenging dataset for traffic lane detection is proposed. The current data set is either too small in amount, or has a single scene and road sections with low traffic, which are prone to model underfitting and closed road sections.

Fig. 2 (a) shows some examples, which comprises urban, rural, and highway scenes. As one of the largest and most crowded cities in the world, Beijing provides many challenging traffic scenarios for lane detection.

Figure 2(a) shows some examples, which include urban, rural and highway scenarios. As one of the largest and most congested cities in the world, Beijing provides many challenging traffic scenarios for lane detection.
insert image description here

For each frame, we manually annotate the traffic lanes
with cubic splines. As mentioned earlier, in many cases lane
markings are occluded by vehicles or are unseen. In real ap-
plications it is important that lane detection algorithms could
estimate lane positions from the context even in these chal-
lenging scenarios that occur frequently. Therefore, for these
cases we still annotate the lanes according to the context, as
shown in Fig. 2 (a) (2)(4). We also hope that our algorithm
could distinguish barriers on the road, like the one in Fig. 2
(a) (1). Thus the lanes on the other side of the barrier are not
annotated. In this paper we focus our attention on the detec-
tion of four lane markings, which are paid most attention to
in real applications. Other lane markings are not annotated.

For each frame, we manually annotated traffic lanes using cubic splines. As mentioned earlier, there are many situations where lane markings are obscured or invisible by vehicles. In practical applications, it is important that lane detection algorithms can estimate lane positions from context, even in these often challenging scenarios. Therefore, for these cases, we still annotate the lane lines according to the context, as shown in Fig. 2 (a)(2)(4). We also want our algorithm to distinguish obstacles on the road, like in Figure 2(a)(1), so that the lane on the other side of the obstacle will not be marked. In this paper, we focus on the detection of four lane markings, which is the most concerned in practical applications. Other lane markings are not marked.

3.2 SCNN

to more efficiently learn the spatial relationship and the smooth, continuous prior of lane markings, or other structured object in the driving scenario, we propose Spatial CNN. Note that the ’spatial’ here is not the same with that in ’spatial convolution’, but denotes propagating spatial information via specially designed CNN structure.

Spatial CNNs are proposed to improve the efficiency of learning smooth, continuous prior information on spatial relationships in driving scenes as well as lane lines or other structured objects. What is required 注意is that the "spatial" here is not the same as "spatial convolution", but refers to the propagation of spatial information through a specially designed CNN structure.
insert image description here

As shown in the ’SCNN D’ module of Fig. 3 (b), considering a SCNN applied on a 3-D tensor of size C × H × W, where C, H, and W denote the number of channel, rows, and columns respectively. The tensor would be splited into H slices, and the first slice is then sent into a convolution layer with C kernels of size C×w, where w is the kernel width. In a traditional CNN the output of a convolution layer is then fed into the next layer, while here the output is added to the next slice to provide a new slice. The new slice is thensent to the next convolution layer and this process would continue until the last slice is updated.

As shown in the 'SCNN D' module of Figure 3 (b), consider applying SCNN on a 3-D tensor of size C × H × W, where C, H, W denote the number of channels, rows, columns, respectively number. The tensor is split into H slices, and the first slice is fed into a convolutional layer with C kernels of size C × w, where w is the kernel width. In a traditional
CNN, the output of a convolutional layer is then fed into the next layer, whereas here the output is added to the next slice to provide a new slice. Then, the new slice is sent to the next convolutional layer, and the process continues until the last slice is updated.

Specifically, assume we have a 3-D kernel tensor K with element K i,j,k denoting the weight between an element in channel i of the last slice and an element in channel j of the current slice, with an offset of k columes between two elements. Also denote the element of input 3-D tensor X as X i,j,k , where i, j, and k indicate indexes of channel, row, and column respectively. Then the forward computation of SCNN is:
insert image description here
where f is a nonlinear activation function as ReLU. The X with superscript ` denotes the element that has been updated.
Note that the convolution kernel weights are shared across all slices, thus SCNN is a kind of recurrent neural network.
Also note that SCNN has directions. In Fig. 3 (b), the four ’SCNN’ module with suffix ’D’, ’U’, ’R’, ’L’ denotes SCNN
that is downward, upward, rightward, and leftward respectively.
.

The above picture shows the forward calculation of SCNN. where f is a non-linear activation function such as ReLU. A superscripted X 0 indicates an updated element. It should be noted that the weight of the convolution kernel is on all slices 共享, so SCNN is a recurrent neural network. Note also that SCNN 有方向. In Fig. 3(b), the four 'SCNN' modules with suffixes 'D', 'U', 'R', 'L' denote downward, upward, rightward and leftward SCNN respectively.


Summarize

The above is the summary of today's SCNN paper, followed by GitHub code implementation and model code + training code analysis.

Guess you like

Origin blog.csdn.net/qq_45973897/article/details/129612453