Vision and lidar fusion 3D detection (1) AVOD

1 Overview

AVOD (Aggregate View Object Detection) is similar to MV3D. It is a 3D object detection algorithm that fuses 3D point clouds and camera RGB images. The difference is: MV3D combines camera RGB images, point cloud BEV mapping and FrontView mapping, while AVOD only fuses camera RGB images and point cloud BEV maps.

        From the network results, AVOD uses a two-stage detection network, which makes it easy for us to think of the Faster RCNN object detection network, which is also the result of a two-stage detection network. The rate is slow, and it is only suitable for scenarios where the detection frame rate is not high and the detection accuracy is required.

        The following is a network structure diagram of AVOD   AVOD

2. Network structure

​The network firstly fuses the input data after feature extraction, dimensionality reduction, and cropping to obtain the area containing the foreground in the scene (performs preliminary regression), and then projects the candidate area in the scene to the bird's-eye view and RGB image After that, the area to be cropped is obtained, cropped and adjusted to a uniform size, and then the detection categories and 3D object detection frames of different objects in the scene are obtained through fusion.

2.1 Laser point cloud data preprocessing

AVOD has made some simplifications for laser point cloud processing compared to MV3D. The intensity map is removed, and the height map of the point cloud is divided into M layers, that is, z is in the range of (0, 2.5), 5 layers are obtained at an interval of 0.5, and each grid in each layer takes the point cloud with the highest height .

​The processing of the density map is:

2.2 Feature extraction

        The Feature Extractor at the front end of the network extracts the input data and obtains the feature map. Compared with the feature extractor in MV3D (improved VGG-16), the feature extractor of AVOD uses FPN to extract the features of laser point cloud and RGB image. It has the ability of multi-scale detection (including bottom layer and high layer information), and has certain advantages compared to MV3D in small object detection.

2.3 Reduce data volume

​After the respective feature extraction, the number of channels is reduced by passing the 1*1 convolution operation. Quoting from the original paper:

In some scenarios, the region proposal network isrequired to save feature crops for 100K anchors in GPUmemory. Attempting to extract feature crops directly from
high dimensional feature maps imposes a large memory overhead per input view. As an example, extracting 7 × 7 feature crops for 100K anchors from a 256-dimensional
feature map requires around 5 gigabytes1 of memory assuming 32-bit floating point representation. Furthermore, processing such high-dimensional feature crops with the RPN greatly increases its computational requirements.

The two-stage AOVD is reflected in two aspects: target frame detection, two-step fusion of data level and feature level. The basic process is as follows:

1. AVOD divides the RGB image and the point cloud BEV into two paths and uses the FPN network (details about FPN will not be elaborated) for feature extraction to obtain two input full-resolution feature maps: image feature map and BEV feature map (height map and density plots).

2. For the two feature maps, in order to fuse them, they need to be adjusted to a consistent size, and they are respectively 1x1 convolutions, further feature extraction, and then after adjusting the same size, fusion is performed. This is the first fusion, compared to The second fusion is equivalent to data-level fusion.

3. The features after the first fusion are sent to an RPN network similar to Faster RCNN. After the fully connected layer and NMS, after the first classification and 3DBBox regression, a series of area suggestion boxes (candidate boxes) are generated. ,

4. The generated suggestion frames are combined with the two-way feature maps generated in the first step, and after resizing, the second fusion (regional suggestion frame fusion) is performed.

5. The fused features, after the second classification of the fully connected layer and the 3DBBox regression operation, are post-processed by NMS to finally generate the target detection frame.

2.5 Detection frame encoding

        In AVOD, the algorithm uses two ways to express BEV data: height map and density map

Height map: From the BEV perspective area, it is divided into grids of a certain size, such as: MxN. In each grid, the vertical space of 0-2.5 heights is divided into K layers from the height direction, so that the BEV perspective The point cloud space of is divided into M x N x K three-dimensional lattice (Voxel), and at the same time find the maximum height H_max in each three-dimensional lattice. In this way, there are M x N x K H_max to form a height map.
Density map: Based on the M x N x K three-dimensional lattice (Voxel) in the previous step, calculate the point cloud point density of each three-dimensional lattice. The calculation formula is, and the M x N x K density values ​​​​constitute the density map.
Vector representation of the innovative 3D bounding box:

        The general practice is: For a 2-dimensional detection frame, use 4x2(x, y) total of 8 values, or center point + size notation: 1 (length) + 1 (width) + 2 (x, y) a total of 4 values Value. For 3D detection frame, 8x3(x, y, z) total 24 values ​​are used.

        AVOD innovatively adopts the form of the bottom frame + the height of the upper and lower sides from the ground, namely; 4x2(x,y) + 2x height, a total of 10 values. See the figure below for details, and the left side of the figure is the 24 value representation of MV3D method, the right side is the 10-value representation of AVOD:

2.6 Orientation Estimation

​In MV3D, the estimation of the orientation of the object is based on the long side of the object to roughly determine the orientation of the object, but this method cannot distinguish the difference of ±180°, and this method is not feasible for the detection of pedestrians.

Therefore, a method is used in AVOD for this kind of problem, which is introduced in the orientation estimation (cos\theta,sin\theta)and \thetalimited in (-\pee). In this way, there will be no divergence when the orientation differs by 180°, and each has its own specific value.        
                

Guess you like

Origin blog.csdn.net/scott198510/article/details/130950663
Recommended