PlaneRCNN: 3D Plane Detection and Reconstruction from a Single Image

markdown

motivation

Understand the network that recognizes the plane from the picture, and understand the paper:
PlaneRCNN: 3D Plane Detection and Reconstruction from a Single Image for subsequent work.

Thesis sequence

3) PlaneRCNN: 3D Plane Detection and Reconstruction from a Single Image

Target

The rgb image of the monocular camera -> the plane in the rgb image
insert image description here

insert image description here

Recent investigations extract semantically planar information from images. Such as a refrigerator or a wall, etc.
From the previous research papers, we know that the neural network can extract the depth information in the picture, and the plane information can also be extracted from a similar angle. The previous method to obtain the plane is the parameter ax + by + cz ax+by+cz that the neural network directly returns to the three-dimensional planeax+by+The three parameters of c z . But one drawback of this method is that a maximum number of planes must be given. The effect of this paper is that the number of planes is not limited. **It uses a variant of MAKS-RCNN to directly perform region instances on pictures to obtain planes. **On this basis, network segmentation is performed, and normal prediction and depth prediction of the plane are performed. Use these two pieces of information to fuse to obtain a three-dimensional plane.

Why can 3D plane information be inferred from rgb images?

The change in this article is mainly to use the area detection framework. The area detection has been able to understand the local area of ​​the scene, and use the detected area to directly obtain the plane area. Get the masks of multiple planes. This method uses a variant of MASK-RCNN, the method of region detection in this instance. It is not limited by the number of region detections. Because this area detection is generally rough or directly box, the segmentation method directly segments the area boundary. Therefore, the segmentation input needs to obtain the plane normal and depth of a single image, which are all three-dimensional, which is the step of refine. Fusion with the original area detection plane. A three-dimensional plane can be obtained. This contribution lies in the first use of region detection to detect any number of planes.

The overall structure of the network
insert image description here
The main structure of the previous network Plane detection network, BBOX, MASK uses the algorithm of area detection, and Normal uses a mature method of the area, which can directly return to normal, but the expression of the normal vector is borrowed from Mask-RCNN. The anchor is represented as follows:
insert image description here
This is the normal of the calculation plane. Its approach is to select an Anchor normal and directly return to the 3D vector. This is a normal expression. There are 7 direction areas. It uses statistical concepts. Perform statistics and clustering on the sending vectors of 1000 image planes, and obtain a cluster of 7 normal vectors. Doing so will improve the accuracy rate. I don't understand why the accuracy rate can be improved? Need to see the paper: Mask r-cnn. This is a very important paper on region detection to check out.

The variant mentioned is the improvement of the network, called Segment refined Network, that is, the mask extracted by each network is not directly conv and then concanate, but the other plane mask is convolved as a feature, and then this feature and other ( excluding self) all feature means concanate. This has the characteristics of other planar masks. This can aggregate non-local features, which works better. So you can see the structure below.
insert image description here
The mask of each plane is convolved.

The last module is actually very simple, which is the wrapping loss module. This module is supervised by other perspectives. It is to wrap the adjacent angle of view to your own angle of view, and then directly subtract

Design of Loss function

In addition to directly subtracting from the ground truth. As shown in the picture:
insert image description here

There is also subtraction from adjacent viewing angles, as follows:
insert image description here
where D n D^nDn represents the depth of the adjacent viewing angle, where( uw , vw ) (u^w, v^w)(uw,vw )are the result of warping, which means( uw , vw ) (u^w, v^w)(uw,vw )sum( un , vn ) (u^n, v^n)(un,vn )corresponds, so that the depth of the corresponding current viewing angle is obtained, which isD w D^wDw , to differ them.
insert image description here

Impressions
This paper is more about the integration of multiple modules, mainly to improve the integration of each module. Some deficiencies in previous papers have been improved:
1) The number of planes represented is no longer limited.
2) It can detect small planes (in its refine part, information of multiple masks is fused in time)
3) It uses adjacent perspective pictures
Defects:
1) It integrates various networks to form a system, and at the same time, it is more effective for network improvement Simple. For fusing neighboring views it is only used as a supervised loss.

Each plane of the network can be detected through the region, without adding the detection of corner points, or the detection of line segments, as a constraint. Maybe this kind of fusion is not friendly enough for the curve.

The fusion of features does not use the form of crf.
It is relatively simple to use pictures from other perspectives, and it is only fused as a simple loss function. The fusion features of each feature are too simple.

Guess you like

Origin blog.csdn.net/weixin_43851636/article/details/112546145