Understanding and Modeling of Tongxin's Indoor 3D Scene

Unmanned driving—
RGB-D camera for outdoor scenes Teacher Zhang Guofeng from Zhejiang University has research
Indoor autonomous driving—Robot sweeping robot, home robot to model and decorate indoor scenes

Three methods:
1. Geometry based methods restore the geometric and color information in the scene, and do not involve semantic information
2. Primitive based methods Man-made objects have rules, and use the structural information of these objects to help model or analyze the scene
3 , semantic based methods Extract semantic information from 3D acquired data, on the other hand, hope to apply semantic information to reconstruction or reasoning of 3D scenes

Related work of Method 1:
(1) Algorithm:
- kinectfusion

  • Improved kinectfusion:
    • hierarchical data structure and streaming [chen 2013]
    • Octree [Support Brucker2013]
    • Voxel-hashing [Niebner2013]
      currently the best algorithm:
    • bundlefusion

(2) Color Texture Quality If you want to paste the texture on the geometry and find that the color is poor, solve it:
- color map optimization [zhou2014]
optimize the pose of the RGB camera (when the RGB camera is shooting, it is impossible to fully synchronize with the depth camera, so it needs to be done again Optimize its pose) Because the camera model is not perfect, each image needs to be deformed
- patch based optimization [bi2017]
In many cases, some geometric information will be completely lost during the process, so it is impossible to paste the texture in the later stage, so no drawings Instead, a new image is synthesized from the original image through the method of patch synthesis.
Disadvantage: It is all post-processing and cannot be realized in real time.

Method 2 related work:
use the relationship between the planes to reconstruct, and even use the plane to infer the information of the occluded part.
Classification 1 Heuristic-based Approaches Take some Hough transform or other processing to extract the plane, and then filter, given some thresholds
- Online structure for real-time indoor scene reconstruction. [zhang2015] (real-time)
- Towards Comodity 3D Scanning for content creation. [huang2017] (non-real-time)
CNN method to extract planes:
- PlaneNet (CVPR2018) directly extracts the plane information in the scene from an RGB image and gives a segmentation to the piecewise plane regions in the scene. The upper branch estimates the parameters of all the planes in the scene, and the lower branch determines the segmentation of each region, and at the same time tells the region which plane is estimated above.
- PlaneMatch hopes to calculate an encoder from the given RGB frame. Suppose a segmentation is done first, some piecewise regions are given, and an encoder is given to each region. If the two planes are the same plane, it is expected The encoder descriptor is close enough, so a descriptor is trained. In the later process, you can use the trained descriptor to quickly feature some false information to improve efficiency.

Method 3 related work:
classify all objects in the 3D scene, and give each object a semantic label
subdivision: - object detection (given a point cloud or a single depth image, which places are framed by objects)
- scene segmentation ( Divide the scene into individual objects)
- reasoning/completion (assuming that only part of the scene is seen, and hope to complete the scene, it can be considered as part of the reconstruction)
-

  1. One type of work: Single View based Segmentation/Detection (single RGB-D image for segmentation/detection)
    1.1, Image based approaches, depth as 2D image channel (take depth as a separate channel, use the traditional two-dimensional convolution method Or the traditional method of manually crafted features)
    - Manually crafted features [Silberman2012, Gupta2013]
    - 2D CNN based method [Gupta2015, Deng2017]
    example: - Amodal detection [Deng2017]
    Purpose: To do target detection.
    Implementation method: RGB uses a CNN to do it, depth makes a channel alone, and uses a CNN to do it. The two channels are finally merged together, depth is given an initial value, and finally refine is obtained to obtain object position 1.2, Volumetric based approaches,
    depth as TSDF (transform depth into volumetric expression, put it in a 3D scene, and then use 3D processing methods to do it, such as doing convolution in 3D space or using 3D CRF to do it) - Manually crafted features [
    Ren2016 ]
    - 3D CNN based method [Song2016, Graham2018]
    example: - Submanifold CNN for 3D segmentation [Graham2018]
    Purpose: To do 3D segmentation
    Implementation method: Make the scene into a three-dimensional volumetric scene first, because the object has some surfaces in the local area, so 3D CNN is only done on these surfaces. Very close to the idea of ​​Octree ([Steunbrucker2013]) OCNN

  2. Another type of work: Single View based 3D Scene Completion (It is done separately in vision, and I haven’t seen too many people doing it in graph.
    Assuming a single RGB-D view is given, there will be occlusion of objects, The occlusion of the scene, I hope to restore all the occluded parts.
    - Heuristic solution [Zheng2013] Infer the occluded part through the understanding of the scene, geometry and some physical constraints
    - Random forest [Firman2016]
    - 3D CNN [Song2017, Guo2018 ] song2017: Given a 3D scene, turn it into a volumetric expression, given a depth, to guess all the invisible Voxels in the future, you must first specify where there are objects and where there are no objects, and you must know the object What is it? Two conclusions of this work:
    (1) Semantic information is very helpful to guess where there is an object.
    (2) In order to guess what is behind, it is necessary to know a lot of contact information in the scene. If The receptive field is not big enough, and it can’t be guessed only by looking at the local.
    - The work of Tong Xin’s own group: the previous work is to do TSDF after the depth image is given, and then do 3D volume convolution to get the final result. This will lead to a large amount of calculation , The speed is very slow.
    Improvement: first do 2D convolution on the depth to extract some features, and then project these features onto the 3D volume, and then do scan completion.

  3. Another type of work: 3D Scene based Segmentation/Completion
    Traditional method: Suppose there is a model database, given a scan of a 3D scene, you can try to fit the object, and then retrieve and replace it.
    - Model retrieval and replacement [Kim 2012, Nan2012, Shao2012, Chen2014]
    Some recent deep learning methods are used to do it:
    - Point based approaches [Qi2016, Qi2017] Based on point expression
    - Volumetric based approaches [Dai2017, Dai2018a] Based on volumetric The expression of
    - ScanComplete [Dai2018] Multi-resolution and sliding window. Make the scene into three resolutions first, use 3D CNN on each resolution, use a sliding window to make a local segmentation and completion, and then put these into The next layer is rolled together with the input of the next layer of CNN, and then the feature is refined, and finally this result is obtained.
    - Multiple view based approach [Dai2018b]
    First do 2D CNN, then project it into the volume, and combine it with 3D CNN to do scene segmentation

Challenges:
Automatic high quality 3D scene data acquisition and segmentation
High quality 3D geometry and color textures
complete(without holes caused by occlusions)
accurate labels and object segmentation
scalable, real time 3D scene unders tanding
object detection, segmentation, and prediction
Efficient scene representation for analysis/understanding What kind of expression is used when creating a 3D scene? 2D is easy to combine color and other information, but 3D is not easy to combine color and other information. Each has its own advantages and disadvantages.
2D view can combine color but cannot handle occlusion
3D volume difficult to use color but can handle full scene

Trends:
Fusing images and 3D information There is a lot of information on images, how to combine scenes and images together. Image annotation is much easier than 3D scenes.
If Fusing scene reconstruction and understanding semantic information is done well enough, how can this information be used to help scene reconstruction use
semantic information for 3D scene reconstruction/prediction
more information of the scene We currently only capture geometry and color information, and other reflections , physics, and dynamics did not capture
dynamics, reflection, physics, lighting conditions
more CNNs and deep learning
Future Directions:
From static to dynamics
functions, dynamics...
From reconstruction/understanding to generation generation Not done yet, very important
scene layout and details
From single task to multi-task fusion
planning/navigation+reconstruction+understanding
Kevin Xu and Ligang’s work
More surveys:
kang chen. 3D indoor scene modeling from RGB-D data:A survey,computational visual media
muzammal naseer. indoor scene understanding in 2.5/3D: A survey.
一些公开的数据集:
insert image description here

Q&A:
Q: The reconstruction of geometry can be done in real time, but the real-time performance of semantic information is relatively poor. If you directly input a completed scene, there is no way to achieve real-time, but if only a part of it is updated each time, is it possible to achieve real-time?
A: But if you make a mistake, when and how should you change it? Will there be a chance to correct what was wrong in the past, and will there be a chance to correct what was right in the future.
Q: If it is a picture, you can use RNN, because it does not need to be fused, but if it is done in 3D, a part of the point cloud is incrementally input each time, but there will be some local or global optimization, and the previous data may be will be changed, which may cause problems requiring re-typing.
A: The work of Zhou Kun or bundlefusion is to solve such problems. How to update the results to ensure that the changes are correct, and how to propagate the correct results back?

Guess you like

Origin blog.csdn.net/weixin_44934373/article/details/127981011