Monocular Indoor 3D Scene Reconstruction——Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

  1. Project homepage: https://yinyunie.github.io/Total3D/
  2. Paper link: https://arxiv.org/pdf/2002.12212.pdf
  3. Open source code: https://github.com/yinyunie/Total3DUnderstanding
  4. Related Institutions: 2020, Bournemouth University, Hong Kong Chinese, Shenzhen Big Data Center, Xiamen University

Abstract

insert image description here

Summary core content:

  1. Definition of indoor semantic reconstruction: Simultaneous scene understanding and object reconstruction.
  2. The contribution of this paper: Based on a single image, an end-to-end joint optimization of indoor layout reconstruction, target 3Dbox extraction, and target grid reconstruction is proposed.
  3. Datasets: SUN RGB-D, Pix3D.

paper innovation

  1. For the first time, it proposes the joint optimization of tasks such as indoor layout reconstruction, target frame detection, and target grid reconstruction based on a single image in complex indoor scenes. Experiments show that the multi-task optimization strategy makes each sub-task complement each other, so that each task reaches the best level in the industry.
  2. In the target mesh reconstruction task, a new density-based mesh topology modifier is proposed. According to the local point set density, the edges of the mesh are gradually pruned. This method solves the reconstruction of indoor objects in complex background very well.
  3. We fully consider the attention mechanism, as well as the interrelationships between objects. In indoor 3D object detection tasks, the pose of an object has latent, multifaceted interrelationships with surrounding objects. Our proposed strategy extracts latent features, which can better determine the position and pose of the target and improve the 3D detection task.

the whole frame

insert image description here

The entire network structure consists of the following parts:

  1. 2D target detection: Input a single scene graph, the network is Faster-RCNN, and the output is a 2D target detection frame.
  2. Layout Estimate Network (LEN): The input is a single scene graph, the output is the camera pose, and the Bounding Box parameters of the scene layout.
  3. 3D Object Detection Network (ODN): The input includes (1) the target image in the 2D detection frame in the scene graph; (2) geometric features (geometry features), calculated according to the 2D detection frame. The output is an estimate of the relevant parameters of the 3D Bounding Box.
  4. Mesh Generation Network (MGN): The input includes (1) a target image of a 2D target detection frame; (2) one-hot encoding of a 2D detection target category; (3) a template ball. The output is the target Mesh grid.
  5. Fusion results: Embed the results of all modules together, optimize them uniformly, and finally obtain the reconstruction results of the entire scene. First, by zooming, the Mesh reconstructed by MGN is placed into the Bounding Box estimated by ODN, and then converted to the world coordinate system by the camera pose estimated by LEN.

core module

  1. Camera and World System Setting
    insert image description here
    As shown in the figure above, the world coordinate system coincides with the origin of the camera coordinate system, and the y-axis is perpendicular to the ground. Aligns the x-axis of the world coordinate system with the forward direction of the camera around the y-axis, so the camera's yaw angle can be removed. Then, the pose of the camera relative to the world coordinate system can be pitch β \betaβ,rollγ \gammaDefine γ , in particular:
    R ( β , γ ) = [ cos ( β ) − cos ( γ ) sin ( β ) sin ( β ) sin ( γ ) sin ( β ) − cos ( β ) cos ( γ ) − cos ( β ) sin ( γ ) 0 − sin ( γ ) cos ( γ ) ] R(\beta,\gamma)=\left[ \begin{matrix} cos(\beta) & -cos(\gamma)sin(\ beta) & sin(\beta)sin(\gamma) \\ sin(\beta) & -cos(\beta)cos(\gamma) & -cos(\beta)sin(\gamma) \\ 0 & -sin (\gamma) & cos(\gamma)\end{matrix}\right]R ( b ,c )= cos ( b )s in ( b )0cos ( c ) s in ( b )cos ( β ) cos ( γ )s in ( c )s in ( b ) s in ( c )cos ( b ) s in ( c )cos ( c ).

  2. Parametric expression of labels
    In the world coordinate system, 3DBox can be defined by the 3D center C ∈ R 3 C \in R^3CR3 , sizes ∈ R 3 s \in R^3sR3 , morning angleθ ∈ [ − π , π ] \theta \in[-\pi,\pi]i[ π ,π ] expression. For indoor targets, the centerCCC consists of the 2D projection center of the image planec ∈ R 2 c \in R^2cR2 , and the distanced ∈ R d \in RdR expression. The internal parameter matrix K ∈ R 3 K \in R^3of the given cameraKR3 C C The expression formula of C
    is as follows: C = R − 1 ( β , γ ) ⋅ d ⋅ K − 1 [ c , 1 ] T ∣ ∣ K − 1 [ c , 1 ] T ∣ ∣ 2 (1) C=R^{ -1}(\beta,\gamma) \cdot d \cdot \frac{K^{-1}[c,1]^T}{||K^{-1}[c,1]^T|| _2} \tag{1}C=R1 (b,c )d∣∣K1[c,1]T2K1[c,1]T( 1 )
    Among them, the 2D projection centerccc can be further expressed ascb + δ c^b+\deltacb+δ c b c^b cb is the center of the 2D bounding box,δ ∈ R 2 \delta \in R^2dR2 represents the offset, which needs to be learned by the network. From 2D detection resultsIIFrom I to 3D Bounding Box, the expression of the network is:
    F ( I ∣ δ , d , β , γ , s , θ ) ∈ R 3 × 8 F(I|\delta,d,\beta,\gamma,s,\ theta)\in R^{3\times8}F(Iδ,d,b ,c ,s,i )RAs mentioned in the 3 × 8
    review, the Bounding Box parameters estimated by the ODN network are as follows:
    ( δ , d , s , θ ) (\delta,d,s,\theta)( d ,d,s,θ )
    The parameters estimated by the LEN network include the camera pose and the Bounding Box of the indoor layout, which are expressed as follows:
    R ( β , γ ) and ( C , sl , θ l ) R(\beta,\gamma)~~and~~( C,s^l,\theta^l)R ( b ,γ)  and  (C,sl,il )
    The diagram of parametric expression is as follows:
    insert image description here

  3. Object Detection Network (ODN)
    insert image description here
    The figure above shows the entire structure of the ODN network. The basic process is as follows:

    Input: 2D detection results, geometric features (Geometry Feature)
    (1) Use ResNet-34 to extract the image features (Apperence Feature) in each detection frame.
    (2) Encode the detection frame results and their relative positions together as geometric features as one of the inputs.
    (3) Use the object relation module to calculate the relational feature (relational feature) between each detection target and other targets.
    (4) According to the similarity between image features and geometric shapes, carry out weighted summation, which is called Attention Sum.
    (5) Add Relational Feature and Target features in the way of Element-Wise.
    (6) Use a two-layer MLP network to regress the parameters of each box ( δ , d , s , θ ) (\delta,d,s,\theta)( d ,d,s,θ ) .
    For indoor reconstruction, the object relation module expresses important meanings in the physical world: objects with close distances or objects with high appearance similarity tend to have stronger relationships.

    The specific structure is shown in the table below:
    insert image description here

  4. Layout Estimation Network (LEN)
    The input of the LEN network is a single scene graph, and the output is the camera pose R ( β , γ ) R(\beta,\gamma)R ( b ,γ),3D Box ( C , s l , θ l ) (C,s^l,\theta^l) (C,sl,il ). The network structure is the same as ODN, but the relational feature is removed. The specific network structure is as follows:
    insert image description here

  5. The above picture of Mesh Generation for Indoor Objects (MGN)
    insert image description here
    is the overall structure of the MGN network. The specific processing logic is as follows:

    (1) Network input: (a) the target image in the 2D detection frame; (b) the one-hot code (Category Code) of the target category in the 2D detection frame; (c) the template sphere (Template Sphere), 2562 vertices unit ball.
    (2) Use ResNet-18 to extract image features (Apperence Feature).
    (3) The one-hot encoding of the target category can provide certain shape prior information, which is helpful for faster fitting to the 3D shape of the target.
    (4) Connect the Cattegory code and the Apperence feature, and input it to the decoder network AtlasNet together with the Template sphere to get the rough shape of the target.
    (5) Edge Classifier: The structure is the same as the shape decoder (AtlasNet), and the last layer is replaced by a full connection. The input is the image feature and the deformed target Mesh, and the output f ( ∗ ) f(*)f ( ) (can remove redundant meshes).
    (6) Boundary Refinement: refine the boundary of Mesh and output the final Mesh.

    The specific network structure is as follows:
    insert image description here
    Density vs Distance : During the mesh reconstruction process, the connection objects of the edges need to be modified. The previous method TMN modifies the topological relationship of the mesh based on a fixed distance. The authors argue that topology modification should be based on local geometric properties. To this end, a modified Mesh based on ground-truth local density (local density) adaptation is proposed. Let pi ∈ R 3 p_i \in R^3piR3 is a point on the reconstructed mesh,qi ∈ R 3 q_i \in R^3qiR3 ispi p_ipiNeighbor points on the ground-truth (see MGN network structure diagram). Therefore, we design a binary classifier f ( ∗ ) f(*)f ( ) predictspi p_ipiDefine the ground-truth function of the system:
    f ( pi ) = { F false ∣ ∣ pi − qi ∣ ∣ 2 > D ( qi ) T rue otherwise (2) f(p_i)= \begin{cases} False & . ||p_i-q_i||_2>D(q_i)\\ \tag 2 True & \text{otherwise} \end{cases}f(pi)={ FalseTrue∣∣piqi2>D(qi)otherwise(2)
    其中, D ( q i ) = max ⁡ min ⁡ q m , q n ∈ N ( q i ) ∣ ∣ q m − q n ∣ ∣ 2 , m ≠ n D(q_i)=\max\min\limits_{q_m,q_n\in N(q_i)}||q_m-q_n||_2,m\ne n D(qi)=maxqm,qnN(qi)min∣∣qmqn2,m=n N ( q i ) N(q_i) N(qi) displayqi q_iqiThe set of neighborhood points on the ground-truth, D ( qi ) D(q_i)D(qi) represents the local density.

loss function

  1. Individual loss
    ODN and LEN: use classification and regression loss L cls , reg = L cls + λ r L reg L^{cls,reg}=L^{cls}+\lambda_rL^{reg}Lcls,reg=Lcls+lrLre g a n ODN AND LEN STRUCTURES( θ , θ l , β , γ , d , s , sl (\theta,\theta^l,\beta,\gamma,d,s,s^l( i ,il,b ,c ,d,s,sl )。C , δ C,\deltaC,Delta uses L2 loss.
    MGN: Use Chamfer lossL c L_cLc, the edge loss is L e L_eLe, boundary loss L b L_bLb, the cross-entropy loss L ce L_{ce}Lce
  2. Joint loss
    L = ∑ x ∈ { δ , d , s , θ } λ x L x + ∑ y ∈ { β , γ , C , sl , θ l } λ x L x + ∑ z ∈ { c , e , b , ce } λ z L z + λ co L co + λ g L g (4) L=\sum_{x\in{\lbrace\delta,d,s,\theta\rbrace}} \lambda_xL_x+\sum_{y \in{\lbrace\beta,\gamma,C,s^l,\theta^l\rbrace}} \lambda_xL_x+\sum_{z\in{\lbrace c,e,b,ce\rbrace}} \lambda_zL_z+\ lambda_{co}L_{co}+\lambda_gL_g \tag{4}L=x { δ , d , s , θ }lxLx+y { β , γ , C , sl ,il}lxLx+z { c , e , b , ce }lzLz+lcoLco+lgLg( 4 )
    Among them, the first three items represent the respective losses of ODN, LEN, and MGN. The last two terms represent the joint loss term,L co L_{co}LcoGuarantee the consistency of the predicted LEN, the world coordinate system of ODN Boxes and ground-truth. L g L_gLgIn order to constrain the alignment of the reconstructed Mesh and the scene point cloud. Since the actual scene of the indoor scene is usually rough and partially occluded, the Chamfer distance is not used to define L g L_gLg, the global loss of joint training is as follows:
    L g = 1 N ∑ i = 1 N 1 ∣ S i ∣ ∑ q ∈ S i min ⁡ p ∈ M i ∣ ∣ p − q ∣ ∣ 2 2 (3) L_g=\frac {1}{N}\sum_{i=1}^N\frac{1}{|S_i|}\sum_{q\in S_i}\min\limits_{p\in M_i}||pq||^2_2 \tag{3}Lg=N1i=1NSi1qSipMimin∣∣pq22(3)

experiment settings

  1. Dataset
    SUN RGB-D : 10335 indoor pictures, marked with 3D layout, 2D/3D bounding box, sparse point cloud.
    Pix3D : 395 furniture models, 9 categories, aligned with 10069 images, used for training Mesh reconstruction.
  2. Evaluation Indicator
    Layout: Use IoU evaluation.
    Camera Pose: MAE error.
    Object Detection: AP values ​​for all classes.
    Single Mesh Generation: Charmfer distance, which evaluates the similarity of two point cloud sets in space. Scene Mesh: L g L_g
    in equation (4)Lg.
  3. Training strategy
    First, independently train the LEN and ODN networks on SUN RGB-D, and train the MGN network on Pix3D. Then, Pix3D and SUN RGB-D are used to supervise the generative training of Mesh, and all networks are jointly trained.
  4. Reasoning With
    a single 2080Ti, a scene needs about 1.2s reasoning time.

Experimental results

  1. Quantified results for 3D object detection
    insert image description here
  2. Mesh reconstruction results
    insert image description here


Holistic 3D Scene Understanding from a Single Image with Implicit Representation

Abstract

insert image description here

Summary core content:

  1. The purpose of the paper: 3D scene understanding based on a single image can predict the object shapes, object poses, and scene layout in a single image.
  2. Challenges: As we all know, restoring 3D information from a single image is an ill-posed problem; indoor objects face serious problems such as occlusion, adhesion, and aggregation.
  3. Solution: (1) Use the latest deep implicit representation to solve; (2) Propose an image-based local implicit network structure to improve the estimation of the target shape; (3) Propose a new implicit graph neural network ( SGCN), which corrects the estimation of 3D pose and scene layout. (4) Propose a new [physical violation loss] loss to solve the information estimation of overlapping targets.

Pre-knowledge

Graph Convolutional Network (GCN)

Intuitive definition : Convolution operation is performed on the graph structure to extract features. According to the extracted convolution features, node classification (node ​​classification), graph classification (graph classification), and edge prediction (link prediction) are performed.
The basic structure of the graph: node (node), edge (edge), the edge is also divided into directed or undirected.
insert image description here
Basic principle : Assume that there is a batch of graph data, in which there are N nodes (nodes), each node has its own characteristics, assuming that there are a total of D characteristics, we set the characteristics of these nodes to form an N×D-dimensional matrix X, The relationship between each node will also form an N×N-dimensional matrix A, also known as the adjacency matrix (adjacency matrix). X and A are the inputs of the modulo GCN model.
Simple legend : As shown in the figure below, input C, go through several hidden layers (graph convolutional layer, similar to fully connected layers), output layer F (node ​​features are updated), and the activation function can be ReLU, etc. GCN inputs a graph, and through several layers, the feature of each Node of GCN changes from X to Z, but no matter how many layers go through, the connection relationship between Nodes is shared, that is, the adjacency matrix.
insert image description here
The propagation formula of GCN : H ( l + 1 ) = δ ( D − 1 / 2 A ∗ D 1 / 2 H ( l ) W ( l ) ) H^{(l+1)}=\delta(D^{ -1/2}A^*D^{1/2}H^{(l)}W^{(l)})H(l+1)=d ( D1/2 AD1/2 H(l)W( l ) )
The characteristic matrix X in the figure below is equivalent to H in the formula. In the formula,A ∗ = A + IA^*=A + IA=A+I I I I is the identity matrix, which is equivalent to the self-connection of the adjacency matrix of the undirected graph G. In this way, during aggregation, information from other nodes can be aggregated, and information from its own node can also be aggregated. DDD isA ∗ A^*A degree matrix,HHH is the feature of each layer,WWW is the weight parameter of each layer.
insert image description here

paper innovation

The contribution of the thesis is reflected in four aspects, as follows:

  1. Based on a single image, design a [two-stage] 3D scene understanding system. With the help of deep implicit representation [deep implicit representation], the target shape, target pose, and scene layout can be predicted at the same time, and the latter two items can be optimized at the same time.
  2. A local implicit shape embedding network is proposed, which can extract accurate latent shape information of objects.
  3. A GCN-based scene context understanding network is proposed, which can refine the initial estimate.
  4. Design a new loss function [physical violation loss] to solve the problem of predicting target crossing.

network structure

insert image description here

The above figure is the overall network structure of this article: including the Initial Estimation Stage and the Refinement Stage

  1. 2D Detector: Input a 2D image and use Faster-RCNN for 2D target detection.
  2. Object Detection Network (ODN): Input the geometric features (Geometry Feature) generated by the image in the 2D detection frame and the 2D detection frame, and estimate the initial 3D Bounding Boxes parameters of the target.
  3. Local Implicit Embedding Network (LIEN): Input the one-hot code of the image in the 2D detection frame and the detection target category, extract the implicit local shape information, and obtain the latent code of the target.
  4. Layout Estimation Network (LEN): Input scene image, estimate 3D Layout Bounding Box parameters and camera pose.
  5. Scene Graph Convolutional Network (SGCN): Graph convolutional network, based on the context information in the scene, refines various results of the initial estimate.
  6. LDIF: Decoding module, which decodes the latently encoded features of LIEN into concrete 3D geometry.

Scene Graph Convolutional Network(SGCN)

This module is one of the author's innovations. With the help of graph convolutional network (GCN), various features in the initial stage are fused and corrected to obtain the optimal scene reconstruction result. The following figure is the specific structure of SGCN. The input is the characteristics of each module in the initial stage, which is converted into various node features (as shown in the figure below, Layout Feature, Relation Feature, Object Feature), and then through [MLPs] respectively The features are converted into expressions of the same length (as shown in the figure below, Layout Representation, Relation Representation, Object Representation, the length is 512).
insert image description here

The construction process of Scene Graph Convolutional Network ( SGCN ) is as follows:

  1. Construct the entire scene as a graph GGG. _ The graph has three types of nodes (Nodes): object nodes, scene layout nodes, and relationship nodes between them. The initial graph is an undirected edge graph, which ensures the mutual flow of information between the target (object) and the layout (layout). Then, add a relation node between each object/layout pair.
  2. For different types of nodes, the input features are carefully designed according to the different input sources. For each node, the features are flattened and concaenated into a vector, which is then transformed into a vector of the same length using MLP, and finally embedded into a node's expression vector.
  3. Layout Node : The feature composition includes the basic image features extracted by LEN, the parameterized output of the layout bounding box, the camera pose, and the internal reference connection of the camera (as the camera prior knowledge).
  4. Object Node : The feature composition includes the basic image features extracted by ODN, the analytic code of LIEN, and the one-hot coding features of the 2D Detector prediction frame category (introducing semantic information).
  5. Relationship Node : (1) The characteristics of the relationship node between two objects (objec–object) include the geometry feature generated by the 2D frame and the 3D frame information of the target. (2) The relationship node features of the object and layout (object–layout) are initialized to fixed values, and then learned by the SGCN network to obtain a reasonable relationship expression.
  6. Assuming that the graph contains N targets and 1 layout, then object-layout nodes and relationship nodes can be expressed as two matrices Z o ∈ R d × ( N + 1 ) and Z r ∈ R d × ( N + 1 ) 2 Z^o \in R^{d\times(N+1)}~~and~~Z^r \in R^{d\times(N+1)^2}ZoRd×(N+1)  and  ZrRd×(N+1)2. Define independent message passing weights for different [source-destination] types. Assuming that the type of source node is a and the type of destination node is b, we define linear transformation and adjacency matrix (adjacent matrix) asW ab , α ab W^{ab}, \alpha ^{ab} respectivelyWabaab.
  7. Node update strategy : Assuming that the source object (or layout) is s, the destination object (or layout) is d, and the relationship is r, then the expression update strategy of object node and layout node is as formula (1), and the update expression of relationship node is formula ( 2):
    insert image description here
  8. message passing x4 : After four message passes, independent [MLPs] decode object node Representation into relevant bounding box parameters ( δ , d , s , θ \delta,d,s,\thetad ,d,s,θ ), and the layout node decodes to (C , sl , θ l C,s^l,\theta^lC,sl,il ) and camera poseR ( β , γ ) R(\beta,\gamma)R ( b ,c ) .

Local Implicit Embedding Network

  1. LIEN network structure
    As shown in the figure below, the input of the network is the 2D target detection result, and the one-hot encoding of the detection frame category (providing a certain shape prior), and the output is 32 × 42 32\times4232×42 eigenvectors.
    insert image description here
  2. Latent code and Analytic code?
    According to the network structure, the LIEN network is to obtain the latent encoding of the target shape information. So what features should be integrated into the nodes, which is more conducive to the SGCN network to effectively extract the context information of the scene target? The paper proposes to use local implicit representation features. As shown in the figure above, LIEN obtains a (32x42) feature matrix, where 32 represents 32 3D elements, and 42 includes 10 Gaussian function parameters (analytic code) and 32-dimensional latent variables (latent code). The Gaussian parameters describe the scaling constant, center point, radius, and Euler angle of each Gaussian function, obviously they contain the structural information of the 3D geometry. Therefore, using the analytic code as one of the features of the object node also provides SGCN with the structural information of the local target.

loss function

  1. Loss function of a single model
    Train LIEN and LDIF separately, the loss function is as follows,
    L p = λ ns L ns + λ us L us (3) L_p=\lambda_{ns}L_{ns}+\lambda_{us}L_{us } \tag{3}Lp=lnsLns+lusLus(3)
    其中, L n s , L u s L_{ns},L_{us} Lns,LusIndicates sampling points close to the target surface and uniform sampling points.
    LEN loss:
    LLEN = ∑ y ∈ { β , γ , C , sl , θ l } λ y L y (4) L_{LEN}=\sum_{y\in\lbrace \beta,\gamma,C,s^ l,\theta^l\rbrace}\lambda_yL_y\tag{4}LLEN=y { β , γ , C , sl ,il}lyLy( 4 )
    ODN function:
    LODN = ∑ x ∈ { δ , d , s , θ } λ x L x (5) L_{ODN}=\sum_{x\in\lbrace \delta,d,s,\theta\; rbrace}\lambda_xL_x\tag{5}LODN=x { δ , d , s , θ }lxLx(5)
  2. Properties
    L j = LLEN + LODN + λ co L co + λ phy L phy (6) L_j=L_{LEN}+L_{ODN}+\lambda_{co}L_{co}+\lambda_{phy}L_{ phy}\tag{6}Lj=LLEN+LODN+lcoLco+lphyLphy( 6 )
    Among them,L phy L_{phy}Lphyis a new loss proposed to solve the problem of prediction targets intersecting each other. The specific loss formula is as follows:
    L phy = 1 N ∑ i = 1 N 1 ∣ S i ∣ ∑ x ∈ S i ∣ ∣ relu ( 0.5 − sig ( α LDIF i ( x ) ) ) ∣ ∣ (7) L_{phy }=\frac{1}{N}\sum_{i=1}^N\frac{1}{|S_i|}\sum_{x \in S_i}||relu(0.5-sig(\alpha LDIF_i(x )))|| \tag{7}Lphy=N1i=1NSi1xSi∣∣relu(0.5s i g ( αL D I Fi(x)))∣∣(7)
    其中, L D I F i ( x ) LDIF_i(x) L D I Fi( x ) will targetiii 's coordinate pointxxx is converted to an LDIF value, and the predicted value of LDIF is obtained through the sigmoid function. ReLU means that only the intersection point is punished. penalty intersectionppp as shown in the figure below,
    insert image description here

experiment settings

  1. For the data set
    Pix3D, SUN RGB-D, please refer to the previous analysis of Total3D for details.
  2. Evaluation indicators
    refer to the previous Total3D.
  3. Experimental details
    Directly use the output of Total3D 2D object detection, LEN, ODN have the same structure as Total3D. LIEN is trained on Pix3D together with LDIF. SGCN is trained on SUN RGB-D. The training process is still to train each module independently first, and then jointly train. When training SGCN independently, there is no need to lose L phy L_{phy}Lphy.

Experimental evaluation

  1. SUN RGB-D Quantization
    insert image description here
  2. Mesh reconstruction results
    insert image description here

Guess you like

Origin blog.csdn.net/kxh123456/article/details/129155588