Texture Mapping for 3D Reconstruction with RGB-D Sensor (CVPR_2018) Chinese translation paper with reading

Paper link

Abstract

Obtaining realistic texture details for 3D models is important in 3D reconstruction. However, the presence of geometric errors caused by noisy RGB-D sensor data always prevents the color image from being accurately aligned to the reconstructed 3D model. In this paper, we propose a global-to-local correction strategy to obtain more desirable texture mapping results. Our algorithm first adaptively selects an optimal image for each face of a 3D model, which can effectively remove blurring and ghosting artifacts produced by the blending of multiple images. We then employ a non-rigid global-to-local correction step to reduce seams between textures. This can effectively compensate for texture and geometric misalignment caused by camera pose drift and geometric errors. We evaluate the proposed algorithm on a range of complex scenes and demonstrate its effective performance in generating seamless, high-fidelity textures for 3D models.

1. Introduction

With the advent of RGB-D sensors, 3D reconstruction has made significant progress in recent years. While both small-scale objects and large-scale scenes can be modeled with impressive geometric details [10, 17, 26, 27, 28], the fidelity of texture restoration from 3D models is still not satisfactory [7, 13, 22].

Why is texture mapping lagging behind 3D modeling? There are four reasons: 1) Due to the noise of depth data, the reconstructed 3D model is always accompanied by geometric errors and distortions. 2) In camera trajectory estimation, pose residuals will gradually accumulate and cause camera drift. 3) Timestamps between captured depth frames and color frames are not fully synchronized. 4) RGB-D sensors usually have low resolution, and color images are also easily affected by light and motion conditions. All the above-mentioned challenges lead to misalignment between the geometric model and the corresponding image, and lead to suboptimal mapping results.

Although projection mapping methods [21, 23] can reduce blurring and ghosting artifacts caused by multi-image blending, texture bleeding is unavoidable on the boundaries of different views due to geometric registration errors and camera trajectory drift. Zhou and Koltun [30] proposed an optimization framework using local image warping that compensates for geometric misalignment. However, this method needs to subdivide the grid model, which will greatly increase the amount of data and limit its application range. In addition, the weighted average strategy commonly adopted in multi-image blending is sensitive to light changes and motion blur caused by fast camera movements.

To overcome these challenges, this paper proposes a novel texture mapping method that can perform global-to-local non-rigid correction optimization. First, we select the best image for each face to avoid artifacts in multi-image blending. In the global optimization step, we use a joint color-consistency and geometry-consistency optimization to correct the camera pose of each texture patch from different views. Then, in a local optimization step, we find an additional transformation for the border vertices of the patch to improve texture coordinate drift caused by geometric errors. Finally, we employ a texture atlas to map the texture onto the desired 3D model.

We validate the effectiveness of the proposed method on a series of complex scenes and show high-fidelity textures. In contrast to method [30], our method is much faster and requires much less triangle information. Texture blur artifacts are also largely eliminated. Compared with [23], our method can effectively reduce the seam inconsistency between face boundaries and can tolerate geometric misalignment.

2. Related Work

Texture mapping is an important step in obtaining realistic 3D models [11, 16, 26, 29]. In this section, we'll review some related methods for improving texture mapping.

Blending-Based Approaches: A common approach to texture mapping is to blend multiple images into textures using different weighted averaging strategies [3, 8, 20]. Current RGB-D reconstruction systems [25, 19, 7] mainly rely on truncated signed distance function (TSDF) representations. This means that, in addition to the TSDF volume grid, they also need to add an additional color volume grid to calculate the color of each vertex by running a weighted average of multiple images. However, this makes this method sensitive to computational noise; blurring and ghosting can easily occur if the recovered camera pose or 3D geometry is slightly inaccurate. In addition, the process of model subdivision and the variation of model size in different viewpoints also affect its performance.

Projection-based approach: Another mechanism is projection texture mapping, which associates each face or vertex with an appropriate image. [21] used pairwise Markov random fields to select the optimal image for each face. Inspired by this work, Allene et al. [2] and Gal et al. [14] introduced additional metrics to refine the data term and the smoothing term to select more appropriate views. However, these methods face a challenging problem of how to alleviate the visual seams between adjacent face textures. To overcome this problem, they had to add post-processing by exploiting multiband mixing [6] and Poisson editing [15] respectively. [23] proposed a global color adjustment algorithm to reduce visual breaks caused by view projections. Although these methods can greatly reduce the blurring and ghosting artifacts caused by multi-image blending, texture bleeding is unavoidable on the boundaries of different views due to geometric registration errors and camera trajectory drift.

Warp-based methods: Unlike the above methods, warp-based methods are more resistant to misalignment problems caused by geometric errors and camera drift. Eisemann et al. [12] introduced a local texture warping method by estimating the optical flow between projected texture pictures. Aganj et al. [1] apply different deformations to different images in order to fit the restored mesh. The displacement field is computed by approximately matching feature points in different views with thin-plate splines. Furthermore, Zhou and Koltun [30] designed a texture mapping framework where both camera pose and geometric errors are corrected by local image warping. However, this method needs to subdivide the grid model, which will greatly increase the amount of data and limit its application range. In addition, these methods also suffer from blurring artifacts since the weighted average mixture strategy is still used. Recently, Bi et al. [4] used patch-based synthesis to generate new target texture images for each face to compensate for camera drift and reconstruction errors, but scenes containing dynamic shadows would be a challenge for this method.

3. Overview

The purpose of this work is to map texture images onto 3D models acquired by commodity depth cameras. The input is an RGB-D sequence or real-time video containing depth frames and corresponding color frames, and the output is a 3D model accompanied by high-fidelity textures. To achieve this goal and overcome the aforementioned challenges, we propose a global-to-local optimization strategy that consists of four main steps. Figure 1 shows an overview of the proposed method.

Figure 1: Overview of the method proposed in this paper.  (a) Input image for texture mapping.  (b) The best texture image selected for each face.  Numbers in different colors indicate the selected image index.  (c) Only the results of the global optimization are used.  (d) Results of global-to-local optimization.

Input: The input to our algorithm is an RGB-D sequence or real-time video acquired by Kinect v1. For more detailed color information, it is also recommended to add an additional HD camera on top of the Kinect to obtain high-resolution texture images. However, for a fair comparison, we still take the low-resolution color images of Kinect V1 as experimental input.

Preprocess: Reconstruct the mesh model from the input depth sequence as the initial model M0 for texture mapping, and extract a subset of frames from the original color sequence as texture candidates. To improve quality and reduce computational complexity, unlike [30], we utilize [28] to reconstruct 3D models instead of KinectFusion [18, 22], and by weighting the elements of image sharpness, jitter, blur and viewport overlay to select texture candidate images. This step produces an initial model M0 and a set of camera poses {T0 i} corresponding to the selected color image subsequence {Ci} and depth image subsequence {Di}.

Optimization: To construct high-fidelity textures, our method combines the advantages of [23] and [30]. We select the best texture image for each face of the model to avoid blurring caused by multi-image blending. Therefore, by treating each candidate image as a label, we formulate the selection problem as a multi-label Markov field, which includes the angle between the camera pose and the normal map, the projected area, and the distance from the model face to the camera plane. distance. However, since neither T0 nor m0 is absolutely accurate, it is usually not possible to fully stitch adjacent faces with different labels. To solve this problem, we adopt a global-to-local optimization strategy. For global optimization, we adjust the camera pose of each texture block according to the color consistency and geometric consistency between related blocks. During the local optimization stage, we import an additional transformation to refine texture coordinates on the boundaries of different blocks and make seamlessly tiled textures.

Texture Atlases: Finally, we utilize texture atlases to map the desired textures onto the 3D model. At the optimized camera pose, each face is projected onto its associated texture image to obtain the projected area. Each projected area is used to build a texture atlas while recording the vertex coordinates of each triangular face. Then, we transform them into atlas space. In this way, the texture of each vertex can be directly retrieved in the atlas through texture coordinates, and the final texture model can be generated.

4. Texture Mapping Method

In this section, we explain each step in more detail. Let M0 denote the reconstructed mesh model for texture mapping, {vi} and {fi} are the vertex set and face set of M0 respectively, where each face represents a triangular mesh on the model. T is a 4 × 4 transformation matrix that transforms the vertex vi of M0 from world coordinates to local camera coordinates, defined as:
insert image description here
where R is a 3 × 3 rotation matrix and t is a 3 × 1 transformation vector.
We also designate the perspective projection of a 3D vertex v = [x,y,z]T onto a 2D image plane as Π. Therefore, the pixel coordinate u(u, v) of vertex v on the image plane can be calculated by:
insert image description here
where K is the camera intrinsic matrix, fx, fy are the focal lengths, and cx, cy correspond to the coordinates of the camera center in the pinhole camera model. Furthermore, we use D to denote the depth image, C to denote the color image, and I to correspond to the intensity of the color image.

4.1. Model Reconstruction

The input to our pipeline is a stream of depth images and an accompanying sequence of RGB colors. In our system, we use Microsoft Kinect V1 to capture these data. Since the input frame resolution of Kinect V1 is low and is susceptible to motion blur and judder effects, we selected a subset of high confidence frames for scene modeling and texture mapping.
Our method utilizes the Sparse Sequence Fusion (SSF) method [28] instead of KinectFusion [18, 22] to reconstruct the initial 3D model and extract color frames with high confidence. This method takes into account jitter, blur and some other factors that contribute to scan noise. It can reconstruct the mesh model M0 from the sparse depth image sequence {Di}. The basic function of [28] is defined as follows:
insert image description here
where Esel(i) is the switch item to control the selection of the depth image Di. It should be set to 1 if the current image is considered a valid image to integrate, and 0 otherwise. Ejit(i) measures the effect of jerking by computing instantaneous viewpoint changes between selected images. The continuity term Edif(i) ensures sufficient scene overlap between two selected support images by computing camera pose changes, and Evel(i) evaluates camera motion velocity. In addition to these elements, in order to obtain high-definition images, we have also introduced a term to describe the quality of each color frame. Equation 4 shows our objective function for frame extraction:
insert image description here

where Essf is the SSF term and λ cla is the balance parameter. We use λ cla = 10 in our experiments, the others are set according to [28]. The sharpness term Ecla is defined as:
insert image description here
The blur value θ is calculated by [9]. Equation 5 shows that once a depth image Di is added to the support subset, the sharpness of its corresponding color image Ci must be calculated; otherwise, it is directly ignored according to the value of Esel(i). The iteration continues until all captured images have been processed. This produces a sequence of sparse color images {Ci} with associated camera poses {T0 i} that can be used as texture candidates.

4.2. Texture Image Selection

Many texture mapping methods [3, 8, 20] project a mesh onto multiple image planes, and then employ a weighted average blending strategy to synthesize model textures from pixels [11, 16]. They ideally assume that the estimated geometric surfaces and camera poses are sufficiently accurate, but in practice, this is easily violated. Therefore, instead of directly synthesizing from multiple images, we select an optimal texture image for each face of model m0 separately. By treating each candidate image as a label, we formulate this selection problem as a pairwise Markov random field (MRF) based on [2]: the
insert image description here
data item Ed projects each model face onto each candidate image Ci, and measure the area of ​​the projected region, which is related to angular view proximity, angle, image resolution, and visibility constraints, defined as follows: The smoothing term
insert image description here
Es, described by the formula shown in 8, is computed integral along the edge e to measure Color difference, where e is the common edge between adjacent faces assigned to different texture images (Ci, Cj). ε is the entire edge set of model m0.
insert image description hereThe MRF function E(C) of Equation 6 of the formula is minimized with graph cuts and alpha expansion [5].

4.3. Global Optimization

The above steps associate each face with a texture image Ci. However, directly using texture stitching or color adjustment post-processing [21, 14] cannot make the textures on adjacent faces visually consistent due to geometric errors and camera drift. This is the main challenge of projection texture mapping methods. To eliminate visual seams, we borrow the idea of ​​off-mesh correction to stitch textures between adjacent faces.

Through the extrinsic matrix T0 and the intrinsic matrix K, the model faces can be easily projected into their associated images to obtain the texture color. However, the matrix {T0 i} is always noisy, therefore, the texture color obtained by these transformations may also be inaccurate. Therefore, in this section, we have to optimize {T0 i} to ensure that all faces from different texture images can be closely aligned.

We first perform the clustering process of faces based on the texture image {Ci}, that is, if two adjacent faces correspond to the same texture image, we put them in the same labeled cluster. After traversing all faces, a collection of clusters can be obtained, as shown in Figure 2. For clarity, we name all faces in the same cluster as chart . To improve robustness, if the number of faces Fi in a chart i is less than a threshold FN, the chart will be merged into its closest neighbor j, which is measured by three factors: 1) Charts i and j The viewpoint angle between texture images should be minimal. 2) The number of faces meets the standard of Fj>FN. 3) The projection of all vertices in graph i onto the texture image of graph j should still remain within the bounds. In subsequent experiments, we empirically set FN = 50. Based on clustering, we build an undirected connected graph G from charts ; if two charts are adjacent to each other, there will be an edge gij ∈ G linking them. The textures for the faces inside
Figure 2: Clustering model faces based on their texture images.
the chart are from the same image, so they line up well. This means that, in order to generate natural textures for the model, we only need to adjust the textures between different charts . For ideal texture mapping, we believe that a chart 's border texture can be fully recovered by its adjacent chart 's texture. Based on this observation, we can align adjacent charts whenever possible to minimize inconsistencies between the associated and projected textures of each chart and its neighbors.texture. However, considering only color consistency may lead to misalignment in non-textured areas. Therefore, we also consider geometric consistency, which is the regularization term in Equation 9. As follows, we construct the objective function formula by measuring the color consistency and geometric consistency of each chart .

insert image description here

Where vk represents the entire vertex set in chart i, and N is its number. chart N indicates the number of charts on model m0 . The function φ(x) computes the Z component of the vector x. Gi depicts the neighborhood of chart i. The first term makes the texture of chart i consistent with the projected texture of its neighbor chart j. The second term ensures that when T is changed, the optimized camera pose not only makes the texture consistent, but also the reconstructed model is consistent with the depth image acquired by the RGB-D camera, and ensures that the camera pose T does not when the color constraints are insufficient. Deviate from the initial value T0. By minimizing Equation 9, we can compute a rectification transformation matrix for each chart , which brings adjacent charts closer to each other and reduces visual seams.

4.4. Local Optimization

Although the global optimization is able to stitch most texture regions, for some regions with large geometric errors (as shown by the red boxes in Fig. 1©), the textures still cannot be precisely aligned. Global optimization can only correct camera drift for each chart . If the reconstructed geometry is accurate enough, all textures will be nicely stitched after global optimization. Unfortunately, the pervasiveness of geometric errors makes global optimization insufficient for high-fidelity texture mapping. Therefore, we introduced further adjustments on each face of the model so that local textures can also be well aligned.

Because all faces on a chart correspond to the same texture image, there is no need to optimize the entire chart. Furthermore, since each graph has been approximately aligned in the global optimization step, only a small fraction of vertices need to be corrected to compensate for texture misalignment caused by geometric errors. Instead of editing the mesh model, we recommend modifying the projected coordinates of the border vertices in each chart . As shown in Fig. 3(b), in order to align the texture at vertex v, we can shift v’s projected coordinates in image A to align v’s texture in image B. As long as the border vertices are optimized, the textures of the entire chart will be well aligned.

Figure 3: (a) Projected areas of two adjacent *charts* on their respective texture images.  (b) Correct the texture coordinates of vertex v in *chart* A to align with the coordinates of vertex v in *chart* B's texture image.
However, moving the projected coordinates of vertices is an ill-posed problem. To address the challenge, we find the optimal movement vector for the texture coordinates of each border vertex and align it with its adjacent chart texture. To do this, we compute an additional transformation matrix for a vertex v on the chart boundary, instead of computing the movement vector directly. The additional transformation ensures that the chart on which vertex v resides is sufficiently aligned with the chart connected to v . We then use this matrix to get the best projected coordinates of v as texture coordinates. The texture coordinate correction process enables sufficient alignment of the local textures at each boundary vertex v. We design an objective function to calculate the v matrix to correct the texture coordinates in the image as follows:

insert image description here
where j represents the boundary vertex of chart i, k represents the adjacent chart of i sharing vertex j , and v represents the entire vertex in chart i. Tij is an additional transformation matrix used to correct the texture coordinates of vertex j, so that the projected texture of vertex j in chart i is consistent with the projected texture on the texture image of adjacent chart k. Ti and Tj are transformation matrices optimized for chart A and chart B through global optimization . I represents the identity matrix. The first item is a data item that aligns the textures of vertices on the chart boundaries as closely as possible. The second term is a regularization term, which ensures that the additional matrix does not deviate from the result of the global optimization.

We use Gauss-Newton iteration to solve Equations 9 and 10. We get the camera transformation Ti of each chart after global optimization . For a vertex on the chart boundary, we obtain an additional transformation to correct the projected texture coordinates, which aligns the texture with the texture of the adjacent face at that vertex. We then repeat the process until all bounding texture coordinates have been processed.

The whole chart can be projected to the texture image by Ti to obtain texture coordinates. For the vertices on the chart boundary, we further use the transformation Tij for non-rigid correction to obtain corrected texture coordinates. We save the texture coordinates and get a texture atlas. Finally, using the texture atlas technique, we can generate seamlessly textured models.

5. Results

slightly

6. Conclusion

In this paper, we propose a non-rigid texture mapping method to reconstruct 3D models via RGB-D sensors. The input to our method is an RGB-D video sequence, and the output is a 3D reconstructed model with high-quality textures. We introduce a global optimization step to adjust texture locations, and design local optimizations to further refine texture boundaries. Experiments show that our method can produce high-fidelity textured models even in challenging scenes. In the future, we hope to import visual saliency information [24] into our framework for more detailed texture restoration.

7.References

slightly.

Guess you like

Origin blog.csdn.net/qq_44324007/article/details/127122302