[Paper Notes] Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

image.png
image.png

Summary

We present Text2Room†, a method for generating roomscale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of our approach is a tailored viewpoint selection such that the content of each image can be fused into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with the existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects [56, 41] or zoom-out trajectories [18] from text, our method generates complete 3D scenes with multiple objects and explicit 3D geometry. We evaluate our approach using qualitative and quantitative metrics, demonstrating it as the first method to generate room-scale 3D geometry with compelling textures from only text as input.
We introduce Text2Room†, a method for generating room-scale textured 3D meshes from given text prompts as input. To this end, we leverage a pretrained 2D text-to-image model to synthesize a sequence of images from different poses. To promote these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of ​​our approach is to tailor viewpoint selection so that the content of each image can be blended into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects from text [56, 41] or reducing trajectories [18], our approach generates full 3D scenes with multiple objects and explicit 3D geometry. We evaluate our method using both qualitative and quantitative metrics, demonstrating that it is the first to generate room-scale 3D geometries with compelling textures from text alone as input.

method

  • A two-stage appropriate perspective selection scheme is proposed. First, the layout of the scene is generated, and then the holes in the scene are filled. Iterative scene generation is used to update the mesh, and the 2D text-to-image generation model is called at different perspectives each time. In order to ensure new The geometry of the generated image is consistent with the existing geometry, and depth alignment is used.

Iterative 3D scene generation

  • Using mesh as a representation of geometry
    KaTeX parse error: Unexpected end of input in a macro argument, expected '}' at position 16: \begin{array} \̲m̲a̲t̲h̲c̲a̲l̲{M}=(\mathcal{V…
  • where M \mathcal{M}M是mesh,V \mathcal{V}V is all vertices,C \mathcal{C}C is the color of all vertices,S \mathcal{S}S is all the surface elements, and the entire mesh is generated iteratively.
  • The input to the algorithm is some text prompts { P t } t = 1 T \left\{P_t\right\}_{t=1}^T{ Pt}t=1TAnd the poses corresponding to these prompts
    • This pose is a preset, not a precise pose like in nerf
  • In each iteration, the RGB image I t I_t is first generated based on the existing mesh rasterization in the specified pose.ItDepth map dt d_tdtand mask mt m_tmt, the ungenerated part will be masked
    I t , dt , mt = r ( M t , E t ) I_t, d_t, m_t=r\left(\mathcal{M}_t, E_t\right)It,dt,mt=r(Mt,Et)
  • Then the image and mask will be used to complete the RGB image. Note that a text prompt
    I ^ t = F t 2 i ( I t , mt , P t ) \hat{I}_t=\mathcal{F}_{ will be added here. t 2 i}\left(I_t, m_t, P_t\right)I^t=Ft 2 i(It,mt,Pt)
  • Subsequently, the depth map was also completed, and the existing depth
    d ^ t = align ⁡ ( F d , I t , dt , mt ) \hat{d}_t=\operatorname{align}\left(\ mathcal{F}_d, I_t, d_t, m_t\right)d^t=align(Fd,It,dt,mt)
  • With the newly generated RGBD, a new part of the scene is obtained. This part of the scene is fused with the existing scene to obtain an incrementally changed mesh
    M t + 1 = fuse ⁡ ( M t , I ^ t , d ^ t , mt , E t ) \mathcal{M}_{t+1}=\operatorname{fuse}\left(\mathcal{M}_t, \hat{I}_t, \hat{d}_t, m_t, E_t\right )Mt+1=fuse(Mt,I^t,d^t,mt,Et)

depth alignment

  • image.png

  • Directly estimating the depth of a newly generated image will cause the depth of the old part of the newly generated depth to be misaligned, resulting in discontinuity in the 3D geometry.
    image.png

  • To this end, a two-stage depth alignment is used. First, the depth completion network is used to complete the depth based on the existing partial depth of the old image and the generated RGB, and then the least squares method is used to estimate the scaling and translation.

  • Estimation of depth translation and scaling, using the least squares method to minimize linearly varying predicted disparity and gt disparity
    min ⁡ γ , β ∥ m ⊙ ( γ d ^ p + β − 1 d ) ∥ 2 \min _{\gamma , \beta}\left\|m \odot\left(\frac{\gamma}{\hat{d}_p}+\beta-\frac{1}{d}\right)\right\|^2c , bmin m(d^pc+bd1) 2

  • Last used depth d ^ \hat{d}d^ is the linear transformation of predicted depth
    d ^ = γ ⋅ d ^ p + β \hat{d}=\gamma \cdot \hat{d}_p+\betad^=cd^p+b

  • Finally , 5 × 5 5\times 5 was used on the edge of the mask5×A Gaussian filter of 5 was used for smoothing

mesh fusion

  • During the iterative generation process, RGB and depth will be newly generated, and the newly generated parts must be integrated with the existing scene.
  • First, the points in the camera space are projected to the 3D point cloud through back projection. Here, only the area in the mask (the newly generated part) is projected. P t = {
    E t − 1 K − 1 ( u , v , d ^ t ( u , v ) ) T } u = 0 , v = 0 W , H \mathcal{P}_t=\left\{E_t^{-1} K^{-1}\left(u, v, \hat {d}_t(u, v)\right)^T\right\}_{u=0, v=0}^{W, H}Pt={ Et1K1(u,v,d^t(u,v))T}u=0,v=0W,H
  • E t E_t Et K K K are the external reference and internal reference respectively.
  • After obtaining the 3D points, you need to connect the 3D points into triangle patches. The simplest method is to form two triangles from the four nearest points on the pixel plane (the four vertices of a square)
    image.png
  • However, this approach will cause some triangles to stretch beyond the geometric range due to inaccurate depth. The author uses a two-stage excess part removal method.
    image.png
  • First, it will determine whether the sides of the triangle are too long. If any side is larger than the threshold, it will be removed.
  • Then the angle between the triangular surface and the camera ray will be calculated and judged. If the camera ray is closer to parallel with the surface, the smaller the projected area and the smaller the inner product will be, then it will be screened out
    S = { ( i 0 , i 1 , i 2 ) ∣ n T v > δ sn } \mathcal{S}=\left\{\left(i_0, i_1, i_2\right) \mid n^T v>\delta_{sn}\right\}S={ (i0,i1,i2)nTv>dsn}
    image.png

Two-stage perspective selection

  • If the camera's pose selection is too random, it will lead to poor scene generation, stretching and holes. The author proposed a two-stage perspective selection method to select the next camera pose from the optimal position and gradually fill in the empty areas.
  • In the generation stage, the algorithm generates the main body of the scene, including the overall layout and furniture. The author uses a preset trajectory , which is equivalent to a series of key frames. Each key frame includes a prompt and pose, which is gradually covered through perspectives from different directions. The whole room.
    • The author found that the best effect is when each new pose has only a small overlap with the previous pose. While generating a new scene, it is also partially connected to the existing scene.
    • The author chose to generate the trajectory of the next block in a circular motion, roughly centered on the origin
    • To create ceilings and floors where furniture is not required, you only need to change the prompt, such as 'floor'
  • In the completion stage, it is necessary to deal with the holes in the generation process. The solution is to let the camera look at the hole to generate.
    • Discretize the scene into three-dimensional voxels, randomly sample multiple poses in each voxel, discard the poses that are too close to the set, and then select a pose that sees a large hollow area in each scene, and perform the pose in the selected pose Completion
  • Finally, Poisson surface reconstruction is performed to fill in the remaining holes, making the instant noodles more complete and smooth.

experiment

  • Comparison method

    • Pure ClipNeRF
    • Outpainting
    • Tex2Light
    • Blockade
  • Evaluation index

    • CLIP score (CS): Calculate the semantic similarity between image and text prompt, the higher the better
    • CLIP − S ( c , v ) = w ∗ max ⁡ ( cos ⁡ ( c , v ) , 0 ) \mathrm{CLIP}-\mathrm{S}(\mathbf{c}, \mathbf{v})=w *\max(\cos(\mathbf{c}, \mathbf{v}), 0)CLIPS(c,v)=wmax(cos(c,v),0)
    • Inception score (IS): used to evaluate the authenticity of the generated image, including diversity and distinction, the higher the better
    • Percentual quality (PQ): The quality of generation of user experimental evaluations
    • 3D Structure Completeness (3DS): Structural integrity evaluated by user experiments
  • Qualitative results
    image.png

  • Quantitative results

    • Render 60 images from any new perspective in each scene to calculate metrics.
      image.png

Summary and limitations

  • This method can generate textured 3Dmesh only from text. Use text2image to generate pictures, and then iteratively convert the pictures to 3D scenes through alignment strategies. The core lies in appropriate perspective selection.
  • limitation
    • This method will fail in some specific scenarios.
    • Threshold-based filtering methods cannot remove all extended areas.
    • Some holes cannot be completely filled, resulting in the Poisson reconstruction phase being too smooth.
    • No decoupling of materials and care

Guess you like

Origin blog.csdn.net/u011459717/article/details/130933028