GET3D paper notes (2) loss function and indicators

Metrics

In order to measure the ability of the unconditional generation model, it is necessary to generate the shape of the set S g S_gSgwith reference set shape S r S_rSrMake a comparison and measure the performance of the model on several metrics.

geometry index

reference:

  • https://blog.csdn.net/weixin_43882112/article/details/121073634

Level 1 indicators

Level 1 indicators

COV

COV is a metric for evaluating the quality of 3D shape synthesis, which can be used to evaluate whether the generated shape set covers the reference shape set. Specifically, the COV metric refers to the proportion of shapes in the reference set that are matched to at least one of the generated sets . If each shape in the reference set can find at least one matching shape in the generated set, the COV value is 1, otherwise, it means that some shapes in the reference set cannot find a match, and the COV value is less than 1. The COV metric can be combined with other metrics such as Chamfer Distance (CD) and Light Field Distance (LFD) to comprehensively evaluate the quality of 3D shape synthesis.

Q: How to calculate COV?
A: The formula for COV is calculated as follows:

COV ⁡ ( S g , S r ) = ∣ { argmin ⁡ X ∈ S r D ( X , Y ) ∣ Y ∈ S g } ∣ ∣ S r ∣ , \operatorname{COV}\left(S_{g}, S_{r}\right)=\frac{\left|\left\{\operatorname{argmin}_{X \in S_{r}} D(X, Y) \mid Y \in S_{g}\right\}\right|}{\left|S_{r}\right|}, COV(Sg,Sr)=Sr { argminXSrD(X,Y)YSg} ,

When calculating COV, it is necessary to judge whether each shape in the generated set matches at least one shape in the reference set. Specifically, the following methods can be used for shape matching:

  1. For each generated shape, calculate its distance from every shape in the reference set.
  2. The resulting shape is considered a matching shape if its distance from a shape in the reference set is less than a predefined threshold.
  3. The above process is repeated until all generated shapes are matched or the distance to all shapes in the reference set is greater than a predefined threshold.

According to a predefined threshold, the strictness of shape matching can be flexibly controlled . In general, smaller thresholds can improve matching accuracy, but may result in lower COV values. On the contrary, a larger threshold can improve the COV value, but may reduce the matching accuracy.

Q: Does the COV indicator have to be considered to specify a threshold?
A: When calculating the COV metric, a threshold must be specified to judge shape matching. The size of the threshold will directly affect the accuracy of matching and the value of COV. Different thresholds will produce different matching results and COV values, so an appropriate threshold must be selected when calculating the COV metric. Generally speaking, an appropriate threshold can be selected according to specific application scenarios and requirements. If higher matching accuracy is required, a smaller threshold can be selected; if higher matching robustness is required, a larger threshold can be selected.

MMD

MMD (Minimum Matching Distance) is a measure for comparing the similarity of two shape sets. Specifically, the MMD metric method refers to finding an optimal shape matching method that minimizes the distance between two shape sets.

When calculating MMD, it is necessary to first match each shape in the reference set with all shapes in the generated set, and calculate the distance between them. Then, for each reference shape, the generated shape with the smallest distance to it is selected as its matching shape, and the distance between them is calculated. Ultimately, the value of the MMD metric is the average of the distances between all generated shapes and their matching shapes.

The MMD metric can be combined with other metrics such as Chamfer Distance (CD) and Light Field Distance (LFD) to comprehensively evaluate the quality of 3D shape synthesis.

By calculating the MMD, the distance between the generated shape and the reference shape can be derived to evaluate the similarity between them. The MMD metric can help developers compare between different 3D shape synthesis algorithms and help optimize algorithms for better synthesis quality.

Q: What is the formula calculation of MMD?
A: The formula of MMD is as follows:

MMD ⁡ ( S g , S r ) = 1 ∣ S r ∣ ∑ X ∈ S r min ⁡ Y ∈ S g D ( X , Y ) \operatorname{MMD}\left(S_{g}, S_{r}\right)=\frac{1}{\left|S_{r}\right|} \sum_{X \in S_{r}} \min _{Y \in S_{g}} D(X, Y) MMD(Sg,Sr)=Sr1XSrYSgminD(X,Y)

S g S_g Sgis the generating set, S r S_rSris the reference set. For each shape XX in the reference setX , in the formulamin ⁡ Y ∈ S g D ( X , Y ) \min_{Y \in S_g} D(X,Y)minYSgD(X,Y ) meansXXX and generating setS g S_gSgAll shape YY inY is matched, andYYY , namelyXXX andS g S_gSgThe most similar shape in . D ( X , Y ) D(X,Y)D(X,Y ) meansXXXYYThe distance between Y , can be d CD d_{CD}dCDor d LFD d_{LFD}dL F D

When calculating MMD, it is first necessary to calculate each shape XX in the reference setX finds the most similar shape YYthat matches itY , then count all pairs of matching shapes( X , Y ) (X,Y)(X,Y ) , and take the average value as the value of MMD.

Secondary indicators

CD

X ∈ S g X \in S_g XSgRepresents a generated shape, Y ∈ S r Y \in S_rYSrA shape representing the reference set. To calculate d CD d_{CD}dCD, this paper first collects N=2048 points X p ∈ RN × 3 X_{p} \in \mathbb{R}^{N \times 3} from shapes X and Y respectivelyXpRN×3 Y p ∈ R N × 3 Y_{p} \in \mathbb{R}^{N \times 3} YpRN×3

Then calculate:

d CD ( X p , Y p ) = ∑ x ∈ X p min ⁡ y ∈ Y p ∥ x − y ∥ 2 2 + ∑ y ∈ Y p min ⁡ x ∈ X p ∥ x − y ∥ d_{\mathrm{CD}}\left(X_{p}, Y_{p}\right)=\sum_{\mathbf{x}\in X_{p}}\min _{\mathbf{y}\in Y_{p}}\|\mathbf{x}-\mathbf{y}\|_{2}^{2}+\sum_{\mathbf{y} \in Y_{p}} \min _{\mathbf {x} \in X_{p}}\|\mathbf{x}-\mathbf{y}\|_{2}^{2}dCD(Xp,Yp)=xXpyYpminxy22+yYpxXpminxy22.

In this way, by converting the models into point clouds, the distance between the two models can be compared.

LFD

The complete definition of LFD is from the paper: Distance measurement based on light field geometry and ray
tracing. Optics Express

Light Field Distance (LFD) is a measurement method for calculating the distance between 3D shapes, which can capture the geometric and texture information of 3D shapes, and is more complex than Chamfer Distance.

Q: The calculation formula and calculation process of LFD
A: The calculation formula of LFD is as follows:

d L F D ( X p , Y p ) = 1 ∣ V ∣ ∑ v ∈ V ( 1 − exp ⁡ ( − 1 ∣ I v ∣ ∑ i ∈ I v d ( X i , Y i ) ) ) d_{\mathrm{LFD}}\left(X_{p}, Y_{p}\right)=\frac{1}{\left|V\right|} \sum_{v \in V}\left(1-\exp \left(-\frac{1}{\left|I_{v}\right|} \sum_{i \in I_{v}} d\left(X_{i}, Y_{i}\right)\right)\right) dLFD(Xp,Yp)=V1vV(1exp(Iv1iIvd(Xi,Yi)))

Among them, X p X_pXpY p Y_pYpare point clouds of two shapes, VVV is the set of views used to render the shape,I v I_vIvIndicates the first vvUnder v viewing angles, the set of pixel points of the image used for rendering,d ( X i , Y i ) d(X_i,Y_i)d(Xi,Yi) is at shape pointX i X_iXiSum Y i Y_iYiEuclidean distance calculated between.

The calculation process of LFD is as follows:

  1. For a given shape X p X_pXpY p Y_pYp, separate them from VVEach view in V is rendered as an image.
  2. For each viewing angle v ∈ V v\in VvV , the imageI v I_vIvEncoded as a feature vector. Specifically, LFD encodes images using Zernike moments and Fourier descriptors, which are feature vectors that can capture texture and geometric information of images.
  3. For each shape point X i X_iXiSum Y i Y_iYi, calculate the distance d ( X i , Y i ) d(X_i,Y_i) between their eigenvectors at all viewing anglesd(Xi,Yi)
  4. For each viewing angle v ∈ V v\in VvV , calculate the distance d between all pixels in the rendered image( I v , i , I v , j ) d(I_{v,i},I_{v,j})d(Iv , i,Iv,j) , wherei , j ∈ I vi,j\in I_vi,jIv
  5. For each viewing angle v ∈ V v\in VvV , add up the distances between all pixels, and calculate the average distance1 under this viewing angle ∣ I v ∣ ∑ i ∈ I vd ( X i , Y i ) \frac{1}{|I_v|}\ sum_{i\in I_v}d(X_i,Y_i)Iv1iIvd(Xi,Yi)
  6. For all viewing angles, the average distances under them are averaged and transformed using an exponential function to obtain the value of LFD.

The smaller the value of the LFD measurement method, the smaller the distance between two shapes and the higher the similarity between them. LFD can capture the geometry and texture information of shapes more accurately, so it has been widely used in the fields of 3D shape synthesis, shape retrieval and shape comparison.

Therefore, LFD can be regarded as two steps. For the two shapes to be compared, there are multiple perspectives to observe:

  1. For each viewing angle, calculate the distance between the encoded vectors of each pixel after encoding, and take their average value
  2. Then, calculate the average sum after exponential processing for different viewing angles.

Q: What is the function of 1-exp?
In the calculation formula of LFD, 1-exp is used to scale the average distance under the viewing angle, so that when calculating LFD, the weight of each viewing angle is different, and with the distance increase and gradually decrease. This exponential transformation is computed based on the average distance from view angle, so the larger the distance, the closer this value is to 0. Using the form of 1-exp, the average distance can be mapped from [0,inf] to [0,1], so that the value of LFD is between [0,1], which is convenient for different data sets and experiments The results are compared and analyzed.

At the same time, using exponential transformation can make LFD more sensitive to different distance changes, so as to better distinguish the difference between two shapes. For example, when the distance changes small, the value of 1-exp(x) changes quickly, which can help LFD distinguish different shapes more accurately.

texture index

reference

To evaluate the quality of the generated textures, we employ the Fréchet Inception Distance (FID) metric, which is commonly used to evaluate the quality of 2D image synthesis. Specifically, for each class, we randomly sample camera positions from a predefined camera distribution and render 50k views (one view per shape) for the resulting shapes, using all images in the test set. Then, we encode these images using the pre-trained Inception v3 model, where we take the output of the last pooling layer as the final encoding.

The formula is calculated as follows:

FID ⁡ ( S g , S r ) = ∥ μ g − μ r ∥ 2 2 + Tr ⁡ [ Σ g + Σ r − 2 ( Σ g Σ r ) 1 / 2 ] \operatorname{FID}\left(S_{g}, S_{r}\right)=\left\|\boldsymbol{\mu}_{g}-\boldsymbol{\mu}_{r}\right\|_{2}^{2}+\operatorname{Tr}\left[\boldsymbol{\Sigma}_{g}+\boldsymbol{\Sigma}_{r}-2\left(\boldsymbol{\Sigma}_{g} \boldsymbol{\Sigma}_{r}\right)^{1 / 2}\right] FID(Sg,Sr)= mgmr 22+Tr[ Sg+Sr2( SgSr)1/2]

Among them, S g S_gSgIndicates the generated image set, S r S_rSrRepresents the set of real images, μ g \boldsymbol{\mu}_gmgand μ r \boldsymbol{\mu}_rmrare the mean vectors of generated image encoding and real image encoding, Σ g \boldsymbol{\Sigma}_gSgand Σ r \boldsymbol{\Sigma}_rSrare their covariance matrices, respectively. Tr is a trace operation.

Specifically, the calculation process of FID is as follows:

  1. Use the pre-trained Inception network pair S g S_gSgand S r S_rSrThe images in are encoded to get their feature vectors.
  2. Calculate μ g \boldsymbol{\mu}_gmgand Σ g \boldsymbol{\Sigma}_gSg, that is, to generate the mean vector and covariance matrix of the image encoding.
  3. Calculate μ r \boldsymbol{\mu}_rmrand Σ r \boldsymbol{\Sigma}_rSr, that is, the mean vector and covariance matrix of the real image encoding.
  4. Interpretation μ g − μ r ∣ 2 2 \left|\ball symbol{\mu}{g}-\ball symbol{\mu}{r}\right|_{2}^{2}μgμr22, which is the square of the Euclidean distance between two mean vectors.
  5. 计算 Σ g + Σ r − 2 ( Σ g Σ r ) 1 / 2 \boldsymbol{\Sigma}{g}+\boldsymbol{\Sigma}{r}-2\left(\boldsymbol{\Sigma}{g} \boldsymbol{\Sigma}{r}\right)^{1 / 2} Σg+Σr2( Σ g Σ r )1/2 , the distance between the two covariance matrices.
  6. Calculate the trace operation of the matrix distance to get the final FID value.

Q: What is a covariance matrix?
A: As mentioned in the reference.

Q: How to understand the fifth step?
A: When calculating the square root of the covariance matrix

This formula can be interpreted as calculating the distance between the statistical distribution of the generated samples and the real samples. First, Σ g \boldsymbol{\Sigma}_gSgand Σ r \boldsymbol{\Sigma}_rSris the covariance matrix of the encoding of the generated image and the encoding of the real image. The sum of these matrices represents the population variance of the two distributions. Then, ( Σ g Σ r ) 1 / 2 (\boldsymbol{\Sigma}_g \boldsymbol{\Sigma}_r)^{1/2}( SgSr)1/2 represents the square root of the product of these two covariance matrices, which is called their product covariance matrix.

Σ g + Σ r \boldsymbol{\Sigma}_g+\boldsymbol{\Sigma}_r Sg+Sr 减去 2 ( Σ g Σ r ) 1 / 2 2(\boldsymbol{\Sigma}_g\boldsymbol{\Sigma}_r)^{1/2} 2 ( SgSr)1/2 can be understood as computing the difference between two distributions, taking into account their variance and covariance. Its calculation result represents the distance between two distributions. The smaller the distance, the more similar their distributions and the better the generated image. The Trace operation is to add all the eigenvalues ​​of this matrix, which is a way to measure the change of the matrix and can be used to measure the distance between two covariance matrices. Therefore, using this formula can more accurately measure the gap between generated images and real images.

Q: Is the covariance matrix calculated from two sets of variables? Then Σ g \boldsymbol{\Sigma}_gSgand Σ r \boldsymbol{\Sigma}_rSrHow is it calculated?
A: Yes, the covariance matrix is ​​calculated from two sets of variables. In FID, Σ g \boldsymbol{\Sigma}_gSgand Σ r \boldsymbol{\Sigma}_rSrare the covariance matrices of the generated image encoding and the real image encoding, respectively.

For each encoded vector, it can be regarded as a collection of multiple variables, because each dimension of the encoded vector can be regarded as a variable. Therefore, the generative image encoding and the real image encoding can be regarded as a set composed of multiple variables. For each set, its covariance matrix can be calculated to reflect the relationship between variables. Therefore, Σ g \boldsymbol{\Sigma}_gSgand Σ r \boldsymbol{\Sigma}_rSrare the covariance matrices computed from the sets of generated and ground-truth image encodings, respectively.

Q: v i \mathbf{v}_i viIs the covariance matrix of is obtained by mutual operation of the vectors in this set? If the dimension of each vector is R d R^dRd , what is the dimension of this covariance matrix?
A: covariance matrixΣ g \boldsymbol{\Sigma}_gSgThe dimension is d × dd\times dd×d

Q: How to understand FID
A: You can use Frechet distance (also known as Wasserstein-2 distance) to calculate the distance between two distributions. (Frechet distance of two distributions)

The difference between two Gaussian distributions (synthetic and real images) is measured by the Frechet distance (also known as the Wasserstein-2 distance).
—— "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium" (https://arxiv.org/abs/1706.08500), 2017.

The activation function output from the Inception v3 model is used to summarize each image, and the score is "Frechet Inception Distance".

From Frechet Distance to FID

reference

When univariate:

Mathematically, Frechet Distance is used to calculate the distance between two "multivariate" normal distributions. For a "univariate" normal distribution, the Frechet distance is

For a univariate normal distribution, the Frechet distance is equivalent to the Wasserstein distance, also known as Earth Mover's Distance (EMD), whose formula is:

d ( X , Y ) = ( μ X − μ Y ) 2 + ( σ X − σ Y ) 2 \mathrm{d}(\mathrm{X}, \mathrm{Y})=\left(\mu_{X }-\mu_{Y}\right)^{2}+\left(\sigma_{X}}\sigma_{Y}\right)^{2}d(X,Y)=( mXmY)2+( pXpY)2
We can understand it as the square of the mean difference, plus the square of the difference in std variance.

Then when expanding to FID, the left half is obviously the square of the mean difference, how to understand the right half?
FID ⁡ ( S g , S r ) = ∥ μ g − μ r ∥ 2 2 + Tr ⁡ [ Σ g + Σ r − 2 ( Σ g Σ r ) 1 / 2 ] \operatorname{FID}\left(S_{ g}, S_{r}\right)=\left\|\boldsymbol{\mu}_{g}-\boldsymbol{\mu}_{r}\right\|_{2}^{2}+\ operatorname{Tr}\left[\boldsymbol{\Sigma}_{g}+\boldsymbol{\Sigma}_{r}-2\left(\boldsymbol{\Sigma}_{g} \boldsymbol{\Sigma}_ {r}\right)^{1 / 2}\right]FID(Sg,Sr)= mgmr 22+Tr[ Sg+Sr2( SgSr)1/2]

Let's look at the right half, like ( a 1 / 2 − b 1 / 2 ) 2 (a^{1/2} -b^{1/2})^2(a1/2b1/2)2 form? Anda 1 / 2 a^{1/2}a1/2 each element on the diagonal, sort of like the standard deviation.

  1. First, it is known that the variance is calculated on the diagonal of the covariance matrix (after adding the root sign, you can pretend that it becomes the standard deviation).
  2. Therefore, the formula calculation result in Tr, the element on each diagonal is the square of the standard deviation
  3. And Tr is the sum of the elements on the diagonal, so each element of the trace is the square of the difference of two standard deviations.

And the form of point 3 is similar to the form of FD, but not identical, because the square root of the matrix is ​​not equal to the square root of each element in the matrix . How to further understand? Leave it alone for now.

FID-3D and FID-Ori

FID-3D and FID-Ori are two variants of the FID metric used to assess the difference between generated images and real images, and they differ in the way 2D images are generated. Their calculation steps are as follows:

For FID-Ori, neural body rendering using a 3D-aware image synthesis method directly results in 2D images.
For FID-3D, for those baseline methods that do not output textured meshes, Marching Cubes are used to extract the geometric information of their underlying neural fields. Then, find the intersection point where each pixel ray intersects the resulting grid, and use the 3D position of the intersection to query the RGB values ​​from the network. This way, the rendered image more faithfully represents the underlying 3D shape, taking into account the quality of geometry and textures. Note that FID-3D and FID-Ori are the same for methods that directly generate textured 3D meshes, such as GET3D.
In FID-3D, due to the use of three-dimensional geometric information, the quality of the generated 3D shape can be more comprehensively measured, while in FID-Ori, only the quality of the two-dimensional image is considered, and it is more inclined to directly generate effective Baseline methods for 2D images.

Guess you like

Origin blog.csdn.net/duoyasong5907/article/details/129119127