VQ-VAE

This paper uses the structure of VQ-VAE to learn the encoding of pose features, and uses the trained decoder and codebook to guide the learning of the pose estimation model, turning the pose estimation task into a classification problem of predicting the features in the codebook.

The following is a relatively general structure of the attitude estimation model. Most of the current algorithms cannot escape this category:

I don’t know if you have thought about such a question: the attitude estimation model we trained is an Encoder, Decoder, or Encoder-Decoder structure?

For simplicity, we use the Topdown algorithm to discuss. It is not difficult to imagine that at least two perspectives make sense:

  1. The entire model (Backbone, Neck, Head) is an Encoder, which completes the encoding of the image to the location information. The encoding form of the location information can be specific coordinates, or a Heatmap or discrete bins.

  2. Backbone, as a feature extractor in a consensus, is an Encoder, which encodes the image into a certain feature space, and then the Head part is decoding this feature into position information that humans can understand, so it is an Encoder-Decoder structure.

The biggest difference between the above two perspectives is that one encoding space is defined and understandable by humans, while the other is learned by the model itself and difficult for humans to understand or explain. At this stage, our knowledge cannot give a clear judgment to prove or disprove which of the above-mentioned internal working principles of the model, or in fact none of them. Clearly defined parts and more restrictions on the feature space of model learning are of great help to us in optimizing algorithms and training models.

In this article, let's study together the CVPR 2023 paper "Human Pose as Compositional Tokens". The article uses the structure of VQ-VAE to learn the encoding of pose features, and uses the trained decoder and codebook to guide the pose estimation model. The learning of pose estimation turns the task of pose estimation into a classification problem of predicting features in the codebook.

Paper link: https://arxiv.org/pdf/2303.11638.pdf

Open source address: https://github.com/Gengzigang/PCT/tree/main

Before entering the core idea of ​​the PCT paper, it is necessary to pave the way for the basic knowledge, and briefly introduce VQ-VAE.

In fact, in recent years, more and more work has been done using generative models in pose estimation research. For example, I wrote in "(Paper Notes and Thoughts: Human Pose Regression with Residual Log-likelihood Estimation (ICCV 2021 Oral))[https://zhuanlan. zhihu.com/p/395521994] "The RLE introduced in "Normalizing Flows" uses RealNVP to model the real distribution of data to improve the accuracy of the regression algorithm; another example is the diffusion model derived from several works of the same name after DiffPose ; I have also read some work using GAN or neural style transfer to increase the amount of data, improve image resolution, and data enhancement. This time, VQ-VAE is introduced in PCT.

I have to say that the work using the generative model is more interesting to read, but the only problem is that there are many mathematical formulas, and it will be relatively difficult to understand. Until now, some students still report that RLE can’t understand it. I am also thinking about writing it again. An article explaining RLE in a simple way, try to avoid mathematical formulas as much as possible and then interpret it (digging holes).

AE

Before talking about VQ-VAE, we can introduce the most basic AutoEncoder (AE).

Based on the idea of ​​AE, work such as DAE was soon extended to artificially add noise to the input image, so that the model can better capture key features.

FEET

Let me say one more thing here, why choose to use Gaussian distribution is actually particular. The advantage of Gaussian distribution is that the features are concentrated near 0, so the feature space can be dense. When doing this generation task, random sampling The obtained value is also highly likely to be meaningful.

VQ-VAE

Interestingly, although VQ-VAE has VAE in its name, it is actually more like an AE, because the codebook in the trained VQ-VAE is fixed, which means that you cannot randomly sample and send it to the Decoder for generation like VAE , and the features in the codebook must be predicted in some way, and can be used by Decoder after combination.

The more detailed content is not expanded here, interested friends can learn by themselves.

Pose as Compositional Tokens (PCT)

After laying the groundwork for the basics, let's move on to Human Pose as Compositional Tokens. As the name suggests, this article intends to use the combination of Tokens to represent the posture of the human body. What is "the combination of Tokens"? Contact VQ-VAE, it is naturally a feature in the codebook.

So in a more general understanding, PCT can be seen as using Backbone to predict M features, then replacing them with the features that are closest to them in the codebook, and then throwing them to Head to predict the coordinates.

So, it is natural for us to ask, where do the features in the codebook come from? Here let me show the structure diagram of AE again:

Then, as a superior replacement for AE, VQ-VAE can also be seamlessly connected. We only need to use the pose coordinates to train a VQ-VAE network to get a codebook, and the features in this codebook can be considered as various poses. As long as the codebook is large enough, we will be able to reconstruct the original pose through the combination of these features.

Since the input is coordinates rather than pictures, some Encoder papers chose MLP-Mixer, which can be simply understood as a large MLP that is not easy to overfit. No problem, just adjust according to the difficulty of your task.

The Decoder part is also a stack of MLP-Mixer, using M features to reconstruct the 17x2 pose coordinates.

This network has nothing to do with pictures, so in fact, the data enhancement that can be theoretically performed during training will be much richer than that with pictures. Students who have done Pose Lifting related algorithms should know what I am talking about.

The loss function during training is as follows:

It can be seen that L1 loss is used to supervise the reconstructed coordinate values, and the second item is the commitment loss in VQ-VAE. The purpose is to encourage the output of the Encoder to be as close as possible to the features in the codebook, and to prevent the results predicted by the Encoder from frequently being among the features of each codebook. between jumps.

Class Head

After training the attitude version of VQ-VAE, what we really need is codebook and Decoder, and the original Encoder can be thrown away. The next thing to do is to train a new Encoder. The ability of this Encoder is to encode the input image into the feature space where the codebook is located, and then you can look up the table and reconstruct it happily.

But it is actually quite troublesome to train a Backbone from scratch to do this. The author thought of a simpler way, which is to freeze the Backbone trained on the Heatmap method (a bunch of ready-made ones in MMPose), and then connect it later. A lightweight Class Head can be used for feature conversion.

Specifically, the feature map output by Backbone is often relatively high-dimensional, and the feature space learned by VQ-VAE only needs M features, so it can be simply reduced to M dimensions through 1x1 convolution, and then the two-dimensional features The graph is straightened into one dimension and transformed to a specific feature dimension with a fully connected layer. The approach here is relatively close to that in SimCC.

Finally, a classification layer is used to classify M features (shape MxN). Assuming that there are V features in the codebook, the classification result is the logits of MxV, and then softmax is taken in the [V] dimension like a normal classification problem. You can get the confidence of each feature corresponding to the codebook.

In the replacement step, the normal way is to take the argmax of the logits of MxV to get the index in the codebook, and then replace it directly, but in this way, the training gradient cannot be returned, so the author softened this step and used softmax to get it out The result is multiplied by the characteristic matrix in the codebook:

That is, the matrix operation of (MxN) x (NxV) = (MxV), so that the gradient can be backpropagated, and the network can be trained end-to-end.

The above calculation process can be understood by combining the author's open source code.

# Head 部分
def forward(self, x, extra_x, joints=None, train=True):
    """Forward function."""
    
    if self.stage_pct == "classifier":
        batch_size = x[-1].shape[0]
        cls_feat = self.conv_head[0](self.conv_trans(x[-1]))
        cls_feat = cls_feat.flatten(2).transpose(2,1).flatten(1)
        cls_feat = self.mixer_trans(cls_feat)
        cls_feat = cls_feat.reshape(batch_size, self.token_num, -1)

        for mixer_layer in self.mixer_head:
            cls_feat = mixer_layer(cls_feat)
            
        cls_feat = self.mixer_norm_layer(cls_feat)
        cls_logits = self.cls_pred_layer(cls_feat)
        encoding_scores = cls_logits.topk(1, dim=2)[0]
        cls_logits = cls_logits.flatten(0,1)
        cls_logits_softmax = cls_logits.clone().softmax(1)
    else:
        ## 省略跟 class head 无关的代码 ##

    ## 省略跟 class head 无关的代码 ##
    
    output_joints, cls_label, e_latent_loss = \
        self.tokenizer(joints, joints_feat, cls_logits_softmax, train=train)
    
    if train:
        return cls_logits, output_joints, cls_label, e_latent_loss
    else:
        return output_joints, encoding_scores

# Tokenizer
def forward(self, joints, joints_feature, cls_logits, train=True):
    """Forward function. """
    if train or self.stage_pct == "tokenizer":
        ## 省略跟 class head 无关的代码 ##
    else:
        bs = cls_logits.shape[0] // self.token_num
        encoding_indices = None
    
    if self.stage_pct == "classifier":
        part_token_feat = torch.matmul(cls_logits, self.codebook)
    else:
        part_token_feat = torch.matmul(encodings, self.codebook)

    if train and self.stage_pct == "tokenizer":
        ## 省略跟 class head 无关的代码 ##
    else:
        e_latent_loss = None
    
    # Decoder of Tokenizer, Recover the joints.
    part_token_feat = part_token_feat.view(bs, -1, self.token_dim)
    
    part_token_feat = part_token_feat.transpose(2,1)
    part_token_feat = self.decoder_token_mlp(part_token_feat).transpose(2,1)
    decode_feat = self.decoder_start(part_token_feat)

    for num_layer in self.decoder:
        decode_feat = num_layer(decode_feat)

    decode_feat = self.decoder_layer_norm(decode_feat)
    recoverd_joints = self.recover_embed(decode_feat)
    
    return recoverd_joints, encoding_indices, e_latent_loss

And what I want to say is that, compared to directly softening and multiplying, there is actually a more coquettish operation here:

quantize = cls_logits_softmax + (codebook - cls_logits_softmax).detach()
# 正向传播和往常一样
# 反向传播时,detach()这部分梯度为0,quantize和input的梯度相同
# 即实现将quantize复制给input

In this way, the features predicted by the Encoder can be numerically replaced by the features in the codebook, but the predicted results are used to return the gradient. The advantage is that it can eliminate the difference between the softened multiplication result and the real codebook, because the softened product is not a real replacement after all, and the features will still have subtle numerical differences. If the Decoder is strong enough, it may not matter, but it can eliminate the difference. Definitely better. If you have time, you may be able to verify whether this will help the model performance.

Experimental results

 

The results are good, but I noticed that the experiments in this article use a very strong Backbone like Swin. I don’t know if it’s because the method of freezing the Backbone requires a relatively high feature quality, and the weak Backbone may not be effective enough. As for freezing Backbone, well. . I smell the tight budget. . whaosoft  aiot  http://143ai.com

Another big selling point of this article is its robustness to occlusion situations. Since VQ-VAE is trained directly on the pose, the features in the codebook learned in theory can be combined arbitrarily to produce various poses. The interference of the quality of the picture itself will be significantly reduced, as evidenced by the performance on data sets such as OCHuman and CrowdPose.

 The codebook features learned by VQ-VAE have very clear meanings, so it is easy to perform visual verification. If you adjust the combination of features input to the Decoder, you can see the changes in the predicted attitude, and this change is local. :

epilogue

This paper mainly introduces a method to introduce VQ-VAE to constrain the pose feature space in the pose estimation task. In addition, the official code of this article is based on MMPose, and everyone is welcome to try MMPose~

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130463319