siMLPe:Human Motion Prediction

Paper address: Back to MLP: A Simple Baseline for Human Motion Prediction
Paper code: https://github.com/dulucas/simlpe
Paper source: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023
Paper unit: Grenoble France

Summary

  • This paper addresses the problem of human motion prediction, including predicting future body postures from historically observed sequences.
  • However, state-of-the-art methods that provide good results rely on deep learning architectures of arbitrary complexity, such as RNNs, Transformers, or GCNs, often requiring multiple training stages and more than 2 million parameters.
  • In this paper, we show that, combined with a series of standard practices such as applying discrete cosine transforms (DCT), predicting residual displacements of joints and optimizing velocities as auxiliary losses, a lightweight multilayer perceptron (MLP)-based network with only 140,000 parameters can exceed state-of-the-art performance.
  • Validation on Human3.6M , AMASS and 3DPW datasets shows that our method ( siMLPe ) consistently outperforms all other methods.
  • We hope that our simple approach can provide the community with a strong baseline and allow for a rethinking of the human motion prediction problem.

1 Introduction

  • Given a three-dimensional human pose sequence, the purpose of the human motion prediction task is to predict the subsequent actions of the pose sequence .
  • Predicting future human movements is at the core of many applications, including accident prevention in autonomous driving, tracking people, or human-machine interaction.
  • Due to the spatiotemporal nature of human motion, a common trend in the literature is to design models that can fuse spatiotemporal information.
  • Traditional methods mainly rely on hidden Markov models or Gaussian process latent variable models.
  • However, while these methods perform well under simple and periodic motion patterns, they fail significantly under complex motions.
  • In recent years, with the success of deep learning, various methods capable of processing sequence data have been developed based on different types of neural networks.
  • For example, some works use RNN to model human motion, some are based on GCN , and some are based on Transformers to fuse spatiotemporal information of motion sequences across human joints and time.
  • However, the architectures of these new methods are often not simple, and some of them require additional priors, which makes their networks difficult to analyze and modify.
  • Therefore, a question naturally arises: " Can we solve the human motion prediction problem with a simple network? "
  • To answer this question, we first tried a simple solution of repeating the last input pose and using it as the output prediction. As shown in Figure 1, this naive solution can already achieve reasonable results, meaning that the last input pose is "close" to future poses ( Repeating Last-Frame ).
    Insert image description here
  • Inspired by this, we further train only a fully connected layer to predict the residual between future poses and the last input pose and obtain better performance, which shows the simplicity of building on a base layer like a fully connected layer. The potential of networks for human motion prediction ( One-FC ).
  • Based on the above observations, we return to multi-layer perceptrons (MLPs) and build a simple yet effective network, called siMLPe, with only three components: fully connected layers, layer normalization, and transpose operations. The network architecture is shown in Figure 2 shown.
    Insert image description here
  • Notably, we find that even commonly used activation layers such as ReLU are not required, making our network a completely linear model except for layer normalization.
  • Despite its simplicity, siMLPe can achieve powerful performance when properly combined with three simple practices. These three simple methods are: applying discrete cosine transform (DCT), predicting the residual displacement of the joint, and optimizing the velocity as an auxiliary loss.
  • SIMLPE produces SOAT performance on several standard datasets, including Human3.6M, AMASS, and 3DPW.
  • At the same time, siMLPe is lightweight and requires 20 to 60 times fewer parameters than previous state-of-the-art methods.
  • A comparison of SIMLPE and previous methods can be found in Figure 1 which shows the mean per joint position error (MPJPE) versus network complexity at 1000 ms for different networks on Human3.6M. siMLPe achieves optimal performance with high efficiency.
  • In summary, our contributions are as follows :
    (1) We show that human motion prediction can be modeled in a simple way without explicitly fusing spatial and temporal information. As an extreme example, a single fully connected layer can already achieve reasonable performance.
    (2) We propose siMLPe, a simple and effective human motion prediction network with only three components: fully connected layer, layer normalization and transpose operation, which performs well on multiple benchmarks (such as Human3.6M, AMASS and 3DPW dataset), achieving state-of-the-art performance with far fewer parameters than existing methods.

2. Related Work

  • Human motion prediction is a sequence-to-sequence task that takes past observed motion as input to predict future motion sequences.
  • Traditional motion prediction methods are nonlinear, such as Markov models, Gaussian process dynamics models, and restricted Boltzmann machines.
  • These methods have proven effective at predicting simple movements, but ultimately struggle to predict complex and long-term movements.
  • With the advent of the deep learning era, human movement prediction has achieved great success using deep networks, including Recurrent Neural Networks (RNN), Graph Convolutional Networks (GCNs) and Transformers. This is Main focus of this section.

2.1 Human movement prediction based on RNN

  • Due to the inherent sequential structure of human motion, some works have addressed cyclic models for three-dimensional human motion prediction.
  • However, this type of method suffers from multiple inherent limitations of RNN.
  • First, RNN, as a sequence model, is difficult to parallelize during training and inference.
  • Second, memory constraints prevent RNNs from exploring information from further frames.
  • Some studies alleviate this problem by using RNN variants, sliding windows, convolutional models, or adversarial training. But their networks are still complex and have a large number of parameters.

2.2 Human motion prediction based on GCN

  • In order to better encode the spatial connectivity of human joints, recent work usually constructs human poses as graphs and uses graph convolutional networks (GCNs) for human motion prediction.

2.3 Human movement prediction based on Attention

  • With the development of transformers, some works have tried to use the Attention mechanism to handle this task.

2.4 Summary

  • In summary, with the development of human motion prediction in recent years, structures based on RNN/GCN/Transformer have been well explored, and the results have been significantly improved.
  • Although these methods provide good results, their architectures become increasingly complex and difficult to train.
  • In this paper, we stick to a simple architecture and propose an MLP-based network .
  • We hope that our simple method can serve as a baseline for the community to rethink the problem of human motion prediction.

3. siMLPe

  • In this section, we formulate the problem and give the formula of the DCT transform in Section 3.1, the details of the network architecture in Section 3.2, and the loss we use for training in Section 3.3.
  • Given a sequence of past 3D human poses, our goal is to predict a sequence of future poses.
  • We represent the observed three-dimensional human posture as x_1:T ∈ R T×C , which consists of T consecutive human postures, where the posture at the tth frame x_t is represented by a C-dimensional vector, that is: x_t∈ R C.
  • In this work, similar to the previous work, x_t is the three-dimensional coordinate of the node in the t-th frame, C = 3 × K, where K is the number of nodes.
  • Our task is to predict N motion frames in the future: x_T +1:T +N ∈R N×C .

3.1 Discrete Cosine Transform (DCT)

  • We adopt DCT transform to encode temporal information .
  • More precisely, given the input motion sequence of T frames, the DCT matrix D∈R T×T can be calculated as:
    Insert image description here
    where δ_i,j represents the Kronecker function, and δ_i,j is:
    Insert image description here
  • The input after discrete cosine transform is: D(x_1:T) = Dx_1:T.
  • We apply the Inverse Discrete Cosine Transform (IDCT) to convert the output of the network back to the original pose representation, represented as D -1 and the inverse of D.

3.2 Network architecture

  • Figure 2 shows the architecture of our network. Our network only contains three components: fully connected layers, transpose operations, and layer normalization .
  • For all fully connected layers, their input dimension is equal to the output dimension.
  • Formally, given an input sequence of 3D human poses x_1:T ∈ R T×C , our network predicts a future pose sequence x_T+1:T+N ∈ R N×C :
    Insert image description here
    F denotes our network.
  • After the DCT transformation, we apply a fully connected layer that operates only on the spatial dimensions of the transformed motion sequence D(x_1:T)∈R T×C : where z 0 ∈R T×C is the output of the fully connected layer. W 0 ∈R C×C , b 0 ∈R C represents the learnable parameters of the fully connected layer.
    Insert image description here
  • In practice, this is equivalent to applying a transpose operation to a fully connected layer and then transposing the output features back, as shown in Figure 2.
  • Then, a series of m blocks is introduced, operating only in the temporal dimension , i.e. only merging information across frames.
  • Each block consists of a fully connected layer, and then layer normalization is performed , expressed as:
    Insert image description here
    where, z i ∈R T×C , i∈[1,…, m] represents the output of the i-th MLP block.
    LN represents layer normalization operation.
    W iR T×T and b i ∈ R T are the learnable parameters of the fully connected layer in the i-th MLP block.
  • Finally, similar to the first fully connected layer, we add another fully connected layer after the MLP block, operating only on the spatial dimension
    Insert image description here
    of the features, and then apply the IDCT transform to get the prediction result: where W_m+1 and b_m+1 are the last Learnable parameters of a fully connected layer.
  • Note that lengths T and N do not need to be equal . When T > N, we only take the predicted N previous frames, in the case of T < N, we can fill the input sequence to N by repeating the last frame.

3.3 Losses

  • As mentioned in Section 1, as shown in Figure 1, the last input pose is "close" to the future pose.
  • With this observation, instead of predicting the absolute 3D pose from scratch, we have the network predict the residual between the future pose x_T+t and the last input pose x_T. This simplifies learning and improves performance.

objective function

  • Our objective function L includes two terms Lre and Lv:
    Insert image description here
  • The goal of Lre is to minimize the L2 norm between the predicted motion x_T+1:T+N and the true motion x_T+1:T+N:
    Insert image description here
  • The purpose of Lv is to minimize the L2 norm between the predicted motion speed v_T+1:T+N and the true ground speed vT+1:T+N: where v_T+1:T+N ∈R N×
    Insert image description here
    C , v_T Represents the speed of the tth frame, calculated using the time difference: v_T = x_t+1−x_t.

4. Experiment

4.1 Dataset

  • Human3.6M :
    Human3.6M contains 7 actors performing 15 movements, with 32 joints marked for each pose.
    We follow the testing protocol and use S5 as the test set, S11 as the validation set, and the others as the training set.
    Previous work has used different test sampling strategies, including 8 samples per action, 256 samples per action, or all samples in the test set.
    Since 8 samples are too few, taking all test samples cannot balance different actions with different sequence lengths, so we take 256 samples for each action for testing and evaluate on 22 joints .
  • AMASS :
    AMASS is a collection of multiple motion capture datasets using a unified SMPL parameterization.
    We use AMASS-bmlrub as the test set and split the rest of the AMASS dataset into training and validation sets.
    The model was validated on 18 joints.
  • 3DPW :
    3DPW is a dataset containing indoor and outdoor scenes.
    A pose is represented by 26 joints, and we evaluate 18 joints using a model trained on AMASS to evaluate generalization.

4.2 Evaluation indicators

  • This article uses the Mean Per Joint Position Error ( MPJPE ) on the three-dimensional joint coordinates as the evaluation index. This is the most widely used metric for evaluating 3D pose errors.
  • This metric calculates the average L2-norm of different nodes between predicted and true.
  • Similar to previous work, we ignore the global rotation and translation of poses, maintaining a sampling rate of 25 FPS for all datasets.

Guess you like

Origin blog.csdn.net/gaoqing_dream163/article/details/132170301