CVPR 2022 Oral | PoseConv3D Open Source: A New Paradigm for Action Recognition Based on Human Pose

Author丨Kenny@zhihu (authorized)

Source丨https://zhuanlan.zhihu.com/p/493374779

Edit丨Gokushi Platform

6a0f67af7a6101c31f38484374bdefcb.png

Paper:https://arxiv.org/abs/2104.13586

Code:https://github.com/kennymckormick/pyskl

introduction

772437c9c6dde3d3fb663a6bc1ecc5c2.png
PoseC3D framework

Our previous work, PoseC3D [1], was received as Oral Presentation in this year's CVPR. This work is the first to use 3D-CNN for keypoint sequence-based video understanding, while achieving good recognition performance. There is some inspiration for the work. In the previous article (https://zhuanlan.zhihu.com/p/395588459), we have described the specific method of this work, the advantages over the GCN method, and the actual performance in detail. In the past period of time, I have conducted some related research on skeleton action recognition. This article will focus on the following topics:

  • What is skeleton action recognition? What is its significance?

  • Some problems of existing skeleton action recognition solutions, and the advantages and disadvantages of PoseC3D.

  • Share our latest open source implementation, PYSKL [2], an open source codebase that supports both the PoseC3D and GCN methods, along with a set of best practices.

Skeleton action recognition definition and its meaning

Skeleton Action Recognition

Skeleton-based action recognition means video understanding based on and only based on temporal keypoint sequences. To give a concrete example: if there is a 300-frame video containing a person, using 17 2D keypoints (defined by CoCo), then the shape of the input is 300 x 17 x 2. Generally, the sequence of key points usually refers to the sequence of key points of the human body (such as key points such as elbows, wrists, knees, etc.), but obviously, this kind of method can also be extended to other scenarios, such as using facial key points to identify expressions, The key points of recognition gestures and so on.

Different from RGB-based video understanding, skeleton action recognition has the following advantages: 1. As a form of representation closely related to human actions, skeleton action recognition is of great significance for video understanding; 2. Skeleton points are a lightweight model Therefore, recognition methods based on skeleton points are often much lighter than methods based on other modalities; 3. Recognition based on skeleton points, combined with high-quality posture detectors, can often be obtained with less data. Good recognition effect and strong generalization ability.

At the same time, skeleton action recognition also has the following disadvantages: 1. Not all actions can be recognized only by the sequence of skeleton points, and some action categories need to be determined depending on the context of objects, scenes, etc.; 2. When it is difficult to obtain high quality In the case of key points, the performance of skeletal action recognition will be greatly affected. In the actual use process, it is also necessary to choose which mode is based on the actual needs for video understanding.

Some problems of existing solutions, and the advantages and disadvantages of PoseC3D

The existing Skeleton-based Action Recognition solutions mainly include two categories based on GCN or CNN. This article will discuss some of the problems.

Shared Issue: Skeleton Quality

As an input, the quality of the Skeleton itself is extremely important to the final recognition effect, but this has often received less attention. This point is discussed in PoseC3D. The main findings include: 1. If it is limited to the key points obtained by the pose estimator, then using the result of 2D pose estimation as input is usually much better than using 3D pose estimation results, or 2D pose estimation results. -> The result of 3D lifting is used as the input; 2. In the case of the same 2D pose as the input, depending on the quality of the pose estimator, the recognition results are also good and bad, but the difference is not big. In PYSKL, we also use 3D skeleton points from Kinect and 2D skeleton points output from HRNet to train ST-GCN [3] and ST-GCN++ [2] (a simple variant of ST-GCN we developed) Model. Taking 2D Pose as input, as a simple variant of ST-GCN, ST-GCN++ can achieve the top three of all SOTA methods on three benchmarks (NTURGB+D XSub, NTURGB+D XView, NTURGB+D 120 XSet) performance. It is worth noting that although HRNet 2D Pose has achieved advantages in most of the evaluation data sets, it is not all: for example, on the benchmark of NTURGB+D 120 Xsub, the effect of 2D Pose is worse than that of 3D Pose. We believe that there is still a lot of research to be done on the impact of Skeleton type and quality on recognition performance.

Another noteworthy point is that we believe that the effect of action recognition will increase monotonically with the quality of pose estimation, but the two are by no means similar to a simple linear relationship, which is pointed out in PoseC3D. Additionally, we argue that in some cases even poor quality pose estimates are sufficient for action recognition as long as they contain patterns related to the target action. For example, the recognition effect of HRNet on the GYM dataset is actually not good, but it can still achieve excellent results on the action recognition task depending on the estimated key points.

GCN method and its problems

As the mainstream method of Skeleton Action Recognition, the GCN method still has a series of practical problems, which restrict the performance of the model and affect the fair comparison between different methods to a certain extent. This paper takes the NTURGB+D dataset as an example to briefly describe the existing problems.

Preprocessing & Augmentation

Given a skeleton sequence, current GCN methods process it first to get the model input. The preprocessing is mainly divided into two parts: 1. In the spatial dimension, using the first frame as the benchmark, place the center point of the skeleton in the first frame at the origin, and align the spine of the skeleton in the first frame with the z-axis; 2. . In the time dimension, for the problem of different lengths of sequences in the data set, there are mainly the following solutions:

  • ST-GCN [3]: Expands all sequences with zero padding to the maximum length (the length of the longest sequence among all sequences).

  • AGCN [4]: ​​Expand all sequences to maximum length with loop padding.

  • CTR-GCN [5]: No processing is performed during preprocessing. During data augmentation, Random Crop is used to cut out subsequences and interpolate to a certain length.

Among them, the disadvantage of the first two schemes is that the sequence is pad to the longest length in a certain form, resulting in a waste of computing power. In the third scheme, although a variety of training samples can be generated, the skeleton distribution obtained by interpolate is different from the original data distribution, and the subsequences obtained by crop may not fully represent the entire action.

In PYSKL, we have made improvements to related practices. Specifically, we  directly apply Uniform Sampling in PoseC3D  to GCN, and sample subsequences from the original skeleton sequence as input. In this way, more diverse training samples can be obtained during training (and each sample is sufficient to cover the entire action), which also enables test time data augmentation.

Hyper Parameter Setting

In PYSKL, we also improve the hyperparameter settings for GCN training based on training experience with PoseC3D. The main improvements include that we use the CosineAnnealing training strategy and use a stronger regularization term. After improving the hyperparameter settings, the performance of the GCN model has been greatly improved.

CNN methods and their problems

The methods of skeleton-based action recognition based on CNN are mainly divided into two categories: 2D-CNN and 3D-CNN. The method based on 2D-CNN, such as PoTion [6], draws the sequence of skeleton points on a picture by color coding and processes it with 2D-CNN. The biggest problem is also to use color coding to carry out the time series dimension. Compression on top of that causes irreparable loss of information.

7bca6b60568c6a8e00456c26c73e0e03.png
PoTion uses color coding to draw the sequence of bone points on a map

As a 3D-CNN based scheme, PoseC3D stacks keypoint heatmaps into 3D Voxel and processes them with 3D-CNN. As a simple and clear solution, PoseC3D can directly utilize the powerful spatiotemporal modeling capabilities of 3D-CNN, and has many advantages such as good robustness, good scalability, and good compatibility while achieving good recognition results. But at the same time, it also has the following shortcomings:

  1. There is no specific model design for the skeletal modal characteristics, and there is still room for optimization of the recognition effect.

  2. Compared with GCN, the amount of calculation required is still more: using the R50-based 3D network, its computing power consumption can only be comparable to the heavy method MS-G3D [7] in GCN, more than some other lighter GCNs method.

  3. If the input is a 3D point, it can only be projected to 2D first, and there is information loss. This deficiency may be compensated for by some subsequent work on Multi-View Projection + PoseC3D.

Open source implementation of PYSKL

Based on MMAction2 (https://github.com/open-mmlab/mmaction2), we newly developed the skeletal action recognition code base PYSKL. In the completed first release, it supports three models PoseC3D [1], ST -GCN[3], ST-GCN++[2] . PYSKL has the  following characteristics:

  1. Complete model and excellent implementation: PYSKL supports both 3D-CNN and GCN methods. For PoseC3D, our released model covers multiple datasets and backbone networks, and fully covers the Joint and Limb modalities for most datasets. For the GCN-based method, we refer to the practice of AA-GCN [8], and the release weight file fully covers the four modes of Joint, Bone, Joint Motion, and Bone Motion. Users can easily reproduce related datasets based on release config and weight. Meanwhile, PoseC3D and ST-GCN++, trained based on our proposed good practice, achieve top-ranking performance on multiple benchmarks.

  2. Concise code: Focusing on skeletal action recognition, PYSKL simplifies the code, only retains the main functions, and removes redundant code. Its main directory keeps less than 5000 lines of code, less than one-third of MMAction2.

  3. Easy to use: Users can directly use the pickle files provided in PYSKL for training and testing. At the same time, we provide tools to visualize 2D/3D skeletal data.

e6804d08c3a930343ffded1a8ab9b4c1.png
PoseC3D and ST-GCN++ implementations in PYSKL achieve good performance on NTURGB+D

Link: https://github.com/kennymckormick/pyskl

Reference

[1] Revisiting skeleton-based action recognition:https://arxiv.org/abs/2104.13586

[2] https://github.com/kennymckormick/pyskl:https://github.com/kennymckormick/pyskl

[3] Spatial temporal graph convolutional networks for skeleton-based action recognition:https://scholar.google.com/citations%3Fview_op%3Dview_citation%26hl%3Den%26user%3DtAgSyxIAAAAJ%26citation_for_view%3DtAgSyxIAAAAJ%3Ad1gkVwhDpl0C

[4] Two-stream adaptive graph convolutional networks for skeleton-based action recognition:https://openaccess.thecvf.com/content_CVPR_2019/html/Shi_Two-Stream_Adaptive_Graph_Convolutional_Networks_for_Skeleton-Based_Action_Recognition_CVPR_2019_paper.html

[5] Channel-wise topology refinement graph convolution for skeleton-based action recognition:https://openaccess.thecvf.com/content/ICCV2021/html/Chen_Channel-Wise_Topology_Refinement_Graph_Convolution_for_Skeleton-Based_Action_Recognition_ICCV_2021_paper.html

[6] Potion: Pose motion representation for action recognition:https://openaccess.thecvf.com/content_cvpr_2018/html/Choutas_PoTion_Pose_MoTion_CVPR_2018_paper.html

[7] Disentangling and unifying graph convolutions for skeleton-based action recognition:https://openaccess.thecvf.com/content_CVPR_2020/html/Liu_Disentangling_and_Unifying_Graph_Convolutions_for_Skeleton-Based_Action_Recognition_CVPR_2020_paper.html

[8] Skeleton-based action recognition with multi-stream adaptive graph convolutional networks:https://ieeexplore.ieee.org/abstract/document/9219176/

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

15. The first 3D defect detection tutorial in China: theory, source code and actual combat

Heavy! 3DCVer- Academic paper writing and submission  exchange group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, multi-sensor fusion, CV introduction, 3D measurement, VR/AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, academic exchanges, job search exchanges, ORB-SLAM series source code exchanges, depth estimation and other WeChat groups .

Be sure to note: research direction + school/company + nickname , for example: "3D Vision + Shanghai Jiaotong University + Jingjing". Please note according to the format, it can be quickly passed and invited to the group. Please contact for original submissions .

431e04f4a2340bb48b1e9dd471e118dc.png

▲Long press to add WeChat group or contribute

181d0e187a5162d4948809656422c081.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development jobs and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 5,000 Planet members make common progress and knowledge for creating a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

65555df12dc6432fa89df92816140210.png

 There are high-quality tutorial materials in the circle, answering questions and solving doubts, and helping you solve problems efficiently

I find it useful, please give a like and watch~  22395109ab7b2c6d8d1e93333ae3df26.gif

Guess you like

Origin blog.csdn.net/Yong_Qi2015/article/details/124138386
Recommended