[Paper reading notes] BlazePose: On-device Real-time Body Pose tracking

Paper address: https://arxiv.org/abs/2006.10204

Paper summary:

  The method in this article can run to the real-time effect on the cpu of the mobile phone, and it exceeds 30fps on the pixel2 phone.
  BlazePose, whose network structure borrows from a stacked network such as hourglass, believes that a structure such as encoder-decoder can learn well. The network structure is shown in the following figure: After experiencing an encoder, use the decoder to generate heatmap and offset prediction branches for supervision, and then use an encoder to perform coordinate regression. During training, heatmap and coordinate regression are used to train together, but the branch of coordinate regression does not participate in the back propagation of the network structure before backbone and heatmap. During inference, the heatmap branch and the offset branch are discarded, and only the coordinate output is retained . When predicting coordinates, the vision score of each coordinate is also trained to predict the confidence.
  In fact, BlazePose is also a top-down prediction network. In order to quickly predict the joint points, a tracker is set up to speed up the detection by using the connection between the posture and the human frame between frames.

  In this article, when implementing, we need to pay attention to the various techniques mentioned in the article and the way to build the data set.

Introduction

  Heatmap prediction is suitable for multiple people, but for a single person, its time-consuming and memory-consuming is relatively large. The direct autoregression of coordinates, although the amount of calculation is low, but the coordinates learned are not accurate, which often cannot solve the potential ambiguity problem.
  In network design, the author observes that the low amount of hourglass parameters has also brought great improvements, so the idea of ​​Hourglass, that is, the network structure of encoder-decoder, is used to predict the heatmap of all related nodes. Then connect an encoder to directly return all the joint point coordinates. The intuition behind this idea is that the heatmap branch can be discarded during inference in order to be sufficiently lightweight.

Inference pipeline

  After using the detector or tracker, the key point coordinates are predicted in the area. The tracker predicts the coordinates of key points, the presence of people in the current frame, and the refined region of interest in the current frame. When the tracker shows that the person is not present, run the detection network to detect it.

Human body detector

  Generally, when a human body detector is used, NMS is used to reduce the target. However, NMS is only effective for rigid targets with a few degrees of freedom, and not effective for highly free objects like humans, such as waving hands and hugging. In order to solve the phenomenon that the human body presents a non-rigid body due to its posture, relatively rigid body parts, such as the face or torso, are detected.
  However, experimental observations show that for the network, the strong signal of the torso position is the face. Therefore, human faces are selected as strong features for detection. The face detector uses blaze-face. When doing data sets and experiments, it is assumed that the face is always visible, such as for AR projects. The face detector also predicts other special alignment parameters: the point in the middle of the buttocks, the size of the circle that surrounds the entire person, and the slope (the angle between the middle shoulder and the middle hip). The slope is used for subsequent data processing to straighten the picture.

Topology

  The topology of this paper is shown in the figure below. In addition to the key points of coco, the minimum number of key points of the face, hands and feet are used to estimate the rotation, size and position of interest in the subsequent model.

data set

  The tracking-based solution in this paper needs to show pose alignment. The author limited the scope of the data set: either the whole person is visible, or the key points of the hips and shoulders can be labeled with confidence.
  In order to ensure that the model supports severe occlusion that does not exist in the data set, the author uses a large number of occlusion simulation enhancements .
  There are 60k training data sets, and the pictures show a single person or a small number of people with the same posture and 25k single people in fitness or exercise scenes. All markings are manually marked.

Network structure design

  Use the human body alignment scheme in the human body detector mentioned above, and then perform posture detection. The posture detection network is shown in the figure below. Using heatmap, offset and coordinate regression methods, as shown in Figure 4, heatmap and offset are only used for training, and the corresponding output layer will be deleted during inference.

  Use heatmap as a lightweight supervised embedding, and then use regression encoder network to predict coordinates. In the network design, Hourglass is used for reference, and a small heatmap based on encoder-decoder and a subsequent regression encoder network are stacked.
  At the same time, skip-contections are actively used, but the gradient of the regression encoder will not propagate back to the function of heatmap training. The author found that this not only improves the prediction of heatmap, but also greatly improves the accuracy of coordinate regression.

Alignment and occlusion enhancement

  The priori of the pose is an important part of the solution. During the data preparation during the enhancement training, the author chose to deliberately limit the support range of angle, scale and translation.

  Based on the detection or the key point of the previous frame, the author performs it on the person, making the point in the middle of the buttocks as the center of the square image input by the neural network.
  The author estimates that the rotation of the posture is a straight line L between the points between the hips and the shoulders, and rotates the image so that L is parallel to the y axis. At the same time, the scale is estimated so that all body points are in a square bounding box surrounding the body. On top of this, a 10% zoom and shift enhancement is applied to ensure that the tracker handles the alignment of body twists and distortions between frames.

Thesis experiment

  The test set consists of two 100 pictures manually annotated. There are one to several people in each picture. The first dataset, called AR dataset, has a large number of human body poses in the wild. The second data set contains only yoga and fitness postures.
  The comparison object is openPose. In order to maintain consistency, only the COCO topology is used and 17 points are used for evaluation. This is a common subset of blazePose and openPose. The evaluation standard is [email protected], which is 20% of the human trunk.

  This article trained two models with different capacities, BlazePose Full (6.9MFlop, 3.5M Params) and BlazePose Lite (2.7MFlop, 1.3M Params). On the AR dataset, blazePose is slightly inferior to OpenPose. But blazePose Full is better than OpenPose on yoga and fitness data sets. But it should be noted that, compared with OpenPose on a 20-core desktop CPU, BlazePose is 25 to 75 times faster than OpenPose on a single mid-level mobile phone cpu. According to different quality requirements.

  Sample effect:

Guess you like

Origin blog.csdn.net/qq_19784349/article/details/111238350