OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields Paper Intensive Reading

OpenPose: Real-time multi-person 2D pose estimation using partial affinity fields

Summary

Real-time multi-person 2D pose estimation is a key component in enabling machines to understand people in images and videos. In this work, we propose a real-time method to detect 2D poses of multiple people in images. The proposed method uses a non-parametric representation, which we refer to as Partial Association Field (PAF), to learn to associate body parts with individuals in images. This bottom-up system achieves high accuracy and real-time performance no matter how many people are in the image. In previous work, PAF and body part location estimation are improved simultaneously during the training phase. We demonstrate that "refinement of the PAF" rather than "refinement of the PAF and body part locations" leads to a significant increase in runtime performance and accuracy. We also demonstrate the first combined body and foot keypoint detector, based on our publicly released internal annotated feet dataset. We show that combining detectors not only reduces inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work culminated in the release of OpenPose, the first open-source real-time system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

1 Introduction

In this paper, we consider a core component for obtaining a detailed understanding of people in images and videos: 2D pose estimation of the human body (or the problem of locating anatomical keypoints, or "parts"). Human body estimation mainly focuses on finding the body parts of an individual. Inferring the poses of multiple people in images presents a unique set of challenges. First, each image may contain an unknown number of people in any position or scale. Second, human-to-human interactions can lead to complex spatial disturbances due to contact, occlusion, or limb joints, making the association of components difficult. Third, runtime complexity tends to increase with the number of people in an image, making real-time performance a challenge.

A common approach is to use person detectors and perform individual pose estimation for each detection. These top-down approaches directly exploit state-of-the-art for single-person pose estimation, but suffer from an early promise: if a pedestrian detector fails, as is prone to happen when people approach, there is no way to recover. Furthermore, their running time is proportional to the number of people in the image, for each person detection a single person pose estimator is run. In contrast, bottom-up approaches are attractive because they provide robustness to early commitment and have the potential to decouple runtime complexity from the number of people in the image. However, bottom-up approaches do not directly use global contextual cues from other body parts and other people. The initial bottom-up approaches ([1], [2]) did not maintain the gains in efficiency, as the final parsing required expensive global inference, taking several minutes per image.

In this paper, we propose an efficient multi-person pose estimation method with good performance on several public benchmarks. We present the first bottom-up representation of affinity scores via Part Affinity Fields (PAFs), a set of 2D vector fields that encode the position and orientation of limbs on the image domain. We demonstrate that simultaneously inferring these bottom-up detection and association representations encodes enough global context for greedy parsing to obtain high-quality results at a fraction of the computational cost.

An earlier version of this manuscript appeared in [3]. This release makes some new contributions. First, we demonstrate that PAF refinement is crucial for maximizing accuracy, while body part prediction refinement is not so important. We increase the network depth but remove the refinement stage for body parts (Sections 3.1 and 3.2). This improved network improves speed and accuracy by approximately 200% and 7%, respectively (Sections 5.2 and 5.3). Second, we present an annotated foot dataset1 with 15K human foot instances, which is publicly available (Section 4.2), and we show that it is possible to train a combined model with body and foot keypoints, maintaining The speed of pure body models while maintaining their accuracy (Section 5.5). Third, we demonstrate the generality of our method by applying it to the vehicle keypoint estimation task (Section 5.6). Finally, this work documents the release of OpenPose [4]. This open-source library is the first real-time system for multi-person 2D pose detection, including body, feet, hands, and facial keypoints (Section 4). We also include runtime comparisons with Mask R-CNN [5] and Alpha-Pose [6], demonstrating the computational advantage of our bottom-up approach (Section 5.3).

Figure 1: Top: Multi-person pose estimation. Body parts belonging to the same person are linked, including foot points (big toe, little toe, and heel). Bottom left: Part Association Field (PAF) corresponding to the limb connecting the right elbow and wrist. Color coded directions. Bottom right: 2D vectors in each pixel of each PAF encode the position and orientation of the limb.

2. Related work

Single Person Pose Estimation

Traditional approaches to articulated human pose estimation combine local observations of body parts and inferences about the spatial dependencies between them. Spatial models of articulated poses are either based on tree-structured graphical models [7], [8], [9], [10], [11], [12], [13], whose parametric encodings spatial relationships follow motion Adjacent parts of chains, or non-tree models [14], [15], [16], [17], [18], which augment tree structures with additional edges to capture occlusion, symmetry, and long-range relations. To obtain reliable local observations of body parts, convolutional neural networks (CNNs) have been widely used and significantly improved the accuracy of body pose estimation [19], [20], [21], [22], [ 23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. Thompson et al. [23] use a deep architecture with a graphical model whose parameters are jointly learned with the network. [33] further use CNN to implicitly capture global spatial dependencies by designing networks with large receptive fields. The convolutional pose machine architecture proposed by Wei et al. [20] uses a multi-stage architecture based on the sequential prediction framework [34]; iteratively incorporates global context to refine part confidence maps and preserves the multimodal uncertainty of previous iterations . Intermediate supervision is performed at the end of each stage to address the problem of vanishing gradients during training [35], [36], [37]. Newell et al. [19] also showed that intermediate supervision is beneficial in stacked hourglass architectures. However, all of these methods assume a person, where the location and scale of the person of interest is given.

Multi-Person Pose Estimation

For multi-person pose estimation, most methods [5], [6], [38], [39], [40], [41], [42], [43], [44] use top-down The strategy first detects persons and then independently estimates the pose of each person on each detected region. Although this strategy makes the techniques developed for the single-person case directly applicable, it not only suffers from early detection results in person detection, but also fails to capture the spatial dependencies among different people that require global reasoning. Some approaches have begun to take into account interpersonal dependencies. Eichner et al. [45] extended the graph structure to consider a set of interacting people and deep sorting, but still required a person detector to initialize the detection hypothesis. Pishulin et al. [1] propose a bottom-up approach that jointly labels part detection candidates and associates them with individuals, pairwise scores are regressed from the spatial offsets of the detected parts . This approach does not rely on pedestrian detection, however, solving the proposed integer linear program on a fully connected graph is an NP-hard problem, so the average processing time for a single image is on the order of hours. Insafutdinov et al. [2] built on [1] with a more powerful part detector based on ResNet [46] and image-related pairwise scores, and greatly improved the running time through an incremental optimization method, but the method It still takes several minutes per image, and there is a limit of up to 150 partial proposals. The pairwise representation used in [2] is a vector of offsets between each pair of body parts, which is difficult to regress precisely, so a separate logistic regression is required to convert pairwise features into probability scores.

In earlier work [3], we proposed Part Affinity Fields (PAF), a representation consisting of a set of flow fields that encodes unstructured pairs between a variable number of human body parts relation. Compared with [1] and [2], we can efficiently obtain pairwise scores from PAF without additional training steps. These scores are sufficient for greedy parsing to obtain high-quality results with real-time performance for multi-person estimation. Concurrent with this work, Insafutdinov et al. further simplified their body part relation graph for faster inference in single-frame models and formulated articulated body tracking as spatiotemporal groupings of part proposals. More recently, Newell et al. [48] proposed associative embeddings, which can be viewed as labels representing each keypoint group. They group keypoints with similar labels into individuals. [49] propose to detect individual keypoints and predict their relative displacements, allowing a greedy decoding process to group keypoints into person instances. Kokabas et al. [50] propose a pose residual network that takes in keypoints and person detections, and then assigns keypoints to detected person bounding boxes. Nie et al. [51] propose to partition all keypoint detections using dense regression from candidate keypoints to centroids of persons in images.

In this work, we extend some of our earlier work [3]. We demonstrate that PAF refinement is critical and sufficient for high accuracy, eliminating body part location belief map refinement while increasing network depth. This results in faster and more accurate models. We also demonstrate the first combined body and foot keypoint detector, created from an annotated foot dataset that will be publicly released. We demonstrate that combining these two detection methods not only reduces inference time, but also maintains their respective accuracy compared to running them independently. Finally, we introduce OpenPose, the first open-source library for real-time body, foot, hand, and face keypoint detection.

3. Method

Figure 2 Collation pipeline (a) Our method takes the whole image as input for CNN to jointly predict (b) confidence maps for body part detection and (c) PAFs for part associations (d) parsing step performs a set of bipartite matches to associate Body part candidates (e) We finally combine them into full-body poses for all persons in the image.

Figure 2 illustrates the overall flow of our approach. The system takes as input a color image of size w × h (Fig. 2a) and generates the 2D locations of anatomical keypoints for each person in the image (Fig. 2e). First, the feedforward network predicts a set of 2D confidence maps S of body part locations (Fig. 2b) and a set of 2D vector fields L of part affinity fields (PAFs), which encode the degree of association between parts (Fig. 2c). . The set S = (S1 ,S2 ,...,SJ ) has J confidence maps, one for each part, where S j ∈ R w×h , j ∈ {1...J}. The set L = (L 1 ,L 2 ,...,LC ) has C vector fields, one for each limb, where L c ∈ R w×h×2 , c ∈ {1...C}. For clarity, we refer to some pairs as limbs, but some pairs are not human limbs (for example, faces). Each image location in Lc encodes a 2D vector (Fig. 1). Finally, the confidence map and PAF are parsed by greedy inference (Fig. 2d) to output 2D keypoints for all persons in the image.

3.1. Network Architecture

Figure 3 Architecture of multi-level CNN. The first set of stages predicts PAFs Lt, while the last set predicts confidence maps St. The predictions from each stage and their corresponding image features are connected to each subsequent stage. The convolutions with kernel size 7 in the original method [3] are replaced by 3 layers of convolutions with kernel 3, which are concatenated at their ends

Our architecture, shown in Figure 3, iteratively predicts the encoding part-to-part affinity fields (shown in blue) and detection confidence maps (shown in beige). An iterative prediction architecture following [20] improves predictions in successive stages, t ∈ {1,...,T}, each with intermediate supervision.

The network depth has been increased relative to [3]. In the original approach, the network architecture consists of several 7x7 convolutional layers. In our current model, by replacing each 7x7 convolutional kernel with 3 consecutive 3x3 kernels, the receptive field is preserved while reducing the amount of computation. The number of operations of the former is 2×7^2 − 1 = 97, while that of the latter is only 51. Furthermore, the outputs of each of the 3 convolutional kernels are concatenated following a similar approach to DenseNet [52]. The number of nonlinear layers is tripled, and the network can preserve both lower-level and higher-level features. Sections 5.2 and 5.3 analyze the improvements in accuracy and runtime speed, respectively.

3.2. Simultaneous detection and association

Images are analyzed by a CNN (initialized and fine-tuned by the first 10 layers of VGG-19 [53]), generating a set of feature maps F as input to the first stage. At this stage, the network produces a set of Partial Affinity Fields (PAF) L 1 = φ 1 (F) , where φ 1 refers to the CNN performing inference at stage 1. At each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to produce accurate predictions,

where φt refers to the CNN used for inference at stage t, and TP refers to the number of total PAF stages. After TP iterations, the process is repeated for the confidence map detection, starting with the latest PAF prediction,

where ρt refers to the CNN used for inference at stage t, and TC refers to the number of total confidence map stages.

This approach differs from [3], where both the PAF and confidence map branches are refined at each stage. Therefore, the amount of computation at each stage is halved. We empirically observe in Section 5.2 that improved affinity field predictions improve confidence map results, while the converse is not true. Intuitively, if we look at the PAF channel output, we can guess where the body parts are. But if we see a bunch of body parts with no other information, we can't parse them into different people.

Fig. 4 PAF of the right forearm at various stages. Although there is confusion between left and right body parts and extremities in the early stages, in later stages the estimates become more and more precise through global inference.

Figure 4 shows the refinement of the affinity field across stages. Confidence map results are predicted on top of the latest and finest PAF predictions, resulting in little noticeable difference between confidence map stages. To guide the network to iteratively predict the PAF of the body part in the first branch and the confidence map in the second branch, we apply a loss function at the end of each stage. We use an L2 loss between the estimated predictions and the ground truth maps and fields. Here, we spatially weight the loss function to address the practical problem that some datasets cannot fully label all of them. Specifically, the loss function of the PAF branch in the ti stage and the loss function of the confidence map branch in the tk stage are respectively:

where L*c is the groundtruth PAF, S*j is the groundtruth partial confidence map, and W is the binary mask with W(p) = 0 when annotation is missing at pixel p. The mask is used to avoid penalizing true positive predictions during training. Intermediate supervision at each stage solves the vanishing gradient problem by periodically replenishing gradients [20]. The overall goal is

3.3 Confidence map of part detection

Evaluate fS in Eq. (6) During training, we generate a groundtruth confidence map S* from the annotated 2D keypoints. Each confidence map is a 2D representation of the belief that a particular body part can be located in any given pixel. Ideally, if a person is present in the image, there should be a peak in each confidence map if the corresponding part is visible; if there are multiple people in the image, there should be a peak for each visible part j of each person k correspond

We first generate a personal confidence map S*j,k for each person k. Let xj,k ∈ R2 be the true location of body part j of person k in the image. The value at position p∈R2 in S∗j,k is defined as,

where σ controls the spread of the peak. The groundtruth confidence map predicted by the network is the aggregation of a single confidence map through the maximum operator,

We take the maximum value of the confidence map instead of the average value so that the accuracy of nearby peaks remains different, as shown on the right. At test time, we predict the confidence map and obtain body part candidates by performing non-maximum suppression.

3.4. Part association field of part association

Figure 5 Part association strategy. (a) Body part detection candidates (red and blue dots) and all connection candidates (gray lines) for two body part types. (b) Connection results are represented using midpoints (yellow dots): correct connections (black lines) and incorrect connections (green lines) that also satisfy the association constraints. (c) Results using PAFs (yellow arrows). By encoding position and orientation on limb support, PAF eliminates false associations

Given a set of detected body parts (shown as red and blue points in Figure 5a), how do we assemble them into a full-body pose of an unknown number? We need a confidence measure for the association of each pair of body part detections that they belong to the same person. One possible way to measure association is to detect an additional midpoint between each pair of parts on the limb and examine its occurrence between candidate part detections, as shown in Figure 5b. However, when people cluster together—as they tend to do—these midpoints may favor false associations (as shown by the green line in Figure 5b). This false association arises due to two limitations in the representation:

(1) It only encodes the position of each limb, not the orientation; (2) it reduces the support area of ​​a limb to a single point.

Part Association Fields (PAFs) address these limitations. They preserve the position and orientation information of the limb support area (as shown in Fig. 5c). Each PAF is a 2D vector field for each limb, as shown in Fig. 1d. For each pixel in the region belonging to a particular limb, a 2D vector encodes the direction from one part of the limb to another. Each type of limb has a corresponding PAF connecting its two related body parts.

Consider a single limb shown in the image below. Let xj1,k and xj2,k be the ground of body 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Gaussian 1 Gaussian 2 Max Average p S 0 0.2 0.4 0.6 0.8 1 Gaussian 1 Gaussian 2 Max Average Sppvv Truth location? pxj2, 1xj1, kxj2, k parts j1 and j2 come from limb c of person k in the image. If point p lies on the limb, the value at L∗c,k(p) is the unit vector pointing from j1 to j2; for all other points, the vector has zero value.

fL6 in the calculation formula During training, we define the groundtruth PAF L ∗ c,k at the image point p as

Here, v = (xj 2 ,k - xj 1 ,k )/||xj 2 ,k -xj 1 ,k || 2 is a unit vector in the direction of the limb. The set of points on a limb is defined as the points within a line segment distance threshold, i.e. those points p

where the limb width σl is the distance in pixels, the limb length is lc,k=||x j2,k−xj1,k||2, v⊥ is the vector perpendicular to v

The groundtruth partial affinity field averages the affinity fields for all people in the image,

where nc (p) is the number of nonzero vectors at point p for all k persons.

During testing, we measure the association between candidate part detections by computing line integrals over the corresponding PAFs along line segments connecting candidate part locations. In other words, we measure the alignment of the predicted PAF with the candidate limb formed by connecting the detected body parts. Specifically, for two candidate part positions dj 1 and dj 2 , we sample the predicted part affinity field L c along the line segment to measure the confidence of their association

where p(u) interpolates the positions of the two body parts dj 1 and dj 2,

In practice, we approximate the integral by sampling and summing evenly spaced values ​​of u

3.5 Multi-person parsing using PAF

Figure 6 Graphic matching. (a) Original image with part detection. (b) Part K map. (c) Tree structure. (d) A set of bipartite graphs.

We perform non-maximum suppression on the detection confidence map to obtain a discrete set of part candidate locations. For each part, we may have multiple candidates due to multiple people in the image or false positives (Fig. 6b). These candidate parts define a large number of possible limbs. We score each candidate limb using the line integral computation over the PAF defined in Eq. The problem of finding an optimal resolution corresponds to a K-dimensional matching problem known as NP-Hard [54] (Fig. 6c). In this paper, we propose a greedy algorithm that consistently produces high-quality matches. We speculate that the reason is that pairwise association scores implicitly encode global context due to the large receptive field of the PAF network.

Formally, we first obtain a set of multiple body part detection candidates DJ, where DJ = {dmj : for j ∈ {1...J},m ∈ {1...N j }}, where Nj is the body The number of candidates for part j, dmj ∈ R2 is the position of the mth detection candidate for body part j. These candidate part detections still need to be associated with other parts of the same person - in other words, we need to find pairs of part detections that are actually connected limbs. We define a variable zmnj1j2∈{0,1} to indicate whether two detection candidates dmj1 and dnj2 are connected, and the goal is to find the optimal assignment of all possible connection sets, Z={zmnj1j2: For j1,j2∈{1... J},m∈{1...Nj1},n∈ {1...Nj2}}

If we consider a pair of parts j1 and j2 (e.g., neck and right hip) of the c-th limb, finding the optimal association reduces to a maximum-weight bipartite graph matching problem [54]. This situation is shown in Figure 5b. In this graph matching problem, the nodes of the graph are the body part detection candidates Dj1 and Dj2, and the edges are the possible connections between the pairs of detection candidates. Additionally, each edge is weighted by an equation. 11—Partial affinity aggregation. A match in a bipartite graph is a subset of edges selected in such a way that no two edges share a node. Our goal is to find the maximum weighted match for the selected edge,

where Ec is the total weight for limb type c matching, Zc is the Z subset of limb type c, and Emn is the partial affinity between parts dmj1 and dnj2 defined in Eq. 11. Equations 14 and 15 enforce that no two edges share a node, ie no two limbs of the same type (eg left forearm) share a part. We can use the Hungarian algorithm [55] to obtain the optimal match.

Determining Z is a K-dimensional matching problem in finding the full-body pose of multiple people. This problem is NP-Hard [54] and there are many relaxations. In this work, we add two relaxations to the optimization, specifically for our domain. First, we choose the least number of edges to obtain the spanning tree skeleton of the human pose instead of using the full graph, as shown in Figure 6c. Second, we further decompose the matching problem into a set of bipartite matching subproblems, and independently determine matches in adjacent tree nodes, as shown in Figure 6d. We present detailed comparison results in Section 5.1, which show that least-greedy inference approximates the global solution well at a fraction of the computational cost. The reason is that the relationship between adjacent tree nodes is explicitly modeled by PAF, but internally, the relationship between non-adjacent tree nodes is implicitly modeled by CNN. This property arises because CNNs are trained with a large receptive field, and PAFs from non-adjacent tree nodes also affect predicted PAFs.

With these two relaxations, the optimization is simply decomposed into

Therefore, we obtained limb connection candidates independently for each limb type using Eqns. 13-15. For all body connection candidates, we can assemble the connections that share the same part detection candidate into a full-body pose for multiple people. Our optimization scheme for tree structures is orders of magnitude faster than the optimization for fully connected graphs [1], [2].

Figure 7 Importance of redundant PAF connections. (a) Two different persons were incorrectly merged due to a wrong cervical-nasal connection. (b) The right ear-shoulder connection is more reliable, avoiding false nose-neck connections

Our current model also contains redundant PAF connections (eg, between ear and shoulder, wrist and shoulder, etc.). This redundancy especially improves the accuracy on crowded images, as shown in Fig. 7. To handle these redundant connections, we slightly modified the multiplayer resolution algorithm. While the original approach starts from the root component, our algorithm ranks all possible pairwise connections by their PAF scores. If a connection tries to connect 2 body parts that have been assigned to different people, the algorithm recognizes with higher confidence that this would contradict the PAF connection, and then ignores the current connection.

4、OPENPOSE

A growing number of computer vision and machine learning applications require 2D human pose estimation as input to their systems [56], [57], [58], [59], [60], [61], [62]. To help the research community advance their work, we publicly release OpenPose [4], the first real-time multiplayer system. See Figure 8 for an example of the entire system.

Figure 8 Output of OpenPose, real-time detection of body, feet, hands and facial key points. OpenPose is robust to occlusions, including during human interaction

4.1. System

Available 2D body pose estimation libraries, such as Mask R-CNN [5] or Alpha-Pose [6], require users to implement most of the pipeline, their own frame reader (e.g., video, image, or camera stream), with For the display of visualization results, use the results to generate output files (for example, JSON or XML files), etc. Furthermore, existing face and body keypoint detectors are not combined, requiring different libraries for each purpose. OpenPose overcomes all these problems. It runs on different platforms, including Ubuntu, Windows, Mac OSX, and embedded systems (for example, Nvidia Tegra TX2). It also provides support for different hardware such as CUDA GPU, OpenCL GPU and CPUonly devices. Users can select input between image, video, webcam and IP camera streams. He can also choose whether to display the results or save them on disk, enable or disable each detector (body, feet, face and hands), enable normalization of pixel coordinates, control how much GPU is used, skip frames for faster processing speed, wait

OpenPose consists of three distinct modules: (a) body+foot detection, (b) hand detection [63], and (c) face detection. The core block is the combined body+foot keypoint detector (Section 4.2). It can also use the original body-only models trained on COCO and MPII datasets [3]. Based on the output of the body detector, facial bounding box proposals can be roughly estimated from some body part locations (especially ears, eyes, nose, and neck). Similarly, hand bounding box proposals are generated using arm keypoints. This approach inherits the problems of the top-down approach discussed in Section 1. The hand keypoint detector algorithm is explained in more detail in [63], while the face keypoint detector is trained in the same way as the hand keypoint detector. The library also includes 3D keypoint pose detection by performing 3D triangulation and nonlinear Levenberg-Marquardt refinement [64] on the results of multiple synchronized camera views.

OpenPose outperforms all state-of-the-art methods in inference time while preserving high-quality results. It is able to run at about 22 FPS on a machine equipped with an Nvidia GTX 1080 Ti while maintaining high accuracy (Section 5.3). OpenPose has been used by the research community for many vision and robotics topics, such as person re-identification [56], GAN-based video retargeting of faces [57] and bodies [58], human-computer interaction [59], 3D pose estimation [60] ] and 3D human mesh model generation [61]. Furthermore, the OpenCV library [65] includes OpenPose and our PAF-based network architecture in its Deep Neural Network (DNN) module.

4.2. Extended foot key point detection

Figure 9 Analysis of the key points of the foot. (a) Keypoint annotation of the foot, consisting of the big toe, little toe, and heel. (b) Example of a pure body model where the right ankle is incorrectly estimated. (c) Similar to the body+foot model example, the foot information helps to predict the right ankle position

Existing human pose datasets ([66], [67]) contain limited types of body parts. The MPII dataset [66] annotates ankles, knees, hips, shoulders, elbows, wrists, neck, torso, and top of the head, while COCO [67] also includes some facial keypoints. For both datasets, foot annotations are limited to ankle locations. However, graphics applications such as avatar retargeting or 3D human shape reconstruction ([61], [68]) require foot key points such as the big toe and heel. In the absence of foot information, these methods suffer from issues such as candy-wrapping effects, floor penetration, and foot skating. To address these issues, a small fraction of foot instances in the COCO dataset are labeled using the Clickworker platform [69]. It is split into 14K annotations from the COCO training set and 545 annotations from the validation set. A total of 6 foot key points are marked (see Fig. 9a). We consider the 3D coordinates of foot keypoints instead of surface locations. For example, for exact toe placement, we marked the area between the nail and skin junction, and also accounted for depth by marking the center of the toe instead of the surface.

Figure 10 Keypoint annotation configuration for 3 datasets

Using our dataset, we trained a foot landmark detection algorithm. A naive foot keypoint detector can be built by using the body keypoint detector to generate foot bounding box proposals and then training the foot detector on it. However, this approach suffers from the top-down problems described in Section 1. Instead, the same architecture described previously for body estimation is trained to predict body and foot locations. Figure 10 shows the distribution of keypoints for the three datasets (COCO, MPII and COCO+foot). The body+foot model also incorporates an interpolation point between the hips, allowing the legs to be connected even when the upper torso is occluded or out of the image. We find evidence that foot keypoint detection implicitly helps the network more accurately predict some body keypoints, especially leg keypoints, such as ankle positions. Figure 9b shows an example where the body network alone fails to predict the ankle position. By including foot keypoints during training while maintaining the same body annotations, the algorithm correctly predicts the ankle position in Figure 9c. We quantitatively analyze the accuracy difference in Section 5.5.

Paper address

2017:https://arxiv.org/pdf/1611.08050.pdf

2019:https://arxiv.org/abs/1812.08008

code address

GitHub - CMU-Perceptual-Computing-Lab/openpose: OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation

The overall idea of ​​the thesis

1. After the image enters, a convolution is first performed, which is F in the above picture. The description in the paper is: the first 10 layers of VGG-19 are initialized and fine-tuned

2. A convolution calculates PAFs, which is the relationship between points and points. In this process, the loss is calculated.

3. The result of 1 and the result of 2 are spliced ​​together

4. The last convolution calculates the confidence of each joint and calculates the loss

Theory introduction

AI recognizes the five realms of people

AI recognition can be divided into five levels, in order:

1. Is there anyone? ->object detection ( YOLO SSD Faster-RCNN )

2. Where is the person? ->object localization & semantic segmentation ( Mask-RCNN )

3. Who is this person? ->face identification face recognition

4. What state is this person in at the moment? pose estimation

5. What is this person doing during the current period of time? ->Sequence action recognition Action recognition of a short video

Challenges of Pose Detection

The number of people included in each image is unknown.

Interactions between people are very complex, such as contact, occlusion, etc., which makes it difficult to unite the various limbs, that is, determine which parts of a person.

The more people in the image, the greater the computational complexity (the amount of calculation is positively correlated with the number of people), which makes real time difficult.

Multi-Person Pose Estimation

Multi-person pose estimation is divided into two directions: Bottom-Up method and top-down method 1. Bottom-Up (Bottom-Up) method: first detect the key points of the human body in the image, and then separate the key points of the human body in the image. Assigned to different human instances. 2. Top-down (Top-Down method): first run a human body detector on the image, find all human body instances, and then use key point detection for each human body. This method separates human body detection and key point detection open

Greedy Algorithm

Greedy algorithm (also known as greedy algorithm) means that when solving a problem, always make the best choice at present. That is to say, without considering the overall optimality, what he made is a local optimal solution in a certain sense.

NP-problem

(nondeterministic polynomial time) NP refers to non-deterministic polynomial (abbreviated NP). The so-called non-determinism means that a certain number of operations can be used to solve problems that can be solved in polynomial time.

PAFs

Part Affinity Fields, what are PAFs? ?

 As shown in Figure 4, it is actually the heatmap generated by the connection between the two joints

Hungarian algorithm

Personal understanding of this algorithm, give a simple example

In multi-target tracking deepsort, this algorithm is used. For example, the first frame image detects 5 targets, and the second frame image detects 6 targets. Target tracking needs to match the targets in the two frames before and after as much as possible. Together, the Hungarian algorithm is an algorithm that matches 5 targets in the previous frame and 6 targets in the next frame. It calculates a loss, and the matching scheme with the lowest loss is the most reasonable matching scheme. The Hungarian algorithm is so many Such an algorithm for matching the objectives to find the optimal solution

Hungarian Algorithm(Hungarian Algorithm)

Release Notes

Openpose actually has two papers, written by a team, published in 2017 and 2019 respectively,

The network structure in 2017 is:

Networking in 2019

In the 2019 version of the paper, it is also clearly mentioned that in the new paper, each 7x7 convolution kernel is replaced by 3 consecutive 3x3 kernels, and the DenseNet network structure is also borrowed, as shown in the dotted line in the second picture

reference link

Summary of OpenPose 2019 Edition_Sisyphus Blog-CSDN Blog_Openpose Application Status

How to evaluate Carnegie Mellon University's open source project OpenPose? - Know almost

[Multiple pictures/seconds to understand] Vernacular OpenPose, the most popular pose estimation network-Knowledge

[AI Knowing People] OpenPose: Real-time Multiplayer 2D Pose Estimation | Attached Video Test and Source Code Link-Knowledge

Code compilation explanation

Compilation of OpenPose and basic use of PyOpenPose:

Compilation of OpenPose and basic use of PyOpenPose_哔哩哔哩_bilibili

Guess you like

Origin blog.csdn.net/XDH19910113/article/details/125778366