[Multiple pictures/second understanding] Vernacular OpenPose, the most popular pose estimation network

Foreword:

Recently, in the development of motion counting APP, algorithms related to attitude estimation need to be used. So the algorithms in this field are summarized. Some classic papers on attitude estimation are shown in the figure below. Among them, the influence of OpenPose can be said to be very large. Because of its good open source, papers, codes, tutorials, documents, and models are very rich, so many projects are based on OpenPose. And its bottom-up core algorithm PAF is also very worth learning. In this issue, I will give a detailed explanation of OpenPose, and strive to use the most vernacular language to explain the core algorithm of the content thoroughly.

network structure

​The network structure of OpenPose is shown in the figure below:

First, the feature of the image is extracted by the backbone network VGG19, and then enters the stage module. The stage is some serial modules, and the structure and function of each module are the same. It is divided into two branches, one branch generates pcm, and the other generates paf. And the pcm and paf of each stage will perform loss calculation. The final total loss is the sum of all losses. ​Consider here why do you need multiple stages? Theoretically, the first stage can already output complete information, why do we need the repeated stages later. This is because there is mutual semantic information between the key points. For example, in stage1, only the eyes may be detected, but the nose is not detected. In stage2, since the output of satge1 is included in the input, then stage2 is more accurate. It is possible to infer the position of the nose based on the eyes, so the subsequent stage can use the information extracted by the previous stage to further optimize the detection results, which is very useful for some key points that are difficult to detect. A more straightforward explanation is that the difficulty of detection of all key points is different. Some visual features such as eyes and noses are very obvious, while some key points may have very large differences with clothing, occlusion, jewelry, etc. Changes, so the previous stage detects some simple key points, and the later stage detects more complex key points based on the key points detected earlier. This is a process of gradual optimization.

Vernacular PCM

PCM is actually a heat map of key points. Part confidence map. Used to represent the location of key points. Assuming that 18 human key point information needs to be output, then PCM will output 19 channels, and the last channel is used as the background. In theory, it doesn't matter if you don't output the background, but the output background has two advantages. One is to add a supervisory information, which is conducive to the learning of the network. The other is that the background output continues to be used as the input of the next stage, which is beneficial to the next stage. better semantic information. The figure below is a schematic diagram of inputting a picture and outputting 19 PCMs.

Although it is said that a key point corresponds to a pixel in a pcm, it is generally not done. Generally, the Gaussian kernel is used to create the GroundTruth of the PCM. If there is no Gaussian kernel, then the point attached to the gt is forced to be learned as a negative sample, but in fact the receptive field corresponding to the point (yellow dotted line box) is actually very close to the receptive field corresponding to the gt point (red dotted line box), causing neural The Internet is very confusing. Is there a key point or not? Even sometimes due to errors in the labeling, the gt points marked may not be as accurate as the negative sample points next to them, so the neural network will be even more confused, learn XX. Sometimes they tell me that the round ones are apples, and sometimes they tell me that the round ones are pears. How can I learn from them? Therefore, using the PCM ground truth of the Gaussian kernel can alleviate this problem very well. Because the points adjacent to the label point are not simply regarded as negative samples, but are also regarded as positive samples according to the Gaussian distribution, but the confidence is slightly lower, which is more logical.

 

Vernacular PAF

PCM is a technology that is used in most key point detection networks. I believe many friends have already understood it, and PAF is the core of openpose, which is the biggest feature that distinguishes openpse from other key point detection frameworks. Chinese can be translated as the affinity field of key points, which is used to describe the affinity between different key points. Different joints belonging to the same person have high affinity, while joints between different people have low affinity. Since openpose is a bottom-up pose estimation network, that is, no matter who the joints are, it will be detected first, and then it will be determined which joints have high affinity, and then they will be classified as the same person. As shown in the figure below, since there are two instances of people in the picture, two left eyes and two left ears are detected, so how to pair them? At this point is the role of PAF, PAF will describe the affinity between any two joints (for example, if 2 left eyes and 2 left ears are detected, then each left eye can get the affinity with all left ears, that is There are 4 affinities in total), as shown in the figure below, the yellower the color, the stronger the affinity between the two key points, the affinity between point 1 of the left ear and point 1 of the left eye is significantly greater than that of point 1 of the left eye and point 2 of the left ear, So left eye 1 is paired with left ear 1 . The same left ear 2 has a greater affinity with the left eye 2, so the left ear 2 is paired with the left eye 2.

The above is a description of the role of PAF, so how is PAF realized? It's actually not that complicated. For the convenience of description (down-to-earth), I will not use the descriptive symbols and formulas in the paper here, but describe it in the most straightforward and down-to-earth vernacular. Openpose first artificially paired each joint, as shown in the figure below, a total of 19 pairs. It can be understood as 19 bones, each of which connects two joints. One PAF heatmap corresponds to one bone. At the beginning, I didn't understand how 18 joints could directly connect 19 bones. I finally understood after drawing all the bones. It turns out that some joints can be on multiple bones, and openpose adds virtual bones between the ears and shoulders. The virtual bones represented by the red connection in the figure below, so there are 19 bones in total, that is, 19 PAF fields. Since PAF is represented by a vector, the output channel of PAF is 19*2=38, because a vector needs to be represented by two scalars x and y, so the output channel of PAF is 2 times that of 19.

How is PAF expressed? Let's look at the PAF field corresponding to a connecting bone from the left hip to the left knee. First of all, we know that bones have length and width in the real world. The length is the distance between the two joints at the ends of the bone, so what about the width of the bone? It can be set through hyperparameters. For example, assuming that the width of the bone is α, then we can define that the affinity vector of points inside the bone is not 0, and the affinity vector of points outside the bone is 0.

As shown in the figure above, the red point is inside the bone, so it is a non-zero vector in the PAF field, and the green point is outside the bone, so it is represented by a 0 vector in the PAF field. So what is the vector of meaningful red points? Openpose is represented by a unit vector pointing from one joint to another, that is, the unit length vector of the yellow vector in the figure. Therefore, the final PAF field is shown in the figure below (the blue part is the 0 vector area):

If a certain pixel is inside multiple bones, then the PAF vector of the pixel is the mean vector of all vectors.

 

Fast Calculation of Affinity

With the PAF field though, how to calculate the affinity between two keypoints? The theory is that the integral of all vectors within that bone is required. However, the amount of calculation is too expensive, and it can actually be sampled uniformly. For example, the two joints are evenly divided into 5 segments, and then the paf vectors of 4 points are collected as the final affinity representation. Because all bones are sampled the same way, the relative size of the final affinity is still guaranteed. As shown in the figure below, in this case, the amount of calculation can be reduced a lot.

 

Keypoint pairing

To be honest, at first I thought that with PCM and PAF, pairing should not be a problem. For example, through PCM to get 3 left ears and 3 left eyes, and then according to the PAF affinity between them, pair the one with the greatest affinity. . For example, as shown below:

But in fact, PAF is not so smart. Sometimes many affinities are relatively close, and it is impossible to simply configure according to the maximum value. For example, left ear point 1 may have the greatest affinity with left eye 1, but left eye 1 has the greatest affinity with left ear 2. In this case, the simple maximum pairing will have some problems.

In fact, this is a bipartite graph search group large matching problem in graph theory. Therefore, the Hungarian algorithm can be directly used for maximum matching. Regarding the Hungarian algorithm, it is actually not difficult to understand. There are many good tutorials on the Internet. Friends who are interested can refer to relevant information.

@end(Friends with common technical hobbies can add me to WeChat, learn together, and make progress together: 15158106211)

Guess you like

Origin blog.csdn.net/cjnewstar111/article/details/115284760