Pose Estimation Evaluation Index

insert image description here

PCK

The percentage of correctly estimated key points (Percentage of Correct Keypoints) is basically not used now.

P C K i k = ∑ i δ ( d i d ≤ T k ) ∑ i 1 PCK_i^k=\frac{\sum_i\delta(\frac{d_i}{d}\leq T_k)}{\sum_i1} PCKik=i1id (ddiTk)

The normalized distance between the detected key point and its corresponding groundtruth is less than the set threshold T k T_kTkproportion. T k T_kTkIndicates the threshold value set manually, T k ∈ [ 0 : 0.01 : 0.1 ] T_k\in[0:0.01:0.1]Tk[0:0.01:0.1]

part iii represents the number of the joint point

d i d_i diIndicates the iiThe Euclidean distance between the predicted value of i joint points and groundtruth .

d d d is a scale factor of a human body, and the calculation methods of this factor are different for different public datasets.

$\delta(*) $ means that if the condition is true, then δ ( ∗ ) = 1 \delta(*)=1d ( )=1

ok

oks (object keypoint similarity) is inspired by the IOU of target detection, which is used to measure the similarity between predicted key points and groundtruth.
OKS p = ∑ iexp { − dpi 2 / 2 S p 2 σ i 2 } δ ( vpi > 0 ) ∑ i δ ( vpi > 0 ) OKS_p=\frac{\sum_iexp\{-d_{pi}^2/2S_p ^2\sigma_i^2\}\delta(v_{pi}>0)}{\sum_i\delta(v_{pi}>0)}OK S _p=id ( vpi>0)iexp{ dpi2/2Sp2pi2} d ( vpi>0)
where ppp means someone in the ground truth,pip^ipi represents the key point of this person;

d p i d_{p^i} dpiIndicates the ii of the pth person currently detectedThe Euclidean distancedpi = ( xi ′ − xpi ) ( yi ′ − ypi ) d_{p^i}=\sqrt{(x'_i-x_{p^i})(y'_i -y_{p^i})}dpi=(xixpi)(yiypi) , where $(x'_i,y'_i)$ is the detection result, ( xpi , ypi ) (x_{p^i},y_{p^i})(xpi,ypi) is the ground truth,

v p i = 1 v_{p^i}=1 vpi=1 means that the visibility of this key point is 1, that is, the key point is unoccluded and marked. vpi = 2 v_{p^i}=2vpi=2 indicates that the key points are occluded but marked.

S p S_p SpRepresents the scale factor of groundtruth p, whose value is the square root of the area of ​​the pedestrian detection frame: S p = wh S_{p}=\sqrt{wh}Sp=wh w , h w,h w,h is the width and height of the detection frame.

σ i \sigma_i piii _The normalization factor of i key points is related to the difficulty of key point labeling, which is through the manual labeling of all samples and the statistical standard deviation of the true value, σ \ sigmaThe larger σ means that this type of key point is more difficult to label.

For the 5000 samples in the coco data set, the normalization factors of 17 types of key points are calculated , σ \sigmaThe value of σ is: {nose: 0.026, eyes: 0.025, ears: 0.035, shoulders: 0.079, elbows: 0.072, wrists: 0.062, hips: 0.107, knees: 0.087, ankles: 0.089}, so this value can be usedas Constant, if the keypoint type used is not among these, it will be calculated separately.

AP (Average Precision) average accuracy rate

For the AP of single-person pose estimation, there is only one human body in the target picture, and the AP calculation method is:
AP = ∑ p δ ( oksp > T ) ∑ p 1 AP=\frac{\sum_p\delta(oks_p>T)}{\ sum_p1}AP=p1pd ( o k sp>T)
Indicates that the oks of all pictures are greater than the threshold TTThe percentage of T (T is artificially given).
insert image description here

For multi-person pose estimation, since there are M targets in an image, assuming that a total of N people are predicted, then the groundtruth and the predicted value can form a M × NM\times NM×N matrix, and then take the maximum value of each row as the oks of the target, then:
AP = ∑ m ∑ p δ ( oksp > T ) ∑ m ∑ p 1 AP=\frac{\sum_m\sum_p\delta(oks_p> T)}{\sum_m\sum_p1}AP=mp1mpd ( o k sp>T)
If the detection method used is bottom-up, first find out all the key points and then form a person. Suppose there are M people in a picture, and N people are predicted. Since the one-to-one correspondence between the predicted N people and the M people in the groundtruth is unknown, it is necessary to calculate the oks of each person in the groundtruth and the predicted N people . Then you can get a size of M ∗ × ∗ NM*×*NM×A matrix of N , each row of the matrix is​​the oksofNthe maximum value ofoksin each rowthe oksofGT. Finally, eachGTpedestrian has a scalaroks, and then a thresholdTAPcan be calculated through all pedestrians in all pictures:

MPJPE

For 3D pose estimation, the commonly used evaluation index is MPJPE (Mean Per Joint Postion Error), which can also be seen from the literal meaning to be the average Euclidean distance between the predicted key points and groundtruth, but the representation of general key points It is root-relative, that is, the coordinates of one of the key points as the root node. Then it is usually calculated in the camera coordinate system.

P-MPJPE

P-MPJPE (Procrustes analysis MPJPE) is an MPJPE based on Procrustes analysis. First, the output is rigidly transformed to groundtruth alignment and then the MPJPE is calculated.

Guess you like

Origin blog.csdn.net/weixin_45755332/article/details/128593344