Click on "Computer Vision Workshop" above and select "Star"
Dry goods delivered as soon as possible
Author: ChaucerG
Source丨Jizhi Shutong
This paper introduces
YOLO-Pose
, a novelHeatmap
joint-free detection method, pose estimation based on the YOLOv5 object detection framework.Existing
Heatmap
two-stage based methods are not optimal because they are not trained end-to-end and the training relies on an alternative L1 loss, which is not equivalent to maximizing the evaluation metric, i.e. Objective Keypoint Similarity (OKS).
YOLO-Pose
It is possible to train the model end-to-end and optimize the OKS metric itself. The model learns to jointly detect the bounding boxes of multiple persons and their corresponding 2D poses in a single forward pass, surpassing the best results of both top-down and bottom-up approaches.
YOLO-Pose
No post-processing is required for bottom-up methods to group detected keypoints into a skeleton, since each bounding box has an associated pose, resulting in an inherent grouping of keypoints. Unlike top-down methods, multiple forward passes are canceled because all poses are localized.YOLO-pose achieves new state-of-the-art results on COCO validation (90.2% AP50) and test-dev sets (90.3% AP50), surpassing Trick without flipping, multi-scale, or any other test time increase All existing bottom-up methods. All experiments and results reported in this paper do not have any increase in test time, unlike traditional methods that use flip tests and multi-scale tests to improve performance.
1YOLO-Pose method
YOLO-Pose
Like other Bottom-up
methods, it is also a Single Shot
method. However, it is not used Heatmaps
. Instead, YOLO-Pose
connect all of a person's key points with Anchor
.
YOLO-Pose
Based on the YOLOv5 target detection framework, it can also be extended to other frameworks. YOLO-Pose
Also verified to a limited extent on YOLOX. Figure 2 illustrates the overall architecture with pose estimation.
2.1 Overview
YOLOv5 is a very good detector both in terms of accuracy and complexity. So choose it as a base to build upon, and build on top of it. YOLOv5 mainly focuses on 80-class COCO object detection, Box head
predicting Anchor
85 elements of each, corresponding to the bounding boxes, object scores and confidence scores of the 80 classes, respectively. And corresponding to each grid position there are 3 different shapes Anchor
.
For problems that Human Pose Estimation
can be viewed as a single class Person detection
, each person has 17 relevant keypoints, and each keypoint has a re-identified location and confidence: . So, Anchor
there are 51 elements in total for the 17 keypoints associated with one.
So for each Anchor
, Keypoint Head
51 elements are Box head
predicted and 6 elements are predicted. For n keypoints Anchor
, the overall prediction vector is defined as:
Keypoint confidences are trained based on the visibility signatures of keypoints. If a keypoint is visible or occluded, then the Ground Truth
confidence is set to 1, otherwise, if the keypoint is outside the field of view, the confidence is set to 0.
Keep the confidence of keypoints greater than 0.5 during inference. All other predicted keypoints are masked. Predicted keypoint confidences are not used for evaluation. However, since the network predicts all 17 keypoints for each detection, keypoints outside the field of view need to be filtered out. Otherwise, there will be confidence th keypoints that lead to deformed skeletons. Existing based methods do not need Heatmap
to Bottom-up
do this because keypoints outside the field of view are not detected in the first place.
YOLO-Pose
Use CSP-darknet53
as Backbone
, PANet
to fuse features from Backbone
different scales. Next are 4 different scales Head
. Finally, there are 2 Decoupled Heads
for predicting boxes and keypoints.
The complexity in this work YOLO-Pose
is limited to 150 GMACS, within which YOLO-Pose
competitive results can be achieved. As the complexity further increases, Top-down
the gap to the method can be further bridged. However, YOLO-Pose
this path is not pursued as YOLO-Pose
the focus is on real-time models.
2.2 Anchor based multi-person pose formulation
For a given image, the one that matches a person Anchor
will store its entire 2D pose
sum bounding box
. bounding box
The coordinates of are converted to the Anchor
center, while bounding box
the dimensions of are Anchor
normalized according to the height and width of . Likewise, keypoint positions are converted wrt to Anchor
center. However, the keypoints are not Anchor
normalized with the height and width. Key point
and box
both are predicted at Anchor
the center.
Since YOLO-Pose
the improvement is Anchor
independent of the width and height of , it YOLO-Pose
can be easily extended to Anchor Free
object detection methods such as YOLOX
, FCOS
.
2.3 IoU Based Bounding-box Loss Function
Most object detectors optimize IoU Loss
variants such as GIoU
, , DIoU
or CIoU Loss
, Distance-based Loss
because these losses are scale-invariant and directly optimize the evaluation metric itself. YOLO-Pose
used for CIoU Loss
supervision bounding box
. For the kth Anchor
match at position and scale s Ground Truth bounding box
, the loss is defined as:
is the kth Anchor
prediction box at position (i, j) and scale s. In YOLO-Pose
, there are 3 for each position Anchor
, and the prediction happens on 4 cales.
2.4 Human Pose Loss Function Formulation
OKS is a more commonly used indicator for evaluating key points. Traditionally, Heat-map based Bottom-up
methods use L1 loss to detect keypoints. However, L1 loss is not necessarily suitable for obtaining optimal OKS. Likewise, the L1 loss does not take into account the scale of the target or the type of keypoints. Since it Heat-map
is a probability map, it Heat-map
is impossible to use OKS as a loss in pure-based methods. OKS can be used as a loss function only when regressing to keypoint locations. Geng et al. used scale-normalized L1 loss for keypoint regression, which is a step towards OKS loss.
Therefore, the authors directly define the key point of the regression as the Anchor
center, so that the evaluation metric itself can be optimized instead of one surrogate loss
. Here the concept of IoU loss is extended from boxes to keypoints.
In the presence of keypoints, the target keypoint similarity (OKS) is regarded as the IOU. The OKS loss is inherently scale-invariant and more important than some keypoints. For example, key points on a person's head (eyes, nose, ears) are penalized more than key points on his body (shoulders, knees, hips, etc.).
Yolo pose architecture based on YOLOv5. The input image is passed through the CSP-darknet53
backbone, and feature maps of different scales {P3, P4, P5, P6} are generated. PANet is used to fuse these feature maps across multiple scales. The output of PANet is input to the detection head. Finally, each detection head branches to Box Head and Keypoint Head.
Unlike standard IoU loss, whose gradient vanishes without overlapping, OKS loss never does. Therefore, OKS loss is more similar to DIoU loss.
Corresponding to each bounding box, the entire pose information is stored. Therefore, if a GT bounding box Anchor
matches in position and scale s, Anchor
the keypoints will be predicted relative to the center. The OKS is calculated separately for each keypoint and then summed to give the final OKS loss or keypoint IOU loss.
Corresponding to each keypoint, a confidence parameter is learned that shows whether a keypoint exists for that person. Here, the visibility flags of key points are used as GT.
Where hyperparameters: ,,,. Mainly used to balance losses.
2.5 Test Time Augmentations
All SOTA methods for pose estimation rely on test-time augmentation (TTA) to improve performance. Flip testing and multi-scale testing are two commonly used testing techniques. The flip test increases the complexity by 2X, while the multi-scale test runs inference on three scales {0.5X, 1X, 2X}, increasing the complexity by (0.25X+1X+4X)=5.25X. The complexity will increase by 5.25*2x=10.5X with the flip test and multi-scale test.
In addition to increasing computational complexity, preparing augmented data is itself expensive. For example, in a flip test, the image needs to be flipped, which increases the latency of the system. Similarly, multi-scale testing requires a resizing operation for each scale. These operations can be very expensive because they may not be accelerated, unlike CNN operations. There is an additional cost to fuse the outputs of the various forward passes. For embedded systems, being able to get competitive results without any TTA is the most important thing.
Therefore, YOLO-Pose
all results of 's do not have any TTA.
2.6 Keypoint Outside Bounding Box
top-down
The method performs poorly under occlusion. One of the advantages compared to top-down
our method YOLO-Pose
is that the keypoints are not constrained within the predicted bounding box. Therefore, if keypoints are outside the bounding box due to occlusion, they can still be correctly identified. However, in top-down
our approach, pose estimation also fails if the person is not detected correctly. Among the YOLO-Pose
methods, occlusion and incorrect box detection mitigate these challenges to some extent, as shown in Figure 3.
2.7 ONNX Export for Easy Deployability
YOLO-Pose
All ops used in are part of the standard deep learning library and compatible with ONNX. Therefore, the entire model can be exported into ONNX, which makes it easy to deploy across platforms. This standalone ONNX model can be executed using ONNXRUNTIME, taking an image as input, and inferring the bounding box and pose of each person in the image. There is no other top-down
way to export to an intermediate ONNX representation end-to-end.
2 Experimental results
3.1 Ablation experiment
1、OKS Loss vs L1 Loss
2、Across Resolution
3. Quantization operation
The YOLOv5 model is sigmoid-weighted linear unit (SiLU)
. Liu et al. observed that unbounded activation functions like SiLU or HardSwish are not quantization friendly, whereas models with ReLUX activations are robust to quantization due to their finite nature.
Therefore, the model is retrained with ReLU activations. We observed a 1-2% decrease in activation from SiLU to ReLU. We call these models YOLOv5_relu.
3.2 COCO results
3 Reference
[1].YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss
This article is for academic sharing only, if there is any infringement, please contact to delete the article.
Dry goods download and study
Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years
Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision
Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision
3D visual quality courses recommended:
1. Multi-sensor data fusion technology for autonomous driving
2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)
9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]
10. Monocular depth estimation method: algorithm sorting and code implementation
11. The actual deployment of deep learning models in autonomous driving
12. Camera model and calibration (monocular + binocular + fisheye)
13. Heavy! Quadcopters: Algorithms and Practice
14. ROS2 from entry to mastery: theory and practice
15. The first 3D defect detection tutorial in China: theory, source code and actual combat
Heavy! Computer Vision Workshop - Learning Exchange Group has been established
Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.
At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .
▲Long press to add WeChat group or contribute
▲Long press to follow the official account
3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:
Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days
There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently
I find it useful, please give a like and watch~