YOLOv5 is never absent | YOLO-Pose brings a real-time and easy-to-deploy pose estimation model! ! !

Click on "Computer Vision Workshop" above and select "Star"

Dry goods delivered as soon as possible

c941c14a214ca41c1bf8b1e0751ad08b.png

Author: ChaucerG

Source丨Jizhi Shutong

e3bc334467dfe5fdecae9fdc0ff5586f.png

This paper introduces YOLO-Pose, a novel Heatmapjoint-free detection method, pose estimation based on the YOLOv5 object detection framework.

Existing Heatmaptwo-stage based methods are not optimal because they are not trained end-to-end and the training relies on an alternative L1 loss, which is not equivalent to maximizing the evaluation metric, i.e. Objective Keypoint Similarity (OKS).

YOLO-PoseIt is possible to train the model end-to-end and optimize the OKS metric itself. The model learns to jointly detect the bounding boxes of multiple persons and their corresponding 2D poses in a single forward pass, surpassing the best results of both top-down and bottom-up approaches.

YOLO-PoseNo post-processing is required for bottom-up methods to group detected keypoints into a skeleton, since each bounding box has an associated pose, resulting in an inherent grouping of keypoints. Unlike top-down methods, multiple forward passes are canceled because all poses are localized.

YOLO-pose achieves new state-of-the-art results on COCO validation (90.2% AP50) and test-dev sets (90.3% AP50), surpassing Trick without flipping, multi-scale, or any other test time increase All existing bottom-up methods. All experiments and results reported in this paper do not have any increase in test time, unlike traditional methods that use flip tests and multi-scale tests to improve performance.

1YOLO-Pose method

YOLO-PoseLike other Bottom-upmethods, it is also a Single Shotmethod. However, it is not used Heatmaps. Instead, YOLO-Poseconnect all of a person's key points with Anchor.

4a5e0a2aa12510fae8ff1b62a1ab70af.png

YOLO-PoseBased on the YOLOv5 target detection framework, it can also be extended to other frameworks. YOLO-PoseAlso verified to a limited extent on YOLOX. Figure 2 illustrates the overall architecture with pose estimation.

2.1 Overview

YOLOv5 is a very good detector both in terms of accuracy and complexity. So choose it as a base to build upon, and build on top of it. YOLOv5 mainly focuses on 80-class COCO object detection, Box headpredicting Anchor85 elements of each, corresponding to the bounding boxes, object scores and confidence scores of the 80 classes, respectively. And corresponding to each grid position there are 3 different shapes Anchor.

For problems that Human Pose Estimationcan be viewed as a single class Person detection, each person has 17 relevant keypoints, and each keypoint has a re-identified location and confidence: . So, Anchorthere are 51 elements in total for the 17 keypoints associated with one.

So for each Anchor, Keypoint Head51 elements are Box headpredicted and 6 elements are predicted. For n keypoints Anchor, the overall prediction vector is defined as:

aefa402d66ae99f570bd56c8bedb7345.png

Keypoint confidences are trained based on the visibility signatures of keypoints. If a keypoint is visible or occluded, then the Ground Truthconfidence is set to 1, otherwise, if the keypoint is outside the field of view, the confidence is set to 0.

Keep the confidence of keypoints greater than 0.5 during inference. All other predicted keypoints are masked. Predicted keypoint confidences are not used for evaluation. However, since the network predicts all 17 keypoints for each detection, keypoints outside the field of view need to be filtered out. Otherwise, there will be confidence th keypoints that lead to deformed skeletons. Existing based methods do not need Heatmapto Bottom-updo this because keypoints outside the field of view are not detected in the first place.

YOLO-PoseUse CSP-darknet53as Backbone, PANetto fuse features from Backbonedifferent scales. Next are 4 different scales Head. Finally, there are 2 Decoupled Headsfor predicting boxes and keypoints.

The complexity in this work YOLO-Poseis limited to 150 GMACS, within which YOLO-Posecompetitive results can be achieved. As the complexity further increases, Top-downthe gap to the method can be further bridged. However, YOLO-Posethis path is not pursued as YOLO-Posethe focus is on real-time models.

2.2 Anchor based multi-person pose formulation

For a given image, the one that matches a person Anchorwill store its entire 2D posesum bounding box. bounding boxThe coordinates of are converted to the Anchorcenter, while bounding boxthe dimensions of are Anchornormalized according to the height and width of . Likewise, keypoint positions are converted wrt to Anchorcenter. However, the keypoints are not Anchornormalized with the height and width. Key pointand boxboth are predicted at Anchorthe center.

Since YOLO-Posethe improvement is Anchorindependent of the width and height of , it YOLO-Posecan be easily extended to Anchor Freeobject detection methods such as YOLOX, FCOS.

2.3 IoU Based Bounding-box Loss Function

Most object detectors optimize IoU Lossvariants such as GIoU, , DIoUor CIoU Loss, Distance-based Lossbecause these losses are scale-invariant and directly optimize the evaluation metric itself. YOLO-Poseused for CIoU Losssupervision bounding box. For the kth Anchormatch at position and scale s Ground Truth bounding box, the loss is defined as:

326f5a13d542c6e8182d621591d430af.png

is the kth Anchorprediction box at position (i, j) and scale s. In YOLO-Pose, there are 3 for each position Anchor, and the prediction happens on 4 cales.

2.4 Human Pose Loss Function Formulation

OKS is a more commonly used indicator for evaluating key points. Traditionally, Heat-map based Bottom-upmethods use L1 loss to detect keypoints. However, L1 loss is not necessarily suitable for obtaining optimal OKS. Likewise, the L1 loss does not take into account the scale of the target or the type of keypoints. Since it Heat-mapis a probability map, it Heat-mapis impossible to use OKS as a loss in pure-based methods. OKS can be used as a loss function only when regressing to keypoint locations. Geng et al. used scale-normalized L1 loss for keypoint regression, which is a step towards OKS loss.

Therefore, the authors directly define the key point of the regression as the Anchorcenter, so that the evaluation metric itself can be optimized instead of one surrogate loss. Here the concept of IoU loss is extended from boxes to keypoints.

In the presence of keypoints, the target keypoint similarity (OKS) is regarded as the IOU. The OKS loss is inherently scale-invariant and more important than some keypoints. For example, key points on a person's head (eyes, nose, ears) are penalized more than key points on his body (shoulders, knees, hips, etc.).

Yolo pose architecture based on YOLOv5. The input image is passed through the CSP-darknet53backbone, and feature maps of different scales {P3, P4, P5, P6} are generated. PANet is used to fuse these feature maps across multiple scales. The output of PANet is input to the detection head. Finally, each detection head branches to Box Head and Keypoint Head.

Unlike standard IoU loss, whose gradient vanishes without overlapping, OKS loss never does. Therefore, OKS loss is more similar to DIoU loss.

Corresponding to each bounding box, the entire pose information is stored. Therefore, if a GT bounding box Anchormatches in position and scale s, Anchorthe keypoints will be predicted relative to the center. The OKS is calculated separately for each keypoint and then summed to give the final OKS loss or keypoint IOU loss.

9b2d562afdbfe8f01a317c9e5bd34d32.png

Corresponding to each keypoint, a confidence parameter is learned that shows whether a keypoint exists for that person. Here, the visibility flags of key points are used as GT.

bef7d7cba6c5551512520762004f41fa.png

Where hyperparameters: ,,,. Mainly used to balance losses.

2.5 Test Time Augmentations

All SOTA methods for pose estimation rely on test-time augmentation (TTA) to improve performance. Flip testing and multi-scale testing are two commonly used testing techniques. The flip test increases the complexity by 2X, while the multi-scale test runs inference on three scales {0.5X, 1X, 2X}, increasing the complexity by (0.25X+1X+4X)=5.25X. The complexity will increase by 5.25*2x=10.5X with the flip test and multi-scale test.

In addition to increasing computational complexity, preparing augmented data is itself expensive. For example, in a flip test, the image needs to be flipped, which increases the latency of the system. Similarly, multi-scale testing requires a resizing operation for each scale. These operations can be very expensive because they may not be accelerated, unlike CNN operations. There is an additional cost to fuse the outputs of the various forward passes. For embedded systems, being able to get competitive results without any TTA is the most important thing.

Therefore, YOLO-Poseall results of 's do not have any TTA.

2.6 Keypoint Outside Bounding Box

top-downThe method performs poorly under occlusion. One of the advantages compared to top-downour method YOLO-Poseis that the keypoints are not constrained within the predicted bounding box. Therefore, if keypoints are outside the bounding box due to occlusion, they can still be correctly identified. However, in top-downour approach, pose estimation also fails if the person is not detected correctly. Among the YOLO-Posemethods, occlusion and incorrect box detection mitigate these challenges to some extent, as shown in Figure 3.

9f5bc3be0fca80810649722ab4de8b56.png

2.7 ONNX Export for Easy Deployability

YOLO-PoseAll ops used in are part of the standard deep learning library and compatible with ONNX. Therefore, the entire model can be exported into ONNX, which makes it easy to deploy across platforms. This standalone ONNX model can be executed using ONNXRUNTIME, taking an image as input, and inferring the bounding box and pose of each person in the image. There is no other top-downway to export to an intermediate ONNX representation end-to-end.

2 Experimental results

3.1 Ablation experiment

1、OKS Loss vs L1 Loss

8c312bafff107a3fc57de6490a846bb8.png

2、Across Resolution

41d78627c97a68ab0d5f578f27204c54.pnge5dc9d4602dfa63fe1c6228ef46800cf.png

3. Quantization operation

805870c1911fafc9a87fc8e72611b9e1.png

The YOLOv5 model is sigmoid-weighted linear unit (SiLU). Liu et al. observed that unbounded activation functions like SiLU or HardSwish are not quantization friendly, whereas models with ReLUX activations are robust to quantization due to their finite nature.

Therefore, the model is retrained with ReLU activations. We observed a 1-2% decrease in activation from SiLU to ReLU. We call these models YOLOv5_relu.

3.2 COCO results

12f7e8344728768fd955331897132676.png

3 Reference

[1].YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

15. The first 3D defect detection tutorial in China: theory, source code and actual combat

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

0a33b0a138a6cd56111bbd50a600c2d1.png

▲Long press to add WeChat group or contribute

98475575844d3f4a3ae34b5d2007303a.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

b183c09aef6ebc6f9d024495083d7322.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/124207287