Machine Learning Notes - Understanding MediaPipe + Combining OpenCV for Human Pose Estimation

1. Overview of MediaPipe

        MediaPipe provides open source cross-platform, customizable ML solutions for live and streaming. MediaPipe is an open source framework developed by Google. It is a very lightweight framework for multi-platform machine learning solutions that can run in real-time on CPUs.

        Support more scene detection

        Face detection, face mesh, gesture recognition, partial gesture recognition, human pose estimation (3D coordinates of 33 key points can be given), target detection and tracking, etc.

         The following languages ​​are also supported: Android iOS C++ Python JS Coral

        But not all features are available in every language.

         For details, please refer to the document address: Home - mediapipe https://google.github.io/mediapipe/

2. Overview of Human Pose Estimation

1. Human pose estimation

        Human pose estimation represents the graphical skeleton of the human body. It helps to analyze human activities. A skeleton is basically a set of coordinates that describe a person's pose. Each joint is an individual coordinate called a keypoint or pose landmark. Connections between keypoints are called pairs.

        With pose estimation, we are able to track human movements and activities in real-world space. This opens up a wide range of application possibilities. It is a powerful technique that helps to build complex applications very efficiently.

2. Application of Human Pose Estimation

(1) Estimation of human activities

        Pose estimation can be used to track human activities such as walking, running, sleeping, drinking. It provides some information about a person. Activity estimation can enhance security and surveillance systems.

(2) Movement transfer

        One of the most interesting applications of human pose estimation is motion transfer. We see in movies or games that 3d graphics characters move their bodies like real humans or animals. By tracking human poses, 3D rendered graphics can be animated with the movement of the human body.

(3) Robots

        To train the motion of the robot, human pose estimation can be used. Instead of manually programming the robot to follow a specific path, a human pose skeleton is used to train the robot's joint movements.

(4) Games

        VR or virtual reality games are very popular these days. In virtual reality games, 3D poses are estimated by one or more cameras, and game characters move according to human movements.

3. Attitude estimation model

        There are three main types of pose estimation models:

        1. Kinematics model: It is a bone-based model that represents the human body.

        2. Flat model: A flat model is a contour-based model that uses the contours around the human body to represent the shape of the human body.

        3. Volume models: Volume models create a 3d mesh of the human body, representing the shape and appearance of the human body.

4. Categories of Pose Estimation

        1. 2D pose estimation: In 2d pose estimation, only the x and y coordinates of each landmark in the image are predicted. It does not provide any information about the skeleton angle or the rotation or orientation of the object or body instance.

        2. 3D pose estimation: 3D pose estimation allows us to predict the spiral position of a person. It provides x, y and z coordinates for each landmark. Through 3d pose estimation, we can determine the angle of each joint of the human skeleton.

        3. Rigid body pose estimation: Rigid body pose estimation is also called 6D pose estimation. It provides all the information about the pose of the human body and the rotation and orientation of the human instance.

        4. Single pose estimation: In a single pose estimation model, only one person's pose can be predicted in an image.

        5. Multi-pose estimation: In multi-pose estimation, multiple human poses can be predicted simultaneously in one image.

5. The process of pose estimation

        Pose estimation mainly uses deep learning solutions to predict human pose landmarks. It takes an image as input and provides pose landmarks as output for each instance.

        There are two methods for pose estimation,

        1. Bottom-up: In this approach, each instance of a specific keypoint is predicted in the image, and then a set of keypoints are combined into the final skeleton.

        2. Top-down: In the top-down approach, objects/people are first detected in a given image, and then landmarks are predicted in each cropped object instance of that image.

6. Some popular pose estimation models

        1. OpenPose: OpenPose is one of the most popular multi-person human pose estimation methods. It is an open-source real-time multi-person detection with high-precision keypoints.

        2. DeepPose: DeepPose uses deep neural networks to estimate human pose. The architecture captures all hinges and joints in convolutional layers, followed by fully connected layers to form part of these layers.

        3. PoseNet: PoseNet is built on top of Tensorflow.js. It's a lightweight architecture that runs on mobile devices and browsers.

Third, combined with OpenCV for human pose estimation

1. Code reference

        MediaPipe uses TensorFlow lite on the backend. A person (ROI) is first located within the frame using a detector. Then it uses the ROI cropped frame as INPUT and predicts landmarks/keypoints within the ROI. The mediaPipe pose estimator detected a total of 33 keypoints.

         MediaPipe pose estimation is a single 3D pose estimator. It detects the x, y and z coordinates of each landmark. The Z axis is basically information about the depth of the landmark. This means the distance or proximity of the landmark to the camera relative to other landmarks.

        Because the version of opencv I am using does not implement gui, the recognized images are saved in the folder, otherwise they can be seen in real time.

# TechVidvan Human pose estimator
# import necessary packages

import cv2
import mediapipe as mp

# initialize Pose estimator
mp_drawing = mp.solutions.drawing_utils
mp_pose = mp.solutions.pose

pose = mp_pose.Pose(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5)

# create capture object
cap = cv2.VideoCapture('C:\\Users\\xiaomao\\Desktop\\123.mp4')

i = 0
while cap.isOpened():
    # read frame from capture object
    _, frame = cap.read()

    # try:
    # convert the frame to RGB format
    RGB = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    # process the RGB frame to get the result
    results = pose.process(RGB)

    print(results.pose_landmarks)
    # draw detected skeleton on the frame
    mp_drawing.draw_landmarks(
        frame, results.pose_landmarks, mp_pose.POSE_CONNECTIONS)

    # show the final output
    cv2.imwrite('C:\\Users\\zyh\\Desktop\\123456\\' + str(i) + '.jpg', frame)
    i = i+1

cap.release()
cv2.destroyAllWindows()

2. Result preview

Guess you like

Origin blog.csdn.net/bashendixie5/article/details/123508782