Motion capture and character model driver based on mediapipe

 When I was doing this as an intern, I also looked for a lot of information, but a lot of the information was not explained thoroughly. Now I will summarize my insights after reading the code, and there are some points that I think are very important.

1. Difficulties of the task:

1. How to detect key points?

2. How to process the detected key point data?

3. How do the skeletal key point data correspond to the character model?

4. How to constrain the joints of the character model?

2. Solution:

1. Key point detection:

Mediapipe is used here, in which the pose_landmark function can output not only the coordinates of 2D key points, but also the coordinates of 3D key points. The coordinates of 2D key points are based on the upper left corner of the image as the origin , while the coordinates of 3D key points are Take hips as the origin, that is, the middle of left_hip and right_hip in Figure 1 is the origin. If 3D key points are used directly as input, there will be no global translation and rotation. Now that the key points in the 3D space and the points of the 2D projection are known, the PNP algorithm can be used to solve the pose of the camera, and then the position changes of the 3D key points relative to the camera can be calculated.

figure 1

The first time I came into contact with the world coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system in 3D vision, I was confused. This PNP algorithm actually uses the transformation of the coordinate system.

1.1.Coordinate system in 3D vision:

1.1.1 World coordinate system:

The world coordinate system is the Xw-Yw-Zw coordinate system in Figure 2. Simply understood, it is the real 3-dimensional world we live in .

figure 2 

1.1.2 Camera coordinate system:

The Xc-Yc-Zc coordinate system in Figure 2 is equivalent to arbitrary rotation and translation of the world coordinate system, and can also be understood as having a camera in the real world.

1.1.3 Image coordinate system:

The UV coordinate system in Figure 2, without the Z axis, is equivalent to projecting the points of the camera coordinate system on a plane parallel to the Xc- and Yc of the camera coordinate system.

1.1.4 Pixel coordinate system:

The XY coordinate system in Figure 2 is calculated in pixels as the offset relative to the origin.

1.2. Conversion between coordinate systems:

World coordinate system-camera coordinate system:

Camera coordinate system-image coordinate system:

An f is used here, which represents the focal length of the camera, which is the internal parameter of the camera.

Image coordinate system-camera coordinate system:

The PNP algorithm only needs to provide an additional angle to solve the pose of the camera coordinate system relative to the world coordinate system, that is, the R rotation matrix and T translation matrix.

The point in the camera coordinate system projected onto the plane does not match the x and y of the 2D point. In order to keep the two consistent, keep the z coordinate of the camera coordinate unchanged and directly adjust x and y to make the projection results consistent.

2. Data processing of key points:

First, let’s introduce why we need to process key point data. Through observation, we found that when mediapipe detects key points on the human body, some key points will jitter at high frequencies. In order to eliminate jitter and make the displacement of key points smoother, we use a filtering method. There are many filtering methods here, such as sliding mean filtering , Kalman filtering , euro filtering , moving least squares and so on. Here I will briefly introduce:

2.1. Moving mean filter method:

Set the sampling interval and take the average value within the interval. For example: take the average value every 5 sampling points, and the output value at this time is the average value of the first five 3D coordinate points. code show as below:

# 可根据需要调整窗口大小
window_size = 5  
# 初始化滑动均值滤波的队列
keypoint_queue = []
# 处理每一帧图像中的3D关键点
def process_3d_keypoints(frame_3d_keypoints):
    # 将新的3D关键点添加到队列
    keypoint_queue.append(frame_3d_keypoints)
    # 如果队列大小超过窗口大小,则移除最旧的元素
    if len(keypoint_queue) > window_size:
        keypoint_queue.pop(0)
    # 计算滑动均值
    smoothed_keypoints = np.mean(keypoint_queue, axis=0)
    # 在这里执行对平滑后的关键点的操作,例如可视化或其他后续处理
    return smoothed_keypoints

2.2.Kalman filtering method:

In the Kalman filter method, there are two variables, one is the predicted state variable (the value calculated by the formula and theorem, which represents the ideal value), and the other is the actually measured observed variable (with error). The idea of ​​the Kalman filter method is: the predicted state variables and the actually measured observation variables are fused to obtain data that is close to the real thing , and then used as the state variables of the next process, and then fused with the observed variables, and so on.

class KalmanFilterWrapper:
    def __init__(self, input_dim, init_error, init_process_var, init_measure_var):
        self.input_dim = input_dim
        self.init_error = init_error
        self.init_process_var = init_process_var
        self.init_measure_var = init_measure_var
        self.kf,self.statapose= self._create_kalman_filter()

    def _create_kalman_filter(self):
        kf = cv2.KalmanFilter(self.input_dim, self.input_dim)
        kf.transitionMatrix = np.eye(self.input_dim,dtype=np.float32)
        kf.measurementMatrix = np.eye(self.input_dim,dtype=np.float32)
        kf.processNoiseCov = self.init_process_var * np.eye(self.input_dim,dtype=np.float32)
        kf.measurementNoiseCov = self.init_measure_var * np.eye(self.input_dim,dtype=np.float32)
        kf.errorCovPost = self.init_error * np.eye(self.input_dim,dtype=np.float32)
        return kf,kf.statePost

    def filter(self, observation):
        return self.kf.correct(observation)

    def predict(self):
        return self.kf.predict()

 The above program shows a 1D Kalman filter. The initialization state of the system is set to the coordinates of the key points during the first detection. The prediction variance of the system: represents the uncertainty of the system model considered in the Kalman filter. .Measurement variance of the system: represents the uncertainty of the sensor (measurement tool) measurement . The uncertainty of the system is large, so the prediction variance can be increased. There is noise in the measurement tool, and the measurement variance can be increased. The initialization error and both must be adjusted according to the actual situation. Input_dim means to break it into one dimension. Three Kalman filters are used here. One Kalman filter is created for the body, the left hand, and the right hand. Some parts have strong correlations and some are weak, so the separate filtering effect will be better.

2.3.euro filtering method:

euro is a low-pass filter that can filter out noise in real time, and only needs to adjust two parameters ( minimum cut-off frequency and speed coefficient ). Generally, these two parameters need to be adjusted to achieve a dynamic balance effect. Lowering the minimum cut-off frequency can greatly The amplitude of jitter is reduced, but delay will occur. Increasing the speed coefficient can reduce delay, but the debouncing effect will be greatly reduced.

Here is the corresponding code:

class OneEuroFilter:

    def __init__(self, t0, x0, dx0=0.0, min_cutoff=1.0, beta=0.0, d_cutoff=1.0):
        """Initialize the one euro filter."""
        # The parameters.
        self.min_cutoff = float(min_cutoff)
        self.beta = float(beta)
        self.d_cutoff = float(d_cutoff)

        # Previous values.
        self.x_prev = x0
        self.dx_prev = float(dx0)
        self.t_prev = t0

    def smoothing_factor(self, t_e, cutoff):
        r = 2 * math.pi * cutoff * t_e
        return r / (r + 1)

    def exponential_smoothing(self, a, x, x_prev):
        return a * x + (1 - a) * x_prev

    def filter_signal(self, t, x):
        """Compute the filtered signal."""
        t_e = t - self.t_prev

        # The filtered derivative of the signal.
        a_d = self.smoothing_factor(t_e, self.d_cutoff)
        dx = (x - self.x_prev) / t_e
        dx_hat = self.exponential_smoothing(a_d, dx, self.dx_prev)

        # The filtered signal.
        cutoff = self.min_cutoff + self.beta * abs(dx_hat)
        a = self.smoothing_factor(t_e, cutoff)
        x_hat = self.exponential_smoothing(a, x, self.x_prev)

        # Memorize the previous values.
        self.x_prev = x_hat
        self.dx_prev = dx_hat
        self.t_prev = t

        return x_hat

 The following represents the logic of the entire code:

 

2.4. Moving Least Squares (MLS):

In the MLS method, a fitting curve needs to be established near a group of nodes at different locations. Each node has its own set of coefficient values ​​aj. The value of the coefficient (aj) of each node only considers its adjacent samples. point, and the closer the sampling point is to the node, the greater the contribution.

def mls_smooth_numpy(input_t: List[float], input_y: List[np.ndarray], query_t: float, smooth_range: float):
    # 1-D MLS: input_t: (N), input_y: (..., N), query_t: scalar
    if len(input_y) == 1:
        return input_y[0]
    # input_t:和当前帧的时间差
    input_t = np.array(input_t) - query_t
    # input_y:
    input_y = np.stack(input_y, axis=-1)
    broadcaster = (None,)*(len(input_y.shape) - 1)
    # w: 每个节点对采样点的权值
    w = np.maximum(smooth_range - np.abs(input_t), 0)
    # input_t[broadcaster]:(1,len(input_t))--2维
    # w[broadcaster]:(1,len(input_t))---2维
    coef = moving_least_square_numpy(input_t[broadcaster], input_y, w[broadcaster])
    return coef[..., 0]

def moving_least_square_numpy(x: np.ndarray, y: np.ndarray, w: np.ndarray):
    # 1-D MLS: x: (..., N), y: (..., N), w: (..., N)
    # p:(1,2,len(input_t))
    p = np.stack([np.ones_like(x), x], axis=-2)             # (..., 2, N)
    M = p @ (w[..., :, None] * p.swapaxes(-2, -1))
    a = np.linalg.solve(M, (p @ (w * y)[..., :, None]))
    a = a.squeeze(-1)
    return a

The effect of MLS is related to the length of the processed data and the weight of each sampling point.

2.5.Introducing the center of mass:

Although the various filtering methods above are used to smooth the key points, the human body will still shake as a whole, especially when there are large movements, it will look very unnatural. I saw an expert’s solution, which is to add weights to each key point, and then use weighted summation to find an approximate center of mass of the human body. Then use the above-mentioned smoothing method for the center of mass to make a powerful calculation of the trajectory of the center of mass. Smooth and has a stabilizing effect on a wide range of movements throughout the body. The weight of each key point can be determined based on the distance from the actual center of mass of the human body, but in the end the sum of all weights must be equal to 1. That is to say, the closer to the center of the human body, the greater the weight (for example: the weights of left_hip and right_hip can be increased if they are closer to the center of gravity), and vice versa.

3. Correspondence between skeletal key points and character model:

3.1. Key point correspondence:

image 3 

Figure 3 shows the key points of the bones in unity, but they may not correspond to the number of key points in the mediapipe. That is to say, the mediapipe defined in unity is not defined, and the mediapipe defined in the mediapipe may not be defined. So we have to calculate some of the values ​​​​by ourselves based on the ratio:

Figure 4 

For example, there are no hips, spine, chest, upper chest, neck, etc. in mediapipe, so the calculation of the coordinates of these key points depends on the positions of existing bones. For example, if I want to calculate hips, I need to roughly measure the position of the model's hips relative to left_hip and right_hip, and then scale and offset based on the actual left_hip and right_hip. This is true for other bone calculations. In addition, the 3D coordinates in mediapipe are moved according to hips, so in unity only need to calculate the offset of hips to ensure the movement of the human body, and other joints can rotate around the parent joint, thus realizing various movements of the human body. .

3.2. Correspondence of bone length:

 This is also a good idea. Everyone is different in height, short, fat and thin, so the length of the bones is also different, and the detection in mediapipe is also different. Similarly, the proportions of virtual character models are also different (especially some cartoon characters). At this time, in order to better drive the character model, the key points of each part of the body must correspond to the character model, and the interaction between the target key points must also be considered. Coupling, the human body's bones also have coupling, it cannot be solved independently by several parts, it must be optimized as a whole.

3.3. Bone rotation:

In Unity, you have to calculate the coordinates of key points and the angles formed between joints one by one. The quaternion matrix is ​​mainly used to represent the rotation of the bone relative to the parent bone. Why not use Euler angles to rotate? This is because there will be a universal lock problem. Let me explain it briefly here: For example: if the three axes xyz are rotated 90 degrees around y, then the other two axes x and z will overlap, making it impossible to know which axis is around the x axis in the end. Rotation, still z-axis. Of course, there are many solutions, and quaternions are one of them. For details, please see the link below:

https://blog.csdn.net/euphorias/article/details/123612227

Rotation in Unity mainly uses quaternion matrices.

Quaternion Inverse(Quaternion rotation)

Rotation method:

Quaternion LookRotation(Vector3 forward, [DefaultValue("Vector3.up")] Vector3 upwards);

The following is to find the intermediate matrix:

The intermediate matrix is ​​equivalent to finding the angle required for the bone to rotate to the same angle as the parent bone. 

hip.Inverse = Quaternion.Inverse(Quaternion.LookRotation(forward));
hip.InverseRotation = hip.Inverse * hip.InitRotation;

Then the rotation of the bone is calculated, and finally the final rotated position of the bone is obtained through the intermediate matrix. 

root = animator.GetBoneTransform(HumanBodyBones.Hips);
midRoot = Quaternion.Inverse(root.rotation) * Quaternion.LookRotation(forward);

The above needs to calculate the angles one by one, which is quite troublesome. There is a big guy on the Internet who directly uses the optimizer in pytorch to write an optimizer for solving angles. The angles between joints are not solved independently, but are solved as a whole. This is also what I think is the essence of Big Brother’s entire project, so I will focus on explaining it.

3.4. Overall solution:

The boss used the L-BFGS algorithm to solve the BFGS algorithm under limited memory. It is more suitable for large-scale numerical calculations. It has the characteristics of fast convergence speed of Newton's method, but it does not need to store the Hesse matrix like Newton's method, so it saves time. A lot of space and computing resources.

3.4.1 Newton’s method to solve:

Newton's method to find roots:

a. Select any point x1 on the x-axis, and draw the perpendicular line of x through x1 to obtain the intersection point f(x1) of the image.

b. Make the tangent line of the function through f(x1) and obtain the intersection point x2 of the tangent line and the x-axis.

c. Repeat steps a and b.

d. After continuous iteration, xn is equal to the root of f(x).

Newton's method is an infinite approximation to the root. When the difference between the previous Xk-1 and Xk is less than the threshold, Xk can be considered to be the root of f(x).

Through the derivative of f(Xn), find the root Xn=Xn-1- ( f(Xn-1)/Xn-1)

The essence of Newton's method to find the stationary point: second-order Taylor expansion, the derivative function is 0. Get X = Xk - [ f '(Xk)/f ''(Xk) ]

In machine learning, x is not a number but a vector, f'(Xk) is also a vector, f''(Xk) is a matrix also called Hessian matrix. So the above formula becomes X_{k+1}=X_{k}-H_{k}^{-1}*g_{k}, where g_{k}represents f_{}^{'}(x), H_{k}^{-1}represents the reciprocal of the second derivative function. But when X has many dimensions, the second derivative is difficult to find and requires multiple iterations.

The emergence of the BFGS algorithm has solvedH_{k}^{-1} the difficult problem. The essence is: approximation through iteration H_{k}^{-1}, and then approximate replacementH_{k}^{-1}

Approximation method:

D_{k+1}=\left [ I-\frac{S_{k}*Y_{k}^{T}}{Y_{k}^{T}*S_{k}} \right ]*D_{k}*\left [ I-\frac{Y_{k}*S_{k}^{T}}{Y_{k}^{T}*S_{k}} \right ]+\frac{S_{k}*S_{k}^{T}}{Y_{k}^{T}*S_{k}}Where I is the identity matrix, D_{k}corresponding to H_{k}^{-1}, S_{k}=X_{k+1}-X_{k}Y_{k}=g_{k+1}-g_{k}g_{k}represents the derivative function of the original function.

step:

a. Randomly pick an x1, then the derivative g1 of the loss function at x1 can be found, and the initial value of D1 is the identity matrix I.

b.x2=x1- D_{1}*g1, then x2 can be found, and g2 can also be found.

c.s1 = x2-x1, y1=g2-g1, enter the formula to find D2.

d.x3 = x2 -D2*g2, x3 can be found, and g3 can also be found. Enter the formula to find D3, and then calculate x4.

e. Repeat the above steps for iteration until Dn reaches the threshold, stop iteration, at which time H_{k}^{-1}the value can be approximated.

L-BFGS algorithm:

Each time the BFGS algorithm needs to store the Dk matrix, it requires a large amount of memory space, which may make the computer unable to run. Therefore, the L-BFGS algorithm only saves the latest m iteration information, and the previous information is discarded (which can be regarded as another approximation of the BFGS algorithm), thus greatly reducing the data storage space.

The above describes the origin of the step-by-step derivation of the L-BFGS algorithm from Newton's method. In fact, it is replaced by Newton's method, and then optimized in terms of operation speed and operation memory. The boss connects all the bones in the body, that is, the movement of the parent bone will affect the child bones, and the child bones will affect the grandchild bones. By analogy, the impact will be passed on. The L-BFGS algorithm is used here to optimize the Euler angle of each bone rotation. The loss function is the constraint of the Euler angle of each bone rotation + weak l2 regularization of the angle + the length of the bone before and after rotation (unchanged )

The boss used Blender to do it, and I used Unity. At that time, I had an idea whether the bone rotation in Unity could also be optimized using the L-BFGS algorithm. But then I tried various methods without success. Here's what I was thinking about at the time, using the localword matrix and worldtolocal matrix in unity for transformation, p: parent, c: child, cc: grandson

4. Constraints on the joint rotation angles of the character model:

In Blender, in the L-BFGS algorithm above, angle constraints are made. You define the angle limits for rotation around the x, y, and z axes. When these angles are exceeded, corresponding penalties will be given to suppress the next angle. The change.

In Unity, you can first calculate the angle of the joint, limit the angle, and then convert it into a quaternion matrix.

5. Code part:

python part of the code: https://github.com/ykyk000

If there are any errors, please correct me! ! !

Guess you like

Origin blog.csdn.net/qq_58484580/article/details/132661430