Self-developed multi-modal tracking algorithm PICO finds new ideas for "handle miniaturization"

Author: Zhang Tao, Lin Zeyi, Wen Chao, Zhao Yang

R&D background

As a head-mounted tracking accessory, the VR handle can calculate the spatial movement trajectory of the handle through the inside-out optical tracking and positioning principle of the HMD (head-mounted display device), and combines with the 6-axis sensor to achieve 6DoF spatial positioning. At the same time, combined with the physical buttons, motor feedback, joysticks, etc. of the handle controller, users can also obtain realistic and delicate tactile feedback, further enhancing the ability and immersion of virtual reality human-computer interaction, which is also the reason for the current handle-less solution. Difficult to achieve.

The current mainstream VR controller tracking technology solutions include optical tracking, self-tracking and electromagnetic tracking solutions.

304e1fbcf7c4ce82b9a4e5c2e3981c1e.png

Due to its high accuracy, low power consumption and low cost, optical tracking is currently the most mainstream VR controller tracking method. In order to ensure that the IR light (infrared light) is not easily blocked, the handle body usually has an obviously raised tracking light ring.

However, in order to comply with the trend of miniaturization of VR equipment, improve the convenience of users to carry, and provide a more natural interaction method, PICO canceled the tracking halo on the handle and chose to arrange a small number of IR lights in the limited area of ​​the handle body.

Centaur multi-modal fusion algorithm architecture

Smaller handles and fewer IR lights also mean more frequent occlusions. How to solve the problem of handle tracking under occlusion is a key issue faced by the PICO R&D team .

Based on the team's technical accumulation in optical tracking and bare hand tracking, PICO innovatively proposed a multi-modal handle tracking architecture based on neural networks , which integrates inertial measurement units (IMU), optical sensors and hand image information. . When the handle is blocked, bare-hand tracking can provide more accurate observations, and the handle can provide accurate predictions for hand tracking. The two are deeply integrated and assist each other.

Bare hand tracking

Due to the occlusion of the handle, the visual features of bare hands are usually not obvious, which often leads to tracking failure. In response to this difficulty, the bare-hand algorithm team innovatively proposed the Down-Top end-to-end 6DoF tracking algorithm. By effectively utilizing the global context information of multi-view timing, it can accurately and stably predict the handle pose information at one time, and can track the handle on the handle. When it fails, a robust 6DoF pose is provided in a timely manner.

1. Model background

The current common bare hand tracking algorithm is based on the Top-Down structure, that is, the bounding-box box of the hand is detected based on the Detection model, and then the bounding-box box is used to select the hand, as shown in the following figure:

1c84d8832eaaef584c465d4a7463e1b6.png

This structure can achieve higher accuracy, but in special action scenes such as horizontal lifting and natural hanging, due to reasons such as the small handle being blocked or being far away, there will be more blurred areas after the hand is enlarged, as shown in the following figure:

43543e2609a93914c026228eb0b06e21.jpeg5f9a4e5ce87157e87b539883b77de11c.jpeg

In this case, the Top-Down structure makes it difficult to detect the position of the wrist point, resulting in calculation failure and handle failure. But the Down-Top structure can help PICO determine the position of the wrist point from the arm, body and other information in the big picture.

2. Model structure diagram
c0c93d1fb4ff9f197fa517af783406ef.png
3. Evaluation results

From the experimental results using the Top-Down model structure and the Down-Top model structure in scenarios such as horizontal lifting and hanging down, it can be found that using the Down-Top scheme can obtain a higher detection rate with similar accuracy. :36%->93%.

Top-Down

386583023554f664c21b89c5bcf44b08.png

Down-Top

a070ef6bbdb7b7717c69275a5d5228aa.png

Fusion algorithm

1. New challenges

Traditional optical tracking solutions rely on a prominent physical structure on the handle (i.e., the tracking halo) to ensure that the handle has enough LED light points that can be observed by the positioning camera under various holding angles and position conditions. Arrive. After the 2D positions of multiple LED light points on the image are determined, PNP calculation can be performed. With the aid of the high frame rate IMU in the handle, accurate high-frequency positioning results of the handle can be obtained, thereby providing users with accurate and smooth tracking experience.

d93dcf41d3d6f2c9e71df9b626b85e04.png

PICO 4 handle

88b84b79ec1582205176abaad3f125d7.png

Quest 2 controller

But after removing the halo, the tracking algorithm faces greater challenges. Since LEDs can only be sparsely arranged in several areas of the handle body, and are smaller in number and easier to be blocked, the camera often can only observe limited infrared lights, or even zero. At this time, the algorithm only relies on the inertial recursive calculation of the IMU and cannot provide stable and reliable positioning information for a long time.

After multiple rounds of exploration and pre-research, the PICO algorithm team innovatively proposed a multi-modal fusion solution that combines inertial measurement unit (IMU), optical sensor and hand image information. This solution is based on the complementarity of gesture recognition and handle optical tracking, and perfectly solves the above-mentioned series of challenges and difficulties. The team named it Centaur multi-modal fusion algorithm.

2. Composition of Centaur multi-modal fusion algorithm

The Centaur multi-modal fusion algorithm fuses visual information and inertial information to obtain the optimal estimate of the handle's posture and speed, and provides it to the upper application layer. The composition of the fusion algorithm is shown in the figure below:

5faf12ebb79c1cfdc2c659f33656ae4c.png

The functions of each module in the figure:

  • Multiple Global-Shutter IR camra are arranged around the headset. The normal exposure frame can capture the characteristics of the human hand, while the low exposure frame can obtain the position of the LED in the handle while suppressing most ambient light interference.

  • An IMU module is arranged inside the handle to provide acceleration and angular rate information when the handle is moving.

  • The 3-DOF module estimates the rotation information of the hand with the help of pure IMU data.

  • Based on the deep learning gesture detection and tracking module (AI-based hand detection & tracking), it accurately predicts the pose information of the handle by effectively utilizing the global context information of multi-view time series.

  • The optical positioning module (Led detection/matching & pose estimation) uses a priori information such as the posture and LED distribution on the handle provided by 3-DOF to determine the matching relationship between the image spot and the LED light through an intelligent matching mechanism to obtain the handle pose. single frame estimate.

  • The multi-frame fusion filter (Multi-State-ESKF) fuses and calculates the obtained hand pose, handle IMU data, LED optical estimation pose, LED matching relationship and other information to obtain high-precision, high frame rate handle position, Rotation and speed information are updated to the system interface for use by upper-layer applications.

3. Tracking and Fusion

When the algorithm is run for the first time, or is in the 3DOF state, there is no temporal prior information generated by continuous tracking, so the initialization scheme of Bootstrap from scratch is required. With the blessing of LED and gesture information, the initialization algorithm has been upgraded accordingly compared to traditional optical positioning, and two algorithms of LED initialization and gesture initialization are run. The algorithm that first solves the correct initial state will use the initial position of the handle. posture and speed initialization fusion filter, thereby significantly improving the speed and success rate of handle initialization under various holding postures.

When the algorithm initialization is completed and the tracking state is entered, the algorithm flow is as shown in the figure below:

6e754686d6ae27cec37237d4445c3de2.png
  • Step 1. When a new image frame arrives, based on the historical frame status in the sliding window, use IMU data to perform inertial recursion calculation to obtain the status prediction value of the new image frame.

  • Step 2. Based on the predicted controller pose, the predicted position of the controller LED or hand features can be obtained in the current frame image. The specific classification is described below:

    • For normal exposure frames: Use the Down-Top network structure described above to directly obtain the 6DOF pose result of the wrist joint, use the "handle-wrist" alignment relationship to convert it into a handle pose, and add it as a pose observation, as Constraints for the current frame.

    • For low-exposure frames: detect the 2D position of the LED spot in the area. Based on the nearest neighbor matching algorithm, the predicted 2D point set is matched with the detected 2D point set. Use the PNP solver to obtain the handle pose estimate, and add both the pose results and the 2D matching results to the observation factor as current frame constraints.

  • Step 3. The final fusion algorithm adopts the Multi-State ESKF scheme, which adopts a combination of loose coupling/tight coupling mode, which significantly improves the tracking effect while saving computation and ensuring stability.

4. Centuar multi-modal fusion algorithm benefits
  • The picture below shows the tracking effect when the handle is stationary when there are only 3 LED lights. Multi-frame tight coupling is more accurate than single-frame loose coupling, the tracking is more stable, and the fluctuations are significantly reduced :

    • The jitter in optical observation is very obvious, and the ±3sigma range is approximately "16mm on the x-axis, 4mm on the y-axis, and 25mm on the z-axis". The actual action is to place it directly in front and below the headset, exposing three infrared lights and keeping them still. Therefore, the error in the depth direction (xz) is significantly larger than the error in the orthogonal direction to the depth (y).

    • Loosely coupled ESKF has a suppressive effect on optical observation jitter. The three-axis jitter range is reduced to "about 6mm on the x-axis, 2.5mm on the y-axis, and about 9mm on the z-axis", but the estimated speed fluctuation is still 10mm/s.

    • The result of multi-frame tight coupling is the best. The trajectory is significantly smoother. The jitter range is about "2mm on the x-axis, 1mm on the y-axis, and about 3mm on the z-axis." The speed jitter range is about 3mm/s. Compared with the error indicators of the original Filter About 3 times the profit.

e3998d3e4073b26abed53b829063c70e.png
  • When turning the hand to completely block all LEDs, the algorithm integrates gesture positioning information and IMU information to maintain the tracking status and tracking accuracy of the handle, enabling smooth switching and smooth operation in various scenarios.

  • In order to verify the tracking effect, the PICO team also conducted extreme tests with geek players. In scenarios such as sports, fitness, music games, etc. that require rapid shaking of the handle, the PICO multi-modal fusion algorithm can accurately and stably track the hand and The position and posture of the handle.

494b7727c70fe1f0953b415f5044af53.png

PICO small handle without light ring

Self-developed synchronized multi-camera system

Data collection and automatic annotation

The PICO Data Laboratory has built a multi-modal synchronized camera system, which not only can obtain a large amount of high-precision data information, but also lays a solid foundation for technology and product research and development. The hardware of the system includes an industrial RGB camera array, a structured light scanner, an optical motion capture camera system, and a VR headset. The software includes point cloud registration, spatio-temporal calibration, automatic labeling of gesture handles, etc. The data collection and automatic labeling process includes Preparation before collection and data collection operation, of which the data collection operation is divided into two stages.

baa527e152456830618cc9514984f05e.png

Left: Synchronized camera system; Right: VR headset with light ball

  • Preparation before collection

We use a structured light scanner to obtain dense point clouds on the surface of the handle and IR photosphere to obtain the conversion relationship from the photosphere to the handle model. We also bound the photosphere to the tag calibration board, and obtained the sensor parameters including the VR headset by observing the calibration board; for the timeline of each sensor, we used two methods to align: one is intrusive sharing External clock signal, the second is to quickly dance the headset to obtain the VR headset trajectory and the light ball trajectory bound to it for time and space alignment.

f2b17177a2ba49502be7aeae879f3df9.jpeg

Before acquisition, structured light scanning and registration

ef7fc4db22a2576b3ba6181152d31044.png

Stage 1: Collect the spatial relationship between the hand and the handle

600d0e83bf689d159a29ef8014ea09a6.png

Phase 2, handle tracking and gesture labeling

  • data collection job

    • In the first step, multi-view images are used as input and a self-developed hand posture annotation algorithm is used to obtain key point locations. In this link, in order to maintain the high accuracy of the data, we propose a gesture estimation algorithm based on decoupled representation. We constructed a 2D visual space and a 3D node space, and continuously optimized the hand posture through iteration. At the same time, in order to solve the problem of training data sources during cold start of data annotation, we also designed a multi-view self-supervision framework. Relevant algorithms have been published in the ICCV2023 conference .

    • In the second step, after obtaining hand postures observed from different viewing angles, we fuse multi-viewing information. By using the triangulation method, the 3D hand pose after multi-view fusion is obtained through RANSAC. On this basis, fine-tuning and optimization are performed based on the confidence of each hand key point.

    • The third step is to target the 3D hand key points obtained in the previous step, and optimize the results of the previous sequence by comprehensively considering various constraints such as bone position, movement speed, rotation of hand joints, and the collision relationship between gestures and handles. . At this point, we have obtained the key points of the hand and the relative positional relationship between the hand and the handle .

    • The first stage: The camera system simultaneously collects images from industrial cameras and VR head-mounted cameras, and simultaneously collects coordinates of landmark points captured by optical motion capture cameras.

  • In the second stage, the person being collected keeps the hand posture relative to the handle unchanged, and waves the handle in different scenes to obtain its trajectory.

Through the spatial relationship between the light ball and the handle, the hand and the handle obtained in the first stage, and the light ball trajectory collected in the second stage, the trajectory of the gesture and the handle in the collection space can be obtained. On the other hand, through the spatial relationship between the light ball and the head-mounted camera and the light ball trajectory obtained through phase 2 tracking, the gestures and handles can be projected into the head-mounted camera to obtain data labels.

Summarize

The PICO R&D team is always committed to creating high-quality XR technology and product experiences for global users. The miniaturized design of the handle is an innovative and breakthrough development in the design of XR interaction solutions. PICO's self-developed Centaur multi-modal tracking algorithm not only enabled the "miniaturized handle" to complete a technological breakthrough and successfully implement it, but also provided new opportunities for subsequent people. Computer interaction design provides new ideas and possibilities.

おすすめ

転載: blog.csdn.net/ByteDanceTech/article/details/133191468