Anytime, anywhere, track every pixel, and the "track everything" video algorithm that is not afraid of occlusion is here

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

2b1e0c1c4e4c3d084635e7e509ee85c9.gif

3552969d23940d500f649344d5068bb6.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

Computer Vision Research Institute column

Column of Computer Vision Institute

Tracking Everything Everywhere All at Once.

Transferred from "The Heart of the Machine"

3cd53bc916481ecc1c9a883b5315def7.gif

Some time ago, Meta released the "Segment Everything (SAM)" AI model, which can generate a mask for any object in any image or video, which made researchers in the field of computer vision (CV) exclaim: "CV does not exist anymore." After that, there was a frenzy of "second creation" in the field of CV. Some works successively combined functions such as target detection and image generation on the basis of segmentation, but most of the research was based on static images.

Now, a new study called "Track Everything" proposes a new method for motion estimation in dynamic video, which can accurately and completely track the trajectory of objects.

fe6f735a1d5e7cfb4667a840d17656df.gif

The study was jointly completed by researchers from Cornell University, Google Research and UC Berkeley. They jointly proposed a complete and globally consistent motion representation OmniMotion, and proposed a new test-time (test-time) optimization method for accurate and complete motion estimation of each pixel in the video. 

16c4db44ad7f2bbbcb526c52ce5bbf2e.png

  • Paper address: https://arxiv.org/abs/2306.05422

  • Project homepage: https://omnimotion.github.io/

Some netizens forwarded this study on Twitter, and received 3,500+ likes in just one day, and the research content was well received.

f457e46ffb5dd24675953193297dab53.png

From the demo released by the study, the effect of motion tracking is very good, such as tracking the trajectory of a jumping kangaroo:

351d5efbccabe6307d1cb00d05ba3001.gif

Movement curve of swing:

1c8f6866548ce4ee0e05d05dd82592f8.gif

You can also view motion tracking interactively:

c453fb6fd9b5c3582a0d9d73c9512bf7.gif

Even if the object is occluded, the trajectory can be tracked, such as a dog being occluded by a tree while running:

c193a2f303861d0afe5457a9d6243d2c.gif

In the field of computer vision, there are two commonly used motion estimation methods: sparse feature tracking and dense optical flow. However, these two methods have their own shortcomings. Sparse feature tracking cannot model the motion of all pixels; dense optical flow cannot capture motion trajectories for a long time.

The proposed OmniMotion uses quasi-3D canonical volumes to represent video, and tracks each pixel through a bijection between local space and canonical space. This representation guarantees global consistency, enables motion tracking even when objects are occluded, and models any combination of camera and object motion. This study shows through experiments that the proposed method greatly outperforms existing SOTA methods.

Method overview

The study takes as input a collection of frames with pairs of noisy motion estimates (e.g., optical flow fields) to form a complete, globally consistent motion representation for the entire video. The research then added an optimization process that allows it to query representations with any pixel in any frame to produce smooth, accurate motion trajectories throughout the video. Notably, the method can identify when points in the frame are occluded, and can even track points through occlusions.

OmniMotion Characterization

Traditional motion estimation methods, such as paired optical flow, lose track of objects when they are occluded. In order to provide accurate and consistent motion trajectories even under occlusion, this study proposes a global motion representation OmniMotion.

This research attempts to accurately track real-world motion without explicit dynamic 3D reconstruction. OmniMotion representations represent scenes in videos as canonical 3D volumes, mapped into local volumes in each frame via local-canonical bijection. A local canonical bijection is parameterized as a neural network and captures camera and scene motion without separating the two. Based on this approach, video can be viewed as a rendering of a local volume from a fixed static camera.

7fd33bd33de7e722a13a8e36560435f4.png

Since OmniMotion does not explicitly distinguish between camera and scene motion, the resulting representation is not a physically accurate 3D scene reconstruction. Therefore, the study calls it quasi-3D characterization.

OmniMotion preserves information about all scene points projected to each pixel, as well as their relative depth order, which allows points in the frame to be tracked even if they are temporarily occluded.

393ceb55cea5d045e6859c80f397aa2d.gif

Experiment and Results

quantitative comparison

The researchers compared the proposed method with the TAP-Vid benchmark, and the results are shown in Table 1. It can be seen that their method consistently achieves the best position accuracy, occlusion accuracy, and timing consistency on different datasets. Their method can handle different pairwise correspondence inputs from RAFT and TAP-Net well, and provides consistent improvement over these two baseline methods.

c90ec5e1953dff9d3d46a880f7d8a7ba.png

qualitative comparison

As shown in Figure 3, the researchers performed a qualitative comparison of their method with baseline methods. The new method shows excellent recognition and tracking capabilities in (long-term) occlusion events, while providing reasonable positions for points during occlusions and handling large camera motion parallax.

fe5fcc946108a23458fb3a271db9a5d3.png

Ablation experiments and analysis

The researchers used ablation experiments to verify the validity of their design decisions, and the results are shown in Table 2.

111bf025ad81fedb800605039d31b9a2.png

In Fig. 4, they show the pseudo-depth maps generated by their model to demonstrate the learned depth ranking.

310b82f1ba282a39dba3a14449e5680b.png

It should be noted that these maps do not correspond to physical depth, however, they show that using only photometric and optical flow signals, the new method can effectively determine the relative order between different surfaces, which is crucial for tracking in occlusions. important. More ablation experiments and analysis results can be found in the supplementary material.

© THE END 

For reprinting, please contact this official account for authorization

6557e6579dd1523fc191cd630c9ef76e.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as target detection, target tracking, and image segmentation. The research institute always shares the algorithm framework of the latest papers, and the platform focuses on "research" and "practice". In the later stage, we will share the practical process for the corresponding fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

267c5bf37034efec9e901cf6a427bc0f.png

5a4f74b0c0edc455c27be32075a8f945.png

77f6099d886373f01c03939edea8c7e8.png

519557ae4732264edbe8eff3b414ffb9.png

81d29977abd44a7c13d8e6a71e13adc2.png

Click "Read the original text" to cooperate and consult immediately

Guess you like

Origin blog.csdn.net/gzq0723/article/details/131179436