Click on the blue word to follow us
Follow and star
never get lost
Institute of Computer Vision
Public ID|Computer Vision Research Institute
Learning group|Scan the QR code to get the joining method on the homepage
Computer Vision Research Institute column
Column of Computer Vision Institute
Tracking Everything Everywhere All at Once.
Transferred from "The Heart of the Machine"
Some time ago, Meta released the "Segment Everything (SAM)" AI model, which can generate a mask for any object in any image or video, which made researchers in the field of computer vision (CV) exclaim: "CV does not exist anymore." After that, there was a frenzy of "second creation" in the field of CV. Some works successively combined functions such as target detection and image generation on the basis of segmentation, but most of the research was based on static images.
Now, a new study called "Track Everything" proposes a new method for motion estimation in dynamic video, which can accurately and completely track the trajectory of objects.
The study was jointly completed by researchers from Cornell University, Google Research and UC Berkeley. They jointly proposed a complete and globally consistent motion representation OmniMotion, and proposed a new test-time (test-time) optimization method for accurate and complete motion estimation of each pixel in the video.
Paper address: https://arxiv.org/abs/2306.05422
Project homepage: https://omnimotion.github.io/
Some netizens forwarded this study on Twitter, and received 3,500+ likes in just one day, and the research content was well received.
From the demo released by the study, the effect of motion tracking is very good, such as tracking the trajectory of a jumping kangaroo:
Movement curve of swing:
You can also view motion tracking interactively:
Even if the object is occluded, the trajectory can be tracked, such as a dog being occluded by a tree while running:
In the field of computer vision, there are two commonly used motion estimation methods: sparse feature tracking and dense optical flow. However, these two methods have their own shortcomings. Sparse feature tracking cannot model the motion of all pixels; dense optical flow cannot capture motion trajectories for a long time.
The proposed OmniMotion uses quasi-3D canonical volumes to represent video, and tracks each pixel through a bijection between local space and canonical space. This representation guarantees global consistency, enables motion tracking even when objects are occluded, and models any combination of camera and object motion. This study shows through experiments that the proposed method greatly outperforms existing SOTA methods.
Method overview
The study takes as input a collection of frames with pairs of noisy motion estimates (e.g., optical flow fields) to form a complete, globally consistent motion representation for the entire video. The research then added an optimization process that allows it to query representations with any pixel in any frame to produce smooth, accurate motion trajectories throughout the video. Notably, the method can identify when points in the frame are occluded, and can even track points through occlusions.
OmniMotion Characterization
Traditional motion estimation methods, such as paired optical flow, lose track of objects when they are occluded. In order to provide accurate and consistent motion trajectories even under occlusion, this study proposes a global motion representation OmniMotion.
This research attempts to accurately track real-world motion without explicit dynamic 3D reconstruction. OmniMotion representations represent scenes in videos as canonical 3D volumes, mapped into local volumes in each frame via local-canonical bijection. A local canonical bijection is parameterized as a neural network and captures camera and scene motion without separating the two. Based on this approach, video can be viewed as a rendering of a local volume from a fixed static camera.
Since OmniMotion does not explicitly distinguish between camera and scene motion, the resulting representation is not a physically accurate 3D scene reconstruction. Therefore, the study calls it quasi-3D characterization.
OmniMotion preserves information about all scene points projected to each pixel, as well as their relative depth order, which allows points in the frame to be tracked even if they are temporarily occluded.
Experiment and Results
quantitative comparison
The researchers compared the proposed method with the TAP-Vid benchmark, and the results are shown in Table 1. It can be seen that their method consistently achieves the best position accuracy, occlusion accuracy, and timing consistency on different datasets. Their method can handle different pairwise correspondence inputs from RAFT and TAP-Net well, and provides consistent improvement over these two baseline methods.
qualitative comparison
As shown in Figure 3, the researchers performed a qualitative comparison of their method with baseline methods. The new method shows excellent recognition and tracking capabilities in (long-term) occlusion events, while providing reasonable positions for points during occlusions and handling large camera motion parallax.
Ablation experiments and analysis
The researchers used ablation experiments to verify the validity of their design decisions, and the results are shown in Table 2.
In Fig. 4, they show the pseudo-depth maps generated by their model to demonstrate the learned depth ranking.
It should be noted that these maps do not correspond to physical depth, however, they show that using only photometric and optical flow signals, the new method can effectively determine the relative order between different surfaces, which is crucial for tracking in occlusions. important. More ablation experiments and analysis results can be found in the supplementary material.
© THE END
For reprinting, please contact this official account for authorization
The Computer Vision Research Institute study group is waiting for you to join!
ABOUT
Institute of Computer Vision
The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as target detection, target tracking, and image segmentation. The research institute always shares the algorithm framework of the latest papers, and the platform focuses on "research" and "practice". In the later stage, we will share the practical process for the corresponding fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!
Click "Read the original text" to cooperate and consult immediately