Google's blockbuster new work OmniMotion: track all models! Don't be afraid of blocking!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [Target Tracking and Transformer] Exchange Group

Tracking Everything Everywhere All at Once.

Some time ago, Meta released the "Segment Everything (SAM)" AI model, which can generate a mask for any object in any image or video, which made researchers in the field of computer vision (CV) exclaim: "CV does not exist anymore." After that, there was a frenzy of "second creation" in the field of CV. Some works successively combined functions such as target detection and image generation on the basis of segmentation, but most of the research was based on static images.

Now, a new study called "Track Everything" proposes a new method for motion estimation in dynamic video, which can accurately and completely track the trajectory of objects.

fbf3d4480b7800e0f3ae75ee6067f50f.gif

The study was jointly completed by researchers from Cornell University, Google Research and UC Berkeley. They jointly proposed a complete and globally consistent motion representation OmniMotion, and proposed a new test-time (test-time) optimization method for accurate and complete motion estimation of each pixel in the video. 

Welcome to CVer Computer Vision Knowledge Planet! Update the latest and most cutting-edge AI papers and projects every day, scan the QR code below, and you can join the study!

90ba4f24ba4b28191433f1de5c30cb07.jpeg

c6965b49408c71525b7089b27f922331.png

  • Paper address: https://arxiv.org/abs/2306.05422

  • Project homepage: https://omnimotion.github.io/

Some netizens forwarded this study on Twitter, and received 3,500+ likes in just one day, and the research content was well received.

22daf67658bb8e61df986f7fe61b7375.png

From the demo released by the study, the effect of motion tracking is very good, such as tracking the trajectory of a jumping kangaroo:

19cdb14504b582a3bfb60aeb3615af37.gif

Movement curve of swing:

35cc9f7e31ef2a75f1e4b1f9077df940.gif

You can also view motion tracking interactively:

91a5edcb2470d615ff9759fe3c42186f.gif

Even if the object is occluded, the trajectory can be tracked, such as a dog being occluded by a tree while running:

ce130265cc8df246ca2e6597bc8371c2.gif

In the field of computer vision, there are two commonly used motion estimation methods: sparse feature tracking and dense optical flow. However, these two methods have their own shortcomings. Sparse feature tracking cannot model the motion of all pixels; dense optical flow cannot capture motion trajectories for a long time.

The proposed OmniMotion uses quasi-3D canonical volumes to represent video, and tracks each pixel through a bijection between local space and canonical space. This representation guarantees global consistency, enables motion tracking even when objects are occluded, and models any combination of camera and object motion. This study shows through experiments that the proposed method greatly outperforms existing SOTA methods.

Method overview

The study takes as input a collection of frames with pairs of noisy motion estimates (e.g., optical flow fields) to form a complete, globally consistent motion representation for the entire video. The research then added an optimization process that allows it to query representations with any pixel in any frame to produce smooth, accurate motion trajectories throughout the video. Notably, the method can identify when points in the frame are occluded, and can even track points through occlusions.

OmniMotion Characterization

Traditional motion estimation methods, such as paired optical flow, lose track of objects when they are occluded. In order to provide accurate and consistent motion trajectories even under occlusion, this study proposes a global motion representation OmniMotion.

This research attempts to accurately track real-world motion without explicit dynamic 3D reconstruction. OmniMotion representations represent scenes in videos as canonical 3D volumes, mapped into local volumes in each frame via local-canonical bijection. A local canonical bijection is parameterized as a neural network and captures camera and scene motion without separating the two. Based on this approach, video can be viewed as a rendering of a local volume from a fixed static camera.

ac996f150ad456bcb7cdbfb0edc3bb11.png

Since OmniMotion does not explicitly distinguish between camera and scene motion, the resulting representation is not a physically accurate 3D scene reconstruction. Therefore, the study calls it quasi-3D characterization.

OmniMotion preserves information about all scene points projected to each pixel, as well as their relative depth order, which allows points in the frame to be tracked even if they are temporarily occluded.

5d4c136c3efb158974f49fc99bdd5308.gif

Experiment and Results

quantitative comparison

The researchers compared the proposed method with the TAP-Vid benchmark, and the results are shown in Table 1. It can be seen that their method consistently achieves the best position accuracy, occlusion accuracy, and timing consistency on different datasets. Their method can handle different pairwise correspondence inputs from RAFT and TAP-Net well, and provides consistent improvement over these two baseline methods.

b16d6d9fec49a9b998704ceb7151f3e5.png

qualitative comparison

As shown in Figure 3, the researchers performed a qualitative comparison of their method with baseline methods. The new method shows excellent recognition and tracking capabilities in (long-term) occlusion events, while providing reasonable positions for points during occlusions and handling large camera motion parallax.

1e9911555fb675177d676f2b370aaac3.png

Ablation experiments and analysis

The researchers used ablation experiments to verify the validity of their design decisions, and the results are shown in Table 2.

ee18dd3cf58ff8c7ff7aeaa60dc52d1a.png

In Fig. 4, they show the pseudo-depth maps generated by their model to demonstrate the learned depth ranking.

75adaaa3a1bd754d34f679574a8e8852.png

It should be noted that these maps do not correspond to physical depth, however, they show that using only photometric and optical flow signals, the new method can effectively determine the relative order between different surfaces, which is crucial for tracking in occlusions. important. More ablation experiments and analysis results can be found in the supplementary material.

Click to enter —> [Target Tracking and Transformer] Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标跟踪和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标跟踪或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标跟踪或者ransformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watche286d70c082c4662357bc89e11543fff.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131148940