Meta latest open source! Keep track of all upgrades! Performance beyond OmniMotion!

introduction

In recent months, the CV world has really been on the fence with "everything". First, Meta released Segment Anything on April 5, which can provide Mask for any object in any image. Subsequently, a large number of second-generation "everything" emerged, such as SAM3D (segment everything in 3D scenes), SAMM (segment everything medical model), SegGPT (segment everything in the context), Grounded Segment Anything (detect everything/generate everything) Wait, it really is a large model that rules a field. On June 8, Google proposed the "track everything" model OmniMotion, which directly performs accurate and complete motion estimation on each pixel in the video. I thought it was over, but two days ago Meta opened up CoTracker: track any number of points in any long video, and you can add new points for tracking at any time! The performance directly surpasses Google's OmniMotion, and I can't help but sigh that the world of the big guys is really too big. Today, the author will lead the friends to appreciate this masterpiece! Note that the tracking of everything mentioned here is not target tracking, but specific point tracking. Friends who are interested in target tracking can pay attention to the Track Anything article.

Show results

Let's take a look at the specific effect first!
insert image description here

Really silky smooth. Points on almost all kinds of dynamic targets can be tracked stably! Let's see if the points sampled from the standard grid can be tracked stably:
insert image description here

Compared with other SOTA solutions, it can be said to be very strong. And not afraid of occlusion at all:
insert image description here
insert image description here

In short, the effect is very good. The code is already open source, please try it out if you are interested. Let's take a look at the specific article information.

Summary

Methods for video motion prediction either use optical flow to jointly estimate the instantaneous motion of all points in a given video frame, or independently track the motion of individual points throughout the video. The latter is true even for robust deep learning methods that are able to track points through occlusions. Tracking points individually ignores potentially strong correlations between points, e.g. because they belong to the same physical object, and can hurt performance. In this paper, we propose CoTracker, an architecture for jointly tracking multiple points throughout a video. This architecture combines several ideas from the optical flow and tracking literature to form a new, flexible and robust design. It is based on a Transformer network to model the correlation at different time points through a dedicated attention layer. The Transformer iteratively updates estimates for multiple trajectories. It can be applied to very long videos in a sliding window manner, for which we design an unrolled learning loop. It can jointly track from one point to several points, and supports adding new points for tracking at any time. The result is a flexible and robust tracking algorithm that outperforms state-of-the-art methods on nearly all benchmarks.

Algorithm analysis

At present, there are two main types of mainstream motion tracking methods: one is the optical flow method, which directly estimates the instantaneous velocity of all points in the video frame, but it is difficult to estimate long-term motion (especially when encountering occlusion and low frame rate of the camera). The other is the tracking method, which selects finite points to track directly in continuous time, but does not use the interaction relationship of different points on the same object. And CoTracker uses the ideas of two methods at the same time: using Transformer to model the correlation of points on the same object, and using sliding windows to track ultra-long video sequences! The input to CoTracker is the video and a variable number of track start positions, and the output is the entire track. Note that the input to the network can be anywhere and at any time in the video sequence! In fact, the whole paper does not have a lot of mathematical derivations, but the idea of ​​the article is very clever. The specific principle of CoTracker is as follows. First assume that the point is static to initialize the point coordinate P, and then use CNN to extract the image feature Q. In order to save video memory, the image feature is 1/32 the size of the original image, and a mark v is added to indicate whether the target is occluded. Afterwards, the input token (P, v, Q) is fed to the Transformer for correlation modeling, and the output token (P', Q') represents the updated position and image features.
insert image description here

The highlight of CoTracker is that it designs sliding windows and loop iterations to track long videos. For video sequences whose length exceeds the maximum length T supported by Transformer, a sliding window is used for modeling. The length of the sliding window is J=⌈2T′/T-1⌉, and for each sliding window, M iterations are performed. That is to say, there are a total of JM iterative operations. For each iteration, the point's position P and image features Q are updated. It should be noted that the mark v here is not updated through the Transformer, but is updated according to the calculation results after the end of M iterations. And what does unrolled learning mean? The main thing here is the learning method for semi-overlapping windows. Since this learning method does not lead to a substantial increase in computational cost, CoTracker can theoretically handle video sequences of arbitrary length. Also, unrolled learning can track points that appear later in the video!
insert image description here

CoTracker also uses a lot of tricks, such as using two linear layers for the input and output of the network, so that the computational complexity is reduced from O(N2T2) to O(N2+T2). Here N represents the total number of points tracked, and T represents the length of the video sequence.

Experimental results

The training of CoTracker is carried out on the TAP-Vid-Kubric comprehensive data set, and the test is carried out on four data sets including TAP-Vid-DAVIS, TAP-Vid-Kinetics, BADJA, and FastCapture that contain the true value of the trajectory. The training uses 11,000 pre-generated 24-frame video sequences, each containing 2,000 tracking points. In the specific training process, 256 points are mainly sampled from the foreground object. I used 32 V100s for training (as expected, ordinary people still can't afford it). The experiment of CoTracker is very interesting because the data set contains the true value of the track. Therefore, in order to verify that the trajectory is actually generated by CoTracker, the author also selected many additional points in the video for verification. CoTracker results are also great! The performance directly surpasses the OmniMotion released by Google in June!
insert image description here
insert image description here

The author also conducted experiments to prove the importance of the unroll sliding window. Since the benchmarks in the evaluation process are super long, the results also prove that CoTracker can indeed be well applied to long video sequences!
insert image description here

Summarize

Recently, major laboratories around the world are releasing large-scale models frantically. Cotracker is a very novel AI model that tracks everything. Its main innovation is that it can perform correlation modeling on different points on the same object, and it can be used for ultra-long video sequences. This point cannot be replaced by conventional optical flow methods and tracking methods, so for projects with long-term tracking requirements, Cotracker is a very good choice. What novelty models will appear in the future, matching everything? Dialogue everything? Calculate all poses? let us wait and see.

Guess you like

Origin blog.csdn.net/limingmin2020/article/details/132292893