SUSTech Black Technology: Eliminate video characters with one click, the savior of special effects artists is here!

Cressy is sent from the concave non-si
qubit | public account QbitAI

This video segmentation model from Southern University of Science and Technology can track anything in a video.

It can not only "watch", but also "cut". It is also an easy task for it to remove individuals from the video.

In terms of operation, the only thing you need to do is to click the mouse a few times.

9d6e69b75ed24f3059f4779ccfd516be.gif

The special effects artist seemed to have found a savior after seeing the news, and bluntly said that this product will change the rules of the game in the CGI industry.

9123e54844c780e0ce591e5e2105f00e.png

This model is called TAM (Track Anything Model), is it similar to the name of Meta's image segmentation model SAM?

Indeed, TAM extends SAM to the video field and lights up the skill tree for dynamic object tracking .

fe5cf966072720f8964123f1544d2ba7.gif

Video segmentation models are actually not a new technology, but traditional segmentation models do not ease the work of humans.

The training data used by these models all need to be manually labeled, and even need to be initialized with the mask parameters of specific objects before use.

The emergence of SAM provides a prerequisite for solving this problem - at least the initialization data no longer needs to be manually obtained.

Of course, TAM does not use SAM to superimpose frame by frame, but also needs to construct the corresponding spatio-temporal relationship.

The team integrated the SAM with a memory module called XMem.

You only need to use SAM to generate initial parameters in the first frame, and XMem can guide the subsequent tracking process.

There can also be many targets to track, such as the following Qingming River Scene:

de900d4fab76ead6442ca44dbe630e96.gif

Even if the scene changes, it will not affect the performance of TAM:

7f17f41f1c6843a85ace6dde2f783898.gif

After some experience, we found that TAM adopts an interactive user interface, and the operation is very simple and friendly.

f70690a77e51d77ab528f79cee75a1eb.png

In terms of hard power, the tracking effect of TAM is indeed good:

75912d9e09cce19ad7016b1a87f9b7a8.gif

However, the accuracy of the elimination function in some details needs to be improved.

84ef5a7a9ae5042bb0c6cd0c2af96853.png

From SAM to TAM

As mentioned above, TAM is based on SAM and combined with memory ability to establish spatio-temporal correlation.

Specifically, the first step is to initialize the model with the still image segmentation capability of SAM.

With just one click, SAM can generate the initialization mask parameters of the target object, replacing the complicated initialization process in traditional segmentation models.

With the initial parameters, the team can hand it over to XMem for semi-manual intervention training, which greatly reduces the workload of humans.

f8159fc193f6329a96c6f2063c606ba5.png

During this process, some artificial predictions are used to compare with the output of XMem.

In the actual process, it becomes more and more difficult for XMem to obtain accurate segmentation results as time goes by.

When the gap between the result and the expectation is too large, it will enter the re-segmentation link, and this step is still completed by SAM.

After re-optimization of SAM, most of the output results are relatively accurate, but some of them need manual re-adjustment.

The training process of TAM is roughly like this, and the object elimination skills mentioned at the beginning are formed by the combination of TAM and E 2 FGVI.

E 2 FGVI itself is also a video element elimination tool. With the support of TAM's precise segmentation, its work is more targeted.

To test TAM, the team evaluated it using the DAVIS-16 and DAVIS-17 datasets.

5aeea9c62fb75f6c90a4890f5998163e.png

The intuitive feeling is still very good, and it is true from the data point of view.

Although TAM does not need to manually set the mask parameters, its J (area similarity) and F (boundary accuracy) indicators are very close to the manual model.

Even the performance on the DAVIS-2017 dataset is slightly better than the STM in it.

Among other initialization methods, the performance of SiamMask cannot be compared with TAM at all;

Another method called MiVOS performs better than TAM, but after all, it has evolved for 8 rounds...

f4ac25218233bd1269857e87c2de44cb.png

Team Profile

TAM is from the Visual Intelligence and Perception (VIP) Laboratory of Southern University of Science and Technology.

The laboratory's research directions include text-image-sound multi-model learning, multi-model perception, reinforcement learning, and visual defect detection.

At present, the team has published more than 30 papers and obtained 5 patents.

The team leader is Associate Professor Zheng Feng from Southern University of Science and Technology. He graduated with a Ph.D. from the University of Sheffield in the UK.

Paper address:
https://arxiv.org/abs/2304.11968
GitHub page:
https://github.com/gaomingqi/Track-Anything
Reference link:
https://twitter.com/bilawalsidhu/status/1650710123399233536?s=20

Guess you like

Origin blog.csdn.net/QbitAI/article/details/130497782