With just one photo, Douyin girls can dance

26dea69cc11b0ff24c6c358157db3281.png

(Free forever, scan the QR code to join)

Reprinted from: Heart of the Machine

Animated video generation has become popular these days. This time, the new frameworks of NUS and Byte not only have natural and smooth effects, but are also much better than other methods in terms of video fidelity.

9a26825a32b3df919acf8a3b241af3f5.png

Recently, the Alibaba research team built a method called Animate Anyone, which only requires a photo of a person and is guided by skeletal animation to generate a natural animated video. However, the source code for this study has not yet been released.

74adfd51f91cde2266dc8a19b5a91f46.gifLet Iron Man move.

In fact, the day before the paper Animate Anyone appeared on arXiv, the National University of Singapore Show Laboratory and Byte jointly conducted a similar study. They proposed MagicAnimate, a diffusion-based framework designed to enhance temporal consistency, faithfully preserve reference images, and improve animation fidelity. Moreover, the MagicAnimate project is open source, and the inference code and gradient online demo have been released.

ccd56a30b5d270573a8a1396a6bcbf9d.png

  • Paper address: https://arxiv.org/pdf/2311.16498.pdf

  • Project address: https://showlab.github.io/magicanimate/

  • GitHub address: https://github.com/magic-research/magic-animate

To achieve the above goals, the researchers first developed a video diffusion model to encode temporal information. Then to maintain appearance coherence across frames, they introduce a novel appearance encoder to preserve the complex details of the reference image. Taking advantage of these two innovations, the researchers further used simple video fusion technology to ensure smooth transitions of long video animations.

Experimental results show that MagicAnimate outperforms the baseline method on both benchmarks. Especially on the challenging TikTok dancing data set, our method outperforms the strongest baseline method by more than 38% in video fidelity.

Let’s take a look at the dynamic display effects of the following TikTok ladies.

8f532ff8c11ffaa2a6195f9c8ef9038c.gif

13378a2ef74a30de4d3ccc93cd50316c.gif

b89b481a6a58cf35f6fb5ce7882565d7.gif

In addition to the dancing TikTok ladies, there is also Wonder Woman who "runs".

ea3b18b5232e4dde9ecc0c3da0964b46.gif

The Girl with a Pearl Earring and the Mona Lisa both did yoga.

12fa6642140c84f0267e74dff5f301b6.gif

6c541ad16711e903c7eb7fafa1e24198.gif

In addition to being a single person, dancing with multiple people can also be done.

611b581784ebb6694ab61d329abab137.gif

dfc70a61cbafb24214f273da0d24fcfc.gif

Compared with other methods, the effect is superior.

aa8239ac92e943cfb154bbdd3020b954.gif

Some foreign netizens have created a trial space on HuggingFace, and it only takes a few minutes to create an animated video. But this website has already received a 404.

8d9b728ce477868f796e4521b6cd010b.pngSource: https://twitter.com/gijigae/status/1731832513595953365

32ce8d098090846135ce6ab4caaaf489.gif

Next, we introduce the MagicAnimate method and experimental results.

Method overview

Given a reference image I_ref and a motion sequence 5dfaaec294cc92a375304ac406360f4d.png, where N is the number of frames. MagicAnimate is designed to synthesize continuous video 80168a4b5c0aa7bba5a7831352655de3.png25c37194348244ece45bfdd5257e1ee6.png. The picture I_ref appears in it, while following the motion sequence 483f09eaadb380da4546b95f9d3e88e0.png. Existing frameworks based on the diffusion model process each frame independently, ignoring the temporal consistency between frames, resulting in a "flickering" problem in the generated animation.

To solve this problem, this study builds a video diffusion model for temporal modeling by incorporating a temporal attention block into a diffusion backbone network fbc25c832ae3c167fb2883ebd63a626d.png.

In addition, existing works use CLIP encoders to encode reference images, but this study believes that this method cannot capture complex details. Therefore, this study proposes a new appearance encoder 234be6883039f1d1a242cacb0a7548c0.pngto encode I_ref into the appearance embedding y_a, and adjust the model based on this.

The overall process of MagicAnimate is shown in Figure 2 below. First, the appearance encoder is used to embed the reference image into the appearance embedding, and then the target pose sequence is passed to the pose ControlNet 4dbc49a91410c2d10960eaa42ccb82d0.pngto extract motion conditions d68e91acf8b1b6094cbaed813ea667ac.png.

e9c19733db0eefee03ac8d7924743139.png

In practice, due to memory constraints, MagicAnimate processes the entire video in segments. Thanks to temporal modeling and powerful appearance encoding, MagicAnimate maintains a high degree of temporal and appearance consistency between clips. But there are still subtle discontinuities between parts, and to alleviate this, the research team used a simple video fusion method to improve transition smoothness.

As shown in Figure 2, MagicAnimate decomposes the entire video into overlapping segments and simply averages the predictions of overlapping frames. Finally, this study also introduces an image-video joint training strategy to further enhance the reference image retention capability and single-frame fidelity.

Experiments and results

In the experimental part, the researchers evaluated the performance of MagicAnimate on two data sets, namely TikTok and TED-talks. The TikTok data set contains 350 dancing videos, and TED-talks contains 1,203 clips extracted from TED talk videos on YouTube.

Let’s look at the quantitative results first. Table 1 below shows the quantitative results comparison between MagicAnimate and baseline methods on the two data sets. Table 1a shows on the TikTok data set that our method surpasses all baseline methods in reconstruction indicators such as L1, PSNR, SSIM and LPIPS.

Table 1b shows that on the TED-talks dataset, MagicAnimate is also better in terms of video fidelity, achieving the best FID-VID score (19.00) and FVD score (131.51).

cbf052c57281bfcbf82d3919e0cc7132.png

Let’s look at the qualitative results again. The researchers show a qualitative comparison of MagicAnimate with other baseline methods in Figure 3 below. Our method achieves better fidelity and exhibits stronger background preservation, thanks to the appearance encoder that extracts detail information from the reference image.

b6b2958be3612804260b6422dcd60439.png

The researchers also evaluated MagicAnimate's cross-identity animation and compared it with SOTA baseline methods, namely DisCo and MRAA. Specifically, they sampled two DensePose motion sequences from the TikTok test set and used these sequences to animate reference images from other videos.

Figure 1 below shows that MRAA cannot generalize to driving videos containing a large number of different poses, while DisCo has difficulty retaining details in the reference image. In contrast, our approach demonstrates its robustness by faithfully animating a reference image for a given target motion.

a49c4116b36d11683d20fe6c6184ce56.png

Finally, there is the ablation experiment. In order to verify the effectiveness of the design choices in MagicAnimate, the researchers conducted ablation experiments on the TikTok data set, including with and without temporal modeling, appearance encoder, inference stage video fusion, and image-video joint as shown in Table 2 below and Figure 4 below. training etc.

1bf4ef5fa662a589284cd47a76ebf443.png

db093a38366777473657abfb3a77db7b.png

MagicAnimate also has broad application prospects. The researchers say that despite being trained only on real human data, it has demonstrated the ability to generalize to a variety of application scenarios, including animation of unseen domain data, integration with text-image diffusion models, and Multiplayer animation, etc.

791cd924ad10f06e48e4e7f874ba4c9c.png

Finally, I would like to recommend the contents of the quantitative booklet written by our team , from Python installation, getting started, data analysis, crawler explanation, crawling historical + real-time data of stock funds, how to write a simple quantitative strategy, strategy backtesting, and how to look at the capital curve. Everything is introduced! Very good value!

Welcome to subscribe: original price 199, early bird price 39 (increased by 10 yuan for 100 people or more). The current price is very, very low. With just 2 cups of milk tea, you can get a lifetime subscription + full course source code , and a permanent companion group. 48 hours no-questions-asked refund, feel free to eat!

381e04d1dfde27a3a7003caa3bea218d.png

推荐阅读:
入门: 最全的零基础学Python的问题  | 零基础学了8个月的Python  | 实战项目 |学Python就是这条捷径

Essential information: Crawling Douban short reviews, the movie "The Next Us"  |  Analysis of the best NBA players in 38 years  |    From highly anticipated to word of mouth! Detective Tang 3 is disappointing   |  Laugh at the new Legend of Heaven and Dragon Sword  |  Lantern Riddle Answer King  | Use Python to make a massive sketch of young ladies  | Mission: Impossible is so popular, I use machine learning to make a mini movie recommendation system

Fun: pinball game   |  nine-square grid   |  beautiful flowers  |  two hundred lines of Python "Tiantian Cool Run" game!

AI:  Robot that can write poetry  |  Color pictures  |  Predict income  |  Mission: Impossible is so popular, I use machine learning to make a mini movie recommendation system

Gadget:  Convert Pdf to Word, easily convert tables and watermarks!  |  Save html web pages to pdf with one click! |   Goodbye PDF extraction charges!  |  Use 90 lines of code to create the most powerful PDF converter, one-click conversion of word, PPT, excel, markdown, and html  |  Create a DingTalk low-price ticket reminder!  |60 lines of code made a voice wallpaper switcher that I can watch every day! |

Guess you like

Origin blog.csdn.net/cainiao_python/article/details/135312725