Analysis of the principles of video generation: from Gen2, Emu Video to PixelDance, SVD, Pika 1.0, W.A.L.T

Preface

Considering that Vincent Video is starting to explode, for example, November is the most popular month for Vincent Video

  • On November 3, Runway’s Gen-2 released a milestone update, supporting 4K ultra-realistic definition works (runway is the developer of the earliest version of Stable Diffusion, and Stability AI developed the subsequent SD version)
  • On November 16, Meta released Wensheng video model Emu Video
  • On November 18, ByteDance launched PixelDance halfway
  • On November 21, Stability AI, which develops and maintains subsequent versions of Stable Diffusion, finally released their own generative video model: Stable Video Diffusion (SVD)

    picture

In addition, more than one B-end client has approached Qiyue, hoping to help him develop a Vincent Video application. Therefore, our first project team is planning to develop a Vincent Video project after AIGC models, and ultimately connect Vincent 3D and Vincent Digital People.

Of course, our company still has three major project teams

  1. In addition to the AIGC model generation system that has been released to the outside world, Vincent Picture, Vincent Video, Vincent 3D, and Digital Human are all in the first project team
  2. GPT for paper review ( is currently iterating the second version " July paper review GPT version 2: from Meta Nougat, GPT4 Reviewed to Mistral, LongLora Llama), including subsequent AI agent commercial projects, in the second project team
  3. Enterprise multi-document knowledge base Q&A (currently solving various known problems), is in the third project team

Part One Video-generated iPhone moments: Runway releases Gen-1 and Gen-2 successively

1.1 Gen-1: AI editing of existing 3D animations and mobile videos

In February 2023, Runway, which previously developed the initial version of stable diffusion, proposed the first AI editing model Gen-1. Gen-1 can edit the video we want based on the original video. Whether it's a rough 3D animation or a shaky video taken with a mobile phone, Gen-1 can upgrade it to an incredible effect (of course, the reason behind this is that Gen1 is trained jointly on images and videos)

For example, using a few packaging boxes, Gen-1 can generate a video of a factory, turning decay into magic. It’s that simple.

1.1.1 How Gen-1 does it: Add a timeline to the image model, and perform joint training on images and videos

The corresponding paper for Gen-1 is:Structure and Content-Guided Video Synthesis with Diffusion Models , by the way, some articles will confuse this paper with a Gen2 paper, but in fact, runway has only published Gen-1 papers, and the 2 papers have not been published before the end of 23. Please pay attention.

As shown in the figure below, it can be based on latent video diffusion models (latent video diffusion models). By giving the original input image in the middle part of the figure below, the video can then be generated through the text guidance in the upper part of the figure below, or through the bottom part of the figure below. Part of the image guide to generate video

How?

  • First of all, the reason why videos can be generated through text guidance is inseparable from the series of preparatory work for text-guided image generation (Text-conditioned models, such as DALL-E2 and Stable Diffusion, enable novice users to generate detailed imagery given only a text prompt as input). After all, latent diffusion models provide an efficient way to synthesize images in a perceptually compressed space
  • Secondly, by introducing a pre-trained image model with a timeline (temporal layers into a pre-trained image model), and Joint training on images and videos " means training on a large-scale dataset of unsubtitled videos and paired "text-image" uncaptioned videos and paired text-image data)", thereby extending the latent diffusion model to video generation

    Gen1 proposes a controllable structure and content-aware Video diffusion model (We propose a controllable structure and content-aware video diffusion model)
    At the same time, during the inference stage, you can Modifying a video guided by an example image or text means that editing the video is performed entirely in the inference phase, without additional training or preprocessing for each video, i.e. Editing is performed entirely at inference time without additional per-video training or pre-processing) and choose to use monocular depth estimation technology to represent the structure, and use pre-trained neural Network prediction embedding representation content (We opt to represent structure with monocular depth estimates and content with embeddings predicted by a pre-trained neural network, by the way: monocular depth estimation is a computer vision technology , which aims to infer three-dimensional depth information of a scene from two-dimensional images captured using only a single camera)

  • Then it provides several control modes during the video generation process
    First, similar to image synthesis models, train the model so that it can infer the content of the video, such as their appearance or style, and matching user-provided image or text prompts
    Second, inspired by the diffusion process, the information obscuring process is applied to the structure representation to select how well the model adheres to the given structure ( we apply an information obscuring process to the structure representation to enable selecting of how strongly the model adheres to the given structure )
    Finally, the inference process has also been adjusted to enable control over temporal consistency through a custom guidance method and inspired by classifier-free guidance. consistency in generated clips), which is equivalent to achieving a unified alignment of time, content, and structure

1.1.2 Detailed explanation of the training process and inference process of Gen1

The goal of our model is to preserve the structure of the video (The structure generally refers to the geometric and dynamic characteristics of the video, such as the shape, position of objects and their temporal changes a>), edit the content of the video (The content generally refers to the appearance and semantic characteristics of the video, such as the color, style of the object and the brightness of the scene)

In order to achieve this goal, it is necessary to learn videos based on structure representations and content representationc Generative model, thereby inferring its structural representation from the input video, and then editing the description text of the video Modify (modify it based on a text prompt c describing the edit), as shown belowxp(x \mid s, c)sc

  • In the training process on the left side of the above figure, the input video x is encoded to with a fixed encoderAND and diffused to < /span>, the model learns to reverse the diffusion process in the latent space, and provided by the cross attention block With the help of , Then, in )We extract a structure representation s by encoding depth maps obtained with MiDaS, and a content representation c by encoding one of the frames with CLIP . ( On the other side, a structural representation is extracted by encoding "depth maps obtained using MiDaS" and by encoding one of the frames using CLIP, To extract content representationz_0z_t
    SC
    Sz_tC
  • In the inference process on the right side of the figure above, the structure of the input videoS is provided in the same way. To specify content via text, we convert CLIP text embeddings to image embeddings via a prior)
1.1.2.1 Review of potential diffusion models

The forward diffusion process of the diffusion model is defined as

q\left(x_{t} \mid x_{t-1}\right):=\mathcal{N}\left(x_{t}, \sqrt{1-\beta_{t}} x_{t-1}, \beta_{t} \mathcal{I}\right)

Slowly add noise consistent with the normal distribution to each samplex_{t-1} to obtainx_t. This forward diffusion process simulates a Markov chain, The variance of the noise is\beta _{t}, whilet \in\{1, \ldots, T\}

As for the reverse process, it is defined according to the following formula

\begin{array}{c} p_{\theta}\left(x_{0}\right):=\int p_{\theta}\left(x_{0: T}\right) d x_{1: T} \\ p_{\theta}\left(x_{0: T}\right)=p\left(x_{T}\right) \prod_{t=1}^{T} p_{\theta}\left(x_{t-1} \mid x_{t}\right) \\ p_{\theta}\left(x_{t-1} \mid x_{t}\right):=\mathcal{N}\left(x_{t-1}, \mu_{\theta}\left(x_{t}, t\right), \Sigma_{\theta}\left(x_{t}, t\right)\right) \end{array}

Among them, the variance is fixed\Sigma_{\theta}\left(x_{t}, t\right), and we only need to learn the mean value\mu_{\theta}\left(x_{t}, t\right). The loss function we need to optimize the target is a>

L:=\mathbb{E}_{t, q} \lambda_{t}\left\|\mu_{t}\left(x_{t}, x_{0}\right)-\mu_{\theta}\left(x_{t}, t\right)\right\|^{2}

eventually converted into

Friendly reminder, if you have any questions about the derivation of the above diffusion model DDPM, please refer to the second part of this article "The Origin of AI Painting Ability: From VAE, Diffusion Model DDPM, DETR to ViT/Swin transformer》, the derivation of each step of DDPM is very detailed

1.1.2.2 Spatio-temporal Latent Diffusion

In order to correctly model the distribution of video frames, the following work needs to be done

  1. Introducing temporal layers to extend the image architecture, and these temporal layers are only valid for video input, while the autoencoder remains fixed and processes each frame in the video independently
    we extend an image architecture by introducing temporal layers, which are only active for video inputs. All other layers are shared between the image and video model. The autoencoder remains fixed and processes each frame in a video independently.
  2. UNet mainly consists of two modules: residual block and transformer block, which are extended to video by adding one-dimensional convolution across time and one-dimensional self-attention across time (we extend them to videos by adding both 1D convolutions across time and 1D self-attentions across time)

    In each residual block, as shown on the left side of the figure above, we introduce one temporal convolution after each 2D convolution ( temporal convolution after each 2D convolution)
    Similarly, as shown on the right side of the figure above, after each 2D transformer block, there is a temporal 1D transformer block (which mimics its spatial counterpart along the time axis), and input learnable positional encodings of the frame index into temporal transformer blocks< a i=6> is equivalent to  Adding a one-dimensional temporal convolution after each spatial convolution  Adding a one-dimensional temporal convolution after each spatial attention layer Temporal attention layer


    \rightarrow
    \rightarrow
  3. In the final implementation, the image is treated as a video with only a single frame to handle the two situations uniformly
    The batch size is b, the number of frames is n, the number of channels is c, The spatial resolution is w ✖️ h, that is, the shape is b × n × c × h × The batched tensor of w is re- Arranged asw × h (i.e. shape b × n × c × h × w) is rearranged to (b · n) × c × h × w for spatial layers, to (b · h · w) × c × n for temporal convolutions, and to (b · h · w) × n × c for temporal self-attention

//To be updated

1.1.2.3 Representing Content and Structure

The diffusion model is very suitable for modeling conditional distributions such as p(x \mid s, c). Due to the lack of large-scale paired video-text data sets, it can only be limited to video data without subtitles. Conduct training

  1. In short, our goal is to edit videos based on text prompts for editing videos provided by users, but we still face a problem: that is, we do not have the training data of video triples, editing prompts, and generated output, nor There are no paired video and text subtitles (Thus, while our goal is to edit an input video based on a text prompt describing the desired edited video, we have neither training data of triplets with a video , its edit prompt and the resulting output, nor even pairs of videos and text captions)
  2. Therefore, we must derive representations of structure and content from the training videox itself, i.e., s=s(x), c=c(x), So the loss function is\lambda_{t}\left\|\mu_{t}\left(\mathcal{E}(x)_{t}, \mathcal{E}(x)_{0}\right)-\mu_{\theta}\left(\mathcal{E}(x)_{t}, t, s(x), c(x)\right)\right\|^{2}
  3. In contrast, during inference, structures and contentc are derived from the input videoand and text prompt<, respectively. /span>, t, edited version x of y obtained by sampling the generative model conditional on s(y)c(t)
    z \sim p_{\theta}(z \mid s(y), c(t)), \quad x=\mathcal{D}(z)

content presentation level

  1. In order to infer the content representation from both text input x and video input x, they use CLIP's image embeddings to represent content.
    For video input, they An input frame is randomly selected during training, similar to how a prior model can be trained that allows image embeddings to be sampled from text embeddings. This approach enables specifying editing by image input rather than text. This approach enables
  2. To be updated...

// To be updated

1.2 Gen-2 gets an epic upgrade - it can generate videos from scratch

Many students have not yet had time to experience Gen-1. Unexpectedly, in March 2023, runway soon launched the internal beta version of Gen-2 and officially released it to the public in June (< a i=1>This is runway’s introduction page to Gen-2:https://research.runwayml.com/gen2), compared to Gen- 1. Gen-2 gets an epic upgrade - it can generate videos from scratch. If stable diffusion/midjourney released last year is the representative of Vincentian pictures, then Gen2 is the first representative of Vincentian videos

  1. When Gen-2 was first released, it could only generate 4 seconds of video. The free trial limit for each user is 105 seconds, which can generate about 26 Gen2 videos.
  2. In August, the maximum length of generated videos was increased from 4s to 18s.
  3. In September, a new director mode was added, which can control the position and movement speed of the camera.

1.2.1 8 modes of video generation based on Gen-2

  1. Text to Video
  2. Text + Image to Video
  3. Image to Video, for example, enter the following picture

    Gen-2 can generate the corresponding video based on the picture above.

    Gen2:Image to Video

  4. Stylization
  5. Storyboard
  6. Mask
  7. Render
  8. Customization

1.2.2 Gen-2 update in November 23: Generate 4K ultra-high-definition video and move wherever you want

Gen-2 launches 2 major updates in a row in November

  1. On November 3, Runway’s Gen-2 released a milestone update to support 4K ultra-realistic definition works.
  2. On November 21, the motion pen refresh function of “paint wherever you move” was launched, directly marking an important milestone in the controllability of the generated model.

    picture

Part 2 Meta releases generative video model: Emu Video

On November 16, Meta released the Vincent video model Emu Video, which supports flexible image editing (for example, turning a "rabbit" into a "rabbit playing a trumpet", and then into a "rabbit playing a rainbow-colored trumpet") , also supports the generation of high-resolution videos based on text and images (such as making the "trumpet-playing rabbit" dance happily)

So what is the principle behind it? In fact, it involves two tasks

  1. Flexible image editing is accomplished by a model called "Emu Edit". It supports free editing of images through text, including local and global editing, removing and adding backgrounds, color and geometry conversion, detection and segmentation, etc.
    In addition, it can follow instructions accurately to ensure Pixels in the input image that are not relevant to the instruction remain unchanged, such as putting a skirt on an ostrich

    picture

  2. High-resolution videos are generated by a model called "Emu Video". Emu Video is a diffusion model-based Vincent video model that can generate 512x512 4-second high-resolution videos based on text. And there are human evaluations showing that Emu Video may score higher in terms of generation quality and text fidelity compared to Runway's Gen-2 and Pika Labs' generation effects. Here's what it produces:

    picture

In its official blog, Meta looked forward to the application prospects of these two technologies, such as allowing social media users to generate their own animations and emoticons, edit photos and images according to their own wishes, and so on. Of course, regarding the generation of animations/emoticons, Meta also mentioned it when it released the Emu model at the previous Meta Connect conference (See:Meta ChatGPT version is here: Llama 2 blessing, access to Bing search, Xiao Zha’s live demonstration)

picture

Next, we introduce these two models respectively.

2.1 Emu Edit: Precise image editing

2.1.1 Advantages compared to InstructPix2Pix: more accurate execution of instructions

The paper corresponding to Emu Edit is "Emu Edit: Precise Image Editing via Recognition and Generation Tasks", and its project address is: https ://emu-edit.metademolab.com/

As stated in the paper, image editing is used by millions of people every day today. However, popular image editing tools either require considerable expertise and are time-consuming to use, or are very limited, offering only a set of predefined editing operations such as specific filters. Fortunately, today's instruction-based image editing (Instruction-based image editing) attempts to allow users to use natural language instructions to solve these limitations. For example, a user can provide an image to the model and instruct it with a command like "Dress the emu in a firefighter costume."

However, while instruction-based image editing models like InstructPix2Pix can be used to handle a variety of given instructions,

  1. They often have difficulty interpreting and executing instructions accurately
     

    By the way, instructable - pix2pix introduced an instructable image editing model, which they developed by leveraging both GPT-3 and Prompt-to-Prompt to generate an instruction-based image editing model. of a large synthetic dataset, and uses this dataset to train an image editing model that can follow instructions
    Unlike InstructPix2Pix, which uses a synthetic dataset, Mag-icBrush does this by requiring humans to use an online image editing tool , developed a manually annotated instruction-guided image editing dataset, and then fine-tuned the instructable on this dataset - pix2pix can improve image editing capabilities

  2. In addition, these models have limited generalization capabilities and are usually unable to complete tasks that are slightly different from those during training. For example, in the picture below, when a little rabbit is asked to play a rainbow-colored trumpet, other models either dye the rabbit into rainbow colors or directly Generate a rainbow-colored trumpet

To solve these problems, Meta introduced Emu Edit, the first image editing model trained on diverse tasks. As mentioned earlier, Emu Edit can perform free-form editing according to instructions, including local and global editing. , tasks such as background removal and addition, color changes and geometric transformations, detection and segmentation.

Unlike many of today's generative AI models, Emu Edit follows instructions exactly, ensuring that pixels in the input image that are independent of the instructions remain unchanged. For example, on the left side of the picture below, the user gives the instruction "remove the puppy from the grass." There is almost no change in the picture after removing the object. Another example is on the right side of the picture below, when the dog in the lower left corner of the picture is removed. Text, and then changing the background of the image, Emu Edit can also handle it very well:

picturepicture

2.1.2 Develop a 10 million-scale data set covering 16 different tasks

Considering that the scale, diversity, and quality of existing data on the market are limited, in order to train this model, Meta developed a data set containing 16 different tasks and 10 million synthetic samples. a> Each sample contains an input image, a description of the task to be performed (i.e., a text instruction), and a target output image, task index ", specifically:

  1. Task List
    The 16 tasks are divided into three main categories: region-based editing, free-form editing, and visual tasks
    Region-Based Editing
    Local : Substituting one object for another, altering an object’s attributes (e.g., “make it smile”)
    Remove: Erasing an object from the image
    Add: Inserting a new object into the image
    Texture : Altering an object’s visual characteristics with out affecting its structure (e.g., painting over, filling or
    covering an object)
    Background: Changing the scene’s background
    Free-Form Editing
    Global : Edit instructions that affect the entire image, or that can not be described using a mask (e.g., “let’s see it
    in the summer”)
    Style: Change the style of an image
    Text Editing : This involves text-related editing tasks
    such as adding, removing, swapping text, and altering the
    text’s font and color
    Vision tasks
    Detect : Identifying and marking a specific object with in the image with a rectangle bounding box
    Segment : Isolating and marking an object in the image
    Color: Color adjustments like sharpening and blurring
    Image-to-Image Translation : Tasks that involve bi directional image type conversion, such as sketch-to image, depth map-to-image, normal map-to-image,pose to-image,segmentation map-to-image, and so on
  2. Generation of text instructions
    To generate editing instructions, we utilize a dialogue-optimized 70 billion parameter variant of Llama 2. Specifically, we provide a task description for LLM , some task-specific exemplars, and a realistic image description
    To increase diversity, we sample the exemplars and randomize their order. Given such input, we expect LLM to output: (1) an editing instruction, (2) an output title for the ideal output image, (3) which objects should be updated or added to the original image
    The following is the prompt they designed
    def get_content_instruction(new_prompt):
        optional_verbs = choice(["include", "place", "position", "set", "incorporate", "alongside", 
                                 "give", "put", "insert", "together with", "with", "make", "integrate", 
                                 "have", "append", "make", "add", "include"])
    
        # system message #
        system_message = (
            f"<<SYS>>\n"
            "You are an assistant that only speaks JSON. Do not write normal text. The assistant answer is "
            "JSON with the following string fields: 'edit', 'edited object','output'. Here is the latest "
            "conversation between Assistant and User.\n"
            "<</SYS>>"
        )
    
        # introduction message #
        intro_message = (
            f"[INST]User: Hi, My job to take a given caption ('input') and to output the following: an "
            f"instruction for {optional_verbs} an object to the image ('edit'), the object to {optional_verbs} "
            "('edited object'), and the caption with the object ('output'). Please help me do it. "
            "I will give you the 'input', and you will help. When you reply, use the following format: "
            "{\"edit\": '<instruction>', 'edited object': '<object>', 'output': '<caption>'}[/INST]\n"
            "Assistant: Sure, I'd be happy to help! Please provide the actual input caption you'd like me to "
            f"read and I'll assist you with writing an instruction to {optional_verbs} an object to the "
            "image, writing the added object and writing the caption with the object."
        )
    
        # shuffling #
        random.seed(torch.randint(1 << 32, ()).item())
        shuffle(few_shot_examples)
        few_shot_examples = few_shot_examples[:int(len(few_shot_examples) * 0.6)]
        prompt = system_message + intro_message + "".join(few_shot_examples)
    
        # add the test prompt #
        prompt += f"[INST]User: {new_prompt}[/INST]"
    
        return prompt
  3. Image Pairs Generation
    When creating a pair of input and editing images, a crucial prerequisite is to ensure that the two images only appear on specific elements or Different in location, while remaining the same in all other respects. Previous command-based image editing methods relied on Prompt-to-Prompt (P2P) to build image editing datasets
    A crucial prerequisite when creating a pair of input and edited images is to guarantee that the two images differ only in specific elements or locations, while remaining identical in all other aspects.
    Previous instruct-based image editing methods [Instructpix2pix: Learning to follow image editing instructions] rely on Prompt-to-Prompt (P2P) to build an image-editing dataset.
    P2P injects cross-attention maps from the input image generation to the edited image generation.


    In order to support local editing, P2P also approximates the mask of the edited part based oncross-attention maps code and restrict editing to that local area
    P2P relies on word-for-word alignment between the input image title and the edited image title (such as "a cat rides a bicycle" and "a cat "Ride a car"), to generate edited image pairs
    However, when there is no word-to-word alignment, due to the reliance on cross-attention maps, the generated masks are often imprecise
    To support local edits, P2P additionally approximates a mask of the edited part, based on the cross-attention maps and constrains the edit to this local area .
    P2P relies on word-to-word alignment between the input image caption and the edited image caption (e.g. "a cat riding a bicycle" and "a cat riding a car") to produce editing image pairs.
    However, when there is no word-to-word alignment, the resulting mask tends to be imprecise due to its reliance on cross-attention maps.


    Furthermore, since word-to-word alignment is not a practical assumption in most image editing tasks, this approach often fails to preserve structure and identity To address this challenge, this paper proposes a mask extraction method that is applied before the editing process Our approach involves: (i) and (ii) integrating these masks during the editing process to ensure seamless fusion of edited regions with the original image. (i) identifying the edited areas from the editing instruction via an LLM and creating corresponding masks before image generation, Our approach involves: To address this challenge, we propose a mask extraction method, which is applied before the editing process.Furthermore, as word-to-word alignment is not a practical assumption in most of the image editing tasks, this approach often fails to preserve structure and identity . and (ii) integrate these masks during the editing process to Ensure seamless blending of the edited area with the original imageIdentify editing regions from editing instructions via LLM and create corresponding masks before generating the image,








  4. In addition, we utilize various tech-niques, including dilation and Gaussian blur to improve the mask Gaussian blurring, to refine the masks and adopt acomprehensive filtering method to ensure the fidelity of the data set (We employ a comprehensive filtering approach to ensure the fidelity of the dataset) This includes: (i) us-ing the task predictor (Sec. 4.2) to reassign samples withinstructions that should belong to another task, i.e. belong to another task (ii) Apply CLIP filtering metrics [2], that isapply-ing CLIP filtering metrics [2] (iii) Employing structure pre-serving filtering based on the L1 distance between the depthmap of the input image and the edited image (iv) Apply the image detector to verify the presence (in the Add task), absence (in the Remove task) or replacement (in the Remove task) of the element against the object specified in the directive Local task), i.e. apply-ing image detectors to validate the presence (in Add task), the absence (in Remove task) or replacement (in Local task) of elements, according to the objects specified in the in-struction This process filtered out 70% of the data, resulting in a final dataset of 10 million samples









2.1.3 Model architecture: Based on the latent diffusion model, it is first pre-trained and then fine-tuned through thousands of annotated images.

A two-stage approach that begins with a pre-training phase and ends with a quality fine-tuning phase (The Emu model is a two-stage approach that begins with a pre-training phase and concludes with aquality fine-tuning stage). The key to this approach is that the fine-tuning dataset is relatively small, containing only a few thousand images, but must be of extraordinary quality, often requiring human annotation

  1. Emu adopts a latent diffusion model architecture [High-resolution image synthesis with latent diffusion models] to support high-resolution image generation, and incorporated a 16-channel autoencoder with encoder E and decoder D.
    support high-resolution image generation and incorporated a 16-channel autoencoder with encoder E and decoder D.
  2. A large U-Net, ϵθ, containing 2.8 billion parameters, θ, text embeddings from CLIP ViT-L and T5-XXL, and a massive pre-training dataset containing 1.1 billion images facilitate the model The ability to learn complex semantics and finer details, the noise-offset strategy (noise-offset strategy) helps generate high-contrast and beautiful imagesnoise-offset strategy a> a>noise-offset strategy contributing to high-contrast and aesthetically pleasing image generation.noise-offset strategy
    A large U-Net, ϵθ, with 2.8 billion parameters, θ, text embeddings from CLIP ViT-L [18] and T5-XXL [19], and a substantial pre-training dataset of 1.1 billion images facilitate the model's ability to learn complex semantics and finer details, with a
  3. Given the encoded latent of an image z = E(x), the diffusion process generates a noisy latent zt where the noise level increases over timesteps t < /span>where ϵ ∈ N(0, 1) is the noise added by the diffusion process and y = (cT , cI , x) is a triplet of instruction, input image and target image from the dataset. In practice, we initialize the weights of Emu Edit with the weights of Emu where ∈N(0,1) is the result of the diffusion process and y = (cT, cI, x) is a triplet of instructions, input images and target images from the dataset. In practice, we use the weight of Emu to initialize the weight of Emu EditNeed to minimize the following optimization problemEmu Edit) To convert Emu to an instruction-based image editing model, we condition it on the image to be modified cI and the instruction cT Emu In order to T.



    \min _{\theta} \mathbb{E}_{y, \epsilon, t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t}, t, E\left(c_{I}\right), c_{T}\right)\right\|_{2}^{2}\right]

2.1.4 Two keys to training: multi-task training, task embedding vector fusion and time step embedding through cross attention

There are two main keys to training methods:

  • First, they developed unique data management pipelines for each of the 16 tasks. Meta found that training a single model on all tasks produced better results than training expert models independently on each task. And as the number of training tasks increases, the performance of Emu Edit will also increase.
  • Secondly, in order to effectively handle a variety of tasks, the concept of learned task embeddings is introduced to guide the generation process in the direction of the correct generation task
    Second, to process this wide array of tasks effectively, we introduce the concept of learned task embeddings, which are used to steer the generation process toward the correct generative task.

    Specifically, For each task, a unique task embedding vector is learned and integrated into the model via cross-attention interactions, and added to the time step embedding< a i=4>(we learn a unique task embedding vector, and integrate it into the model through cross-attention interactions, and by adding it to the timestep embeddings)

    They demonstrate that the learned task embeddings significantly enhance the model’s ability to accurately infer appropriate editing intent from free-form instructions and perform correct edits
    During this process, the model weights are kept unchanged and only one task embedding is updated to adapt to the new task. Experiments show that Emu Edit can quickly adapt to new tasks, such as super-resolution

The following focuses on explaining the learned task embedding (Learned Task Embedding)


To guide the generation process in the right direction, they learn an embedding vector for each task in the dataset

  • During training, given a sample in our dataset, we use the task indexi to get the embedding vector of the task from the embedding tablev_i , and optimize it jointly with the model weights (we use the task index, i, to fetch the task's embedding vector, vi, froman embedding table, and optimize it jointly with the modelweights )
  • Specifically, we embed the task into U-Net via cross-attention interaction and add it to the time step embedding (We do so by introducing the task embedding vias an additional condition to the U-Net, ϵθ. Concretely, we integrate the task embedding into the U-Net via cross-attention interactions, and by adding it to the timestep em-beddings )

The optimization problem is updated to

\min _{\theta, v_{1}, \ldots, v_{k}} \mathbb{E}_{\hat{y}, \epsilon, t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t}, t, E\left(c_{I}\right), c_{T}, v_{i}\right)\right\|_{2}^{2}\right]

wherek is the total number of tasks in our dataset, \hat{y}=\left(c_{I}, c_{T}, x, i\right) is the input image, input instruction text, target image and task index from the dataset Quadruple

// To be updated

2.2 Emu Video: First generate images, then generate videos from images and text

2.2.1 EMU VIDEO:Factorizing Text-to-Video Generation by Explicit Image Conditioning

Large-scale Vincentian graph models are trained on network-scale image-text pairs to generate high-quality diverse images, but the problem is

  1. While these models can be further adapted to text-to-video (T2V) generation by using video-text pairs, video generation still lags behind image generation in terms of quality and diversity
    vs. Compared with image generation, video generation is more challenging because it requires modeling a higher-dimensional spatiotemporal output space, and it is still based on text cues. Furthermore, existing video-text datasets on the market are typically an order of magnitude smaller than image-text datasets
  2. The mainstream mode of video generation is to use a diffusion model to generate all video frames at once. In stark contrast, in NLP, long sequence generation is formulated as an autoregressive problem: predict the next word conditional on the previously predicted word
    \rightarrow  Therefore, the conditional signal for subsequent predictions (conditioning signal) will gradually become stronger. The researchers hypothesized that strengthening the conditional signal is also important for high-quality video generation because video generation itself is a time series. However, using diffusion models for autoregressive decoding is challenging because with such The model itself requires multiple iterations to generate a single frame of image
    \rightarrow

Therefore, Meta researchers proposed EMU VIDEO, and their paper is "EMU VIDEO: Factorizing Text-to-Video Generation by Explicit Image Conditioning》, whose project address is https://emu-video.metademolab.com/, enhances the conditions of diffusion-based text-to-video generation through an explicit intermediate image generation step

Specifically, they decompose the Vincent video problem into two sub-problems:

  1. According to the input text promptp, generate an imageI
  2. Then use stronger conditions: the generated image and text to generate the videoin
    Intuitively, giving the model a starting image and text will make video generation easier, because The model only needs to predict how the image will evolve in the future
    And, to constrain the modelF with the image, they temporarily zero-pad the image and compare it with A binary mask (indicating which frames are zero-padded) is concatenated with the noisy input

    So what kind of text-to-image model is used for initialization? We use the text-to-image U-Net architecture for our model and initialize all spatial parameters using the pretrained T2l model. The model uses both frozen T5-XL and frozen CLIP text encoders to extract features from text prompts. A separate cross-attention layer in U-Net is responsible for each text feature. After initialization, the model contains 2.7B frozen spatial parameters, and 1.7B learned temporal parameters
    The model uses both a frozen T5-XL [15] and a frozen CLIP [58] text encoder to extract features from the text prompt. Separate cross-attention layers in the U-Net attend to each of the text features. After initialization, our model contains 2.7B spatial parameters which are kept frozen, and 1.7B temporal parameters that are learned

    Since the video-to-text dataset is much smaller than the image-to-text dataset, the researchers also initialized their text-to-video model using a weight-frozen pre-trained text-to-image (T2I) model.
    And they identified key design decisions—adjusted noiseschedules for diffusion, and multi-stage training >) as inprior work) - This method supports direct generation of 512px high-resolution video without requiring a deep cascade of models used in previous methods (

Tell me more details

  1. We initialize F with a pretrained text-to-image model to ensure that it can generate images on initialization
    Since it is initialized from the pretrained T2I model and kept frozen, Our model therefore retains the conceptual and stylistic diversity learned from large image-text datasets and uses it to generate i. This requires no additional training cost, unlike Imagen video where image and video data are jointly fine-tuned to maintain this style< a i=4>Since the spatial layers are initialized from a pretrained T2I model and kept frozen, our model retains the conceptual and stylistic diversity learned from large image-text datasets, and uses it to generate I. This comes at no additional training cost unlike approaches [Imagen video] that do joint finetuning on image and video data to maintain such styleOf course, many direct T2V methods [such as Align your latents: High-resolution video synthesis with latent diffusion models, Another example is Make-a-video: Text-to-video generation without text-video data] which is also initialized from the pre-trained T2I model and keeps the spatial layer frozen. However, they do not adopt our image-based factorization and thus cannot preserve the quality and diversity of T2I models


    Many direct T2V ap-proaches [7, 68] also initialize from a pretrained T2I modeland keep the spatial layers frozen. However, they do notemploy our image-based factorization and thus do not re-tain the quality and diversity in the T2I model

    Next, we only need to train F to solve the second step, which is to infer the video conditioned on the text hint and the starting frame.
    We do this by sampling the starting frame I , and require the model to predict T frames using both text cues pxw and image I conditioning, thereby using video-text pairs to train F
  2. Since a latent diffusion model is used, the video V is first converted into a latent space X∈R T ×C×H×W using an image autoencoder applied frame by frame, which reduces the spatial dimension
    The latent space can be converted back to the pixel spaceusing the autoencoder's decode ) The T frames of the video are independently denoised to generate the denoised input Xt, and the diffusion model is trained to denoise (The T frames of the videoare noisy independently to produce the noised input Xt,which the diffusion model is trained to denoise)
  3. We use the pre-trained T2I model to initialize the latent diffusion model F
    like " above 1.1.2.2 Spatio-temporal Latent Diffusion)", we add a new learnable temporal parameter:
    \rightarrow  Add a one-dimensional temporal convolution after each spatial convolution Product
    \rightarrow  Add a one-dimensional temporal attention layer after each spatial attention layer
    The original spatial convolution layer and attention layer are applied independently to each on T frames and keep frozen

    The pre-trained T2I model is already a text condition. Combined with the image condition described above, F is both a text and an image condition
    The pretrained T2I model is already text conditioned and combined with the image conditioning described above,Fis conditioned on both text and image

The ultimate benefit of doing so is

  • Unlike methods that generate videos directly from text, their decomposition method explicitly generates an image during inference, which allows them to easily preserve the visual diversity, style, and quality of Vincentian graph models, as follows As shown in the figure
    This allows EMU VIDEO to surpass the direct T2V method even when the training data, calculation amount and trainable parameters are the same.

    picture

  • And for example, through a multi-stage training method, the quality of Vincent video generation can be greatly improved.

    picture

2.2.2 How to extend the duration of generated video

As can be seen from the demo shown, EMU VIDEO can already support 4-second video generation. In the paper, they also explore ways to increase the duration of videos

The authors show that with a small architectural modification, they can constrain the model on T-frames and scale to video. Therefore, they trained a variant of EMU VIDEO to generate 16 future frames conditioned on the "past" 16 frames. When extending the video, they used a different future text prompt than the original video, and the effect is shown in Figure 7. They found that the expanded video followed both the original video and the future text prompts.

Part 3 PixelDance: The generated video is extremely dynamic

On November 18, Byte launched PixelDance halfway through

  • Generating videos with high consistency and rich dynamics, so that the video content can truly move, is currently the biggest challenge in the field of video generation.
  • In this regard, the latest research results PixelDance have taken a critical step, and the dynamics of its generated results are significantly better than Other existing models have attracted industry attention

3.1 Two video generation modes of PixelDance

On the official website (https://makepixelsdance.github.io), PixelDance provides two different video generation modes

3.1.1 Basic mode: generate videos through guidance pictures + text descriptions

The first is Basic Mode. Users only need to provide a guidance picture + text description, and PixelDance can generate a highly consistent and dynamic video. The guidance picture can be a real Pictures can also be generated using existing Vincentian diagram models.
Judging from the displayed results, PixelDance can solve all of the real style, animation style, two-dimensional style and magic style. Character movements, facial expressions, camera perspective control, special effects movements, PixelDance It can also be completed very well

picture

3.1.2 Advanced Magic Mode: Generate cool shots through two guide pictures + text description

The second is the advanced Magic Mode, which gives users more room to use their imagination and creativity. In this mode, users need to provide two guide pictures + text description, which can better generate more difficult and cool special effects shots.

picture

In addition, the official website also showcases a 3-minute short story film entirely produced using PixelDance

  1. Using PixelDance, each scene and corresponding actions can be created according to a story envisioned by the user. Whether it is a real scene (such as Egypt, the Great Wall, etc.) or an imaginary scene (such as an alien planet), PixelDance can generate videos with rich details and action, even various special effects shots.
  2. Moreover, the image of the protagonist Mr. Polar Bear’s black top hat and red bow tie is well maintained in different scenes. Long video generation is no longer a matter of piecing together weakly relevant short video clips.

To achieve such outstanding video generation effects, it does not rely on complex data sets and large-scale model training. PixelDance achieved the above effects with only a 1.5B model on the public WebVid-10M data set.

In addition, the user can also use a simple sketch as the last frame of the video to guide the video generation process (we take the image sketch as an example and finetune PixelDance with image sketch [49 ]as the last frame instruction)

3.2 Principle analysis of PixelDance and interpretation of its paper

3.2.1 PixelDance: Based on potential diffusion model + <Text command, first frame command, last frame command> as condition

The Byte team proposed PixelDance in this paper "Make Pixels Dance: High-Dynamic Video Generation" (Paper address: https://arxiv.org/abs/2311.10982 , demo address: https://makepixelsdance.github.io), very readable. The key reason I see is that it is written by our Chinese after all, and it is worth reading again and again a>

The paper pointed out the reason why video generation is difficult to achieve good results: compared with image generation, video generation has the characteristics of a significantly larger feature space and significantly greater action diversity. This makes it difficult for existing video generation methods to learn effective time-domain action information. Although the generated videos have high picture quality, their dynamics are very limited.

PixelDance is a video generation method based on the latent diffusion model, conditioned on <text, first frame, last frame> instructions (conditioned on < text,first frame,last frame> instructions)

  • Text instructions are encoded by a pre-trained text encoder and integrated with cross-attention into a diffusion model
  • Image instructions are encoded using a pre-trained VAE encoder and concatenated withperturbed video latents or Gaussian noise as input to the diffusion model
    The image instructions are encoded with a pretrained VAE encoder and concatenated with either perturbed video latents or Gaussian noise as the input to the diffusion model
  • For the firstframe instruction, we employ the ground-truth first frame for training, making the model adhere to the first frame in-struction strictly in inference), maintaining continuity between consecutive video clips. In inference, this instruction can be conveniently obtained from the T2I model [32] or directly provided by the user

But how to get the last frame? Because the last frame is different from the first frame, they developed three technologies for this

  1. First, the last frame instruction is randomly selected from the last three (ground-truth) frames of the video clip during training
    thelast three (ground-truth) frames of a video clip
  2. Second, we introduce noise to the instruction to mitigate the reliance on the instruction and improve the robustness of the model
    Second, we introduce noise to the instruction to mitigate the reliance on the instruction and promote the robustness of model

    which is equivalent to perturb the encoded latents cimage of image instructions with noise >)
  3. Third, randomly discard the last frame instruction with a certain probability (for example, 25%) during training. Accordingly, they proposed a simple yet effective inference sampling strategy.
    we randomly drop the last frame instruction with a certain probability, e.g. 25%, in training.

    During the first τ denoising steps, the last frame instruction is used to guide the video generation towards the desired end stateDuring the first τ denoising steps, the last frame instruct-tion is utilized to guide video generation towards the desiredending status. Then, in the remaining steps, the instruction is discarded, allowing the model to generate more temporally coherent videos (long up to three minutes). The impact of the last frame instruction can be adjusted by Then, during the remaining steps, the instruc-tion is dropped, allowing the model to generate more tem-porally coherent video. The impact of last frame instruction can be adjusted by τ.\ can


    \ can

The paper mentioned that the reason why long videos are not easy to generate is that long videos require seamless transitions between consecutive video clips and long-term consistency of scenes and characters.


There are generally two methods:

  1. Autoregressive methods [15, 22, 41] use sliding windows to generate new segments conditioned on the previous segment
    1) autoregressive methods [15, 22, 41] employ a sliding window to generate a new clip conditioned on the previous clip

    However, autoregressive methods are susceptible to quality degradation because errors accumulate over time
  2. Hierarchical methods [9, 15, 17, 53] generate sparse frames first and then interpolate intermediate frames
    hierarchical methods [9, 15, 17, 53] generate sparse frames first , then interpolate intermediate frames

    As for the hierarchical method, it requires long videos for training, which are difficult to obtain due to frequent lens changes in online videos
    In addition, The challenge is compounded by generating temporally coherent frames across larger time intervals, which often results in lower quality initial frames, making it difficult to achieve good results later in the interpolation

Finally, to generate long videos, PixelDance is trained to strictly follow the first frame instruction, wherethe last frame of the previous video clip(the last frame from preceding clip), is used as the first frame instruction to generate subsequent clips(is used as the first frame instruction for generating the subsequent clip)

3.2.2 PixelDance architecture: based on 2D UNet insertion of time and text instructions + image instruction injection

We adopt the widely used 2D UNet as the diffusion model, which consists of a series of spatial downsampling layers and a series of spatial upsampling layers that insert skip connections (We take the widely used 2D UNetas diffusion model, which is constructed with a series of spatial downsampling layers followed by a series of spatial upsampling layers with inserted skip connections)

  1. Specifically, it is built from two basic blocks, namely 2D convolution block and 2D attention block. We extend the 2D UNet to the 3D variant by inserting temporal layers, where a 2D convolutional layer is followed by a 1D convolutional layer along the temporal dimension, and a 2D attention layer is followed by a 1D attentional layer along the temporal dimension (This point is the same as the Edit video and follows the description of "1.1.2.2 Spatio-temporal Latent Diffusion above"< /span> along temporal dimension following 2D attention layer.1D attention layer along temporal dimension after 2D convolution layer, and 1D convolution layerSpecifically, it is built with two basic blocks, i.e., 2D convolution block and 2D attention block We extend the 2D UNet to 3D variant with inserting temporal layers [ 22], where )
  2. The model can be trained jointly with images and videos to maintain high-fidelity generative capabilities in spatial dimensions. For image input, 1D temporal operations are disabled
    We use bidirectional self-attention in all temporal attention layers and use the pre-trained CLIP text encoder for text instructions For encoding, the embedding c text is injected through the cross attention layer in UNet, with the hidden states as queries and c text as keys and valuesc text is injected through cross-attention layers in the UNet with hidden states as queries and c We use bi-directional self-attention in all temporal attention layers. We encode the text instruction using a pre-trained CLIP text encoder [30], and the embedding The model can be trained jointly with images and videos to maintain high-fidelity generation ability on spatial dimension. The 1D temporal operations are disabled for image input.text as keys and values)

  3. Image Instruction Injection(Image Instruction Injection)
    We Combines first and last frame image directives with text directives. We utilize ground-truth video frames as instructions in training.
    Given the image instructions of the first and last frames, denoted as \left\{\mathbf{I}^{\text {first }}, \mathbf{I}^{\text {last }}\right\}, we first use VAE to encode them into the input space of the diffusion model, getting \left\{\mathbf{f}^{\text {first }}, \mathbf{f}^{\text {last }}\right\}, where\mathbf{f} \in \mathbb{R}^{C \times H \times W}
    We incorporate image instructions for both the first and last frames in conjunction with text instruction. We utilize ground-truth video frames as the instructions in training.
    Given the image instructions on the first and last frame, denoted as{If irst, Ilast}, we first encode them into the input space of diffusion models using VAE, result in {ff irst,flast } where f ∈ RC×H×W


    In order to inject instructions without losing temporal position information, then the final image condition is constructed as (To inject the instructions without loss of thetemporal position information, the final image condition is then constructed as):
    \mathbf{c}^{\text {image }}=\left[\mathbf{f}^{\text {first }}, \mathrm{PADs}, \mathbf{f}^{\text {last }}\right] \in \mathbb{R}^{F \times C \times H \times W}

    where \operatorname{PADs} \in \mathbb{R}^{(F-2) \times C \times H \times W} . The condition c image and noise latent z_t are connected along the channel dimension to serve as input to the diffusion model

3.2.3 Data processing and training details

Finally, they trained the video diffusion model on WebVid-10M, which contains about 10M short video clips with an average duration of 18 seconds and a resolution of 336 × 596. Unfortunately, WebVid-10M has two problems:

  1. While each video is associated with a paired text, the text only provides a rough description that is weakly relevant to the video content.
  2. Another annoying issue with WebVid-10M is the watermark on all videos, which results in the watermark being present in all generated videos

Therefore, we extend our training data with other self-collected 500K watermark-free video clips depicting real-world entities such as humans, animals, objects, and landscapes, and with coarse-grained text Describe the pairing. Although only a modest proportion is included, training this dataset in conjunction with WebVid-10M ensures that PixelDance is able to produce watermark-free videos when image instructions do not contain watermarks (we surprisingly found that combining this dataset with WebVid-10M for training ensures that PixelDance is able to generate watermark-free videos if the image instructions are free of watermarks)

PixelDance is trained jointly on "Video-Text Dataset" and "Image-Text Dataset" (PixelDance is trained jointly on video-text dataset and image-text dataset), specifically

  1. For video data, we randomly sample 16 consecutive frames at 4 fps for each video. Following previous work (Imagen video: High definition video generation with diffusion models), LAION-400M is used as the image-text data set. Image-text data are utilized every 8 training iterations)
  2. The weights of the pretrained text encoder and VAE model are frozen during training. They adopt DDPM with T = 1000 time steps for training. We first train the model at 256×256 resolution for 200K iterations on 32 A100 GPUs with a batch size of 192
  3. The model was then fine-tuned for another 50K iteration at higher resolution, where we derived ϵ-prediction[ from Denoising diffusion probabilistic models]Incorporate training goals

3.2.4 Model evaluation and effect display

Specifically, we used the existing T2I modelStable Diffusion V2.1 to obtain the first frame instruction and generate a video to generate videos given the text and first frame instructions)

Following previous work [7, 44], we randomly select one prompt in each example to generate a total of 2990 videos for evaluation and calculate the Fr-response video distance (FVD) [40] on the MSR-VTT dataset, and CLIP-similarity (CLIPSIM)[47]

  1. FID and FVD measure the distribution distance between the generated video and real data
  2. IS evaluates the quality of the generated video
  3. CLIPSIM estimates the similarity between the generated video and the corresponding text

The evaluation results of MSR-VTT and UCF-101Zero-short are shown in the two tables below

  1. Compared with other T2V methods on MSR-VTT, Pixel-Dance achieves state-of-the-art results in FVD and CLIPSIM, demonstrating its superior ability to generate high-quality videos that are better aligned with text cues
  2. It is worth noting that PixelDance’s FVD score is 381, which greatly exceeds the previous state-of-the-art Mod-elScope [43]. In addition, its FVD is 550 on the UCF-101 benchmark. PixelDance outperforms other models in various indicators. Includes IS, FID and FVD

As mentioned before, their video generation method contains three different instructions: text, first frame and last frame instructions

  1. The first frame directive significantly improves video quality by providing finer visual detail. Furthermore, it is key to generate multiple continuous video clips. With text and first frame instructions, PixelDance is able to generate more action-rich videos than existing models, as shown in the image below

  2. The last frame directive, which depicts the summary status of the video clip, provides additional control over video generation. Additionally, we can generate a natural shot transition using last frame instruction (the last sample below) >)

// To be updated

Part 4 Stable Video Diffusion (SVD)

4.1 Stability AI releases generative video model Stable Video Diffusion (SVD)

On November 21, Stability AI, which develops and maintains subsequent versions of stable diffusion, finally released their own generative video model Stable Video Diffusion (SVD), which supports text-to-video and image-to-video generation, and also supports The transformation of an object from a single perspective to multiple perspectives, that is, 3D synthesis:

4.2 Three steps of SVD training: image pre-training, video pre-training, and video fine-tuning

The paper corresponding to SVD is: "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets"

The paper identifies three steps for training SVD:

  1. Text to imageImage pre-training(image pretraining), that is, 2D text to image diffusion model, such as SDXL work: Improving Latent Diffusion Models for High-Resolution Image Synthesis
  2. Video pre-training on a relatively large but low-resolution video dataset(video pretraining on a large dataset at low resolu-tion) We collected an initial dataset of long videos, which forms the basis of our video pre-training phase. Basic data. ThenTo avoid cuts andfades leaking into synthesized videos, we apply a cut detec-tion pipeline1 in a cascaded manner at three different FPSlevels. Figure 2, left, provides evidence for the need for cutdetection: After applying our cut-detection pipeline, we ob -tain a significantly higher number (∼4×) of clips, indicating that many video clips in the unprocessed dataset containcuts beyond those obtained from metadata.

    Next, we annotate each clip with three different synthetic subtitle methods: first, we use the image captioner CoCa to annotate the in-between frames of each clip, and use V-BLIP to obtain video-based subtitles . Finally, we generate a third description of the clip by doing llm-based summarization of the first two subtitles
    Next, we annotate each clip with three different syn-thetic captioning methods: First, we use the image captionerCoCa [103] to annotate the mid-frame of each clip and use V-BLIP [104] to obtain a video-based caption. Finally, wegenerate a third description of the clip via an LLM-basedsummarization of the first two captions.

    The resulting initial dataset, which we call the Large Video Dataset (LVD), consists of 580M annotated video clip pairs, forming 212 years of content.
  3. Video fine-tuning on a small but high-quality high-resolution video dataset(high-resolution video fine tuning on a much smaller dataset with higher-quality videos) Specifically, they drew on the latent image diffusion model [12, 60] training techniques and increase the resolution of training examples. Additionally, we use a small fine-tuned dataset consisting of 250K high visual fidelity pre-subtitled video clipsHere, we draw on training tech-niques from latent image diffusion modeling [12, 60] andincrease the resolution of the training examples. More-over, we use a small finetuning dataset comprising 250Kpre-captioned video clips of high visual fidelity

In short, SVD is based on Stable Diffusion 2.1. It first pre-trained the basic model with a video dataset of about 600 million samples (we apply our proposed curation scheme to a large video dataset comprising roughly 600 million sam-ples and train a strong pretrained text-to-video base model)

Then fine-tune the base model on a smaller, high-quality dataset for high-resolution downstream tasks (finetune the base model on a smaller, high-quality dataset for high-resolution downstream tasks )

  • Like text to video (below, top row)
  • and image-to-video, where we predict a sequence of frames from a single conditional image (see image below, middle row)

Part 5 Pika Labs: Launch of movie special effects-level video generation model Pika 1.0

5.1 The entrepreneurial project of two beautiful Stanford PhDs: Pika 1.0

Demi Guo, a doctoral student at Stanford, participated in Runway's first AI Film Festival last year and found that the tools of Runway and Adobe Photoshop were not easy to use, and his team's work did not win awards, which led to what happened next. This series of things

  1. In April this year, Guo Wenjing decided to drop out of Stanford to develop better-use AI video tools, and Pika was born (official website address: https://pika.art), and co-founder Chenlin Meng soon joined< /span>
    Of the two, one has participated in the research of AlphaFold2, and the other is the second author of the DDIM paper

  2. After Pika was established, it has now had 500,000 users, who create millions of videos every week
    This explosive growth has aroused the attention of Silicon Valley investors. Interest has allowed Pika to raise US$55 million in three rounds of financing (The first two rounds of financing were led by former GitHub CEO CEO Nat Friedman, and the latest round of US$35 million in Series A The financing was led by Lightspeed Venture Partners)
    and what is encouraging is that the valuation of the team of only 4 people has exceeded US$200 million
  3. On November 29 this year, Pika 1.0 was officially released, opening up unlimited space for the production of 3D animation, animation, cartoons, movies, etc.

Pika 1.0 can not only smoothly generate a video based on text and pictures, but also switch between motion and stillness in an instant:

picture

Moreover, the editability is very strong. You can specify any element in the video and quickly "change" it in one sentence:

picture

In addition, the videos generated through Pika 1.0 are relatively beautiful. For example, the following video of Hayao Miyazaki's painting style, I watched it two or three times myself, ^_^

To summarize, the new features of Pika 1.0 include:

  1. Text-generated video/image-generated video: Enter a few lines of text or upload an image to create a short, high-quality video through AI
  2. Video-Video Different Style Conversion: Convert existing videos to different styles, including different characters and objects, while maintaining the structure of the video
  3. Expand: Expand the canvas or aspect ratio of the video, change the video from TikTok 9:16 format to widescreen 16:9 format, the AI ​​model will predict beyond the original The content at the video boundary is equivalent to predicting first and then completing or filling the required content
  4. Change: Use AI to edit video content, such as changing clothes, adding another character, changing the environment, or adding props
  5. Extend: Use AI to extend the length of existing video clips
  6. New web interface: Pika will be available on Discord and the web

5.2 Pika 1.0 technical details: DreamPropeller accelerates the text to 3D generation process through fractional distillation

In the past, using fractional distillation, such as DreamFusion, ProlificDreamer and other models, the quality of text-to-3D generation was high, but the running time could be as long as 10 hours.

In the latest paper, researchers from Stanford and pika passed this paper "DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling" , jointly proposed an acceleration method based on fractional distillation - DreamPropeller, which can increase the speed of existing methods by 4.7 times

The overall architecture of DreamPropeller is shown in the figure below

  1. At the beginning of each iteration (k times), a window consisting of 3D shapes (indicated in green) is initialized. Then, these shapes are distributed to p GPUs for parallel computation, and are computed in parallel on the GPUs. SDS/VSD gradient of shapes
    Starting from top left, for iteration k, we initialize a window of 3D shapes (in green) with dimension D and dispatch them to p GPUs for parallelly computing the SDS/ VSD gradients,
  2. Then these gradients are collected according to the rule in the following formula "which are gathered for rollout using the rule in Eq. (9)" , and use these gradients to update the shape
    \theta_{\tau}^{k}=h^{\dagger}\left(s\left(\theta_{\tau-1}^{k-1}\right), h^{\dagger}\left(s\left(\theta_{\tau-2}^{k-1}\right), \ldots h^{\dagger}\left(s\left(\theta_{0}^{k-1}\right), \theta_{0}^{k-1}\right)\right)\right)
  3. Compare the shape (orange) obtained by iteration k + 1 with the shape obtained by iteration k. The window slides forward until the error at this time step is not less than the threshold e. The threshold e is based on the average/median value of the window. The error is updated adaptively
    The resulting shapes (in orange) for iteration k + 1 are compared to those in iteration k.
    The window is slid forward until the error at that time step is not smaller than the threshold e, which is adaptively updated with the mean/median error of the window

    In addition, in the case of VSD, the researchers will Optionally, in the case of VSD, we keep independent copies of LoRA diffusion on all GPUs which are updated independently without extra communication .

The following is a visual comparison with other models. It can be seen that the method using DreamPropeller can achieve the same high-quality generation with shorter running time.

Quantitative evaluation of 30 tips from the DreamFusion gallery. Run time is in seconds. Latest research method achieves competitive quality while increasing speed by more than 4 times

// To be updated


Part 6 W.A.L.T: Using Transformer for Diffusion Models

In mid-December 23, researchers from Stanford University, Google, and Georgia Institute of Technology proposed the Window Attention Latent Transformer, or Window Attention Latent Transformer, referred to asW.A.L.T a>

This method successfully integrates the Transformer architecture into the latent video diffusion model. Professor Feifei Li of Stanford University is also one of the authors of the paper.

// To be updated


references

  1. A new breakthrough in video generation: PixelDance, easily presenting complex movements and cool special effects
  2. Is it the end of the world for directors to make blockbusters in one sentence? Runway releases text generation video model Gen-2, science fiction Japanese two-dimensional
  3. November 2023 runway Gen2 update
    Gen-2 subverts AI-generated videos! One sentence produces a 4K high-definition blockbuster in seconds, netizens: Completely changing the rules of the game
    The text-generating video tool has received a major update. How powerful is Runway Gen-2?
  4. The Meta version of ChatGPT is here: Llama 2 blessing, access to Bing search, Xiao Zha’s live demonstration, and introduction of the Vincentian graph modelEmu
  5. Meta generative AI continuous amplification moves: video generation surpasses Gen-2, animated emoticons can be customized as you like
  6. Stanford’s beautiful PhD entrepreneurship project is a hit! AI video generation has become a hit since its debut, and it raised US$55 million in half a year
    The video Pika 1.0 of Stanford Chinese Ph.D. student is a hit! The 4-person company is valued at 200 million, with OpenAI Lianchuang participating in the investment
  7. Pika 1.0 first test: Netizens are the first to experience the movie-level explosion effect, and the technical details behind it are disclosed for the first time
  8. ..

Create, modify, and improve records

  1. 11.28, read Runway’s Gen1 paper word for word, and improve the first part of this article
    Consider adding a new research direction: Wensheng Video
    ​We (we) will release a series of interpretation blogs, open classes, courses, commercial projects/solutions, etc. around Wensheng’s videos one by one​​
  2. 11.29, started reading the Emu Edit paper and EMU VIDEO paper released by Meta, and improved the third part of this article.
  3. 11.30, based on the stable video diffusion paper, improve the fourth part of this article
    and update the fifth part related to Pika 1.0
  4. 12.1, by reading this paper "Make Pixels Dance: High-Dynamic Video Generation" by PixelDance, I started to improve the third part
    In other words, this paper is really readable Very good
  5. 12.2, by reviewing Meta's Emu Edit paper again, a new section "2.1.2 Developing a 10 million-scale data set covering 16 different tasks" is added to supplement the description of its data set.
  6. 12.3, sort out the fine-tuning details of the full text, and finally form the first draft of this article
    This article will be continuously revised and improved in the next half month, and if any of the models or technologies are more critical and New developments will also be added to this article
  7. 12.9, a new section: "5.2 Pika 1.0 technical details: DreamPropeller accelerates the text to 3D generation process through fractional distillation"
  8. 12.15, added: "Part 6 W.A.L.T: Using Transformer for Diffusion Model", but the specific content needs to be improved..

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/134655535