CoDi: Any-to-Any Generation, Unifying Multiple Modalities

Summary

insert image description here

Paper link: https://arxiv.org/pdf/2305.11846.pdf
We propose Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities from any combination of input modalities, such as language , image, video or audio. Unlike existing generative AI systems, CoDi can generate multiple patterns in parallel, and its input is not limited to a subset of patterns such as text or images. Despite the lack of training datasets for many pattern combinations, we propose to tune patterns in the input and output spaces. This gives CoDi the freedom to impose conditions on any combination of inputs and generate any set of patterns, even if they are not present in the training data. CoDi employs a novel composable generation strategy, consisting of bridging alignments in the diffusion process to construct a shared multimodal space, allowing simultaneous generation of interleaved modalities such as time-aligned video and audio. Highly customizable and flexible, CoDi achieves robust co-modal generation quality and outperforms or matches state-of-the-art unimodal techniques in unimodal synthesis. The project page with demos and code is at https://codi-gen.github.io/.
insert image description here

1 Introduction

In recent years, powerful cross-modal models have emerged that can generate one modality from another, such as text-to-text [6, 37], text-to-image [13, 19, 22, 41, 44], or Text to audio [23,33].

However, these models are limited in their applicability to the real world, where multiple modalities coexist and interact. Although we can chain schema-specific generative models in a multi-step generative setting, the generative capacity of each step is still limited, and the serial multi-step process can be tedious and slow. Furthermore, independently generated unimodal streams will be inconsistent and aligned when stitched together in a later processing manner (e.g., synchronized video and audio). The development of a comprehensive and general model that can generate any combination of patterns from any set of input conditions is eagerly awaited, as it will more accurately capture the multimodal nature of the world and human understanding, seamlessly integrating information from a wide range of sources , and enable intense immersion in human-AI interactions (e.g., by simultaneously generating coherent video, audio, and text descriptions).

To achieve this goal, we propose Composable Diffusion (CoDi), the first model capable of simultaneously processing and generating arbitrary modal combinations shown in Fig. 1. Training a model to accept any mixture of input modalities and flexibly generate any mixture of outputs presents significant computational and data requirements as the number of combinations of input and output modalities grows exponentially. Furthermore, consistent training data for many pattern groups is scarce or even non-existent, making training with all possible input-output combinations infeasible. To address this challenge, we propose to tune multiple modalities during the input conditioning (Section 3.2) and generative diffusion steps (Section 3.4). Furthermore, a proposed "bridge alignment" strategy for contrastive learning (Section 3.2) allows us to effectively model the exponential number of input-output combinations with the linear number of training objectives.

To achieve this goal, we propose Composable Diffusion (CoDi), the first model capable of simultaneously processing and generating arbitrary modal combinations shown in Fig. 1. Training a model to accept any mixture of input modalities and flexibly generate any mixture of outputs presents significant computational and data requirements as the number of combinations of input and output modalities grows exponentially. Furthermore, consistent training data for many pattern groups is scarce or even non-existent, making training with all possible input-output combinations infeasible. To address this challenge, we propose to tune multiple modalities during the input conditioning (Section 3.2) and generative diffusion steps (Section 3.4). Furthermore, a proposed "bridge alignment" strategy for contrastive learning (Section 3.2) allows us to effectively model the exponential number of input-output combinations with the linear number of training objectives.

The second stage of training enables the model to handle many-to-many generative strategies, which involve simultaneously generating arbitrary combinations of output patterns. To our knowledge, CoDi is the first AI model with this capability. This is achieved by adding a cross-attention module to each diffuser, and an environment encoder V to project latent variables from different ldms into a shared latent space (Section 3.4). Next, we freeze the parameters of the LDM and only train the cross-attention parameters and V. Since the environment encoders of different modalities are aligned, the output of the LDM can be cross-engaged with any co-generated modality group via the V-interpolated representation. This enables CoDi to seamlessly generate any set of patterns without training on all possible generative combinations. This reduces the number of training objectives from exponential to linear.

We demonstrate the any-to-any generation capabilities of CoDi, including one-to-single modality generation, multi-condition generation, and the novel ability to jointly generate multiple modalities. For example, generate synchronized video and audio based on text input prompts; or generate video to give prompt images and audio. We also quantitatively evaluate CoDi using 8 multimodal datasets. CoDi demonstrates excellent generation quality in a variety of scenarios, with synthesis quality comparable to or even better than one-on-one mode SOTA, such as audio generation and audio subtitles.

2. Related work

Diffusion models (DMs) learn the data distribution by denoising and restoring the original data. The Deep Diffusion Process (DDP) [45] employs a sequence of reversible diffusion steps to model image probability distributions. It uses a reversible encoder to map an input image to a latent space, and a decoder to map latent variables to an output image. The Denoising diffusion probistic model (DDPM) [20] uses a cascade of diffusion processes to gradually increase the complexity of the probability density function model. At each step, the model adds noise to the input image and estimates the corresponding noise level using an autoregressive model. This enables the model to capture dependencies between adjacent pixels and generate high-quality images. Score-based generative models (SOG) [46] utilize score functions to model the diffusion process. [40] generate high-fidelity images from text-prompted clip representations. Latent Diffusion Model (LDM) [41] uses VAE to encode the input into the latent space to reduce the modeling dimension and improve the efficiency. The motivation is that image compression can be separated into a semantic space by a diffusion model and a perceptual space by an autoencoder. Video diffusion models have built on image diffusers by incorporating temporal modeling modules and cascaded model architectures to generate temporally consistent intrinsic frames [14, 19, 21, 44]. Diffusion models have also been applied in other domains, such as audio generation from text and visual cues [23, 33].
insert image description here

In recent years, multimodal modeling techniques have developed rapidly, and researchers strive to build a unified representation of multiple modalities using a single model to achieve a more comprehensive cross-modal understanding. Visual Transformer [11] has diverse model architectures and training techniques, and has been applied to various downstream tasks such as visual question answering and image captioning. Multimodal encoders have also proven successful in the domains of visual language [1, 8, 57], video audio [47] and video speech language [55, 56]. Aligning data from different modalities is an active research area [12, 38], with promising applications in cross-modal retrieval and construction of unified multimodal representations [33, 35, 41].

3. Method

3.1. Preliminary: potential diffusion model

Diffusion models (DM) are a class of generative models that learn the data distribution p(x) by simulating the diffusion of information over time. During training, random noise is iteratively added to x while the model learns to denoise the examples. For inference, the model denoises data points sampled from a simple distribution such as a Gaussian distribution. The Latent Diffusion Model (LDM) [41] learns the distribution of the latent variable z corresponding to x, which significantly reduces the computational overhead by reducing the data dimensionality.

In LDM, an autoencoder is first trained to reconstruct x \boldsymbol{x}x,即 x ^ = D ( E ( x ) ) \hat{\boldsymbol{x}}=D(E(\boldsymbol{x})) x^=D ( E ( x )) , where E and D denote the encoder and decoder, respectively. Based on the variance planβ 1 , … , β T \beta_{1}, \ldots, \beta_{T}b1,,bT, latent variable z = E ( x ) \boldsymbol{z}=E(\boldsymbol{x})z=E ( x ) diffuses iteratively with time step t, that is,q ( zt ∣ zt − 1 ) = N ( zt ; 1 − β tzt − 1 , β t I ) q\left(\boldsymbol{z}_{t} \mid \boldsymbol{z}_{t-1}\right)=\mathcal{N}\left(\boldsymbol{z}_{t} ; \sqrt{1-\beta_{t}} \boldsymbol{z }_{t-1}, \beta_{t} \boldsymbol{I}\right)q(ztzt1)=N(zt;1bt zt1,btI)[20,45]。

The forward propagation procedure allows for zt z_{t}ztRandom sampling in closed form [20,45] at any time step: zt = α tz + σ t ϵ \boldsymbol{z}_{t}=\alpha_{t} \boldsymbol{z}+\sigma_{t } \boldsymbol{\epsilon}zt=atz+ptϵ ,withϵ ∼ N ( 0 , I ) , α t : = 1 − β t \ball symbol{\epsilon} \sim \mathcal{N}(0, I), \alpha_{t}:=1-\beta_ {t}ϵN(0,I),at:=1bt并且 σ t : = 1 − ∏ s = 1 t α s \sigma_{t}:=1-\prod_{s=1}^{t} \alpha_{s} pt:=1s=1tas. The diffuser learns how to denoise { zt } \left\{\boldsymbol{z}_{t}\right\}{ zt} restorez \boldsymbol{z}z . Following the reparameterization approach proposed in [20], the denoising training objective can be expressed as [41]:

L D = E z , ϵ , t ∥ ϵ − ϵ θ ( z t , t , C ( y ) ) ∥ 2 2 (1) \mathcal{L}_{D}=\mathbb{E}_{\boldsymbol{z}, \boldsymbol{\epsilon}, t}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\boldsymbol{z}_{t}, t, C(\boldsymbol{y})\right)\right\|_{2}^{2} \tag{1} LD=Ez , ϵ ,tϵϵi(zt,t,C(y))22(1)

In data generation, denoising can be achieved by reparameterization of Gaussian sampling:
p ( zt − 1 ∣ zt ) = N ( zt − 1 ; 1 α t ( zt − β t σ t ϵ θ ) , β t I ) (2) p\left(\boldsymbol{z}_{t-1} \mid \boldsymbol{z}_{t}\right)=\mathcal{N}\left(\boldsymbol{z}_{t -1} ; \frac{1}{\sqrt{\alpha_{t}}}\left(\boldsymbol{z}_{t}-\frac{\beta_{t}}{\sqrt{\sigma_{t }}} \boldsymbol{\epsilon}_{\theta}\right), \beta_{t} \boldsymbol{I}\right) \tag{2}p(zt1zt)=N(zt1;at 1(ztpt btϵi),btI)(2)

LD \mathcal{L}_{D}LDIn the diffusion time step t ∼ U [ 1 , T ] t \sim \mathcal{U}[1, T]tU[1,T ] Oϵ θ \ball symbol{\epsilon}_{\theta}ϵiIs a has by θ \thetaθ parameterized UNet backbone denoising model;y \boldsymbol{y}y denotes a condition variable that can be used to control generation; C is a hint encoder. As described in [41], the conditional mechanism is achieved by first placingy \boldsymbol{y}y becomesC ( y ) C(\boldsymbol{y})C ( y ) , and then realize C ( y ) C(\boldsymbol{y})by cross-attentionC ( y ) from UNetϵ θ \ball symbol{\epsilon}_{\theta}ϵiconditions are realized. Unlike previous work, our model can be conditioned on any combination of text, image, video and audio. See the next section for details.

3.2 Composable multimodal conditioning

To enable our model to be conditioned on any combination of input/cue patterns, we align the cue encoders for text, image, video and audio (by C t , C i , C v C_{t}, C_{i} respectively , C_{v}Ct,Ci,Cvand C a C_{a}Carepresentation) to project inputs from any modality into the same space. Then, the multimodal condition can be conveniently implemented by interpolating the representation of each modality m: for C(xt,xi,xv,xa) = ∑ m α m C(m) form ∈ xt,xi,xv,xa C\left(x_{t}, x_{i}, x_{v}, x_{a}\right)=\sum_{m} \alpha_{m} C(m) for m \in x_{t}, x_{i}, x_{v}, x_{a}C(xt,xi,xv,xa)=mamC(m)formxt,xi,xv,xa, where, ∑ m α m = 1 \sum_{m} \alpha_{m}=1mam=1 . Through simple weighted interpolation of aligned embeddings, we enable models trained with single conditioning (ie, only one input) to perform zero-shot multi-conditioning (ie, with multiple inputs). The process is shown in Figure 2(a)(2).

Simultaneously optimizing all four hint encoders in a combinatorial manner is computationally expensive, requiring O ( n 2 ) \mathcal{O}\left(n^{2}\right)O(n2 )Yes. Furthermore, for some bimodalities, well-aligned paired datasets are limited or unavailable, such as image-audio pairs. To address this challenge, we propose a simple and effective technique called "bridge alignment" to efficiently align conditional encoders. As shown in Figure 2(a)(1), we choose the text modality as the “bridging” modality because it is ubiquitous in paired data, such as text-image, text-video and text-audio pairs. We start with a pretrained text-image pair encoder, CLIP [38]. We then train audio and video cue encoders on audio-text and video-text paired datasets using contrastive learning, and freeze the weights of the text and image encoders.

In this way, all four modalities are aligned in feature space. As shown in Section 5.2, CoDi can efficiently exploit and combine the complementary information present in any pattern combination to produce more accurate and comprehensive output. High build quality is not affected by the number of hint modes. As we will discuss in subsequent sections, we proceed to apply bridge alignment to align the latent space of LDMs with different modalities for joint multimodal generation.

3.3. Composable Diffusion

Training an end-to-end any-to-any model requires extensive learning on a variety of data sources. The model also needs to maintain the generation quality of all synthetic streams. To address these challenges, CoDi is designed to be composable and integratable, allowing models specific to individual modalities to be constructed independently and then integrated smoothly. Specifically, we start by training image, video, audio, and text ldm independently. These diffusion models then efficiently learn to participate across modalities in joint multimodal generation via a novel mechanism called "latent alignment" (Section 3.4).

Image Diffusion Model. Image LDM follows the same structure as Stable Diffusion 1.5 [41] with the same initialization weights. Reusing weights transfers the knowledge and excellent generative fidelity of stable diffusion algorithms trained on large-scale high-quality image datasets into CoDi.

Video Diffusion Model. To model the temporal properties of videos while maintaining visual generation quality, we construct a video diffuser by extending the image diffuser with a temporal module. Specifically, we insert pseudo-temporal attention [13] before the residual block. However, we argue that pseudo-temporal attention can only enable video frames to globally attend to each other by flattening pixels (height, width dimensions) to the batch dimension, resulting in a lack of cross-frame interaction between local pixels. We argue that this leads to the common temporal inconsistency problem in video generation, where the position, shape, color, etc. of objects may not be consistent in the generated frames. To address this issue, we propose adaptive latent shifting [2], which spatiotemporally shifts latent features based on temporal attention. We divide the video into k = 8 blocks by hidden dimension, and for each block i = 0 to 7, we shift the time dimension forward by i positions. Further details will be provided in the appendix.

Audio Diffusion Model. To enable flexible cross-modal attention in joint generation, the audio diffuser is designed to have a similar architecture to the visual diffuser, where the mel-spectrogram can be naturally viewed as an image with 1 channel. We encode the melspectrum of the audio into a compressed latent space using a VAE encoder. In audio synthesis, a VAE decoder maps latent variables to mel-spectrograms, and a vocoder generates audio samples from mel-spectrograms. We used the audio VAE from [33] and the vocoder from [27].

Text Diffusion Model. The VAE of text LDM is OPTIMUS [29], and its encoder and decoder are [9] and GPT-2 [39], respectively. For denoising UNet, unlike UNet in image diffusion, the 2D convolutions in the residual block are replaced by 1D convolutions [53].

3.4. Latent alignment joint multimodal generation

The final step is to enable cross-attention between diffusive flows in joint generation, i.e. generating two or more modalities simultaneously. This is done by adding to UNet ϵ θ \boldsymbol{\epsilon}_{\theta}ϵiThis is achieved by adding a cross-modal attention sublayer (Fig. 2(b)(2)). Specifically, consider a diffusion model of a modality a cross-focused with another modality B. Let the mode m A m_{A} of the diffusion step tmAand m B m_{B}mBThe latent variables are expressed as zt A \boldsymbol{z}_{t}^{A}ztA z t B \boldsymbol{z}_{t}^{B} ztB. The proposed "latent alignment" technique is such that a pattern-specific environment encoder VB V_{B}VBFirst put zt B z_{t}^{B}ztBProjected into the shared latent space of different modalities. Then, in each layer of UNet of mode A, a cross-attention sublayer pays attention to VB ( zt B ) V_{B}\left(\boldsymbol{z}_{t}^{B}\right)VB(ztB) . For the diffusion model of modality A, the training objective in Eq.(1) becomes:
LC ross A = E z , ϵ , t ∥ ϵ − ϵ θ c ( zt A , VB ( zt B ) , t , C ( y ) ) ∥ 2 2 \mathcal{L}_{C \text { ross }}^{A}=\mathbb{E}_{\boldsymbol{z}, \boldsymbol{\epsilon}, t}\left\ |\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta_{c}}\left(\boldsymbol{z}_{t}^{A}, V_{B}\left(\boldsymbol{z }_{t}^{B}\right), t, C(\boldsymbol{y})\right)\right\|_{2}^{2}LC ross A=Ez , ϵ ,t ϵϵic(ztA,VB(ztB),t,C(y)) 22
where θ c \theta_{c}icDenotes the weight of the cross-attention module in UNet.

The training target generated by A + B joint is LC ross A + LC ross B \mathcal{L}_{C ross}^{A}+\mathcal{L}_{C ross}^{B}LCrossA+LCrossB. V in different modes ( ⋅ ) V(\cdot)V ( ) is trained to be consistent with contrastive learning. Due to the diffusion process in Section 3.1, zt A z_{t}^{A}of any time step can beztA z t B z_{t}^{B} ztBClosed sampling, so it can be conveniently compared with L Cross \mathcal{L}_{\text {Cross }}LCross Train together vs learn. The purpose of V is to enable the generation of any modality combination (polynomial) by training a linear number of joint generation tasks. For example, if we independently train the joint generation of modalities A, B and B, C, then we have VA ( zt A ) V_{A}\left(\boldsymbol{z}_{t}^{A}\right )VA(ztA) V B ( z t B ) V_{B}\left(\boldsymbol{z}_{t}^{B}\right) VB(ztB) V C ( z t C ) V_{C}\left(\boldsymbol{z}_{t}^{C}\right) VC(ztC) alignment. Therefore, CoDi can seamlessly realize the joint generation of patterns A and C without any additional training. Furthermore, such a design can automatically and easily generate modes A, B, and C simultaneously. Specifically, UNet (A) can be compared withVB ( zt B ) V_{B}\left(\boldsymbol{z}_{t}^{B}\right)VB(ztB) V C ( z t C ) V_{C}\left(\boldsymbol{z}_{t}^{C}\right) VC(ztC) for interpolating cross-attendance, although CoDi has not been trained for such a task.

As shown in Fig. 2(b)(3), we employ a similar design to “Bridge Alignment” when training the joint generation: (1) we first train the image and text diffuser and its environment on the text-image paired data Cross-attention weights in encoder V. (2) Freeze the weights of the text diffuser and train the cross-attention weights of the ambient encoder and the audio diffuser on the text-audio paired data. (3) Finally, the audio diffuser and its environment encoder are frozen, and the joint generation of video modalities based on audio-video paired data is trained. As shown in Section 5.3, although only three paired joint generation tasks (i.e., text+audio, text+image, and video+audio) were trained, CoDi is able to simultaneously generate various pattern combinations not seen in training, e.g. Image-text-audio joint generation in 5.

4. Experiment

4.1. Training target and data set

We list the training tasks for CoDi in Table 1, including single modality synthesis, joint multimodal generation, and contrastive learning of alignment hint encoders. Table 1 summarizes the dataset, task, sample size and domain. The datasets come from the following domains: image+text (e.g. images with captions), audio+text (e.g. audio with descriptions), audio+video (e.g. videos with sound) and video+text (e.g. videos with descriptions). As one may have noticed, linguistic morphology occurs in most datasets and domains. This echoes the idea of ​​using text as a bridging modality to be able to infer and generate new unseen combinations, such as audio and images bridged by text, as described in Sections 3.2 and 3.4. Due to space limitations, see Appendix C for more details on the training dataset and , Appendix A.1 for model architecture details, and Appendix B for training details.
insert image description here

Picture + text. We use a recently developed large-scale image captioning dataset Laion400M [42]. This image-text paired data allows us to train with the tasks text→image, image→text, and joint generation of images and text. For joint generation tasks, we propose to use text→image+text for training, where the hint text is the truncated image caption and the output text is the original caption. Since conditional information is incomplete, text and image diffusers need to learn to participate with each other through a joint generative process.

Audio + text. We curated a new dataset, Freesound 500K, by scraping 500K audio samples along with labels and descriptions from the Freesound website. We also use AudioSet [42] and AudioCaps [24], using 2 million human-labeled 10-second sound clips from YouTube videos and 46K audio-text pairs from the AudioSet dataset, respectively. Audio samples are clipped into 10-second segments for training purposes. Paired audio+text data enables us to train text→audio, audio→text, text→audio+text generation, and audio-text contrastive learning. Similar to image+text joint generation, in Text→Audio+Text, the text prompt is the truncated text and the output is the original text.

video. We use the following diverse high-quality video datasets to train video generation and video cue encoders. WebVid [4], a large-scale web video dataset and description; HD-Villa-100M [54], supports high-resolution YouTube videos of at least 720P. We use WebVid for text→video and video-text contrastive learning tasks. We use HD-Villa-100M for image→video generation, with the middle frame as the input image.

audiovisual. Network video is a natural arrangement of audio and video data resources. However, many existing datasets, such as ACAV100M [28], mainly focus on human speech videos rather than natural sounds. Therefore, we utilize the sound-oriented datasets AudioSet and SoundNet [3] for joint audio-video generation. For Image→Audio+Video, we use the middle frame of the target video as the input hint image. We also train the model to generate audio using in-between frames as cue inputs, i.e. image→audio.

5. Verification result

In this section, we evaluate the quality of model generation under different settings, including single modality generation, multi-condition generation, and multi-output joint generation. We provide quantitative benchmarks on the evaluation dataset as well as qualitative visualizations.

5.1 Single-mode generation results

We first present an example demonstration in Figure 3, where we show various one-to-one modality generation. We then evaluate the synthetic quality of unimodal generation of text, images, video and audio. CoDi achieves SOTA on audio subtitles and audio generation, as shown in Table 6 and Table 4. It is worth noting that CoDi, a diffusion-based model, shows for the first time in this domain comparable performance to Autoregressive Transformer-based SOTA on image captioning (Table 5). CoDi is the first diffusion model based on video captioning, Table 7. On image and video generation, CoDi performs comparable to the state-of-the-art (Table 2 and Table 3). This gives us a powerful starting point for multi-condition and multi-output generation, which will be covered in the next Sections 5.2 and 5.3.
insert image description here
insert image description here

We demonstrate in Section 3.2 that CoDi is able to integrate representations from different modalities in generation. Therefore, we first show a demo of multi-condition generation as shown in Figure 4.
insert image description here

5.2. Multiple conditions generate results

For quantitative evaluation, we focus on multiple inputs for image synthesis output, since the evaluation indicator (FID) in this case does not require modality-specific inputs like text. We tested several input combinations, including text+image, text+audio, image+audio, text+video, and three types of input text+audio+image. We test on the validation set of AudioCaps [24] because all four modes are present in this dataset. The hint image input is the middle frame of the video. As shown in Table 8, CoDi achieves high image generation quality given various input modality sets. We also tested several input combinations with video as output, including text, text+audio, image+image, and text+audio+image. We also test on MSRVTT [24] because all four modes are present in this dataset. Likewise, the hint image input is the middle frame of the video. As shown in Table 9, CoDi achieves high video and ground truth text similarity given various input pattern sets. Also, our model does not require multi-condition generative training like text+audio or text+image. Through the bridge alignment and composable multimodal conditioning proposed in Section 3.2, our model trained on a single condition can perform zero-shot inference on multiple conditions.

insert image description here

5.3. Multi-output joint generation results

For joint multimodal generation, we first demonstrate a high-quality multimodal output joint generation demo, as shown in Figure 5. For quantitative evaluation, there are no existing evaluation metrics, since ours is the first model that can be generated in all 4 modalities simultaneously. Therefore, we propose the following metric SIM to quantify the coherence and consistency between two generative modalities via the embedded cosine similarity:
SIM ⁡ ( A , B ) = cos ⁡ ( CA ( A ) , CB ( B ) ) \operatorname{SIM}(A, B)=\cos \left(C_{A}(A), C_{B}(B)\right)YES ( A ,B)=cos(CA(A),CB(B))

where A, B are generated modalities, and CA and CB are aligned encoders that project A and B into the same space. We use the hint encoder described in Section 3.2. This metric aims to compute the cosine similarity of two modality embeddings using a contrastively learned cue encoder. Therefore, the higher the metric, the more consistent and similar the generated modalities.
insert image description here

To demonstrate the effectiveness of joint generation, assuming that the prompt modality is P, we compare the separate generation of SIM(A, B) of A and B with the joint generation, i.e., {P→A, P→B} with {P→A + B}. The benchmark is the validation set of AudioCaps [24]. We test on the following settings, audio→image+text, image→audio+text, text→video+audio, image→video+audio. Audio → video + text, audio → text + video + image, text → video + image + audio, where the image prompt is the middle frame of the video clip. As shown in Table 10, joint generation (the right side of “/” indicates similarity) consistently outperforms independent generation (left side of “/” indicates similarity).
insert image description here

6 Conclusion

In this paper, we propose Composable Diffusion (CoDi), a groundbreaking multimodal generative model capable of processing and simultaneously generating modalities of text, images, video, and audio. Our method is able to synergistically produce high-quality and coherent output across a variety of modalities, from various combinations of input modalities. Through extensive experiments, we demonstrate the remarkable ability of CoDi to flexibly generate single or multiple patterns from a wide range of inputs. Our work marks an important step towards more engaging and comprehensive human-computer interaction, laying a solid foundation for future generative artificial intelligence research. Limitations and wider implications. See Appendix D for a discussion of limitations and wider implications.

Guess you like

Origin blog.csdn.net/m0_47867638/article/details/131196652