Is the structure of Stable Diffusion about to be eliminated? Detailed interpretation of Google’s latest killer VideoPoet

Diffusion Models video generation-blog summary

Foreword: The field of video generation has long been dominated by Stable Diffusion. Most methods add a time layer to Stable Diffusion of pre-trained images to learn dynamic information. Although there are models such as CoDi "[NeurIPS 2023] Multi-modal joint video generation large model CoDi" that have tried to break through the limitations of this structure, they have not brought particularly influential work to the industry. Recently, Google took action and came up with the Decoder-Only structure video generation model, which is called the king! Huawei once released a Decoder-Only model (which was also ridiculed by the crowd), but facts have gradually proved how imaginative it is to be able to combine text, audio, video and other modalities by encoding them into tokens! Will next year's ChatGPT-5 also be in this form?

Table of contents

Guess you like

Origin blog.csdn.net/qq_41895747/article/details/135211141