Foreword:The recent text-to-video paper Show-1 achieved first place in the FVD and CLIPSIM indicators on the MSR-VTT evaluation data set, and the FID indicator Second place on the list. Using a hybrid model method that combines pixel-based VDM and latent space-based VDM for text-to-video generation can not only achieve high generation indicators, but also greatly reduce inference resource consumption. This blog explains the paper and code in detail.
Table of contents