Grid Diffusion Models for Text-to-Video Generation

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


This paper has been withdrawn by Taegyeong Lee

View a PDF of the paper titled Grid Diffusion Models for Text-to-Video Generation, by Taegyeong Lee and 2 other authors

No PDF available, click to view other formats

Abstract:Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.

Submission history

From: Taegyeong Lee [view email]
[v1]
Sat, 30 Mar 2024 03:50:43 UTC (47,320 KB)
[v2]
Mon, 30 Dec 2024 05:52:33 UTC (1 KB) (withdrawn)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.