UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

stp2yNovember 12, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 4 Mar 2024 (v1), last revised 10 Nov 2024 (this version, v4)]

View a PDF of the paper titled UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control, by Tian Xia and 2 other authors

View PDF
HTML (experimental)

Abstract:Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl’s efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.

Submission history

From: Xuweiyi Chen [view email]
[v1]
Mon, 4 Mar 2024 18:58:11 UTC (6,671 KB)
[v2]
Tue, 5 Mar 2024 13:58:02 UTC (6,671 KB)
[v3]
Wed, 6 Mar 2024 10:46:41 UTC (6,672 KB)
[v4]
Sun, 10 Nov 2024 05:34:31 UTC (18,000 KB)

Source link
lol

By stp2y