Long Story Short: Story-level Video Understanding from 20K Short Films

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Long Story Short: Story-level Video Understanding from 20K Short Films, by Ridouane Ghermi and 3 other authors

View PDF
HTML (experimental)

Abstract:Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.

Submission history

From: Ridouane Ghermi [view email]
[v1]
Fri, 14 Jun 2024 17:54:54 UTC (12,924 KB)
[v2]
Fri, 10 Jan 2025 10:36:58 UTC (34,848 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.