Tarsier: Recipes for Training and Evaluating Large Video Description Models

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Tarsier: Recipes for Training and Evaluating Large Video Description Models, by Jiawei Wang and 3 other authors

View PDF
HTML (experimental)

Abstract:Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+51.4%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+12.3%$ advantage against GPT-4V and a $-6.7%$ disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further improves significantly with a $+4.8%$ advantage against GPT-4o. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark — DREAM-1K (this https URL) for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at this https URL.

Submission history

From: Jiawei Wang [view email]
[v1]
Sun, 30 Jun 2024 09:21:01 UTC (4,401 KB)
[v2]
Tue, 24 Sep 2024 04:41:08 UTC (4,403 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.