I Know About “Up”! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled I Know About “Up”! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction, by Zaiqiao Meng and Hao Zhou and Yifang Chen

View PDF
HTML (experimental)

Abstract:Visual Language Models (VLMs) are essential for various tasks, particularly visual reasoning tasks, due to their robust multi-modal information integration, visual reasoning capabilities, and contextual awareness. However, existing VLMs{}’ visual spatial reasoning capabilities are often inadequate, struggling even with basic tasks such as distinguishing left from right. To address this, we propose the ours{} model, designed to enhance the visual spatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D reconstruction model for obtaining different views of the input images and incorporates a prompting mechanism to further improve visual spatial reasoning. Experimental results on four visual spatial reasoning datasets show that our ours{} achieves up to 19.48% accuracy improvement, which indicates the effectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.

Submission history

From: Hao Zhou [view email]
[v1]
Fri, 19 Jul 2024 09:03:30 UTC (1,899 KB)
[v2]
Thu, 12 Sep 2024 11:17:46 UTC (1,976 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.