SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

stp2yDecember 9, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 3 Dec 2024 (v1), last revised 6 Dec 2024 (this version, v2)]

View a PDF of the paper titled SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection, by Joongwon Chae and 2 other authors

View PDF
HTML (experimental)

Abstract:Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models – Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By utilizing normalized coordinate detection for bounding boxes and transforming them into actionable segmentation outputs, we establish a connection between spatial and language representations in multimodal architectures. Experimental results demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU with 512×512 resolution images yields an average inference time of 7 seconds per image, demonstrating the framework’s effectiveness in both accuracy and practical deployability. The project code is available at this https URL

Submission history

From: Joongwon Chae [view email]
[v1]
Tue, 3 Dec 2024 16:53:58 UTC (13,462 KB)
[v2]
Fri, 6 Dec 2024 07:08:56 UTC (13,786 KB)

Source link
lol

By stp2y