The Wallpaper is Ugly: Indoor Localization using Vision and Language

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning



arXiv:2410.03900v1 Announce Type: new
Abstract: We study the task of locating a user in a mapped indoor environment using natural language queries and images from the environment.
Building on recent pretrained vision-language models, we learn a similarity score between text descriptions and images of locations in the environment.
This score allows us to identify locations that best match the language query, estimating the user’s location.
Our approach is capable of localizing on environments, text, and images that were not seen during training.
One model, finetuned CLIP, outperformed humans in our evaluation.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.